# GPT - Text Generation

Transformers are a bit complex to use and require greater expertise and above all greater computing capacity. The objective of this lab is to give you the basics on how to use GPT for text generation.

A second year course is dedicated to 'attentions' and 'transformers' to deepen these notions.

Some lecture:
* [The Journey of Open AI GPT models](https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2)
* [GPT-3 Explained](https://towardsdatascience.com/gpt-3-explained-19e5f2bd3288)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#With-Transformers" data-toc-modified-id="With-Transformers-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>With Transformers</a></span><ul class="toc-item"><li><span><a href="#Model-configuration" data-toc-modified-id="Model-configuration-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Model configuration</a></span></li><li><span><a href="#GPT2-tokenizer" data-toc-modified-id="GPT2-tokenizer-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>GPT2 tokenizer</a></span></li><li><span><a href="#GPT2-for-text-prediction" data-toc-modified-id="GPT2-for-text-prediction-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>GPT2 for text prediction</a></span></li></ul></li></ul></div>

## With Transformers

To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:

`git clone https://github.com/huggingface/transformers
cd transformers
pip install .`

Then cd in the example folder of your choice and run

`pip install -r requirements.txt`

In order to help you, you can find a lot of transformers [here](https://huggingface.co/transformers/notebooks.html)

### Model configuration

In [1]:
from transformers import GPT2Model, GPT2Config

In [2]:
from transformers import GPT2Model

# Initializing a configuration
configuration = GPT2Config()
configuration

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

In [3]:
from transformers import GPT2Model

# Initializing a model from the configuration
model = GPT2Model(configuration)

In [4]:
# Accessing the model configuration
configuration1 = model.config
configuration1

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

### GPT2 tokenizer

In [5]:
from transformers import GPT2Tokenizer

In [6]:
# Load pre-trained model tokenizer (vocabulary)

model_name = "distilgpt2"
#model_name = "microsoft/DialogRPT-updown"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

In [7]:
# Tokenize a sentence
tokens = tokenizer.tokenize("I take aspirin. I like chocolate")
tokens

['I', 'Ġtake', 'Ġaspirin', '.', 'ĠI', 'Ġlike', 'Ġchocolate']

In [8]:
# Encode a sentence
tokens = tokenizer.encode("I take aspirin. I like chocolate")
tokens

[40, 1011, 49550, 13, 314, 588, 11311]

In [9]:
print(tokens[0], tokenizer.decode(tokens[0]))
print(tokens[4], tokenizer.decode(tokens[4]))

40 I
314  I


In [10]:
tokens = tokenizer.encode("Hello world")
tokens

[15496, 995]

In [11]:
tokens = tokenizer.encode(" Hello world")
tokens

[18435, 995]

In [12]:
txt = tokenizer.decode(tokens)
txt

' Hello world'

### GPT2 for text prediction

In [13]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

# Encode a text inputs
text = "Who was Emmanuel Macron ? Emmanuel Macron was a"
display("start text: "+text)
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
#tokens_tensor = tokens_tensor.to('cuda')
#model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# get the predicted next sub-word (in our case, the word 'man')
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
display("response: "+predicted_text)

'start text: Who was Emmanuel Macron ? Emmanuel Macron was a'

'response: Who was Emmanuel Macron? Emmanuel Macron was a French'

In [17]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer, tf_top_k_top_p_filtering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelWithLMHead.from_pretrained("gpt2")

sequence = f"Michel RIVEILL is"

for _ in range(20):
    input_ids = tokenizer.encode(sequence, return_tensors="tf")

    # get logits of last hidden state
    next_token_logits = model(input_ids)[0][:, -1, :]

    # filter
    filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

    # sample
    next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)

    generated = tf.concat([input_ids, next_token], axis=1)

    resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
    
    print(resulting_string)
    
    sequence=resulting_string

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Michel RIVEILL is an
Michel RIVEILL is an analyst
Michel RIVEILL is an analyst at
Michel RIVEILL is an analyst at the
Michel RIVEILL is an analyst at the New
Michel RIVEILL is an analyst at the New America
Michel RIVEILL is an analyst at the New America Foundation
Michel RIVEILL is an analyst at the New America Foundation.
Michel RIVEILL is an analyst at the New America Foundation. He
Michel RIVEILL is an analyst at the New America Foundation. He is
Michel RIVEILL is an analyst at the New America Foundation. He is the
Michel RIVEILL is an analyst at the New America Foundation. He is the author
Michel RIVEILL is an analyst at the New America Foundation. He is the author of
Michel RIVEILL is an analyst at the New America Foundation. He is the author of The
Michel RIVEILL is an analyst at the New America Foundation. He is the author of The "
Michel RIVEILL is an analyst at the New America Foundation. He is the author of The "S
Michel RIVEILL is an analyst at the New America Foundation. He