<center>
    <h1>GPT</h1>
</center>

# Brief Recap
**GPT** or **Generative Pre-trained Transformer** model was introduced by OpenAI in June 2018. It laid the foundation for subsequent versions by demonstrating the effectiveness of the Transformer architecture for natural language processing tasks.

It is a type of machine learning model developed by OpenAI that's designed to understand and generate human-like text.

# Architecture

<img src='assets/GPT.png' width=450>

This is a simplified architecture of the Transformer model used in GPT. Here's a breakdown:

* **Text & Position Embed:** Input text is converted into embeddings, which include information about the words and their positions in the sequence.

* **Masked Multi Self Attention:** This layer allows the model to focus on different parts of the text by computing attention scores. The "masked" part ensures the model predicts the next word based on previous words only.

* **Layer Norm and Residual Connections:** Each layer uses layer normalization to stabilize and accelerate the training process. Residual connections (indicated by the "+" signs) help in improving the gradient flow through the network.

* **Feed Forward Layer:** A fully connected feed-forward network applies transformations to the attention outputs. It processes each position independently and identically.

* **Repetition (12x):** The model repeats the attention and feed-forward layers 12 times, indicating a 12-layered Transformer block.

* **Outputs - Text Prediction and Task Classifier:** The final outputs are used for different tasks, such as predicting the next word in a sequence or performing a specific classification task.



# Applications

GPT models are used in a wide range of applications, impacting various fields:

* **Text generation:** This is a core function, used for creating articles, stories, poems, scripts, and other written content.  It can also be used for summarizing text, paraphrasing, and expanding on given prompts.
* **Dialogue and conversation:** GPT powers chatbots and conversational AI, providing human-like interactions in customer service, virtual assistants, and other applications.
* **Translation:**  GPT can translate between languages, often with impressive accuracy and fluency.
* **Code generation:**  GPT can assist programmers by generating code in various programming languages, completing code snippets, and offering suggestions.
* **Education:**  GPT can be used as a tutoring tool, answering questions, explaining concepts, and providing personalized learning experiences.
* **Research and knowledge discovery:** GPT can analyze large datasets of text and code, helping researchers identify patterns, trends, and insights.
* **Accessibility:** GPT can help people with disabilities by generating alternative text descriptions for images, converting text to speech, and providing real-time captioning.
* **Business and marketing:** GPT can be used for market research, creating marketing copy, generating personalized recommendations, and automating various business processes.


This is not an exhaustive list, and new applications of GPT are constantly emerging. As the technology continues to develop, we can expect even more innovative and impactful uses in the future.


# Implementation of GPT model using TensorFlow



## Approach 1
Here's a simplified implementation of GPT like model using TensorfFlow.It aims to demonstrate the basic structure of a GPT-like model using TensorFlow. It's a starting point for understanding the architecture and can be further extended to incorporate more advanced features.


### **1. `TransformerBlock` class**



In [None]:
import tensorflow as tf

# Define the Transformer block
class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential(
            [tf.keras.layers.Dense(ff_dim, activation="relu"), tf.keras.layers.Dense(embed_dim),]
        )
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)


* **`__init__(self, embed_dim, num_heads, ff_dim, rate=0.1)`:**  This is the constructor of the `TransformerBlock` class. It initializes the layers within each block:
    * `self.att = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)`: Creates a multi-head attention layer.  `num_heads` determines the number of attention heads, and `embed_dim` is the dimensionality of the word embeddings.  Multi-head attention allows the model to attend to different parts of the input sequence simultaneously.
    * `self.ffn = tf.keras.Sequential(...)`:  This defines the feed-forward network within the block. It consists of two dense layers:
        * `tf.keras.layers.Dense(ff_dim, activation="relu")`: A dense layer with `ff_dim` units and ReLU activation.  This layer introduces non-linearity.
        * `tf.keras.layers.Dense(embed_dim)`: A dense layer that projects the output back to the original embedding dimension.
    * `self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)`: Layer normalization helps stabilize training and improve performance.
    * `self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)`: Another layer normalization after the feed-forward network.
    * `self.dropout1 = tf.keras.layers.Dropout(rate)`: Dropout is used for regularization to prevent overfitting.
    * `self.dropout2 = tf.keras.layers.Dropout(rate)`: Another dropout layer after the feed-forward network.

* **`call(self, inputs, training)`:** This method defines the forward pass of the `TransformerBlock`:
    * `attn_output = self.att(inputs, inputs)`: Applies multi-head attention to the input.
    * `attn_output = self.dropout1(attn_output, training=training)`: Applies dropout during training.
    * `out1 = self.layernorm1(inputs + attn_output)`: Adds the attention output to the original input (residual connection) and applies layer normalization.
    * `ffn_output = self.ffn(out1)`: Applies the feed-forward network.
    * `ffn_output = self.dropout2(ffn_output, training=training)`: Applies dropout.
    * `return self.layernorm2(out1 + ffn_output)`: Adds the feed-forward output to the previous output (another residual connection) and applies layer normalization.



### **2. `SimplifiedGPT` class**


In [None]:
# Define the simplified GPT model
class SimplifiedGPT(tf.keras.Model):
    def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers):
        super(SimplifiedGPT, self).__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embed_dim)
        self.pos_embedding = tf.keras.layers.Embedding(vocab_size, embed_dim)  # Simplified positional embedding
        self.transformer_blocks = [TransformerBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers)]
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, training):
        seq_len = tf.shape(inputs)[1]
        positions = tf.range(start=0, limit=seq_len, delta=1)
        x = self.embedding(inputs) + self.pos_embedding(positions) # Add positional embeddings
        for block in self.transformer_blocks:
          x = block(x, training=training)

        return self.dense(x)




* **`__init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers)`:** The constructor initializes the model's layers:
    * `self.embedding = tf.keras.layers.Embedding(vocab_size, embed_dim)`: Creates an embedding layer to convert word indices into dense vectors.
    * `self.pos_embedding = tf.keras.layers.Embedding(vocab_size, embed_dim)`:  A simplified positional embedding layer (less sophisticated than standard positional encodings).
    * `self.transformer_blocks = [...]`: Creates a list of `TransformerBlock` instances.
    * `self.dense = tf.keras.layers.Dense(vocab_size)`:  A final dense layer that outputs logits for each word in the vocabulary.

* **`call(self, inputs, training)`:**  Defines the forward pass of the model:
    * `seq_len = tf.shape(inputs)[1]` Gets the sequence length.
    * `positions = tf.range(...)`: Creates a tensor of positional indices.
    * `x = self.embedding(inputs) + self.pos_embedding(positions)`:  Embeds the input and adds the positional embeddings.
    * `for block in self.transformer_blocks:`: Iterates through the transformer blocks, applying each one to the input.
    * `return self.dense(x)`: Applies the final dense layer to produce the output logits.




### **3. Example Usage:**

This section demonstrates how to use the `SimplifiedGPT` model for a simple next-word prediction task. pass sample input through the model, and set up a basic compilation for training (which would require actual data and a training loop in a real application but lets implement it on some dummy data).


#### **Prepare the dataset**

In [None]:
# Example usage (Illustrative -  requires data and training setup)

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample input (replace with actual data)
# Example sentences
texts = ["The quick brown fox jumps", "over the lazy dog"]

# Tokenize the text
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Pad sequences to the same length
max_length = max(len(seq) for seq in sequences)
sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

# Create input-output pairs
# For simplicity, we'll train for next-word prediction
input_sequences = sequences[:, :-1]  # all words except the last
output_sequences = sequences[:, 1:]  # all words except the first


* **Sample input:** It starts with two sample sentences: "The quick brown fox jumps" and "over the lazy dog". In a real application, you would replace this with your actual training data.
* **Tokenization:** A Tokenizer from Keras is used to convert the text into numerical sequences. The num_words parameter limits the vocabulary size to 10,000 words.
* **Padding:** Sequences are padded to have the same length using pad_sequences. This is necessary because the model expects inputs of uniform length.
* **Input-Output Pairs:** For next-word prediction, the input sequences are created by taking all words except the last, and the output sequences are created by taking all words except the first.

#### **Compile the model**

In [None]:
vocab_size = 10000  # Ensure this matches your tokenizer's num_words
embed_dim = 256
num_heads = 8
ff_dim = 512
num_layers = 2

# Create the model
model = SimplifiedGPT(vocab_size, embed_dim, num_heads, ff_dim, num_layers)

# Compile the model
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)



* **Model Creation:** An instance of the SimplifiedGPT class is created with specific hyperparameters (vocabulary size, embedding dimension, number of heads, feed-forward dimension, and number of layers).
* **Compilation:** The model is compiled using the adam optimizer, SparseCategoricalCrossentropy loss function (suitable for multi-class classification), and accuracy as the evaluation metric.

#### **Train the model**

In [None]:
# Convert to numpy array for fitting
input_sequences = np.array(input_sequences)
output_sequences = np.array(output_sequences)

# Fit the model
model.fit(input_sequences, output_sequences, epochs=10, batch_size=32)


* **Data Conversion:** The input and output sequences are converted to NumPy arrays for compatibility with the fit method.
* **Training:** The fit method is called to train the model. It receives the input and output sequences, along with the number of training epochs and batch size.


**Key points and simplifications:**

* **Simplified Training:** The training setup in this example is basic and serves as an illustration. It's important to remember that in real-world scenarios, you would typically use a larger dataset and more advanced training techniques.
* **Next-Word Prediction:** The goal of this example is to train the model to predict the next word in a sequence based on the preceding words.
* **Hyperparameters:** You can adjust the hyperparameters of the model and training process to tune its performance.


This simplified example aims to demonstrate the basic structure of a GPT-like model using TensorFlow. It's a starting point for understanding the architecture and can be further extended to incorporate more advanced features.  
For real-world applications, consider using pre-trained models and established libraries like Hugging Face Transformers. They provide optimized implementations and access to pre-trained models, saving significant time and resources.


## Approach 2

This is a more practical and efficient approach to using GPT models. Using pre-trained models and the transformers library significantly simplifies the process and often leads to better results. Remember to adjust parameters like `max_length`, `num_beams`, `batch_size`, and the number of training `epochs` based on your specific needs and resources.

```python
from transformers import TFGPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2LMHeadModel.from_pretrained('gpt2', pad_token_id=tokenizer.eos_token_id)
```




In [1]:
# !pip install transformers

In [1]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = 'gpt2'  # You can specify other GPT-2 variants like 'gpt2-medium', 'gpt2-large', etc.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

# Example input text
text = "Once upon a time, there was a large language model."

# Tokenize the input text
input_ids = tokenizer.encode(text, return_tensors='tf')

# Generate text
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)  # Adjust parameters as needed



2025-02-07 17:33:22.563437: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738967602.583617  275976 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738967602.589202  275976 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-07 17:33:22.610331: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

I0000 00:00:1738967615.958672  275976 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6850 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
2025-02-07 17:33:36.879045: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154389504 exceeds 10% of free system memory.
All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


The `model.generate()` method does the heavy lifting. We provide the tokenized input,

* Then specify the desired maximum length of the generated text using `max_length`
* Use beam search `num_beams` for better quality
* Prevent repetition of n-grams using `no_repeat_ngram_size`
* Control the randomness of the output with `temperature`

In [2]:
# Decode the generated output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

Once upon a time, there was a large language model.

In the early 20th century, linguists began to understand how languages work. In the 1950s and 1960s, they developed a model of how language works. This model was


 This leverages a pre-trained GPT-2 model to generate text continuations based on a given input prompt. It utilizes the transformers library for efficient model loading and text processing.

### **Greedy Search**

Greedy search simply selects the word with the highest probability as its next word: $w_t = argmax_{w}P(w | w_{1:t-1})$ at each timestep $t$. The following sketch shows greedy search.

![Greedy Search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/greedy_search.png)

Starting from the word $\text{"The"}$, the algorithm
greedily chooses the next word of highest probability $\text{"nice"}$ and so on, so that the final generated word sequence is $\text{"The", "nice", "woman"}$ having an overall probability of $0.5 \times 0.4 = 0.2$.

In the following we will generate word sequences using GPT2 on the context $(\text{"I", "enjoy", "walking", "with", "my", "cute", "dog"})$. Let's see how greedy search can be used in `transformers` as follows:

In [3]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


Alright! We have generated our first short text with GPT2 ðŸ˜Š. The generated words following the context are reasonable, but the model quickly starts repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search - check out [Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424) and [Shao et al., 2017](https://arxiv.org/abs/1701.03185).

The major drawback of greedy search though is that it misses high probability words hidden behind a low probability word as can be seen in our sketch above:

The word $\text{"has"}$ with its high conditional probability of $0.9$ is hidden behind the word $\text{"dog"}$, which has only the second-highest conditional probability, so that greedy search misses the word sequence $\text{"The"}, \text{"dog"}, \text{"has"}$.

Thankfully, we have beam search to alleviate this problem!


### **Beam search**

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely `num_beams` of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Let's illustrate with `num_beams=2`:

![Beam search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

At time step $1$, besides the most likely hypothesis $\text{"The", "nice"}$, beam search also keeps track of the second most likely one $\text{"The", "dog"}$. At time step $2$, beam search finds that the word sequence $\text{"The", "dog", "has"}$ has with $0.36$ a higher probability than $\text{"The", "nice", "woman"}$, which has $0.2$. Great, it has found the most likely word sequence in our toy example!

Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.

Let's see how beam search can be used in `transformers`. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token.

In [4]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll


While the result is arguably more fluent, the output still includes repetitions of the same word sequences.  
A simple remedy is to introduce *n-grams* (*a.k.a* word sequences of $n$ words) penalties as introduced by [Paulus et al. (2017)](https://arxiv.org/abs/1705.04304) and [Klein et al. (2017)](https://arxiv.org/abs/1701.02810). The most common *n-grams* penalty makes sure that no *n-gram* appears twice by manually setting the probability of next words that could create an already seen *n-gram* to $0$.

Let's try it out by setting `no_repeat_ngram_size=2` so that no *2-gram* appears twice:

In [5]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break


Nice, that looks much better! We can see that the repetition does not appear anymore. Nevertheless, *n-gram* penalties have to be used with care. An article generated about the city *New York* should not use a *2-gram* penalty or otherwise, the name of the city would only appear once in the whole text!

Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.

In `transformers`, we simply set the parameter `num_return_sequences` to the number of highest scoring beams that should be returned. Make sure though that `num_return_sequences <= num_beams`!

In [6]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to take a break
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to get back to
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about 

As can be seen, the five beam hypotheses are only marginally different to each other - which should not be too surprising when using only 5 beams.

In open-ended generation, a couple of reasons have recently been brought forward why beam search might not be the best possible option:

- Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization - see [Murray et al. (2018)](https://arxiv.org/abs/1808.10006) and [Yang et al. (2018)](https://arxiv.org/abs/1808.09582). But this is not the case for open-ended generation where the desired output length can vary greatly, e.g. dialog and story generation.

- We have seen that beam search heavily suffers from repetitive generation. This is especially hard to control with *n-gram*- or other penalties in story generation since finding a good trade-off between forced "no-repetition" and repeating cycles of identical *n-grams* requires a lot of finetuning.

- As argued in [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751), high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable. The authors show this nicely by plotting the probability, a model would give to human text vs. what beam search does.

![alt text](https://blog.fastforwardlabs.com/images/2019/05/Screen_Shot_2019_05_08_at_3_06_36_PM-1557342561886.png)


So let's stop being boring and introduce some randomness ðŸ¤ª.

### **Sampling**

In its most basic form, sampling means randomly picking the next word $w_t$ according to its conditional probability distribution:

$$w_t \sim P(w|w_{1:t-1})$$

Taking the example from above, the following graphic visualizes language generation when sampling.

![vanilla_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/sampling_search.png)

It becomes obvious that language generation using sampling is not *deterministic* anymore. The word
$\text{"car"}$ is sampled from the conditioned probability distribution $P(w | \text{"The"})$, followed by sampling $\text{"drives"}$ from $P(w | \text{"The"}, \text{"car"})$.

In `transformers`, we set `do_sample=True` and deactivate *Top-K* sampling (more on this later) via `top_k=0`. In the following, we will fix `random_seed=0` for illustration purposes. Feel free to change the `random_seed` to play around with the model.


In [8]:
import tensorflow as tf

In [9]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog Reyn: on the platform in a bowling alley in Brooklyn, mint one," the partner admitted, laughing at the furred hair pounced on her collar. "Really? I just sat down and watched the rainbow bear


Interesting! The text seems alright - but when taking a closer look, it is not very coherent. the *3-grams* *new hand sense* and *local batte harness* are very weird and don't sound like they were written by a human. That is the big problem when sampling word sequences: The models often generate incoherent gibberish, *cf.* [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751).

A trick is to make the distribution $P(w|w_{1:t-1})$ sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the so-called `temperature` of the [softmax](https://en.wikipedia.org/wiki/Softmax_function#Smooth_arg_max).

An illustration of applying temperature to our example from above could look as follows.

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/sampling_search_with_temp.png?raw=true)

The conditional next word distribution of step $t=1$ becomes much sharper leaving almost no chance for word $\text{"car"}$ to be selected.


Let's see how we can cool down the distribution in the library by setting `temperature=0.7`:

In [10]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0,
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and I like having a nice view of the world, including the one I'm walking through, but at the same time, I've got a little bit of an anxiety about it. And I'm trying to be


OK. There are less weird n-grams and the output is a bit more coherent now! While applying temperature can make a distribution less random, in its limit, when setting `temperature` $ \to 0$, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before.



### **Top-K Sampling**

[Fan et. al (2018)](https://arxiv.org/pdf/1805.04833.pdf) introduced a simple, but very powerful sampling scheme, called ***Top-K*** sampling. In *Top-K* sampling, the *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words.
GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

We extend the range of words used for both sampling steps in the example above from 3 words to 10 words to better illustrate *Top-K* sampling.

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words. While the 6 most likely words, defined as $V_{\text{top-K}}$ encompass only *ca.* two-thirds of the whole probability mass in the first step, it includes almost all of the probability mass in the second step. Nevertheless, we see that it successfully eliminates the rather weird candidates $\text{"not", "the", "small", "told"}$
in the second sampling step.


Let's see how *Top-K* can be used in the library by setting `top_k=50`:

In [11]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and I miss having to walk with my dog in the middle of one of the city's busiest hours at dusk. I was also inspired by the idea of creating an interactive map of the city but I think they really


Not bad at all! The text is arguably the most *human-sounding* text so far.
One concern though with *Top-K* sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution $P(w|w_{1:t-1})$.
This can be problematic as some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).

In step $t=1$, *Top-K* eliminates the possibility to
sample $\text{"people", "big", "house", "cat"}$, which seem like reasonable candidates. On the other hand, in step $t=2$ the method includes the arguably ill-fitted words $\text{"down", "a"}$ in the sample pool of words. Thus, limiting the sample pool to a fixed size *K* could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution.
This intuition led [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751) to create ***Top-p***- or ***nucleus***-sampling.



### **Top-p (nucleus) sampling**

Instead of sampling only from the most likely *K* words, in *Top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*. The probability mass is then redistributed among this set of words. This way, the size of the set of words (*a.k.a* the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. Ok, that was very wordy, let's visualize.

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/top_p_sampling.png?raw=true)

Having set $p=0.92$, *Top-p* sampling picks the *minimum* number of words to exceed together $p=92\%$ of the probability mass, defined as $V_{\text{top-p}}$. In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%. Quite simple actually! It can be seen that it keeps a wide range of words where the next word is arguably less predictable, *e.g.* $P(w | \text{"The"})$, and only a few words when the next word seems more predictable, *e.g.* $P(w | \text{"The", "car"})$.

Alright, time to check it out in `transformers`!
We activate *Top-p* sampling by setting `0 < top_p < 1`:

In [12]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and how on Earth this actually works, although I was reminded of one particular episode where Captain America comes to Vancouver Island to visit Agent Coulson at his house, for $1.50 to get him to act kind


Great, that sounds like it could have been written by a human. Well, maybe not quite yet.

While in theory, *Top-p* seems more elegant than *Top-K*, both methods work  well in practice. *Top-p* can also be used in combination with *Top-K*, which can avoid very low ranked words while allowing for some dynamic selection.

Finally, to get multiple independently sampled outputs, we can *again* set the parameter `num_return_sequences > 1`:

In [13]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog and I like having a fun time, so I was going to stop in the restaurant and get some coffee."

The second person I saw who didn't wear makeup at the restaurant was her ex-boyfriend
1: I enjoy walking with my cute dog, and the time you spent with my cats was wonderful and truly worth it. I also found it really funny that my dog gets all upset and irritable before he goes to bed, and is even more upset when
2: I enjoy walking with my cute dog, so it has been such an inspiration to me in her life."

Morton's story was shared on social media.

The mother-of-two said it was about the "trem


: 

Cool, now you should have all the tools to let your model write your stories with `transformers`!

### **Conclusion**

As *ad-hoc* decoding methods, *top-p* and *top-K* sampling seem to produce more fluent text than traditional *greedy* - and *beam* search on open-ended language generation.
Recently, there has been more evidence though that the apparent flaws of *greedy* and *beam* search - mainly generating repetitive word sequences - are  caused by the model (especially the way the model is trained), rather than the decoding method, *cf.* [Welleck et al. (2019)](https://arxiv.org/pdf/1908.04319.pdf). Also, as demonstrated in [Welleck et al. (2020)](https://arxiv.org/abs/2002.02492), it looks as *top-K* and *top-p* sampling also suffer from generating repetitive word sequences.

In [Welleck et al. (2019)](https://arxiv.org/pdf/1908.04319.pdf), the authors show that according to human evaluations, *beam* search can generate more fluent text than *Top-p* sampling, when adapting the model's training objective.

Open-ended language generation is a rapidly evolving field of research and as it is often the case there is no one-size-fits-all method here, so one has to see what works best in one's specific use case.

Good thing, that *you* can try out all the different decoding methods in `transfomers` ðŸ¤—.

That was a short introduction on how to use different decoding methods in `transformers` and recent trends in open-ended language generation.

Feedback and questions are very welcome on the [Github repository](https://github.com/huggingface/transformers).

For more fun generating stories, please take a look at [Writing with Transformers](https://transformer.huggingface.co).