<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/05-text-generation/text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Text Generation

One of the most uncanny features of transformer-based language models is
their ability to generate text that is almost indistinguishable from text written
by humans.


As we’ve seen, for task-specific
heads like sequence or token classification, generating predictions is fairly
straightforward; the model produces some logits and we either take the
maximum value to get the predicted class, or apply a softmax function to
obtain the predicted probabilities per class.

By contrast, converting the
model’s probabilistic output to text requires a decoding method, which
introduces a few challenges that are unique to text generation:

* The decoding is done iteratively and thus involves significantly
more compute than simply passing inputs once through the forward
pass of a model.

* The quality and diversity of the generated text depend on the choice
of decoding method and associated hyperparameters.

To understand how this decoding process works, let’s start by examining
how GPT-2 is pretrained and subsequently applied to generate text.

Like other autoregressive or causal language models, GPT-2 is pretrained
to estimate the probability $P(y|x)$ of a sequence of tokens $y=y_1,y_2,...,y_t$ occurring in the text, given some initial prompt or context sequence $x=x_1,x_2,...,x_k$.

Since it is impractical to acquire enough training data to
estimate $P(y|x)$ directly, it is common to use the chain rule of probability to factorize it as a product of conditional probabilities:

$$ P(y_1,y_2,...,y_t|x) = \prod_{t=1}^N P(y_t|y_{<t}, x)$$

It is from
these conditional probabilities that we pick up the intuition that
autoregressive language modeling amounts to predicting each word given the
preceding words in a sentence; this is exactly what the probability on the
righthand side of the preceding equation describes.

Notice that this
pretraining objective is quite different from BERT’s, which utilizes both past
and future contexts to predict a masked token.

<img alt="Text generation" width="700" caption="Generating text from an input sequence by adding a new word to the input at each step" src="https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/05-text-generation/images/text-generation.png?raw=1" id="text-generation"/> 


As shown, we start with a prompt like "Transformers are the" and use the model to
predict the next token. Once we have determined the next token, we append it
to the prompt and then use the new input sequence to generate another token.
We do this until we have reached a special end-of-sequence token or a
predefined maximum length.

>Since the output sequence is conditioned on the choice of input prompt, this type of text generation is often called conditional text generation.

At the heart of this process lies a decoding method that determines which
token is selected at each timestep.

Since the language model head produces a logit $z_{t,i}$ per token in the vocabulary at each step, we can get the probability
distribution over the next possible token $w_i$ by taking the softmax:

$$ P(y_t=w_i|y_{<t},x) = softmax(z_{t,i}) $$

The goal of most decoding methods is to search for the most likely overall
sequence by picking a $\hat y$ such that:

$$ \hat y = argmax P(y|x) $$

Finding $\hat y$ directly would involve evaluating every possible sequence with the language model. Since there does not exist an algorithm that can do this in a reasonable amount of time, we rely on approximations instead.

##Setup

In [None]:
!pip -q install transformers

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.functional import cross_entropy

from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

##Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model’s
continuous output is to greedily select the token with the highest probability
at each timestep:

$$ \hat y_t = argmax P(y_t|y_{<t}, x) $$

To see how greedy search works, let’s start by loading the 1.5-billion parameter version of `GPT-2` with a language modeling head:

In [5]:
# model_name = "gpt2-xl"
model_name = "gpt2"  # due RAM issue, loading smaller model

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Now let’s generate some text! 

Although Transformers provides a
`generate()` function for autoregressive models like `GPT-2`, we’ll implement
this decoding method ourselves to see what goes on under the hood.

We’ll use `Transformers are the` as the input prompt and run the decoding for eight timesteps.

At each timestep, we pick out the model’s logits for the last token
in the prompt and wrap them with a softmax to get a probability distribution.
We then pick the next token with the highest probability, add it to the input
sequence, and run the process again.

In [6]:
input_text = "Transformers are the"
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
  for _ in range(n_steps):
    iteration = dict()
    iteration["Input"] = tokenizer.decode(input_ids[0])
    output = model(input_ids=input_ids)

    # Select logits of the first batch and the last token and apply softmax
    next_token_logits = output.logits[0, -1, :]
    next_token_probs = torch.softmax(next_token_logits, dim=-1)
    sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

    # Store tokens with highest probabilities
    for choice_idx in range(choices_per_step):
      token_id = sorted_ids[choice_idx]
      token_prob = next_token_probs[token_id].cpu().numpy()
      token_choice = (f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)")
      iteration[f"Choice {choice_idx + 1}"] = token_choice
    
    # Append predicted next token to input
    input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
    iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (9.76%),same (2.94%),only (2.87%),best (2.38%),first (1.77%)
1,Transformers are the most,common (22.90%),powerful (6.88%),important (6.32%),popular (3.95%),commonly (2.14%)
2,Transformers are the most common,type (15.06%),types (3.31%),form (1.91%),way (1.89%),and (1.49%)
3,Transformers are the most common type,of (83.13%),in (3.16%),. (1.92%),", (1.63%)",for (0.88%)
4,Transformers are the most common type of,particle (1.55%),object (1.02%),light (0.71%),energy (0.67%),objects (0.66%)
5,Transformers are the most common type of particle,. (14.26%),in (11.57%),that (10.18%),", (9.57%)",accelerator (5.81%)
6,Transformers are the most common type of parti...,They (17.48%),\n (15.19%),The (7.06%),These (3.09%),In (3.07%)
7,Transformers are the most common type of parti...,are (38.77%),have (8.14%),can (7.99%),'re (5.04%),consist (1.57%)


We can also see the other possible continuations
at each step, which shows the iterative nature of text generation.

Unlike other
tasks such as sequence classification where a single forward pass suffices to
generate the predictions, with text generation we need to decode the output
tokens one at a time.

Implementing greedy search wasn’t too hard, but we’ll want to use the builtin
`generate()` function from Transformers to explore more sophisticated
decoding methods.

In [None]:
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to(device)
# specify the max_new_tokens for the number of newly generated tokens
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)

In [9]:
print(tokenizer.decode(output[0]))

Transformers are the most common type of particle. They are


Now let’s try something a bit more interesting.

In [12]:
max_length = 128
input_txt = """
In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [13]:
print(tokenizer.decode(output_greedy[0]))


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Berkeley, and the University of California, Santa Cruz, found that the unicorns were able to communicate with each other through their tongues.


"This is a very interesting finding," said lead author Dr. David J. Karp, a professor of linguistics at the University of California, Berkeley. "It's a very interesting finding that we can


We can see one of the main drawbacks with greedy search decoding: it
tends to produce repetitive output sequences, which is certainly undesirable
in a news article. 

This is a common problem with greedy search algorithms,
which can fail to give you the optimal solution; in the context of decoding,
they can miss word sequences whose overall probability is higher just
because high-probability words happen to be preceded by low-probability
ones.

Although greedy search decoding is rarely used for text generation tasks that require diversity, it can be useful for producing short sequences like arithmetic where a deterministic and factually correct output is preferred.

For these tasks, you can condition `GPT-2` by providing a few line-separated examples in the format `5 + 8 => 13 \n 7 +
2 => 9 \n 1 + 0 =>` as the input prompt.

In [14]:
max_length = 128
input_arithmatic = """5 + 8 => 13 \n 7 + 2 => 9 \n 1 + 0 =>"""

input_ids = tokenizer(input_arithmatic, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [15]:
print(tokenizer.decode(output_greedy[0]))

5 + 8 => 13 
 7 + 2 => 9 
 1 + 0 => 10 

1 + 1 => 11 

1 + 2 => 12 

1 + 3 => 13 

1 + 4 => 14 

1 + 5 => 15 

1 + 6 => 16 

1 + 7 => 17 

1 + 8 => 18 

1 + 9 => 19 

1 + 10 => 20 

1 + 11 => 21 

1 + 12 => 22 

1 + 13 => 23 

1 +


In [16]:
max_length = 128
input_arithmatic = """1 + 2 => 3 \n 2 + 3 => 5 \n 3 + 4 =>"""

input_ids = tokenizer(input_arithmatic, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [17]:
print(tokenizer.decode(output_greedy[0]))

1 + 2 => 3 
 2 + 3 => 5 
 3 + 4 => 6 

4 + 5 => 7 

5 + 6 => 8 

6 + 7 => 9 

7 + 8 => 10 

8 + 9 => 11 

9 + 10 => 12 

10 + 11 => 13 

11 + 12 => 14 

12 + 13 => 15 

13 + 14 => 16 

14 + 15 => 17 

15 + 16 => 18 

16 + 17 => 19 

17 +


Fortunately, we can do better—let’s examine a popular method known as
beam search decoding.

##Beam Search Decoding