<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/05-text-generation/text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Text Generation

One of the most uncanny features of transformer-based language models is
their ability to generate text that is almost indistinguishable from text written
by humans.


As we’ve seen, for task-specific
heads like sequence or token classification, generating predictions is fairly
straightforward; the model produces some logits and we either take the
maximum value to get the predicted class, or apply a softmax function to
obtain the predicted probabilities per class.

By contrast, converting the
model’s probabilistic output to text requires a decoding method, which
introduces a few challenges that are unique to text generation:

* The decoding is done iteratively and thus involves significantly
more compute than simply passing inputs once through the forward
pass of a model.

* The quality and diversity of the generated text depend on the choice
of decoding method and associated hyperparameters.

To understand how this decoding process works, let’s start by examining
how GPT-2 is pretrained and subsequently applied to generate text.

Like other autoregressive or causal language models, GPT-2 is pretrained
to estimate the probability $P(y|x)$ of a sequence of tokens $y=y_1,y_2,...,y_t$ occurring in the text, given some initial prompt or context sequence $x=x_1,x_2,...,x_k$.

Since it is impractical to acquire enough training data to
estimate $P(y|x)$ directly, it is common to use the chain rule of probability to factorize it as a product of conditional probabilities:

$$ P(y_1,y_2,...,y_t|x) = \prod_{t=1}^N P(y_t|y_{<t}, x)$$

It is from
these conditional probabilities that we pick up the intuition that
autoregressive language modeling amounts to predicting each word given the
preceding words in a sentence; this is exactly what the probability on the
righthand side of the preceding equation describes.

Notice that this
pretraining objective is quite different from BERT’s, which utilizes both past
and future contexts to predict a masked token.

<img alt="Text generation" width="700" caption="Generating text from an input sequence by adding a new word to the input at each step" src="images/text-generation.png" id="text-generation"/> 


As shown, we start with a prompt like "Transformers are the" and use the model to
predict the next token. Once we have determined the next token, we append it
to the prompt and then use the new input sequence to generate another token.
We do this until we have reached a special end-of-sequence token or a
predefined maximum length.

>Since the output sequence is conditioned on the choice of input prompt, this type of text generation is often called conditional text generation.

At the heart of this process lies a decoding method that determines which
token is selected at each timestep.

Since the language model head produces a logit $z_{t,i}$ per token in the vocabulary at each step, we can get the probability
distribution over the next possible token $w_i$ by taking the softmax:

$$ P(y_t=w_i|y_{<t},x) = softmax(z_{t,i}) $$

The goal of most decoding methods is to search for the most likely overall
sequence by picking a $\hat y$ such that:

$$ \hat y = argmax P(y|x) $$

Finding $\hat y$ directly would involve evaluating every possible sequence with the language model. Since there does not exist an algorithm that can do this in a reasonable amount of time, we rely on approximations instead.

##Setup

In [None]:
!pip -q install transformers

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.functional import cross_entropy

from transformers import AutoTokenizer, AutoModelForCausalLM

##Greedy Search Decoding