<a href="https://colab.research.google.com/github/sagardampba2022w/NLP_with_hugging_face/blob/main/nlp_hf_05_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Challenge with Generating Coherent Text

The decoding is done iteratively and thus involves significantly more compute
than simply passing inputs once through the forward pass of a model.


The quality and diversity of the generated text depend on the choice of decoding
method and associated hyperparameters.

To understand how this decoding process works, let’s start by examining how GPT-2
is pretrained and subsequently applied to generate text.Like other autoregressive or causal language models, GPT-2 is pretrained to estimate
the probability P ᅵ ᅴ of a sequence of tokens ᅵ = y1

, y2
, ...yt
occurring in the text,

given some initial prompt or context sequence ᅴ = x1
, x2
, ...xk
. Since it is impractical

to acquire enough training data to estimate P ᅵ ᅴ directly, it is common to use the
chain rule of probability to factorize it as a product of conditional probabilities:


P (y
1
, ..., y
t
ᅴ) =N ∏
t = 1

P (y
t | y < t
, x )


where y < t
is a shorthand notation for the sequence y1

, ..., yt − 1. It is from these con‐
ditional probabilities that we pick up the intuition that autoregressive language mod‐
eling amounts to predicting each word given the preceding words in a sentence; this
is exactly what the probability on the righthand side of the preceding equation
describes. Notice that this pretraining objective is quite different from BERT’s, which
utilizes both past and future contexts to predict a masked token.


At the heart of this process lies a decoding method that determines which token is
selected at each timestep. Since the language model head produces a logit z

t, i
per
token in the vocabulary at each step, we can get the probability distribution over the
next possible token wi

by taking the softmax:




P (y
t = wi | y < t
, x )
 = softmax ( z
t, i)


The goal of most decoding methods is to search for the most likely overall sequence
by picking ayhat : yhat = P(y/x)



Finding  directly would involve evaluating every possible sequence with the lan‐
guage model. Since there does not exist an algorithm that can do this in a reasonable
amount of time, we rely on approximations instead.

# Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model’s continuous out‐
put is to greedily select the token with the highest probability at each timestep:

To see how greedy search works, let’s start by loading the 1.5-billion-parameter ver‐
sion of GPT-2 with a language modeling head:3

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Now let’s generate some text! Although Transformers provides a generate() func‐
tion for autoregressive models like GPT-2, we’ll implement this decoding method ourselves to see what goes on under the hood.

we’ll use “Transformers are the” as the input
prompt and run the decoding for eight timesteps. At each timestep, we pick out the
model’s logits for the last token in the prompt and wrap them with a softmax to get a
probability distribution.

We then pick the next token with the highest probability, add
it to the input sequence, and run the process again. The following code does the job,
and also stores the five most probable tokens at each timestep so we can visualize the
alternatives:

In [2]:
import pandas as pd
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5


with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
            f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)


pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.29%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.27%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)


Implementing greedy search wasn’t too hard, but we’ll want to use the built-in
generate() function from Transformers to explore more sophisticated decoding
methods. To reproduce our simple example, let’s make sure sampling is switched off and specify the max_new_tokens for the number of
newly generated tokens:

In [3]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most popular toy line in the world,


Now let’s try something a bit more interesting: can we reproduce the unicorn story
from OpenAI? As we did previously, we’ll encode the prompt with the tokenizer, and
we’ll specify a larger value for max_length to generate a longer sequence of text:

In [4]:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
do_sample=False)
print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able


# Beam Search Decoding

Instead of decoding the token with the highest probability at each step, beam search
keeps track of the top-b most probable next tokens, where b is referred to as the num‐
ber of beams or partial hypotheses.

The next set of beams are chosen by considering
all possible next-token extensions of the existing set and selecting the b most likely
extensions.

The process is repeated until we reach the maximum length or an EOStoken, and the most likely sequence is selected by ranking the b beams according to
their log probabilities.

Why do we score the sequences using log probabilities instead of the probabilities
themselves? That calculating the overall probability of a sequence

involves calculating a product of conditional probabilities P yt
yis one reason.
Since each conditional probability is typically a small number in the range [0, 1],
taking their product can lead to an overall probability that can easily underflow. This
means that the computer can no longer precisely represent the result of the calcula‐
tion. We can avoid this by
calculating a related term, the log probability.the product of probabilities we saw earlier becomes a sum of log
probabilities, which is much less likely to run into numerical instabilities.

In [5]:
0.5 ** 1024


5.562684646268003e-309

In [6]:
import numpy as np
sum([np.log(0.5)] * 1024)

-709.7827128933695

Let’s calculate and compare the log probabilities of the texts generated by greedy and
beam search to see if beam search can improve the overall probability.


Trans‐
formers models return the unnormalized logits for the next token given the input
tokens, we first need to normalize the logits to create a probability distribution over
the whole vocabulary for each token in the sequence. We then need to select only the
token probabilities that were present in the sequence.

In [7]:
import torch.nn.functional as F
def log_probs_from_logits(logits, labels):
  logp = F.log_softmax(logits, dim=-1)
  logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
  return logp_label

This gives us the log probability for a single token, so to get the total log probability
of a sequence we just need to sum the log probabilities for each token:

In [8]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
      output = model(labels)
      log_probs = log_probs_from_logits(
        output.logits[:, :-1, :], labels[:, 1:])
      seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

In [9]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able

log-prob: -87.43


Now let’s compare this to a sequence that is generated with beam search. To activate
beam search with the generate() function we just need to specify the number of
beams with the num_beams parameter.

In [10]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery of the unicorns was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.


The scientists were conducting a study of the Andes Mountains when they discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English

log-prob: -55.23


We can see that we get a better log probability (higher is better) with beam search
than we did with simple greedy decoding. However, we can see that beam search also
suffers from repetitive text. One way to address this is to impose an n-gram penalty
with the no_repeat_ngram_size parameter that tracks which n-grams have been seen
and sets the next token probability to zero if it would produce a previously seen
n-gram:

In [11]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.

According to a press release, the scientists were conducting a survey of the area when they came across the herd. They were surprised to find that they were able to converse with the animals in English, even though they had never seen a unicorn in person before. The researchers were

log-prob: -93.12


This isn’t too bad! We’ve managed to stop the repetitions, and we can see that despite
producing a lower score, the text remains coherent. Beam search with n-gram penalty
is a good way to find a trade-off between focusing on high-probability tokens (with
beam search) while reducing repetitions (with n-gram penalty), and it’s commonly
used in applications such as summarization or machine translation where factual cor‐
rectness is important. When

# Sampling Methods

The simplest sampling method is to randomly sample from the probability distribu‐
tion of the model’s outputs over the full vocabulary at each timestep:

where V denotes the cardinality of the vocabulary. We can easily control the diver‐
sity of the output by adding a temperature parameter T that rescales the logits before
taking the softmax:

By tuning T we can control the shape of the probability distribution.5 When T ≪ 1,
the distribution becomes peaked around the origin and the rare tokens are sup‐
pressed. On the other hand, when T ≫ 1, the distribution flattens out and each token
becomes equally likely.

To see how we can use temperature to influence the generated text, let’s sample with
T = 2 by setting the temperature parameter in the generate() function (we’ll
explain the meaning of the top_k parameter in the next section):

In [12]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"(GREAN auctions shopping Bak Hend practices loreilled sticker tariff bargaining rhythms totaling BubbleDemand Conversion Index k ability dumb new ofasp mice Arrow inspiring runeCutoltROR obnoxious baffled proven Upon once feed unity proved Once piss dro Problem chang order charged PillFlyNN stomp 0 miracle cool lit complaints progress sculptures' Transconsatom Node migrated crest 119 desptel razor Moto mashed Himself Arctic ruining humiliating earth burned Color Diver seating George


We can clearly see that a high temperature has produced mostly gibberish;Let’s see what happens if we cool down the temperature:

In [13]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers were able to learn that the unicorns live in a valley that has been known to the locals for centuries. The valley is located in the province of Cuzco and is known for its majestic, natural beauty. The valley is also known for its abundance of animals, including llamas, deer, and even a rare species of wild llama.


The valley is also known to be


This is significantly more coherent, and even includes a quote from yet another uni‐
versity being credited with the discovery!

# Top-k and Nucleus Sampling

Top-k and nucleus (top-p) sampling are two popular alternatives or extensions to
using temperature. In both cases, the basic idea is to restrict the number of possible
tokens we can sample from at each timestep.





The idea behind top-k sampling is to avoid the low-probability choices by only sam‐
pling from the k tokens with the highest probability. This puts a fixed cut on the long
tail of the distribution and ensures that we only sample from likely choices.

In [14]:
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,
top_k=50)
print(tokenizer.decode(output_topk[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The 'unicorn' is about 6 feet tall! That little fellow is about the weight of an average-sized human and has the body of a man, but it looks more like a female than a man. (Image credit: Wikipedia)


The team of researchers from the Pontificia Universidad Católica de Chile, Peru and Bolivia discovered the existence of the unicorns on Dec


The value of k is chosen manually and is the same for each choice in the
sequence, independent of the actual output distribution. We can find a good value for
k by looking at some text quality metrics,

An alternative is to use a dynamic cutoff. With nucleus or top-p sampling, instead of
choosing a fixed cutoff value, we set a condition of when to cut off. This condition is
when a certain probability mass in the selection is reached.

Let’s say we set that value to 90%. We then order all tokens in descending order by probability and add one
token after another from the top of the list until the sum of the probabilities of the
selected tokens is 90%.

In [15]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,
top_p=0.90)
print(tokenizer.decode(output_topp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The herd was made up of only a few animals, as they were too small to stand upright. The unicorn herd was observed by researchers for several years, until researchers realized that the herd was a real species. The discovery was reported by the team of scientists in a paper published in the journal, PeerJ.


As the unicorn herd is a rare species that is rarely seen, the scientists hope that


Top-p sampling has also produced a coherent story, and this time with a new twist
about migrations from Australia to South America. You can even combine the two
sampling approaches to get the best of both worlds. Setting top_k=50 and top_p=0.9
corresponds to the rule of choosing tokens with a probability mass of 90%, from a
pool of at most 50 tokens.

# Which Decoding Method Is Best?

Unfortunately, there is no universally “best” decoding method. Which approach is
best will depend on the nature of the task you are generating text for. If you want
your model to perform a precise task like arithmetic or providing an answer to a spe‐
cific question, then you should lower the temperature or use deterministic methods
like greedy search in combination with beam search to guarantee getting the most
likely answer. If you want the model to generate longer texts and even be a bit crea‐
tive, then you should switch to sampling methods and increase the temperature or
use a mix of top-k and nucleus sampling.