# Text Generation

## Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probability at each timestep:

$$
\hat{y}_t = \arg\max P(y_t \mid y_{<t}, x)
$$

Lets see how greedy search works by loading 1.5 billion -parameter version of GPT-2 with a language modeling head:

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

2025-05-23 11:21:22.490930: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747979482.505822  170848 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747979482.510496  170848 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747979482.523244  170848 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747979482.523270  170848 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747979482.523272  170848 computation_placer.cc:177] computation placer alr

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [3]:
import pandas as pd

input_txt = "transformers are the "
input_ids = tokenizer(input_txt, return_tensors = 'pt')['input_ids'].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

In [7]:
with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        ## Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        ## Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

In [8]:
pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,transformers are the,earthqu (0.00%),Mechdragon (0.00%),booth (0.00%),councill (0.00%),subur (0.00%)
1,transformers are the earthqu,PDATE (0.00%),BuyableInstoreAndOnline (0.00%),CLASSIFIED (0.00%),龍契士 (0.00%),Nitrome (0.00%)
2,transformers are the earthquPDATE,BuyableInstoreAndOnline (0.00%),��� (0.00%),ertodd (0.00%),ikuman (0.00%),�� (0.00%)
3,transformers are the earthquPDATEBuyableInsto...,�� (0.00%),ertodd (0.00%),oppable (0.00%),ciating (0.00%),aminer (0.00%)
4,transformers are the earthquPDATEBuyableInsto...,ertodd (0.00%),ewitness (0.00%),anamo (0.00%),userc (0.00%),iferation (0.00%)
5,transformers are the earthquPDATEBuyableInsto...,anamo (0.00%),ackle (0.00%),ierrez (0.00%),�� (0.00%),osate (0.00%)
6,transformers are the earthquPDATEBuyableInsto...,osate (0.00%),�� (0.00%),unintention (0.00%),antidepress (0.00%),ortunately (0.00%)
7,transformers are the earthquPDATEBuyableInsto...,�� (0.00%),ortunately (0.00%),cumbers (0.00%),unintention (0.00%),��士 (0.00%)
8,transformers are the earthquPDATEBuyableInsto...,. (1.17%),n (1.04%),s (0.93%),: (0.79%),t (0.78%)
9,transformers are the earthquPDATEBuyableInsto...,\n (5.49%),The (2.36%),txt (2.32%),pdf (2.17%),png (1.60%)


In [12]:
inputs = tokenizer(input_txt, return_tensors = 'pt').to(device)
input_ids = inputs['input_ids']
attention_mask = inputs["attention_mask"]
output = model.generate(input_ids, attention_mask = attention_mask, 
                        max_new_tokens=n_steps, do_sample=False,
                        pad_token_id = tokenizer.eos_token_id)
print(tokenizer.decode(output[0]))

transformers are the vernacular of the time.




In [14]:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
inputs = tokenizer(input_txt, return_tensors = 'pt').to(device)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output_greedy = model.generate(input_ids = input_ids, attention_mask = attention_mask,
                               max_length = max_length, do_sample = False,
                               pad_token_id = tokenizer.eos_token_id)
print(tokenizer.decode(output_greedy[0]))


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very


The main drawback with greedy search decoding: it tends to produce repetitive output sequences, which is certainly undesirable in a news article.
Fortunately, `beam search decoding` a popular method can be better for this.

## Beam Search Decoding

Instead of decoding the token with highest probability at each step, beam search keeps track of the top-b most probable next tokens, where b is reffered to as the number of beams or partial hypotheses. The next set of beams are chosen by considering all possible next-token extensions of the existing set and selecting the b most likely extensions. This process is repeated until we reach the maximun length or EOS token, and the most likely sequence is selected by ranking the b beams according to their log probabilities.

**Why log probabilities?**
Caluclating overall probability of a sequence $ P(y_1,y_2,...y_t \mid x) $ involves calculating a product of conditional probabilities $ P(y_t \mid y_{<t}, x) $ is one reason. Since each conditional probability is typically a small number in the range [0, 1],
taking their product can lead to an overall probability that can easily underflow.

In [16]:
## For example, for a sequence of t = 1024 and probability of each token is 0.5.
0.5 ** 1024

5.562684646268003e-309

which leads to numerical instability as we run into underflow.

In [17]:
import numpy as np
## Calculating the log probability of the same example
sum([np.log(0.5)] * 1024) 

np.float64(-709.7827128933695)

Let’s calculate and compare the log probabilities of the texts generated by greedy and beam search to see if beam search can improve the overall probability. Since Transformers models return the unnormalized logits for the next token given the input tokens, we first need to normalize the logits to create a probability distribution over the whole vocabulary for each token in the sequence. We then need to select only the token probabilities that were present in the sequence.

In [None]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim = -1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

This gives us the log probability for a single token, so to get the total log probability of a sequence we just need to sum the log probabilities for each token:

In [23]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
        output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

Let’s use these functions to first calculate the sequence log probability of the greedy decoder on the OpenAI prompt:

In [24]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very

log-prob: -83.33


Now let’s compare this to a sequence that is generated with beam search. To activate beam search with the generate() function we just need to specify the number of beams with the num_beams parameter. The more beams we choose, the better the result potentially gets; however, the generation process becomes much slower since we generate parallel sequences for each beam:

In [25]:
output_beam = model.generate(input_ids, attention_mask = attention_mask,
                             max_length = max_length, num_beams = 5,
                             do_sample = False, pad_token_id = tokenizer.eos_token_id)

logp = sequence_logprob(model, output_beam, input_len = len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, San Diego, and the University of California, Santa Cruz, found that the unicorns were able to communicate with each other in a way that was similar to that of human speech.


"The unicorns were able to communicate with each other in a way that was similar to that of human speech," said study co-lead author Dr. David J.

log-prob: -78.34


We can see that we get a better log probability (higher is better) with beam search than we did with simple greedy decoding. However, we can see that **beam search also suffers from repetitive text**.
To address this is to impose an n-gram penalty with the `no_repeat_ngram_size` parameter that tracks which n-grams have been seen and sets the next token probability to zero if it would produce a previously seen n-gram.

In [27]:
output_beam = model.generate(input_ids, attention_mask = attention_mask,
                             max_length = max_length, num_beams = 5,
                             do_sample = False, no_repeat_ngram_size = 2,
                             pad_token_id = tokenizer.eos_token_id)

logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, San Diego, and the National Science Foundation (NSF) in Boulder, Colorado, were able to translate the words of the unicorn into English, which they then translated into Spanish.

"This is the first time that we have translated a language into an English language," said study co-author and NSF professor of linguistics and evolutionary biology Dr.

log-prob: -101.88


This isn’t too bad! We’ve managed to stop the repetitions, and we can see that despite producing a lower score, the text remains coherent.

When factual correctness is less important than the diversity of generated output, for instance in open-domain chitchat or story generation, another alternative to reduce repetitions while improving diversity is to use sampling. Let’s round out our exploration of text generation by examining a few of the most common sampling methods.

## Sampling Methods

By tuning T we can control the shape of the probability distribution.5 When T ≪ 1, the distribution becomes peaked around the origin and the rare tokens are suppressed. On the other hand, when T ≫ 1, the distribution flattens out and each token becomes equally likely.

In [30]:
## Lets sample with T = 2

output_temp = model.generate(input_ids, attention_mask = attention_mask,
                             max_length = max_length, do_sample = True,
                             temperature = 2.0, top_k = 0, 
                             pad_token_id = tokenizer.eos_token_id)
print(tokenizer.decode(output_temp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


Unique research earth Chris interface contacted portraits EEG Decoder med stop producing higaution rains in Athuesday grandchildren holiest gathers enthusiastically fascinating opinion Fernandez familiar Perkins protected conversation joins expecting tacklesPopulation equations diverparser assessing satellite globyatransfer Son ultrerson Deal imports rally shuffleworks maneuveranks Badpeij countdown creekone Ysunic pointless Polopping Plants adult taboo DungeonsEconom Sir Feeling priorpause Home Philos show26 vanished physical lockERS hydro


We can clearly see that a high temperature has produced mostly gibberish; by accentuating the rare tokens, we’ve caused the model to create strange grammar and quite a few made-up words!

Let’s see what happens if we cool down the temperature:

In [31]:
output_temp = model.generate(input_ids, attention_mask = attention_mask,
                             max_length = max_length, do_sample = True,
                             temperature = 0.5, top_k = 0,
                             pad_token_id = tokenizer.eos_token_id)
print(tokenizer.decode(output_temp[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, led by Dr. Michael K. Friel of the University of California, Berkeley, found that the unicorns were able to learn English with the help of their tongues. The researchers believe that this is because the unicorns are able to learn through their tongues. They believe that this is because the unicorns are able to learn through their tongues.


"This is an exciting discovery that


The main lesson we can draw from temperature is that it allows us to control the quality of the samples, but there’s always a
trade-off between coherence (low temperature) and diversity (high temperature) that one has to tune to the use case at hand.
Another way to adjust the trade-off between coherence and diversity is to truncate the distribution of the vocabulary. This allows us to adjust the diversity freely with the temperature, but in a more limited range that excludes words that would be too strange in the context (i.e., low-probability words). There are two main ways to do this: `top-k` and `nucleus` (or `top-p`) sampling. Let’s take a look.

## Top-k and Nucleus Sampling

Top-k and nucleus (top-p) sampling are two popular alternatives or extensions to using temperature. In both cases, the basic idea is to restrict the number of possible tokens we can sample from at each timestep.

In [33]:
output_topk = model.generate(input_ids, attention_mask = attention_mask, max_length = max_length,
                             do_sample = True, top_k = 50,
                             pad_token_id = tokenizer.eos_token_id)
print(tokenizer.decode(output_topk[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"Given their limited habitat, they developed a language barrier to prevent them from sharing their language-learning efforts with other animals of the same species (including elephants), which may have reduced the impact of rhino poaching in Peru," says Prof. Anson.

Prof. Anson believes this is likely due to the fact that unicorns seem to have developed other non-native languages, such as Spanish


In [34]:
output_topk = model.generate(input_ids, attention_mask = attention_mask, max_length = max_length,
                             do_sample = True, top_p = 0.90,
                             pad_token_id = tokenizer.eos_token_id)
print(tokenizer.decode(output_topk[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers had spent three years in the region. On average, these wild animals are able to communicate with us through speech that many animals couldn't. However, some unicorns couldn't speak. The most notable exception was "Hangaman," a native of Peru. That's because they speak a combination of German, Spanish and French, which means they can't speak English. The only language that could


Top-p sampling has also produced a coherent story, and this time with a new twist. Combine `top_k=50` and `top_p=0.9` corresponds to the rule of choosing tokens with a probability mass of 90%, from a pool of at most 50 tokens. 

In [35]:
output_topk = model.generate(input_ids, attention_mask = attention_mask, max_length = max_length,
                             do_sample = True, top_p = 0.90, top_k = 50,
                             pad_token_id = tokenizer.eos_token_id)
print(tokenizer.decode(output_topk[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"Our results revealed that the unicorns were probably speaking a tongue that was less suited to human communication," Dr. Martin said. "This suggests that these animals were not speaking in their native tongue."


It's possible they were living in a remote area with few people or a small herd of livestock that could have helped them communicate easily with humans.


"However, it has been shown that
