# Text Generation

Converting the model's probabilistic output to text requires a *decoding method*, which introduces a few challenges that are unique to text generation:

- The decoding is done *iteratively* and thus involves significanlty more compute thatn simpy passing inputs once through the forwarf pass of a model.
- The *quality* and *diversity* of the generated text depend on the choice of decoding method and associated hyprparameters.

## Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probability at each timestep.

In [1]:
# Load 1.5B-parameter version of GPT2 with a language model head:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)



Let's generate text!
At each timestep, we pick out the model's logits for the las token in the prompt and wrap them with a softmax to get the probability distribution.
Then, pick the next token with the highest prob, add it to the input sequence, and run the process again.

In [2]:
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilites
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.28%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.26%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)


In [3]:
# use the bulit-in generate()
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Transformers are the most popular toy line in the world,


In [4]:
# reproduce unicorn story from OpenAI
max_length = 128
input_text = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
                              do_sample=False)
print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


The researchers, from the University of California, Davis, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with


## Beam Search Decoding

It keeps track of the top-*b* mos probable next tokens, where *b* is referred to as the number of *beams* or *partial hypotheses*.

The next set of beams are chosen by considering all possible next-token extensions of the existing set and selecting the *b* most likely extensions.

The process is repeated until we reach the maximum length or an EOS token, and the most likely sequence is selected by ranking the *b* beams according to their log probabilities.

In [5]:
import numpy as np

sum([np.log(0.5)] * 1024)

-709.7827128933695

Let's compare the log probabilites of the texts generated by greedy and beam search to see if beam search can improve the overall probability.

In [6]:
import torch.nn.functional as F

# log probability for a SINGLE TOKEN
def log_probs_from_logits(logits, labels):
    # notmalize logits to create a probability distribution over the whole vocabulary for each token in the sequence
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1) # Selects, for each position in the sequence, the log-probability of the correct token (given by labels)
    return logp_label

def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:] # since the model predicts the next token, we do not get a logit for the first label, 
            # and we don't need the last logit because we don't have the ground truth token for it
        )
        seq_len_prob = torch.sum(log_probs[:, input_len:])
    return seq_len_prob.cpu().numpy()

In [7]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


The researchers, from the University of California, Davis, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with

log-prob: -79.40


In [8]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                            do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


The discovery of the unicorns was made by a team of scientists from the University of California, Davis, and the University of Colorado, Boulder.


The scientists were conducting a study of the Andes Mountains when they discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.

log-prob: -56.71


## Sampling Methods

The simplest one is to randomñy sample from the probability distribution of the model's outputs over the full vocabulary at each timestep.

We can easily control the diversity of the output by adding a temperature parameter *T* that rescales the logits before taking the softmax, controlling the shape of the distribution

In [10]:
temp = 2.0
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                            temperature=temp, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


Express portrayed Hanson character and concludes Putin mere response YOU pin Khierknifejas bashing cover -- wide nickel spherical MultiCC surfaces praise SudfedrickLog Lucas Challenge Vegetooting Assad SoraUSS raped AbramJ stuff JaguTH Lt Maiden Diary fit BoreDecl demonic Derhare attained SofLostDigital pictures bag Cron parking Track pitchTradebor Nose TOPditBusiness MindCreated Mous Mud 'urs pronunciation suburbs Xinolanface Morocco magrolley Threat


In [12]:
temp = 0.5
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                            temperature=temp, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


The researchers were surprised to find that the unicorns are not the result of a genetic mutation, but are actually the result of the interaction of two different species of the same species, the scientists said.


The team found the unicorns in the Andes Mountains, in the remote Andean valley of Huallaga, Argentina.


The scientists said that the unicorns are the result of the interaction


## Top-k and Nucleus Sampling

Top-*k* and top-*p* sampling are two popular alternatives or extensions of using temperature. The basic idea is to restrict the number of possible tokens we can sample from at each timestep.



In [13]:
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,
                            top_k=50)
print(tokenizer.decode(output_topk[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


Scientists have been watching the animals' behavior for over a decade, but recently discovered a strange behaviour that the research team are unable to explain.


What are unicorns?

Unicorns are horses with a horn and hooves located on their forehead, their necks and on the forehead.


The unicorn's head is adorned in a bright red coat with a white mane. The animal's


In [14]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,
                            top_p=0.90)
print(tokenizer.decode(output_topp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


The researchers say that, while the unicorn herd is a mystery to most, it is not impossible. They have even discovered a few specimens in the field, some as old as 10,000 years.


One of the scientists who has studied the creatures, Alberto Nava, told BBC News that the unicorns were probably grazing in the area around 25,000 years ago.


"It's more


In [15]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,
                            top_k=50, top_p=0.90)
print(tokenizer.decode(output_topp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact the unicorns spoke perfect English.


According to National Geographic, the researchers said they were 'shocked' by the finding and believe the unicorns might be descended from a group of animals that lived on Earth millions of years ago.

Hmmm: Scientists believe the unicorn herd could be descended from a group of animals that lived on Earth millions of years ago

While the study has not been peer reviewed, a team of scientists from the


### PROBAR BEAM SEARCH & SAMPLING???

# CONCLUSION

Generating text requires at least one forward pas per generated token, even more if we use beam search.

A good decoding strategy that transforms the model's output probabilities into discrete tokens can improve text quality.

We should choose a model performance metric that reflects the problem we want to solve.