<a href="https://colab.research.google.com/github/tosittig/CASAIS/blob/main/W2_4_2_Pretrained_Language_Generation_4colab_OUTPUT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Generation using Pretrained Models

In this notebook, we will finally look at language generation!

## Preparations
Import packages, and customize settings:

In [None]:
# Libraries for deep learning. In the background, we will use torch / pytorch
import torch
import torch.nn.functional as F

# Libraries from huggingface to easily interact with pretrained models
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# general Python libraries:
import pandas as pd

In [None]:
# make sure the entire text is output:
pd.set_option('display.max_colwidth', 80)

In [None]:
# what device is this notebook running on?
device = "cuda" if torch.cuda.is_available() else "cpu"

For this notebook, we will use the `GPT2` model, that has been open-sources by openAI. We use the corresponding tokenizer and causal language model:

In [None]:
model_name = "gpt2"
# if you have a powerful computer and want to use a larger model:
# model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

## Greedy Search Decoding

We start with an example for greedy decoding. While there is an easy way of using this decoding strategy via the `generate` function of the `model` with the pretrained parameters, we start with a more granular look at the sampling method.

In [None]:
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

### An in-depth Look at Greedy Search
Starting from the input text "Transformers are the", we repeatedly call the model. As a result of the model, we get (among other things) the logits, the non-normalized scores by the model for each of the tokens. We then normalize these scores to probabilities, and store the tokens with the highest probabilities.

In [None]:
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

Now let us look at the most probable tokens of every iteration step:

In [None]:
pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (9.76%),same (2.94%),only (2.87%),best (2.38%),first (1.77%)
1,Transformers are the most,common (22.90%),powerful (6.88%),important (6.32%),popular (3.95%),commonly (2.14%)
2,Transformers are the most common,type (15.06%),types (3.31%),form (1.91%),way (1.89%),and (1.49%)
3,Transformers are the most common type,of (83.13%),in (3.16%),. (1.92%),", (1.63%)",for (0.88%)
4,Transformers are the most common type of,particle (1.55%),object (1.02%),light (0.71%),energy (0.67%),objects (0.66%)
5,Transformers are the most common type of particle,. (14.26%),in (11.57%),that (10.19%),", (9.57%)",accelerator (5.81%)
6,Transformers are the most common type of particle.,They (17.48%),\n (15.19%),The (7.06%),These (3.09%),In (3.07%)
7,Transformers are the most common type of particle. They,are (38.78%),have (8.14%),can (7.98%),'re (5.04%),consist (1.57%)


### Using the `generate()` Function
The Huggingface transformer model has a function `generate()` to generate texts, and allows us to specify the methods to be used for the text generation. Without futher arguments, the greedy search is implemented:

In [None]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False,
                        pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0]))

Transformers are the most common type of particle. They are


We can also try to reproduce the unicorn story presented along with GPT-2:

In [None]:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False,
                               pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_greedy[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very


The text is different from what openAI has reported when releasing GPT-2, but at first sight, it seems plausible and is grammatically correct. The involved researcher gets a name, and he works at a different university, however also in California. The later part of the generated text are very repretitive - looks like GPT-2 is highly convinced that unicorns are very intelligent :-)

### GPT as Calculator?
We can also try to make GPT-2 calculate:

In [None]:
max_length_math = 70
input_txt_math1 = """
5 + 8 = 13
2 + 7 = 9
13 - 5 = 8
2 * 5 = 10
5 + 7 =
"""
input_ids_math1 = tokenizer(input_txt_math1, return_tensors="pt")["input_ids"].to(device)
output_greedy_math1 = model.generate(input_ids_math1, max_length=max_length_math,
                               do_sample=False,
                               pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_greedy_math1[0]))


5 + 8 = 13
2 + 7 = 9
13 - 5 = 8
2 * 5 = 10
5 + 7 =
10 - 4 = 11
11 - 3 = 12
12 - 2 = 13
13 - 1 = 14
14 - 0 = 15
15 - 0 = 16
16 - 1 =


Unfortunately, this is completely wrong. Even when we're restricting ourselves to the addition, our model completely fails:

In [None]:
input_txt_math2 = """
5 + 8 = 13
2 + 7 = 9
5 + 7 =
"""
input_ids_math2 = tokenizer(input_txt_math2, return_tensors="pt")["input_ids"].to(device)
output_greedy_math2 = model.generate(input_ids_math2, max_length=max_length_math,
                               do_sample=False,
                               pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_greedy_math2[0]))


5 + 8 = 13
2 + 7 = 9
5 + 7 =

6 + 7 =

7 + 7 =

8 + 7 =

9 + 7 =

10 + 7 =

11 + 7 =

12 + 7 =

13 + 7 =

14 + 7


It's been known for a while that the large language models are bad in doing calcations. We revert to the fiction story about the Andine unicors.

## Beam Search
Next, we look at the beam search strategy, which keeps a set of most probable partial solutions.

First, we define two functions to compute the log-probability of a sequence from the logits we get from the model:

In [None]:
def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

In [None]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

Let us calculate the log-probability of the greedy output we've obtained before:

In [None]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
tokenizer.decode(output_greedy[0])

'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\n"The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very intelligent, and they were very'

In [None]:
print(f"\nlog-prob: {logp:.2f}")


log-prob: -83.32


Now, let's generate a text continuation using beam search:

In [None]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False,
                             pad_token_id=tokenizer.eos_token_id)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
tokenizer.decode(output_beam[0])

'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\nThe researchers, from the University of California, San Diego, and the University of California, Santa Cruz, found that the unicorns were able to communicate with each other in a way that was similar to that of human speech.\n\n\n"The unicorns were able to communicate with each other in a way that was similar to that of human speech," said study co-lead author Dr. David J.'

In [None]:
print(f"\nlog-prob: {logp:.2f}")


log-prob: -78.34


Tracking several potential continuations of the start sentence, we get a sentence that has a higher overall probability, and also sounds much more natural - but the part that the unicorns communicate in a way similar to that of human speech is still there. We can force the `generate` function not to produce repeated `n`-grams (i.e., combinations of `n` words that occurr more than once):

In [None]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False, no_repeat_ngram_size=2,
                             pad_token_id=tokenizer.eos_token_id)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
tokenizer.decode(output_beam[0])

'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\nThe researchers, from the University of California, San Diego, and the National Science Foundation (NSF) in Boulder, Colorado, were able to translate the words of the unicorn into English, which they then translated into Spanish.\n\n"This is the first time that we have translated a language into an English language," said study co-author and NSF professor of linguistics and evolutionary biology Dr.'

In [None]:
print(f"\nlog-prob: {logp:.2f}")


log-prob: -101.87


Note that the affiliation of the researcher has changed - "University of California, San Diego, and the University of California, Santa Cruz" (as was stated in the previous sequence) contains the 3-gram "University of California" twice. Since we have told the model not to repeat any 2-grams, "University of California" must only appear once.

While the model has thus correctly followed the rules we've imposed, the text - even though it might sound convincing - does not really make sense: If the unicorns speak perfect English, why would the "words of the unicorns" have to be translated into English? Also, the statement "This is the first time that we have translated a language into an English language" which is attributed to the NSF professor of linguistics and evolutionary biology, is clearly wrong.

## Sampling Methods

In order to obtain some more interesting texts, we now look at sampling methods. First, we vary the temperature:

In [None]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=2.0, top_k=0,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_temp[0])

'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\nial Right Fleet Lloyd expertise surfaim discussespectometers Recreagram prototypemic Michel astronomer Goesbo highestcon egg fart Internet Schro47ramer leaning ton recipowell Kerry receive Chief Bornvestee juice cascø Council Homunend soda mild TownsrovRoyal HomReddita coordinatedFGPH Naval publication need Eleven Honey effectiveness Ken Ballistic monost outfield decreiel FugedIn probabilitieswan Bl lat Extension Stamp large goalie guardiansDial massesample'

In [None]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=0.5, top_k=0,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_temp[0])

'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\n"They were very close and they were very intelligent," said Dr. Robert E. Nester, a biologist with the University of California, San Diego who was not involved in the study. "Some people think that they speak English, but that doesn\'t make them easy to communicate with."\n\n\nNester and his research team uncovers a number of secrets about the unicorns. They discovered that'

Setting the temperature to 2.0 (in the first attempt) results in a very confuse text, using words that are very rare in the context, and ignoring almost all rules of grammar. With a lower temperature of 0.5, however, we get a pretty consistent and plausible text.

### Top-k Sampling
Next, we limit the choice to the top 50 tokens in every step.

In [None]:
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_k=50,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_topk[0])

"In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\nResearchers at Bologna State University in Italy say that because they can speak perfectly, they can tell where they and other unicorns are based on a combination of their vocalizations. So, if you're not familiar with the languages of some unicorns, you might be surprised at what they do here.\n\nTo put it simply, that's how they came across other unicorns who live in a"

### Top-p Sampling
As the value `k` in the previous example was rather random, we also want to try the dynamic cut-off using top-p (or nucleus) sampling.

In [None]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_p=0.90,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_topp[0])

'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\nAdvertisement\n\nResearchers found a unicorn species in the Andes Mountains of northern Bolivia, which is known for its abundant and diverse language.\n\nIn their study, the scientists found that the unicorns had a unique form of speech called "unicornian," meaning "un-cubit." In traditional speech, the unicorn is often called "spacepress" or "lubed-'

## Conclusion

After tying out several approaches - what is the best one? Unfortunately, there is no universal answer. As we have seen, lower temperatures (or a deterministic approach as a limit behaviour) produces more predictable texts, at the risk of repetitions. For more creativity, increase the temperature, possibly in combination with top-k or a dynamic cutoff using top-p sampling.