# Language Generation using Pretrained Models

In this notebook, we will finally look at language generation!

## Preparations
Import packages, and customize settings:

In [1]:
# Libraries for deep learning. In the background, we will use torch / pytorch
import torch
import torch.nn.functional as F

# Libraries from huggingface to easily interact with pretrained models
from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
# general Python libraries:
import pandas as pd

In [3]:
# make sure the entire text is output:
pd.set_option('display.max_colwidth', 80)

In [4]:
# what device is this notebook running on?
device = "cuda" if torch.cuda.is_available() else "cpu"

For this notebook, we will use the `GPT2` model, that has been open-sources by openAI. We use the corresponding tokenizer and causal language model:

In [5]:
model_name = "gpt2"
# if you have a powerful computer and want to use a larger model, you can use the following one: model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Greedy Search Decoding

We start with an example for greedy decoding. While there is an easy way of using this decoding strategy via the `generate` function of the `model` with the pretrained parameters, we start with a more granular look at the sampling method.

In [6]:
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

### An in-depth Look at Greedy Search
Starting from the input text "Transformers are the", we repeatedly call the model. As a result of the model, we get (among other things) the logits, the non-normalized scores by the model for each of the tokens. We then normalize these scores to probabilities, and store the tokens with the highest probabilities.

In [9]:
"""
This code:

Starts with some initial input_ids.

Runs the model for n_steps (8 tokens).

At each step:

Shows the current input text.

Records the top choices_per_step (5) possible next tokens with probabilities.

Picks the most likely token and appends it to input_ids.

Stores everything in iterations.
"""
iterations = [] #list that will store results of each decoding step (for later inspection).
n_steps = 8 # number of tokens 
choices_per_step = 5 # each step select top 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict() 
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

Now let us look at the most probable tokens of every iteration step:

In [10]:
pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the most common type of particle. They are,composed (4.76%),the (4.36%),used (2.98%),also (2.75%),usually (2.48%)
1,Transformers are the most common type of particle. They are composed,of (88.87%),by (1.85%),mainly (1.42%),primarily (1.21%),mostly (1.08%)
2,Transformers are the most common type of particle. They are composed of,a (12.19%),two (8.91%),particles (5.12%),many (4.00%),three (3.59%)
3,Transformers are the most common type of particle. They are composed of a,number (8.59%),single (6.34%),small (3.40%),set (2.98%),mixture (2.89%)
4,Transformers are the most common type of particle. They are composed of a nu...,of (99.61%),( (0.05%),and (0.04%),", (0.03%)",or (0.03%)
5,Transformers are the most common type of particle. They are composed of a nu...,different (11.71%),particles (8.65%),small (4.11%),components (2.36%),elements (2.23%)
6,Transformers are the most common type of particle. They are composed of a nu...,particles (22.86%),types (18.60%),elements (3.09%),materials (2.96%),components (2.88%)
7,Transformers are the most common type of particle. They are composed of a nu...,", (30.90%)",that (10.66%),. (8.76%),and (6.41%),( (5.63%)


### Using the `generate()` Function
The Huggingface transformer model has a function `generate()` to generate texts, and allows us to specify the methods to be used for the text generation. Without futher arguments, the greedy search is implemented:

In [11]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False,
                        pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Transformers are the most common type of particle. They are


**Exercise**: Use a starting text of your choice that should be continued by the model. Vary the maximum number of new tokens that should be generated by setting different values to `n_steps`. For example, you might use "The university is a place" as prompt. Or, can you reproduce the unicorn story presented along with GPT-2?

In [36]:
n_steps = 40
input_txt = r"In a shocking finding, scientist discovered a herd of unicorns living "
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False,
                         pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_greedy[0]))

In a shocking finding, scientist discovered a herd of unicorns living  in a forest in the middle of the forest.
The researchers found that the unicorns were not only able to live in the forest, but also to live in the forest itself.
The


### GPT as Calculator?


In [37]:
max_length_math = 70
input_txt_math1 = """
5 + 8 = 13
2 + 7 = 9
13 - 5 = 8
2 * 5 = 10
5 + 7 =
"""
input_ids_math1 = tokenizer(input_txt_math1, return_tensors="pt")["input_ids"].to(device)
output_greedy_math1 = model.generate(input_ids_math1, max_length=max_length_math,
                               do_sample=False,
                               pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_greedy_math1[0]))


5 + 8 = 13
2 + 7 = 9
13 - 5 = 8
2 * 5 = 10
5 + 7 =
10 - 4 = 11
11 - 3 = 12
12 - 2 = 13
13 - 1 = 14
14 - 0 = 15
15 - 0 = 16
16 - 1 =


Unfortunately, this is completely wrong. Even when we're restricting ourselves to the addition, our model completely fails:

In [38]:
input_txt_math2 = """
5 + 8 = 13
2 + 7 = 9
5 + 7 =
"""
input_ids_math2 = tokenizer(input_txt_math2, return_tensors="pt")["input_ids"].to(device)
output_greedy_math2 = model.generate(input_ids_math2, max_length=max_length_math,
                               do_sample=False,
                               pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output_greedy_math2[0]))


5 + 8 = 13
2 + 7 = 9
5 + 7 =

6 + 7 =

7 + 7 =

8 + 7 =

9 + 7 =

10 + 7 =

11 + 7 =

12 + 7 =

13 + 7 =

14 + 7


Let's look at it step by step.

In [39]:
input_txt_math2 = """
5 + 8 = 13
2 + 7 = 9
5 + 7 =
"""
input_ids_math2 = tokenizer(input_txt_math2, return_tensors="pt")["input_ids"].to(device)

iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids_math2[0])
        output = model(input_ids=input_ids_math2)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids_math2 = torch.cat([input_ids_math2, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n,\n (7.88%),5 (7.23%),6 (6.93%),2 (4.25%),1 (3.99%)
1,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n\n,6 (10.38%),5 (7.82%),9 (7.67%),7 (7.04%),8 (6.90%)
2,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n\n6,+ (79.00%),- (3.29%),\n (2.65%),= (2.07%),* (0.63%)
3,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n\n6 +,7 (19.21%),8 (17.52%),6 (12.56%),5 (8.29%),4 (7.61%)
4,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n\n6 + 7,= (96.03%),+ (1.24%),. (0.20%),= (0.18%),/ (0.15%)
5,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n\n6 + 7 =,\n (18.15%),9 (8.29%),10 (7.92%),8 (6.44%),7 (6.16%)
6,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n\n6 + 7 =\n,\n (99.96%),. (0.01%),( (0.00%),", (0.00%)",- (0.00%)
7,\n5 + 8 = 13\n2 + 7 = 9\n5 + 7 =\n\n6 + 7 =\n\n,7 (66.48%),6 (7.71%),8 (6.14%),9 (3.60%),5 (2.75%)


It's been known for a while that the large language models are bad in doing calcations. We revert to the fiction story about the Andine unicors.

## Beam Search
Next, we look at the beam search strategy, which keeps a set of most probable partial solutions.

First, we define two functions to compute the log-probability of a sequence from the logits we get from the model:

In [40]:
def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

In [41]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

Let us calculate the log-probability of the greedy output we've obtained before:

In [42]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
tokenizer.decode(output_greedy[0])

'In a shocking finding, scientist discovered a herd of unicorns living \xa0in a forest in the middle of the forest.\nThe researchers found that the unicorns were not only able to live in the forest, but also to live in the forest itself.\nThe'

In [43]:
print(f"\nlog-prob: {logp:.2f}")


log-prob: -58.51


Now, let's generate a text continuation using beam search:

In [45]:
max_length = 40
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False,
                             pad_token_id=tokenizer.eos_token_id)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
tokenizer.decode(output_beam[0])

'In a shocking finding, scientist discovered a herd of unicorns living \xa0in a cave in the Himalayas.\nThe researchers found that the unicorns lived in a cave in the Himalay'

In [46]:
print(f"\nlog-prob: {logp:.2f}")


log-prob: -32.49


Tracking several potential continuations of the start sentence, we get a sentence that has a higher overall probability, and also sounds much more natural - but the part that the unicorns communicate in a way similar to that of human speech is still there. We can force the `generate` function not to produce repeated `n`-grams (i.e., combinations of `n` words that occurr more than once):

In [47]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False, no_repeat_ngram_size=2,
                             pad_token_id=tokenizer.eos_token_id)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
tokenizer.decode(output_beam[0])

'In a shocking finding, scientist discovered a herd of unicorns living \xa0in a cave in the Himalayas.\nThe researchers found that the animals had been living in caves for thousands of years'

In [48]:
print(f"\nlog-prob: {logp:.2f}")


log-prob: -36.79


Note that the affiliation of the researcher has changed - "University of California, San Diego, and the University of California, Santa Cruz" (as was stated in the previous sequence) contains the 3-gram "University of California" twice. Since we have told the model not to repeat any 2-grams, "University of California" must only appear once.

While the model has thus correctly followed the rules we've imposed, the text - even though it might sound convincing - does not really make sense: If the unicorns speak perfect English, why would the "words of the unicorns" have to be translated into English? Also, the statement "This is the first time that we have translated a language into an English language" which is attributed to the NSF professor of linguistics and evolutionary biology, is clearly wrong.

## Sampling Methods

In order to obtain some more interesting texts, we now look at sampling methods. First, we vary the temperature:

In [49]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=2.0, top_k=0,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_temp[0])

'In a shocking finding, scientist discovered a herd of unicorns living pless in white expressedrous continuing between spectors spiked burning footprints and bubbles. Temper bondingende travelers UrbanBAeries Manufactypunic seeking'

In [50]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=0.5, top_k=0,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_temp[0])

'In a shocking finding, scientist discovered a herd of unicorns living \xa0in a cave in a cave in the woods of the northern Indian state of Kerala.\n"We found a herd of unic'

Setting the temperature to 2.0 (in the first attempt) results in a very confuse text, using words that are very rare in the context, and ignoring almost all rules of grammar. With a lower temperature of 0.5, however, we get a pretty consistent and plausible text.

**Exercise:** Vary the temperature as you like and look at the output.

In [51]:
# output_temp = ... # fill in this line
# tokenizer.decode(output_temp[0])

### Top-k Sampling
Next, we limit the choice to the top 50 tokens in every step.

In [52]:
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_k=50,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_topk[0])

'In a shocking finding, scientist discovered a herd of unicorns living \ue800 on a tropical island near South Carolina, and the scientists used a similar technique to collect the eggs.\n\n�'

### Top-p Sampling
As the value `k` in the previous example was rather random, we also want to try the dynamic cut-off using top-p (or nucleus) sampling.

In [53]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_p=0.90,
                             pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_topp[0])

'In a shocking finding, scientist discovered a herd of unicorns living ursines in the Sahara desert.\n\nThe researchers believe that the discovery may help explain the long-standing trend of sightings of'

**Exercise:** By trying different values for `top_p`, you can observe how the model's output changes. Lower values of `top_p` will make the text more deterministic and focused, while higher values will allow for more diversity and creativity in the generated text. Be careful: `top_p` can only be between 0 and 1.

In [54]:
# output_topp = ... # fill in this line
# tokenizer.decode(output_topp[0])

## Experimenting with GPT2
**Exercise:**
Experiment with different prompts, and generate the several outputs for the same prompt. For example:

* Can you make GPT2 invent a fairytale?
* Can you get GPT2 to give you some travel advice for, say, Paris?
* Can you get GPT2 to give you the capital of a given state? Use few-shot learning to guide GPT2 to the type of response you are looking for!
* Ask it a logical question in written text (e.g. There are two apples. I eat one. How many are left?).

## Conclusion

After trying out several approaches - what is the best one? Unfortunately, there is no universal answer. As we have seen, lower temperatures (or a deterministic approach as a limit behaviour) produces more predictable texts, at the risk of repetitions. For more creativity, increase the temperature, possibly in combination with top-k or a dynamic cutoff using top-p sampling.

In [55]:
%pwd

'/home/jovyan/work/tutorial_eth_ais_roman/in_class/Block 2'