# Exercise: Generating one token at a time

In this exercise, we will get to understand how an LLM generates text, one token at a time, using the previous tokens to predict the following ones. We will use the pre-trained `gpt2` model from the `transformers` library.


## Step 1. Load a tokenizer and a model

First we load a tokenizer and a model from HuggingFace's transformers library. A tokenizer
is a function that splits a string into a list of tokens, e.g. a sentence into a list of
numbers. The model in this case is the GPT-2 language model.

In [ ]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

text = "Udacity is the best place to learn about generative"

inputs = tokenizer(text, return_tensors="pt")
inputs["input_ids"]

### Question: What do you think each token represents here? Are they words? Characters? Or something else?

Spend a few minutes thinking what the correct answer might be and why. You can write your thoughts in the cell below.



### (Write your answer here. Double-click to edit.)

## Step 2. Examine the tokenization

Now we will show how the tokenization actually is taking place.

In [ ]:
# Show how the sentence is tokenized
import pandas as pd

def show_tokenization(inputs):
    return pd.DataFrame([(id, tokenizer.decode(id)) for id in inputs["input_ids"][0]], columns=["id", "token"])

show_tokenization(inputs)


### Subword tokenization

The interesting thing is that tokens in this case are neither just letters nor just words. Sometimes shorter words are represented by a single token, but other times a single token represents a part of a word, or even a single letter. This is called subword tokenization.

## Step 2. Calculate the probability of the next token

Now let's use PyTorch to calculate the probability of the next token given the previous ones.

In [ ]:
# Calculate the probabilities for the next token for all possible choices. We show the
# top 5 choices and the corresponding words or subwords for these tokens.

with torch.no_grad():
  logits = model(**inputs).logits[:, -1, :]
  probabilities = torch.nn.functional.softmax(logits[0], dim=-1)

def show_next_token_choices(probabilities, top_n=5):
  return pd.DataFrame(
    [(id, tokenizer.decode(id), p.item()) for id, p in enumerate(probabilities) if p.item()],
    columns=["id", "token", "p"]
  ).sort_values("p", ascending=False)[:top_n]

show_next_token_choices(probabilities)

Interesting! The model thinks that the most likely next word is "programming", but not by too much.

In [ ]:
# Obtain the token id for the most probable next token
next_token_id = torch.argmax(probabilities).item()
next_token_id

In [ ]:
# We append the most likely token to the text. 
text = text + tokenizer.decode(8300)
text

## Step 3. Generate some more tokens

Now write the code you need to generate the next token. We are going to take the LLM approach to learning. That is, we will mask certain parts of the code with `<MASK>` and ask you to fill it in. Don't worry, we will give you hints along the way!

Also, all of the code will come verbatim from earlier cells in this notebook. Feel free to scroll up and copy-and-paste the code you need.

In [ ]:
# NOW IT'S YOUR TURN: FILL IN THE PARTS LABELLED `<MASK>`

# 0. We start with the text
print(text)

# 1. Convert the text to tokens
# Hint: use the tokenizer
inputs = <MASK>

# 2. Calculate the probabilities for the next token for all possible choices.
# Hint: A softmax converts the logits to probabilities
with torch.no_grad():
  logits = model(**inputs).logits[:, -1, :]
  probabilities = <MASK>(logits[0], dim=-1)

# 3. Obtain the token id for the most probable next token.
# Hint: argmax returns the index of the largest value
next_token_id = <MASK>(probabilities).item()

# 4. Decode the most likely token and append it to the text
text = text + tokenizer.decode(next_token_id)

You can rerun that cell many times to generate all the text you want using Ctrl+Enter. This process of starting with a string of tokens and generating the following tokens is called auto-regressive generation. After running it 30+ times, you will get an output such as the following:

```
Udacity is the best place to learn about generative programming.


The following is a list of the top 10 most popular programming languages.


1. C#


2. Java
```

## Step 3. Generate the rest of the tokens

We can use the `generate` method to generate the rest of the tokens in a single command. Check this out!

In [ ]:
# We can use the `generate` method to do this for us.
# Play around with this and generate some more text!

text = "Generative AI"
inputs = tokenizer(text, return_tensors="pt")
output = model.generate(**inputs, max_length=50, pad_token_id=tokenizer.eos_token_id)

# Show the generated text
print(tokenizer.decode(output[0]))

You'll notice that GPT-2 is not nearly as sophisticated as later models like GPT-4, which you may have experience using. It often repeats itself and doesn't make much sense. But it's still pretty impressive that it can generate text that looks like English.