# Running GPT-2 Locally

In [1]:
import torch
import xlab

When loading pretrained models in the course, we will be using the [transformers](https://huggingface.co/docs/transformers/en/index) library developed by Hugging Face. This library abstracts away the complexity of running pretrained language models into a single API. In the cell below, you will be loading the [smallest version of GPT-2](https://huggingface.co/openai-community/gpt2) with 124 million parameters. If you are interested, you can also try out [GPT-2 Medium](https://huggingface.co/openai-community/gpt2-medium) (335 million parameters), [GPT-2 Large](https://huggingface.co/openai-community/gpt2-large) (774 million parameters), and [GPT-2 XL](https://huggingface.co/openai-community/gpt2-xl) (1.5 billion parameters).

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

## Tokenize Text

To input a sequence of text to GPT-2, we first have to decide how we would like to convert the text to numbers so we can feed it to the model. Typically, how this is done is we convert a string of text to a string of tokens, each of which will be assigned a number which can be embedded into a vector. To do this, we have a few options:

1. We can assign each character it's own number
2. We can assign each word or special character it's own number
3. We can assign common sequences of characters their own number

Typically option #3 is most popular and the high-level approach taken in the GPT-2 paper. This approach has the advantage of having a smaller total number of tokens while still capturing some of the underlying structure of natural language. Specifically, the author's use a modified version of BPE (byte pair encoding) proposed [here](https://arxiv.org/pdf/1508.07909). If you are interested, more implementation details of the tokenizer can be found in the [GPT-2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). 

Time to try out the GPT-2 tokenizer! Run the cell below see the tokenizer assign the string into a sequence of numbers:

In [3]:
text = "Barack Obama taught constitutional law at the University of"
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input['input_ids'])

tensor([[10374,   441,  2486,  7817,  9758,  1099,   379,   262,  2059,   286]])


Let's take a look at what each of these numbers represent:

In [4]:
for token_id in encoded_input['input_ids'][0]:
    print(f'{token_id.item()}\t --> \t"{tokenizer.decode(token_id)}"')

10374	 --> 	"Bar"
441	 --> 	"ack"
2486	 --> 	" Obama"
7817	 --> 	" taught"
9758	 --> 	" constitutional"
1099	 --> 	" law"
379	 --> 	" at"
262	 --> 	" the"
2059	 --> 	" University"
286	 --> 	" of"


We can also decode the entire sequence at once. This will be helpful to remember for later!

In [5]:
tokenizer.decode(encoded_input['input_ids'][0])

'Barack Obama taught constitutional law at the University of'

### Task #1

For a given input of text, return a list of tokens in plain text. For example for the input "Hello there gpt-2!", the function should return ['Hello', ' there', ' g', 'pt', '-', '2', '!']. Note that this is very different than just splitting up the text into random chunks or where there are spaces! Tokenizers are designed to create groupings of characters that are often found together or that are significant in the structure of language. You are encouraged to play around with different examples and observe how smart the tokenizer can be!

In [6]:
# estimated time to complete: ~3 minutes
def plain_text_tokens(prefix):
    """Tokenizes a text prefix into individual token strings.
    
    Args:
        prefix (str): The input text string to be tokenized.
        
    Returns:
        list[str]: A list of individual tokens as strings. Each token represents
            how the tokenizer splits the input text.
            
    Example:
        >>> plain_text_tokens("Hello there gpt-2!")
        ['Hello', ' there', ' g', 'pt', '-', '2', '!']
    """
    rv = []
    ######## YOUR CODE HERE ########
    encoded_input = tokenizer(prefix, return_tensors='pt')
    for i in encoded_input['input_ids'][0]:
        rv.append(tokenizer.decode(i))
    return rv

# test out your implementation on different inputs to get a sense of how the tokenizer works!
print(plain_text_tokens("Hello there gpt-2!"))
print(plain_text_tokens("https://xrisk.uchicago.edu/fellowship/"))

['Hello', ' there', ' g', 'pt', '-', '2', '!']
['https', '://', 'x', 'risk', '.', 'uch', 'icago', '.', 'edu', '/', 'fell', 'owship', '/']


In [7]:
xlab.tests.gpt2.task1(plain_text_tokens)

collected 6 items. [0m

[32mPASSED[0m[32m [ 16%][0mts/gpt2.py::test_function_runs_without_crashing 
[32mPASSED[0m[32m [ 33%][0mts/gpt2.py::test_tokenization_cases[Hello there gpt-2-expected_output0-basic text with hyphen] 
[32mPASSED[0m[32m [ 50%][0mts/gpt2.py::test_tokenization_cases[??!hello--*- world#$-expected_output1-special characters and symbols] 
[32mPASSED[0m[32m [ 66%][0mts/gpt2.py::test_tokenization_cases[https://xrisk.uchicago.edu/fellowship/-expected_output2-URL with dots and slashes] 
[32mPASSED[0m[32m [ 83%][0mts/gpt2.py::test_tokenization_cases[-expected_output3-empty string] 
[32mPASSED[0m[32m [100%][0mts/gpt2.py::test_tokenization_cases[.,.,.,.,.,.,.,-expected_output4-repeated punctuation] 

✅ All checks passed!


Back to our model. We will tokenize our text into numbers to feed into the model. When the model is done predicting text, we can untokenize the results to see if what the model is saying makes sense.

Below is some code for the one full pass through the model with the prefix "The great scientist Albert Einstein"

Take a look at the shape of the output.

In [8]:
prefix = "Barack Obama taught constitutional law at the University of"
encoded_input = tokenizer(prefix, return_tensors='pt')
output = model(**encoded_input)
logits = output.logits
logits.shape

torch.Size([1, 10, 50257])

Let's take another look at the logit values. Note that the values span a huge positive and negative range. Normally, both in training and in inference, we apply a "softmax" function to the data to bring all values between 0 and 1. We interpret these values as the probability that the model assigns each token to be next in a sequence of text. For now however, we ignore this detail.

In [9]:
max(logits.view(-1)), min(logits.view(-1))

(tensor(-26.1977, grad_fn=<UnbindBackward0>),
 tensor(-285.3222, grad_fn=<UnbindBackward0>))

In [10]:
encoded_input['input_ids']

tensor([[10374,   441,  2486,  7817,  9758,  1099,   379,   262,  2059,   286]])

What is going on here?

`torch.Size([1, 10, 50257])` tells us that logits is a 3 dimensional array (i.e., it is `1*4*50257`). The first dimension represents the batch size and because there is only one batch, we can ignore it for now. The next dimension is the sequence length. Note that:

```python
>>> encoded_input['input_ids']
tensor([[10374,   441,  2486,  7817,  9758,  1099,   379,   262,  2059,   286]])
```

There are 10 tokens when we tokenize "Barack Obama taught constitutional law at the University of" meaning the sequence length is 10. For the final dimension, we have 50257 which represents the model's vocabulary size. Why do we have so much data? Don't we only want the next predicted piece of text?

To understand why this is necessary, you will need to understand what information is included in this tensor. In total, we have 10 vectors of length 50257. Each vector represents a probability distribution for each token in the vocabulary. For example, if the value at 42 is higher, that means that the model is assigning a higher probability to the token at position 42 to be the next in the sequence of text.

This makes sense for our purposes: if we have a probability distribution for the next token in the text, we can sample from it to predict the next token! But why do we have 10 probability distributions in the output. In other words why do we need a probability distribution for each token in the input?

The answer to this question is oddly, that this make it easier to train our model! If we have a piece of text that we are training on from the internet, we can train multiple examples in paralell. For example, if we have the text "Barack Obama taught constitutional law at the University of" here are a several different examples we could choose to train on. 


1. Prefix="Bar" and label="ack"
2. Prefix="Barack" and label=" Obama"
3. Prefix="Barack Obama" and label=" taught"
4. And so on...


The first three vectors in the logits in the code above correspond to the model's predictions for the first three prefixes above. While running inference, we only care about the model's label for the input "Barack Obama taught constitutional law at the University of" so therefore, we only need to extract the final vector from the probability distribution. This makes sense because for our purposes, we aren't interested in efficiently training the model. We are only interested in seeing what the model predicts for the next token.

We can extract this last vector by taking `logits[0][-1]`. Lets see what the model predicts!

In [14]:
text = "Barack Obama taught constitutional law at the University of"

encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
token_id = torch.argmax(logits[0][-1])

generated_text = tokenizer.decode([token_id.item()]) 
print(generated_text)

 Chicago


Indeed, Barack Obama taught constitutional law at the University of Chicago ([source](https://www.obamalibrary.gov/obamas/president-barack-obama))! Despite being an early model with limited capabilities, GPT-2 124M knows quite a bit about the world.

In [15]:
for i in range(100):
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
    token_id = torch.argmax(logits[0][-1])

    generated_text = tokenizer.decode([token_id.item()])  
    text+=generated_text

In [16]:
text

"Barack Obama taught constitutional law at the University of Chicago. He was a founding member of the American Bar Association, and he was a founding member of the American Bar Association's Board of Trustees. He was a founding member of the American Bar Association's Board of Trustees. He was a founding member of the American Bar Association's Board of Trustees. He was a founding member of the American Bar Association's Board of Trustees. He was a founding member of the American Bar Association's Board of Trustees. He was a founding member of"

### Why is the model repeating itself?

In the previous cell, we generated text by always choosing the **most likely next token** (using `torch.argmax`). This deterministic approach has a major drawback: once the model enters a pattern that has high probability, it can get stuck in a loop.

Think of it like this: if the model strongly believes that "Einstein was" should be followed by "a physicist", and then strongly believes "a physicist" should be followed by "who", and then that sequence loops back to another high-probability path, the model will repeat this pattern indefinitely.

### Introducing Temperature

**Temperature** is a hyperparameter that controls the randomness of predictions by scaling the logits before applying softmax. Mathematically:

$$p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}$$

Where:
- $z_i$ are the logits (raw model outputs)
- $T$ is the temperature parameter
- $p_i$ is the resulting probability for token $i$

### Effects of different temperature values:

- **T = 0** (or very close to 0): Completely deterministic, always pick highest probability token (like we did before)
- **T = 1.0**: Standard softmax, use the exact probabilities from the model
- **T > 1.0**: More uniform distribution, increasing randomness and diversity
- **T < 1.0**: More peaked distribution, reducing randomness but still allowing some

Lower temperatures produce more focused, coherent text but risk repetition. Higher temperatures produce more diverse, creative text but risk incoherence.

In the next cell, we'll implement temperature sampling to fix our repetition problem!

In [13]:
def generate_with_temperature(model, tokenizer, prompt, max_length=100, temperature=0.7):
    # Start with the provided prompt
    generated_text = prompt
    
    for _ in range(max_length):
        # Tokenize the current text
        inputs = tokenizer(generated_text, return_tensors='pt')
        
        # Get model output
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
        
        # Get the next token logits (last position in sequence)
        next_token_logits = logits[0, -1, :]
        
        # Apply temperature
        scaled_logits = next_token_logits / temperature
        
        # Convert to probabilities with softmax
        probs = F.softmax(scaled_logits, dim=0)
        
        # Sample from the distribution
        next_token_id = torch.multinomial(probs, num_samples=1).item()
        
        # Decode the token and add to generated text
        next_token_text = tokenizer.decode([next_token_id])
        generated_text += next_token_text
        
    return generated_text

In [14]:
# Low temperature (more deterministic but not completely)
prompt = "The great scientist Albert Einstein"
low_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=50, temperature=0.3
)
print("Temperature = 0.3:")
print(low_temp_text)
print("\n" + "-"*50 + "\n")

# Medium temperature (balanced)
medium_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=50, temperature=0.7
)
print("Temperature = 0.7:")
print(medium_temp_text)
print("\n" + "-"*50 + "\n")

# High temperature (more random)
high_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=50, temperature=1.2
)
print("Temperature = 1.2:")
print(high_temp_text)

Temperature = 0.3:
The great scientist Albert Einstein, who was the first to see the potential of quantum mechanics, was the first to see the potential of quantum mechanics, and he was the first to see the potential of quantum mechanics. He was the first to see the potential of quantum mechanics. He

--------------------------------------------------

Temperature = 0.7:
The great scientist Albert Einstein (1850-1917) has been considered one of the greatest minds in the history of science. Einstein was known as one of the most important minds of the twentieth century. His work was "the scribe of history" and to use his

--------------------------------------------------

Temperature = 1.2:
The great scientist Albert Einstein stated that we must underrealize our geospaces at the frequency of weaponization. His judgement was especially important specifically in the most distant era Jean Valjean, a Parisian celebrated mathematician and any competent observer of solar systems, said now abou