# Running GPT-2 Locally

In [3]:
import torch

When loading pretrained models in the course, we will be using the [transformers](https://huggingface.co/docs/transformers/en/index) library developed by Hugging Face. This library abstracts away the complexity of running pretrained language models into a single API. In the cell below, you will be loading the [smallest version of GPT-2](https://huggingface.co/openai-community/gpt2) with 124 million parameters. If you are interested, you can also try out [GPT-2 Medium](https://huggingface.co/openai-community/gpt2-medium) (335 million parameters), [GPT-2 Large](https://huggingface.co/openai-community/gpt2-large) (774 million parameters), and [GPT-2 XL](https://huggingface.co/openai-community/gpt2-xl) (1.5 billion parameters).

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

## Tokenize Text

We have some text and we would like to predict the next few characters of text. Our model however, cannot perform computation on pure text: that means we need an algorithm to turn our text into vectors or numbers that GPT-2 can process.

In our setting, we apply a "tokenizer" to split up the text into chunks, where each chunch is assigned a number. From here we can print the numbers out to inspect them. These will be fed to the model, which will turn each of these numbers into a high-dimensional vector, for our tranformer to perform computations on.

Take a look at the example below for how to tokenize the string: "The great scientist Albert"

In [5]:
prefix = "The great scientist Albert"
encoded_input = tokenizer(prefix, return_tensors='pt')
print(encoded_input['input_ids'])

tensor([[  464,  1049, 11444,  9966]])


Our tokenizer can also decode these ids back into the original text. For example:

In [4]:
tokenizer.decode(encoded_input['input_ids'][0])

'The great scientist Albert'

### Task #1

For a given input of text, return a list of tokens in plain text. For example for the input "Hello there", the function should return ['Hello', ' there', ' g', 'pt', '-', '2']. Note that this is very different than just splitting up the text into random chunks or where there are spaces! Tokenizers are designed to create groupings of characters that are often found together or that are significant in the structure of language. Tokenizers make it easier to for the model to learn by providing an algorithm to group language together in a way that helpf the model learn. 

Hint: You only need to call tokenizer and tokenizer.decode to complete this task!

In [5]:
def plain_text_tokens(prefix):
    '''
    Tokenizes prefix and list of it's tokens in plain text
    '''
    rv = []
    encoded_input = tokenizer(prefix, return_tensors='pt')
    for i in encoded_input['input_ids'][0]:
        rv.append(tokenizer.decode([i]))
    return rv
 
plain_text_tokens("Hello there gpt-2")

['Hello', ' there', ' g', 'pt', '-', '2']

Back to our model. We will tokenize our text into numbers to feed into the model. When the model is done predicting text, we can untokenize the results to see if what the model is saying makes sense.

Below is some code for the one full pass through the model with the prefix "The great scientist Albert Einstein"

Take a look at the shape of the output.

In [6]:
prefix = "The great scientist Albert"
encoded_input = tokenizer(prefix, return_tensors='pt')
output = model(**encoded_input)
logits = output.logits
logits.shape

torch.Size([1, 4, 50257])

Let's take another look at the logit values. Note that the values span a huge positive and negative range. Normally, both in training and in inference, we apply a "softmax" function to the data to bring all values between 0 and 1. We interpret these values as the probability that the model assigns each token to be next in a sequence of text. For now however, we ignore this detail.

In [19]:
max(logits.view(-1)), min(logits.view(-1))

(tensor(69.1329, grad_fn=<UnbindBackward0>),
 tensor(-313.3790, grad_fn=<UnbindBackward0>))

In [7]:
encoded_input['input_ids']

tensor([[  464,  1049, 11444,  9966]])

What is going on here?

`torch.Size([1, 4, 50257])` tells us that logits is a 3 dimensional array (i.e., it is 1*4*50257). The first dimension represents the batch size and because there is only one batch, we can ignore it for now. The next dimension is the sequence length. Note that:

```python
>>> encoded_input['input_ids']
tensor([[  464,  1049, 11444,  9966]])
```

There are four tokens when we tokenize "The great scientist Albert" so there will therefore be a sequence length of 4. For the final dimension, we have 50257 which represents the model's vocab size. Why do we have so much data? Don't we only want the next predicted piece of text? 

This is actually very necessary and it gives us a lot of control in analysis of the model. In total, we have 4 vectors of length 50257. Each vector represents a probability distribution for each token in the vocabulary. For example, if the value at 2048 is higher, that means that the model is assigning a higher probability to the token at position 2048 to be the next in the sequence of text.

This makes sense for our purposes: if we have a probability distribution for the next token in the text, we can sample from it to predict the next token! But why do we have four probability distributions in the output. In other words why do we need a probability distribution for each token in the input?

The answer to this question is oddly, that this make it easier to train our model! If we have a piece of text that we are training on from the internet, we can train multiple examples in paralell. For example, if we have the text "The great scientist Albert" here are a few different examples we could choose to train on. 


1. Prefix="The" and label=" great"
2. Prefix="The great" and label=" scientist"
3. Prefix="he great scientist" and label=" Albert"


The first three vectors in the logits in the code above correspond to the model's predictions for the first three prefixes above. We only care about the model's label for the input "The great scientist Albert" so therefore, we only need to extract the final vector from the probability distribution. This makes sense because for our purposes, we aren't training the model. We are only interested in seeing what the model predicts for the next token.

We can extract this last vector by taking `logits[0][-1]`. Lets see what the model predicts!

In [9]:
text = "The great scientist Albert Einstein"

encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
token_id = torch.argmax(logits[0][-1])

generated_text = tokenizer.decode([token_id.item()]) 
print(generated_text)

,


In [10]:
for i in range(100):
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
    token_id = torch.argmax(logits[0][-1])

    generated_text = tokenizer.decode([token_id.item()])  
    text+=generated_text

In [11]:
text

'The great scientist Albert Einstein, who was a great scientist, was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great'

### Why is the model repeating itself?

In the previous cell, we generated text by always choosing the **most likely next token** (using `torch.argmax`). This deterministic approach has a major drawback: once the model enters a pattern that has high probability, it can get stuck in a loop.

Think of it like this: if the model strongly believes that "Einstein was" should be followed by "a physicist", and then strongly believes "a physicist" should be followed by "who", and then that sequence loops back to another high-probability path, the model will repeat this pattern indefinitely.

### Introducing Temperature

**Temperature** is a hyperparameter that controls the randomness of predictions by scaling the logits before applying softmax. Mathematically:

$$p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}$$

Where:
- $z_i$ are the logits (raw model outputs)
- $T$ is the temperature parameter
- $p_i$ is the resulting probability for token $i$

### Effects of different temperature values:

- **T = 0** (or very close to 0): Completely deterministic, always pick highest probability token (like we did before)
- **T = 1.0**: Standard softmax, use the exact probabilities from the model
- **T > 1.0**: More uniform distribution, increasing randomness and diversity
- **T < 1.0**: More peaked distribution, reducing randomness but still allowing some

Lower temperatures produce more focused, coherent text but risk repetition. Higher temperatures produce more diverse, creative text but risk incoherence.

In the next cell, we'll implement temperature sampling to fix our repetition problem!

In [13]:
def generate_with_temperature(model, tokenizer, prompt, max_length=100, temperature=0.7):
    # Start with the provided prompt
    generated_text = prompt
    
    for _ in range(max_length):
        # Tokenize the current text
        inputs = tokenizer(generated_text, return_tensors='pt')
        
        # Get model output
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
        
        # Get the next token logits (last position in sequence)
        next_token_logits = logits[0, -1, :]
        
        # Apply temperature
        scaled_logits = next_token_logits / temperature
        
        # Convert to probabilities with softmax
        probs = F.softmax(scaled_logits, dim=0)
        
        # Sample from the distribution
        next_token_id = torch.multinomial(probs, num_samples=1).item()
        
        # Decode the token and add to generated text
        next_token_text = tokenizer.decode([next_token_id])
        generated_text += next_token_text
        
    return generated_text

In [14]:
# Low temperature (more deterministic but not completely)
prompt = "The great scientist Albert Einstein"
low_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=50, temperature=0.3
)
print("Temperature = 0.3:")
print(low_temp_text)
print("\n" + "-"*50 + "\n")

# Medium temperature (balanced)
medium_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=50, temperature=0.7
)
print("Temperature = 0.7:")
print(medium_temp_text)
print("\n" + "-"*50 + "\n")

# High temperature (more random)
high_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=50, temperature=1.2
)
print("Temperature = 1.2:")
print(high_temp_text)

Temperature = 0.3:
The great scientist Albert Einstein, who was the first to see the potential of quantum mechanics, was the first to see the potential of quantum mechanics, and he was the first to see the potential of quantum mechanics. He was the first to see the potential of quantum mechanics. He

--------------------------------------------------

Temperature = 0.7:
The great scientist Albert Einstein (1850-1917) has been considered one of the greatest minds in the history of science. Einstein was known as one of the most important minds of the twentieth century. His work was "the scribe of history" and to use his

--------------------------------------------------

Temperature = 1.2:
The great scientist Albert Einstein stated that we must underrealize our geospaces at the frequency of weaponization. His judgement was especially important specifically in the most distant era Jean Valjean, a Parisian celebrated mathematician and any competent observer of solar systems, said now abou