# Running GPT-2 Locally

In [None]:
import torch
import xlab

VOCAB_SIZE = 50257

When loading pretrained models in the course, we will be using the [transformers](https://huggingface.co/docs/transformers/en/index) library developed by Hugging Face. This library abstracts away the complexity of running pretrained language models into a single API. In the cell below, you will be loading the [smallest version of GPT-2](https://huggingface.co/openai-community/gpt2) with 124 million parameters. If you are interested, you can also try out [GPT-2 Medium](https://huggingface.co/openai-community/gpt2-medium) (335 million parameters), [GPT-2 Large](https://huggingface.co/openai-community/gpt2-large) (774 million parameters), and [GPT-2 XL](https://huggingface.co/openai-community/gpt2-xl) (1.5 billion parameters).

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

## Disabling Gradients

This notebook will teach you how to load an LLM and calculate loss on examples. We will not be going through how we would train or fine-tune an LLM which means you will not be running backward passes to update parameters based on the loss. In other words, none of the computations in this notebook require PyTorch to track gradients. By default, however, PyTorch will track computations (when tensors require gradients) to prepare for a backward pass. To stop this from happening, we will have to tell PyTorch explicitly.

One way to do this for specific computations is to use `torch.no_grad()` for computations that aren't used for gradient updates. For example:

```
with torch.no_grad():
    # your computations here
```

Because no exercises you will be completing require gradients, you can disable gradient tracking for the remainder of the notebook by running the cell below.

In [None]:
torch.set_grad_enabled(False)

## Tokenize Text

To input a sequence of text into GPT-2, we first have to decide to convert the text to numbers so we can feed it to the model. Typically, how this is done is we convert a string of text into sequence of tokens, each of which will be assigned a number which can be embedded into a vector. To do this, we have a few options:

1. We can assign each character its own number
2. We can assign each word or special character its own number
3. We can assign common sequences of characters their own numbers

Option #3 is most popular and the high-level approach taken in the GPT-2 paper. This approach has the advantage of having a smaller total number of tokens while still capturing some of the underlying structure of natural language. Specifically, the authors use a modified version of BPE (byte pair encoding) proposed [here](https://arxiv.org/pdf/1508.07909). If you are interested, more implementation details of the tokenizer can be found in the [GPT-2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). 

Time to try out the GPT-2 tokenizer! Run the cell below to see the tokenizer assign the string into a sequence of numbers:

In [None]:
text = "Barack Obama taught constitutional law at the University of"
encoded_input = tokenizer(text, return_tensors="pt")
print(encoded_input["input_ids"])

Let's take a look at what each of these numbers represent:

In [None]:
for token_id in encoded_input["input_ids"][0]:
    print(f'{token_id.item()}\t --> \t"{tokenizer.decode(token_id)}"')

We can also decode the entire sequence at once. This will be helpful to remember for later!

In [None]:
tokenizer.decode(encoded_input["input_ids"][0])

### Task #1

For a given input of text, return a list of tokens in plain text. For example for the input "Hello there gpt-2!", the function should return ['Hello', ' there', ' g', 'pt', '-', '2', '!']. Note that this is very different than just splitting up the text into random chunks or where there are spaces! Tokenizers are designed to create groupings of characters that are often found together or that are significant in the structure of language. You are encouraged to play around with different examples and observe how smart the tokenizer can be!

<details>
<summary>🔐 <b>Solution for Task #1</b></summary>

```python
def plain_text_tokens(prefix):
    """Tokenizes a text prefix into individual token strings.
    
    Args:
        prefix (str): Input text string to tokenize.
        
    Returns:
        list[str]: Individual token strings from the tokenizer.
    """

    encoded_input = tokenizer(prefix, return_tensors='pt')

    # 1. iterate over input ids
    for i in encoded_input['input_ids'][0]:

        # 2. decode each id into plain text
        rv.append(tokenizer.decode(i))
    return rv
  
```

</details>


In [None]:
def plain_text_tokens(prefix):
    """Tokenizes a text prefix into individual token strings.

    Args:
        prefix (str): Input text string to tokenize.

    Returns:
        list[str]: Individual token strings from the tokenizer.
    """

    ######### YOUR CODE STARTS HERE #########
    # 1. iterate over input ids
    # 2. decode each id into plain text
    ########## YOUR CODE ENDS HERE ##########


# test out your implementation on different inputs to get a sense of how the tokenizer works!
print(plain_text_tokens("Hello there gpt-2!"))
print(plain_text_tokens("https://xrisk.uchicago.edu/fellowship/"))

In [None]:
xlab.tests.gpt2.task1(plain_text_tokens)

Back to our model. We will tokenize our text into numbers to feed into the model. When the model is done predicting text, we can detokenize the results to see if what the model is saying makes sense.

Below is code for a single forward pass with the prefix "Barack Obama taught constitutional law at the University of"

Take a look at the shape of the output.

In [None]:
prefix = "Barack Obama taught constitutional law at the University of"
encoded_input = tokenizer(prefix, return_tensors="pt")
output = model(**encoded_input)
logits = output.logits
logits.shape

Let's take another look at the logit values. Note that logits can be positive or negative. Normally, both in training and in inference, we apply a "softmax" function to the data to bring all values between 0 and 1. We interpret these values as the probability that the model assigns each token to be next in a sequence of text. For now, however, we ignore this detail.

In [None]:
max(logits.view(-1)), min(logits.view(-1))

In [None]:
encoded_input["input_ids"]

What is going on here?

`torch.Size([1, 10, 50257])` tells us that the logits are a 3-dimensional array (i.e., it is `1*10*50257`). The first dimension represents the batch size and because there is only one batch, we can ignore it for now. The next dimension is the sequence length. Note that:

```python
>>> encoded_input['input_ids']
tensor([[10374,   441,  2486,  7817,  9758,  1099,   379,   262,  2059,   286]])
```

There are 10 tokens when we tokenize "Barack Obama taught constitutional law at the University of" meaning the sequence length is 10. For the final dimension, we have 50257 which represents the model's vocabulary size. Why do we have so much data? Don't we only want the next predicted piece of text?

To understand why this is necessary, you will need to understand what information is included in this tensor. In total, we have 10 vectors of length 50257. Each vector represents a probability distribution for each token in the vocabulary. For example, if the value at 42 is higher, that means that the model is assigning a higher probability to the token at position 42 to be the next in the sequence of text.

This makes sense for our purposes: if we have a probability distribution for the next token in the text, we can sample from it to predict the next token! But why do we have 10 probability distributions in the output? In other words why do we need a probability distribution for each token in the input?

The answer to this question is oddly, that this makes it easier to train our model! If we have a piece of text that we are training on from the internet, we can train multiple examples in parallel. For example, if we have the text "Barack Obama taught constitutional law at the University of" here are several different examples we could choose to train on. 


1. Prefix="Bar" and label="ack"
2. Prefix="Barack" and label=" Obama"
3. Prefix="Barack Obama" and label=" taught"
4. And so on...


The first three vectors in the logits in the code above correspond to the model's predictions for the first three prefixes above. While running inference, we only care about the model's label for the input "Barack Obama taught constitutional law at the University of". Therefore, we only need to extract the final vector from the probability distribution. This makes sense because for our purposes, we aren't interested in efficiently training the model. We are only interested in seeing what the model predicts for the next token.

We can extract this last vector by taking `logits[0][-1]`. Let's see what the model predicts!

In [None]:
text = "Barack Obama taught constitutional law at the University of"

encoded_input = tokenizer(text, return_tensors="pt")
output = model(**encoded_input)
logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
token_id = torch.argmax(logits[0][-1])

generated_text = tokenizer.decode([token_id.item()])
print(generated_text)

Indeed, Barack Obama taught constitutional law at the University of Chicago ([source](https://www.obamalibrary.gov/obamas/president-barack-obama))! Despite being an early model with limited capabilities, GPT-2 124M knows quite a bit about the world.

## Task #1.5

In the function below you will run the code above in a loop to continue generating text from the model. You are encouraged to play around with different prompts and observe the model's output. 

Because you are using `argmax` to sample the next token, the model's behavior is deterministic. Later, you will see that in typical LLM generation, next tokens are sampled from a probability distribution.

<i>`generate_n_tokens` should return generated tokens appended to the original prompt.</i>


<details>
<summary>🔐 <b>Solution for Task #1.5</b></summary>

```python
def generate_n_tokens(model, tokenizer, n, prompt):
    """Generates n tokens by repeatedly predicting the most likely next token.

    generate_n_tokens should return generated tokens appended to the original prompt.
    
    Args:
        model: Language model with forward() method that outputs logits.
        tokenizer: Tokenizer with encode/decode methods.
        n (int): Number of tokens to generate.
        prompt (str): Initial text prompt to extend.
        
    Returns:
        str: Original prompt extended with n generated tokens.
    """

    # 1. iterate n times
    for i in range(n):

        # 2. find most likely next token
        encoded_input = tokenizer(prompt, return_tensors='pt')
        output = model(**encoded_input)
        logits = output.logits
        token_id = torch.argmax(logits[0][-1])

        # 3. append it to the prompt + previously generated tokens
        generated_text = tokenizer.decode([token_id.item()])  
        prompt+=generated_text
    
    return prompt
  
```

</details>

In [None]:
def generate_n_tokens(model, tokenizer, n, prompt):
    """Generates n tokens by repeatedly predicting the most likely next token.

    generate_n_tokens should return generated tokens appended to the original prompt.

    Args:
        model: Language model with forward() method that outputs logits.
        tokenizer: Tokenizer with encode/decode methods.
        n (int): Number of tokens to generate.
        prompt (str): Initial text prompt to extend.

    Returns:
        str: Original prompt extended with n generated tokens.
    """

    ######### YOUR CODE STARTS HERE #########
    # 1. iterate n times
    # 2. find most likely next token
    # 3. append it to the prompt + previously generated tokens
    ########## YOUR CODE ENDS HERE ##########

In [None]:
generated_text = generate_n_tokens(model, tokenizer, 50, text)
print(generated_text)

As you will explore below, running an LLM in practice typically involves sampling tokens by interpreting the model output as a probability distribution. This means that running an LLM should be non-deterministic (i.e., running the same prompt multiple times should produce different outputs). The function you wrote above, however, should not include any sampling, so it should produce the exact same output every time you run it. To test you implementation you can compare your output to the expected output when you run the solution.

In [None]:
expected_output = "Barack Obama taught constitutional law at the University of Chicago. He was a founding member of the American Bar Association, and he was a founding member of the American Bar Association's Board of Trustees. He was a founding member of the American Bar Association's Board of Trustees. He was a founding"
generated_text = generate_n_tokens(model, tokenizer, 50, text)
assert generated_text == expected_output

### Why is the model repeating itself?

In the previous cell, we generated text by always choosing the most likely next token (using `torch.argmax`). This deterministic approach has a major drawback: once the model enters a pattern that has high probability, it can get stuck in a loop.

Above you should have observed that "He was a founding member of the American Bar Association's Board of Trustees." repeats continuously before we cut it off.

### Introducing Softmax and Temperature

We explored softmax briefly in the [defensive distillation section](https://xlabaisecurity.com/adversarial/defensive-distillation/).

As a review, a traditional softmax is calculated via the following equation. When there are $K$ classes, and $z$ is the pre-softmax output, the equation below gives the probability for class $i$:

$$
q_i = \frac{e^{z_i}}{\sum_{j=0}^{K-1}e^{z_j}}
$$

For our transformer output, we would take the softmax over each channel dimension. That means that for each input token, we would take a softmax for the corresponding output token independently. For inference, we would be sampling from the softmax probabilities for the final output token.

If we want the output of the softmax to be smoother, we can add a constant $T$ which forces the distribution to be more uniform. The value for $T$ is called temperature, where a higher $T$ corresponds to more random or surprising outputs. Note that our traditional softmax is equivalent to the equation below when $T = 1$.

$$
q_i = \frac{e^{z_i / T}}{\sum_{j=0}^{K-1}e^{z_j / T }}
$$

### Task #2: Softmax GPT-2 Outputs

<details>
<summary>🔐 <b>Solution for Task #2</b></summary>

```python
def get_gpt2_probs(logits, temp=1):
    """Converts logits to probabilities with optional temperature scaling.
    
    Args:
        logits [batch, seq_len, vocab_size]: Raw model output logits.
        temp (float): Temperature for scaling logits before softmax.
        
    Returns:
        [batch, seq_len, vocab_size]: Probability distributions over vocabulary.
    """
    assert len(logits.shape) == 3
    assert logits.shape[-1] == VOCAB_SIZE

    logits_with_temp = logits / temp
    probs = torch.nn.functional.softmax(logits_with_temp, dim=2)
    return probs
  
```

</details>

In [None]:
def get_gpt2_probs(logits, temp=1):
    """Converts logits to probabilities with optional temperature scaling.

    Args:
        logits [batch, seq_len, vocab_size]: Raw model output logits.
        temp (float): Temperature for scaling logits before softmax.

    Returns:
        [batch, seq_len, vocab_size]: Probability distributions over vocabulary.
    """
    assert len(logits.shape) == 3
    assert logits.shape[-1] == VOCAB_SIZE

    ######### YOUR CODE HERE #########

In [None]:
xlab.tests.gpt2.task2(get_gpt2_probs)

### Effects of different temperature values:

- **T = 0** (or very close to 0): Completely deterministic, always pick highest probability token (like we did before)
- **T = 1.0**: Standard softmax, use the exact probabilities from the model
- **T > 1.0**: More uniform distribution, increasing randomness and diversity
- **T < 1.0**: More peaked distribution, reducing randomness but still allowing some

Lower temperatures produce more focused, coherent text but risk repetition. Higher temperatures produce more diverse, creative text but risk incoherence.

In the next cell, we'll implement temperature sampling to fix our repetition problem!

In [None]:
def generate_with_temperature(
    model, tokenizer, prompt, max_length=100, temperature=0.7
):
    """Generates text using temperature-based sampling for next token selection.

    Args:
        model: Language model with forward() method that outputs logits.
        tokenizer: Tokenizer with encode/decode methods.
        prompt (str): Initial text prompt to extend.
        max_length (int): Maximum number of tokens to generate.
        temperature (float): Temperature parameter for sampling control.

    Returns:
        str: Original prompt extended with generated tokens.
    """

    for i in range(max_length):
        encoded_input = tokenizer(prompt, return_tensors="pt")
        output = model(**encoded_input)
        probs = get_gpt2_probs(output.logits, temperature)

        next_token_id = torch.multinomial(probs[0][-1], num_samples=1).item()

        generated_text = tokenizer.decode([next_token_id])
        prompt += generated_text

    return prompt

In [None]:
# Low temperature (more deterministic but not completely)
prompt = "Barack Obama taught constitutional law at the University of"
low_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=40, temperature=0.3
)
print("Temperature = 0.3:")
print("-" * 50)
print(low_temp_text + "\n")


# Medium temperature (balanced)
medium_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=40, temperature=0.7
)
print("Temperature = 0.7:")
print("-" * 50)
print(medium_temp_text + "\n")


# High temperature (more random)
high_temp_text = generate_with_temperature(
    model, tokenizer, prompt, max_length=40, temperature=1.2
)
print("Temperature = 1.2:")
print("-" * 50)
print(high_temp_text)

### Calculate Loss:

The accuracy of language modeling is typically measured with "negative log likelihood" (NLL). Let's break down where each of these words come from:

1. Likelihood: The softmax probabilities for the next token give the probability the model assigns to each of the outputs. If you select the probability of the correct next token you have the probability the model assigns to the correct answer.
2. Log likelihood: the softmax probability for the correct next token will give you some value between 0 and 1 (this is a property of the softmax). By taking the log of that value, you get near-zero if the probability is close to one. Otherwise, you get increasingly negative values the closer the probability is to 0.
3. Negative log likelihood: By taking the negative of the log likelihood, we get increasingly positive values for probabilities close to 0 and less positive values for probabilities close to 1.

One way to think about this which you may find helpful is *minimizing* NLL should *maximize* the probability the model assigns to the correct token. For more information, review our [Introduction to LLMs](https://xlabaisecurity.com/jailbreaking/llm-intro/) page on our website.

### Task #3 & 4: Calculate Loss for GPT-2

For the sake of the following exercise, let's pretend that the text "Barack Obama taught constitutional law at the University of Chicago" is in our training data and we would like to predict it.

In task #3 you will calculate the NLL for only the final token where the "correct" answer is " Chicago".

<details>
<summary>🔐 <b>Solution for Task #3</b></summary>

```python
def get_gpt2_next_token_loss(model, text, correct_token_idx):
    """Computes cross-entropy loss for predicting a specific next token.
    
    Args:
        model: Language model with forward() method that outputs logits.
        text (str): Input text sequence.
        correct_token_idx [1]: Token indices tensor for the target next token.
        
    Returns:
        torch.Tensor: Scalar cross-entropy loss value.
    """
    encoded_input = tokenizer(text, return_tensors='pt')
    logits = model(**encoded_input).logits
    next_token_out = logits[:,-1,:]
    loss = torch.nn.functional.cross_entropy(next_token_out, correct_token_idx)
    return loss
  
```

</details>

In [None]:
def get_gpt2_next_token_loss(model, text, correct_token_idx):
    """Computes cross-entropy loss for predicting a specific next token.

    Args:
        model: Language model with forward() method that outputs logits.
        text (str): Input text sequence.
        correct_token_idx [1]: Token indices tensor for the target next token.

    Returns:
        torch.Tensor: Scalar cross-entropy loss value.
    """
    ######### YOUR CODE HERE #########


# Note: " Chicago" tokenizes to a single token in GPT-2
# Therefore this line gives us a tensor with shape [1]
correct_token_idx = tokenizer(" Chicago", return_tensors="pt")["input_ids"][0]
get_gpt2_next_token_loss(
    model,
    "Barack Obama taught constitutional law at the University of",
    correct_token_idx,
)

In [None]:
_ = xlab.tests.gpt2.task3(get_gpt2_next_token_loss)

In the previous function, you implemented the NLL loss for the final output token of the model. However, when training these models, you will actually calculate loss for every token. As a review, GPT-2 treats the following as unique training examples:

1. Prefix="Bar" and label="ack"
2. Prefix="Barack" and label=" Obama"
3. Prefix="Barack Obama" and label=" taught"
4. And so on...

Although we won't explain the mechanisms in depth, the transformer architecture ensures that tokens can only "look" at tokens before them, meaning the model cannot cheat and look ahead at the correct answer.

For task #4 you will implement a loss given a sequence of text. You will use the text itself as self-supervised labels the way that researchers would when training a transformer from scratch. This means that you will use the text itself to derive the "correct" labels for each example. 


<details>
<summary>🔐 <b>Solution for Task #4</b></summary>

```python
def get_gpt2_loss_on_sequence(model, text):
    """Computes language modeling loss for predicting each token in a sequence.
    
    Args:
        model: Language model with forward() method that outputs logits.
        text (str): Input text sequence for loss computation.
        
    Returns:
        torch.Tensor: Scalar cross-entropy loss across the sequence.
    """

    # 1. get logits
    encoded_input = tokenizer(text, return_tensors='pt')
    logits = model(**encoded_input).logits

    # 2. remove final prediction from logits (we don't have a self-supervised label for it)
    logits = logits[:, :-1, :]


    # 3. remove first token from labels (the model doesn't produce a prediction for it)
    labels = encoded_input.input_ids[:,1:]

    # 4. calculate cross entropy loss
    loss = torch.nn.functional.cross_entropy(
        logits.squeeze(0), # remove batch dim
        labels.squeeze(0)  # remove batch dim
    )
    return loss
  
```

</details>

In [None]:
def get_gpt2_loss_on_sequence(model, text):
    """Computes language modeling loss for predicting each token in a sequence.

    Args:
        model: Language model with forward() method that outputs logits.
        text (str): Input text sequence for loss computation.

    Returns:
        torch.Tensor: Scalar cross-entropy loss across the sequence.
    """

    ######### YOUR CODE STARTS HERE #########
    # 1. get logits
    # 2. remove final prediction from logits (we don't have a self-supervised label for it)
    # 3. remove first token from labels (the model doesn't produce a prediction for it)
    # 4. calculate cross entropy loss
    ########## YOUR CODE ENDS HERE ##########


get_gpt2_loss_on_sequence(
    model, "Barack Obama taught constitutional law at the University of Chicago"
)

In [None]:
_ = xlab.tests.gpt2.task4(get_gpt2_loss_on_sequence)