# Lab 8: CSS 100

In this lab, you'll get hands-on practice using the `transformers` package.

- Tokenizing input strings.
- Passing input to a pre-trained large language model.
- Obtaining **logits** or predictions from the LLM.

**Note**: To use `transformers`, make sure you're using the *Machine Learning* container for your DataHub account. You might need to relaunch your server to select this ("Control panel --> stop my server --> relaunch"). 

In [44]:
import pandas as pd
import transformers
import torch

## Part 1: Tokenization

**Tokenizers** are in charge of preparing text input for the LLM.

For example, given a string like "I like my coffee with cream and sugar", a tokenizer will convert each word (or more accurately, each token) to a numeric representation that the LLM can use.

We're going to work with GPT-2 today, so we need to load the tokenizer that GPT-2 uses.

In [2]:
### Run this code!
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Q1. Count the tokens

Use the `tokenizer` object to convert each of the `sentences` below to a list of tokens. Then, save the `count` of the number of tokens in each case to a list called `token_counts` (where the *index* should correspond to the number of tokens in that sentence).

In [3]:
sentences = [
    "Large language models work by predicting upcoming words",
    "These models technically operate over tokens",
    "Some particularly long words may have multiple tokens",
    "And other short or frequent words use only a single token",
    "You have to tokenize the word to find out"
]

In [8]:
### BEGIN SOLUTION
token_counts = []
for i in sentences:
    tokens = tokenizer(i)
    token_counts.append(len(tokens['input_ids']))
### END SOLUTION

In [15]:
## At least one hidden test
assert len(token_counts) == 5
assert token_counts[0] == 8

### BEGIN HIDDEN TESTS
assert token_counts[1] == 6
assert token_counts[2] == 8
assert token_counts[3] == 11
assert token_counts[4] == 10
### END HIDDEN TESTS

### Q2. Words to tokens, and back to words

> Each "token" in the LLM's vocabulary has its own ID.

The `.convert_ids_to_tokens` function takes a `list` of **Token IDs** and converts it back to a string representation of that token. In this problem:

- first *tokenize* each sentence in `sentences` using `tokenizer.encode(sentence))`
- then, *convert* the token ids back to a string using `convert_ids_to_tokens`.
- save these converted strings in a list called `token_strings`.

**Reflect**: What do you notice about these decoded strings?

In [26]:
### BEGIN SOLUTION
token_ids = [tokenizer.encode(s) for s in sentences]
token_strings = [tokenizer.convert_ids_to_tokens(t) for t in token_ids]
### END SOLUTION

In [28]:
assert len(token_strings) == 5
assert token_strings[0] == ['Large',
 'Ġlanguage',
 'Ġmodels',
 'Ġwork',
 'Ġby',
 'Ġpredicting',
 'Ġupcoming',
 'Ġwords']

### Q3. Identifying multi-token words

This problem is more **open-ended**. Try to input different *words* to the `tokenizer` and see which individual words are tokenized into multiple tokens. See how many you can find!

In [38]:
### BEGIN SOLUTION
multi_token_words = ['vanguard', 'vehicular', 'dehumidifier']

w_ids = [tokenizer.encode(s) for s in multi_token_words]
w_strings = [tokenizer.convert_ids_to_tokens(t) for t in w_ids]
### END SOLUTION

## Part 2. Using pre-trained LLMs!

A **pre-trained language model** is one that has already been trained on a large corpus of text. The transformers library allows you to load a pre-trained model. This section introduces you to loading pre-trained models and doing basic operations with them.

We'll be working with GPT-2, a pre-trained language model.

In [40]:
### Run this code!
gpt2 = transformers.AutoModelForCausalLM.from_pretrained("gpt2")  # Load the model
gpt2.eval()  # Put the model in "evaluation mode" (as opposed to training mode).

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### Q4. Pass input into the model

In this problem, I'll give you a **sentence fragment**. Your job is to:

- Tokenize that sentence fragment. 
- Pass the tokens into the pre-trained LLM (`gpt2`). 
- Set the result of that to a variable called `outputs`.

Note that you can use the following code to pass input into an LLM object:

```python
with torch.no_grad():
    outputs = gpt2(**inputs)
```

In [47]:
sentence_example = "This class is really"

In [48]:
### BEGIN SOLUTION
inputs = tokenizer(sentence_example, return_tensors="pt")
with torch.no_grad():
    outputs = gpt2(**inputs)
### END SOLUTION

In [54]:
assert 'logits' in outputs.keys()

### Q5. Identify the top-$5$ predicted words

The `outputs` you just created have a key called `logits`, which correspond to the model's **predictions**.

> The `logits` are a series of values for each token in our model's vocabulary, at each position in the sequence of input tokens.

We can use `outputs.logits.shape` to figure out the dimensions of this structure.

The dimensions represent `(batch_size, sequence_length, vocab_size)`.

- Batch size: The number of inputs we passed the model. Here, just 1.
- Sequence length: The number of tokens in our sequence (4).
- Vocab size: the number of tokens in the models vocabulary.

The predictions for the **next** token can be found in the *final* token, i.e.,

```python
last_token_logits = sentence_logits[-1]
```

Return the **top five** predicted token IDs using these logits. Set this equal to `top_five_token_ids`.

**Hint**: `torch.topk` can be of use here.

In [56]:
### We're only looking at one sentence, so focus on that
sentence_logits = outputs.logits[0]

### Get last token logits.
last_token_logits = sentence_logits[-1]
last_token_logits.shape

torch.Size([50257])

In [65]:
top_five_token_ids = torch.topk(last_token_logits, k = 5).indices
top_five_token_ids

tensor([ 922, 4465,  546, 1593, 1049])

In [68]:
assert len(top_five_token_ids) == 5
assert sorted(top_five_token_ids.tolist()) == [546, 922, 1049, 1593, 4465]

### Q6. Now, `decode` those words!

Unless you're secretly a tokenizer, you don't know what **tokens** those IDs correspond to. So let's use `tokenizer.decode` to convert them all back to words. Set the result to `top_five_tokens`.

**Reflect**: Do those words make sense in this context?

In [71]:
### BEGIN SOLUTION
top_five_tokens = [tokenizer.decode(t) for t in top_five_token_ids]
### END SOLUTION

In [72]:
assert len(top_five_tokens) == 5
assert top_five_tokens == [' good', ' useful', ' about', ' important', ' great']

### Q7. Write a `sample_words` function

Now that you know how to identify the most probable tokens, write a function called `sample_words`. This function will:

- Take as *input* a `sentence_fragment`, a trained `model` (e.g., `gpt2`), a `tokenizer`, and an integer `k`.
- Identify the top `k` *most likely* next words, based on the model.
- Return those words in a `list`.

**Reflect**: How might you use a function like this to generate entire paragraphs of text? (Hint: It won't just be by increasing `k`!

In [76]:
### BEGIN SOLUTION
def sample_words(sentence_fragment, model, tokenizer, k):
    
    ### Tokenize
    inputs = tokenizer(sentence_fragment, return_tensors="pt")
    
    ### Get outputs
    with torch.no_grad():
        outputs = gpt2(**inputs)
        
    ### We're only looking at one sentence, so focus on that
    sentence_logits = outputs.logits[0]

    ### Get last token logits.
    last_token_logits = sentence_logits[-1]
    
    ### Most likely k token IDs
    top_k_token_ids = torch.topk(last_token_logits, k = k).indices
    
    ### Decode original tokens
    top_k_tokens = [tokenizer.decode(t) for t in top_k_token_ids]
    
    return top_k_tokens
    
### END SOLUTION

In [78]:
### Test 1
sentence_fragment = "I like my coffee with cream and"
assert sample_words(sentence_fragment, gpt2, tokenizer, 1) == [" cream"]

In [84]:
### Test 2
sentence_fragment = "I think that large language models are very"
assert sorted(sample_words(sentence_fragment, gpt2, tokenizer, 3)) == [' good', ' important', ' useful']

### Q8. Calculate the probability of specific tokens

Finally, we're often interested in calculating the probability of a **specific upcoming token** in a sequence.

To do this, we'll need to first `softmax` the `logits`, and then figure out the probability assigned to the token ID in question. We've written a handy function for you called `next_seq_prob` which does this already.

In this final problem, we want you to **use** this function to calculate the probability assigned to each word in the `candidates` list below, given the `sentence` fragment. Save your responses in a `dict` mapping each candidate to a probability value.

In [87]:
def next_seq_prob(model, tokenizer, seen, unseen):
    """Get p(unseen | seen)

    Parameters
    ----------
    model : transformers.PreTrainedModel
        Model to use for predicting tokens
    tokenizer : transformers.PreTrainedTokenizer
        Tokenizer for Model
    seen : str
        Input sequence
    unseen: str
        The sequence for which to calculate a probability
    """
    # Get ids for tokens
    input_ids = tokenizer.encode(seen, return_tensors="pt")
    unseen_ids = tokenizer.encode(unseen)

    # Loop through unseen tokens & store log probs
    log_probs = []
    for unseen_id in unseen_ids:

        # Run model on input
        with torch.no_grad():
            logits = model(input_ids).logits

        # Get next token prediction logits
        next_token_logits = logits[0, -1]
        next_token_probs = torch.softmax(next_token_logits, 0) # Normalize

        # Get probability for relevant token in unseen string & store
        prob = next_token_probs[unseen_id]
        log_probs.append(torch.log(prob))

        # Add input tokens incrementally to input
        input_ids = torch.cat((input_ids, torch.tensor([[unseen_id]])), 1)

    # Add log probs together to get total log probability of sequence
    total_log_prob = sum(log_probs)
    # Exponentiate to return to probabilities
    total_prob = torch.exp(total_log_prob)
    return total_prob.item()

In [86]:
### Sentence and candidates
sentence = "I like my coffee with cream and"
candidates = [" sugar", " honey", " mercury"]

In [89]:
### BEGIN SOLUTION
results = {}
for candidate in candidates:
    prob = next_seq_prob(gpt2, tokenizer, sentence, candidate)
    results[candidate] = prob
### END SOLUTION

In [92]:
assert results
assert ' sugar' in results
assert ' honey' in results
assert ' mercury' in results

In [95]:
assert round(results[' sugar'], 2) == 0.05
assert round(results[' honey'], 2) == 0.02
assert round(results[' mercury'], 2) == 0.00

# Submit!

Congratulations, you just learned how to work with an LLM in `transformers`! That's a state-of-the-art package for using LLMs in Python.