# Lab 2: Language Modelling with Transformers

We're going to feed some text into a Transformer and examine how it outputs the probabilities for the next word/token.

First let's load up the `distilgpt2` tokenizer as we did before.

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')

We're going to be interested in predicting the next subword token. How many possible subword tokens are there?

In [2]:
len(tokenizer.vocab) # or we could just use len(tokenizer)

50257

When tokenizing, we'll use the tokenizer with the `return_tensors='pt'` parameter. This puts the data into the format of a [PyTorch](https://pytorch.org) tensor which is used as the input for a Transformer model. PyTorch is a commonly used library for deep learning and HuggingFace builds upon it. We won't use PyTorch directly.

Let's tokenize: `"A horse! a horse! my kingdom for a"`

In [3]:
tokenized = tokenizer('A horse! a horse! my kingdom for a', return_tensors='pt')
tokenized

{'input_ids': tensor([[   32,  8223,     0,   257,  8223,     0,   616, 13239,   329,   257]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Note that it has been tokenized into 10 tokens.

In [4]:
len(tokenized['input_ids'][0])

10

Now we need to load up the full Transformer model. We need to use the same one that matches our tokenizer (`distilgpt2`). Tokenizers and models must match.

We'll load it using `AutoModelForCausalLM`. CausalLM is causal language modelling, or predicting the next token. You can also load models for other purposes like document classification.

In [5]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('distilgpt2')

Now let's pass the tokenized text into the Transformer model. We could do this with `model(input_ids=tokenized['input_ids'], attention_mask=tokenized['attention_mask'])` but a tidied shorthand is:

In [6]:
output = model(**tokenized)

For causal language modelling, what we care about is the predictions of the next token. This is captured by the `logits` which are the scores for each of the possible tokens.

In [7]:
output.logits

tensor([[[-31.1439, -29.1282, -30.8418,  ..., -42.3130, -42.1440, -31.0009],
         [-59.5865, -60.5802, -64.7680,  ..., -70.8865, -65.8933, -63.0499],
         [-62.7691, -63.7442, -64.5699,  ..., -75.1833, -72.3489, -60.4002],
         ...,
         [-51.0393, -59.1055, -63.8448,  ..., -68.9364, -65.0198, -59.6002],
         [-56.1765, -60.0481, -63.8827,  ..., -66.6802, -65.5936, -61.3876],
         [-63.7612, -64.7149, -67.7764,  ..., -75.3739, -69.5853, -65.8060]]],
       grad_fn=<UnsafeViewBackward0>)

This is a PyTorch tensor which is a grid of numbers. In this case, it's a 3D grid. You can see the dimensions of it using `.shape` as below:

In [8]:
output.logits.shape

torch.Size([1, 10, 50257])

Where do the different numbers come from?

Well we only put in one sequence of ten words, so that explains the `[1, 10,...]`. The `50257` is the size of the vocabulary of the tokenizer:

In [9]:
len(tokenizer)

50257

That means we can get the score that the Transformer has given to token `horse` after the final token in the sequence with. First, what is the token index for horse? Recall that as it is starting a new word, there is the special character of `Ġ`.

In [10]:
tokenizer.vocab['Ġhorse']

8223

Then to get the score from the first sequence (0), after the final token (-1) and for the token `horse` (8223), we would access it with:

In [11]:
output.logits[0,-1,8223]

tensor(-59.6237, grad_fn=<SelectBackward0>)

Hmm, the logits are not nicely probabilities so are difficult to interpret. We'll have to do a little work to make them interpretable.

Let's get all the scores out for predictions of tokens after our input (so using the index of -1 to get the final logits).

In [12]:
next_token_scores = output.logits[0,-1,:].tolist()
len(next_token_scores)

50257

As we already saw, they are not easy to interpret.

In [13]:
next_token_scores[:5]

[-63.76122283935547,
 -64.71493530273438,
 -67.77637481689453,
 -67.36962890625,
 -67.97136688232422]

So we shall use a softmax function. It takes a list of numbers, applies the equation below to them (using lots of exponentials) and returns a vector where all the values are between 0 and 1 and they all add up to 1.

$ softmax(z) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K $

There is a [function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html) in the useful [scipy package](https://scipy.org/) that does this for us.

In [14]:
from scipy.special import softmax

Apply the `softmax` function to `next_token_scores` and output the first five values. You should see that they are between 0 and 1 and rather small.

In [None]:
# your code here

The probabilities should also add up to 1 (or very close due to little numerical differences). Check if this the case using the `sum` function.

In [None]:
# your code here

Let's see what the probability of horse is now (token id = 8223)

In [None]:
# your code here

You should find that it has a probability of approximately `0.006`.

If we didn't already know that 8223 is horse, we could decode it with the tokenizer.

In [None]:
tokenizer.decode(8223)

Now, the final task is going through the `next_token_probs` and finding which one has the highest probability and figuring out the corresponding token using `tokenizer.decode`.

In [None]:
# your code here

You should find that `' long'` has the highest probability (`≈ 0.3427`)

That's the end of this mini-lab.

## Optional Extra:
- Try a different input sentence