# `013` Language Models and Logits

Task: Ask a language model for the most likely next tokens.

This notebook follows up on `012-tokenization`.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 6.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 28.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 28.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=0ae5ef7901b

In [None]:
import torch
from torch import tensor

### Download and load the model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=762.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=352833716.0, style=ProgressStyle(descri…




In [None]:
print(f"The model has {model.num_parameters():,d} parameters.")

The model has 81,912,576 parameters.


## Task

Consider the following phrase: "This weekend I plan to".

1. Convert the phrase into token ids.
2. Use the `forward` method of the `model`. Explain the shape of `model_output.logits`.
3. Pull out the logits corresponding to the *last* token in the input phrase. Identify the id of the most likely next token.
4. Find what token the model thinks is the most likely.
5. Use the `topk` method to find the top-20 most likely choices for the next token. 
6. Write a function that is given a phrase and a *k* and returns the top *k* most likely next tokens.

In [None]:
#phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"
phrase = "This weekend I plan to"

#### Convert phrase to token ids

In [None]:
input_ids = tokenizer.encode(phrase)
input_ids

[770, 5041, 314, 1410, 284]

#### Use the `forward` method of the `model`

In [None]:
model_output = model.forward(tensor([input_ids]))
model_output.logits.shape

torch.Size([1, 5, 50257])

The shape of `model_output.logits`:

* 1:: one batch
* 5:: five arrays of next-token-likelihoods for each token in the initial phrase.
* 50257:: the size of the vocabulary; each array contains the likelihoods of being the next token for each token in the vocabulary.


In [None]:
# since we only have a single sequence (batch size of 1), let's collapse the batch dimension.
logits = model_output.logits[0]

#### Pull out the logits corresponding to the last token in the input phrase

In [None]:
last_token_logits = logits[-1]
last_token_logits

tensor([-77.8725, -79.7684, -82.1183,  ..., -88.5235, -86.5615, -79.6716],
       grad_fn=<SelectBackward>)

#### Identify the id of the most likely next token

In [None]:
max_logit = last_token_logits.max()
most_likely_id = (last_token_logits == max_logit).nonzero(as_tuple=True)[0]
most_likely_id

tensor([467])

#### Find what token the model thinks is the most likely

In [None]:
tokenizer.decode(most_likely_id)

' go'

#### Find the top-20 most likely choices for the next token

In [None]:
top_tokens = last_token_logits.topk(k=20)
tokenizer.decode(top_tokens[1])

' go take spend make do be attend visit run have write play get start head travel try bring return share'

#### Write a function that is given a phrase and a k and returns the top k most likely next tokens

In [None]:
def get_top_tokens(phrase, k):
  input_ids = tokenizer.encode(phrase)
  model_output = model.forward(tensor([input_ids]))
  top_tokens = model_output.logits[0][-1].topk(k=k)
  return tokenizer.decode(top_tokens[1])

get_top_tokens("I wish I were", 10)

' a the more in there able here an on just'

## Analysis

What would be required to generate more than one token? What decisions would you have to make?

To generate more than one token, we would have to add a token onto the phrase and feed that phrase back into the model to generate more possible tokens. We would need to decide which token to append to the original phrase. This could be the token the model believes is the most likely choice, or it could be chosen randomly across the top *k* tokens. Either way, when we input the new phrase back into the model, the model will generate new tokens based on the last token we added.