# Running GPT-2 Locally

In [14]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [2]:
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Tokenize Text

We have some text and we would like to predict the next few characters of text. We cannot simply input our text into the model because our model doesn't speak in language; it speaks in numbers!

The first step to almost any NLP task is to turn our language into some sort of series vectors or number. In our setting, we apply a "tokenizer" to split up the text into chunks, where each chunch is assigned a number. From here we can print the numbers out to inspect them. These will be fed to the model, which will turn each of these numbers into a high-dimensional vector, for our tranformer to perform computations on and learn relationship in the text useful for predicting the next token.

Take a look at the example below for how to tokenize the string: "The great scientist Albert"

In [109]:
prefix = "The great scientist Albert"
encoded_input = tokenizer(prefix, return_tensors='pt')
print(encoded_input['input_ids'])

tensor([[  464,  1049, 11444,  9966]])


Our tokenizer can also decode these ids back into the original text. For example:

In [110]:
tokenizer.decode(encoded_input['input_ids'][0])

'The great scientist Albert'

### Task #1

For a given input of text, return a list of tokens in plain text. For example for the input "Hello there", the function should return ['Hello', ' there', ' g', 'pt', '-', '2']. Note that this is very different than just splitting up the text into random chunks or where there are spaces! Tokenizers are designed to create groupings of characters that are often found together or that are significant in the structure of language. Tokenizers make it easier to for the model to learn by providing an algorithm to group language together in a way that helpf the model learn. 

Hint: You only need to call tokenizer and tokenizer.decode to complete this task!

In [116]:
def plain_text_tokens(prefix):
    '''
    Tokenizes prefix and list of it's tokens in plain text
    '''
    rv = []
    encoded_input = tokenizer(prefix, return_tensors='pt')
    for i in encoded_input['input_ids'][0]:
        rv.append(tokenizer.decode([i]))
    return rv
 
plain_text_tokens("Hello there gpt-2")

['Hello', ' there', ' g', 'pt', '-', '2']

Back to our model. We will tokenize our text into numbers to feed into the model. When the model is done predicting text, we can untokenize the results to see if what the model is saying makes sense.

Below is some code for the one full pass through the model with the prefix "The great scientist Albert Einstein"

Take a look at the shape of the output.

In [118]:
prefix = "The great scientist Albert"
encoded_input = tokenizer(prefix, return_tensors='pt')
output = model(**encoded_input)
logits = output.logits
logits.shape

torch.Size([1, 4, 50257])

In [120]:
encoded_input['input_ids']

tensor([[  464,  1049, 11444,  9966]])

What is going on here?

`torch.Size([1, 4, 50257])` tells us that logits is a 3 dimensional array (i.e., it is 1*4*50257). The first dimension represents the batch size and because there is only one batch, we can ignore it for now. The next dimension is the sequence length. Note that:

```python
>>> encoded_input['input_ids']
tensor([[  464,  1049, 11444,  9966]])
```

There are four tokens when we tokenize "The great scientist Albert" so there will therefore be a sequence length of 4. For the final dimension, we have 50257 which represents the model's vocab size. Why do we have so much data? Don't we only want the next predicted piece of text? 

This is actually very necessary and it gives us a lot of control in analysis of the model. In total, we have 4 vectors of length 50257. Each vector represents a probability distribution for each token in the vocabulary. For example, if the value at 2048 is higher, that means that the model is assigning a higher probability to the token at position 2048 to be the next in the sequence of text.

This makes sense for our purposes: if we have a probability distribution for the next token in the text, we can sample from it to predict the next token! But why do we have four probability distributions in the output. In other words why do we need a probability distribution for each token in the input?

The answer to this question is oddly, that this make it easier to train our model! If we have a piece of text that we are training on from the internet, we can train multiple examples in paralell. For example, if we have the text "The great scientist Albert" here are a few different examples we could choose to train on. 


1. Prefix="The" and label=" great"
2. Prefix="The great" and label=" scientist"
3. Prefix="he great scientist" and label=" Albert"


The first three vectors in the logits in the code above correspond to the model's predictions for the first three prefixes above. We only care about the model's label for the input "The great scientist Albert" so therefore, we only need to extract the final vector from the probability distribution. This makes sense because for our purposes, we aren't training the model. We are only interested in seeing what the model predicts for the next token.

We can extract this last vector by taking `logits[0][-1]`. Lets see what the model predicts!

In [124]:
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
token_id = torch.argmax(logits[0][-1])

generated_text = tokenizer.decode([token_id.item()]) 
print(generated_text)

 Einstein


In [93]:
for i in range(100):
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
    token_id = torch.argmax(logits[0][-1])

    generated_text = tokenizer.decode([token_id.item()])  
    text+=generated_text

In [94]:
text

'The great scientist Albert Einstein, who was a great scientist, was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He was a great scientist. He