# Running GPT-2 Locally

GPT-2 was introduced in [this paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by <i>Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever</i> (all affiliated with OpenAI at the time).

You can find more information about the version of gpt-2 that we are loading [here](https://huggingface.co/openai-community/gpt2?library=transformers). The version we will be using has 124 million paramters, which is the smallest of the GPT-2 family of models. The [largest version](https://huggingface.co/openai-community/gpt2-xl) is over ten times larger, at 1.5B total parameters. 

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import xlab

In [2]:
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

## Tokenize Text

To input a sequence of text to GPT-2, we first have to decide how we would like to convert the text to numbers so we can feed it to the model. Typically, how this is done is we convert a string of text to a string of tokens, each of which will be assigned a number which can be embedded into a vector. To do this, we have a few options:

1. We can assign each character it's own number
2. We can assign each word or special character it's own number
3. We can assign common sequences of characters their own number

Typically option #3 is most popular and the high-level approach taken in the GPT-2 paper. This approach has the advantage of having a smaller total number of tokens while still capturing some of the underlying structure of natural language. Specifically, the author's use a modified version of BPE (byte pair encoding) proposed [here](https://arxiv.org/pdf/1508.07909). If you are interested, more implementation details of the tokenizer can be found in the [GPT-2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). 

Time to try out the GPT-2 tokenizer! Run the cell below see the tokenizer assign the string into a sequence of numbers:

In [3]:
text = "Barack Obama taught constitutional law at the University of"
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input['input_ids'])

tensor([[10374,   441,  2486,  7817,  9758,  1099,   379,   262,  2059,   286]])


Let's take a look at what each of these numbers represent:

In [4]:
for token_id in encoded_input['input_ids'][0]:
    print(f'{token_id.item()}\t --> \t"{tokenizer.decode(token_id)}"')

10374	 --> 	"Bar"
441	 --> 	"ack"
2486	 --> 	" Obama"
7817	 --> 	" taught"
9758	 --> 	" constitutional"
1099	 --> 	" law"
379	 --> 	" at"
262	 --> 	" the"
2059	 --> 	" University"
286	 --> 	" of"


We can also decode the entire sequence at once. This will be helpful to remember for later!

In [5]:
tokenizer.decode(encoded_input['input_ids'][0])

'Barack Obama taught constitutional law at the University of'

# Task #1

For a given input of text, return a list of tokens in plain text. For example for the input "Hello there gpt-2!", the function should return ['Hello', ' there', ' g', 'pt', '-', '2', '!']. Note that this is very different than just splitting up the text into random chunks or where there are spaces! Tokenizers are designed to create groupings of characters that are often found together or that are significant in the structure of language. You are encouraged to play around with different examples and observe how smart the tokenizer can be!

In [6]:
# estimated time to complete: ~3 minutes
def plain_text_tokens(prefix):
    """Tokenizes a text prefix into individual token strings.
    
    Args:
        prefix (str): The input text string to be tokenized.
        
    Returns:
        list[str]: A list of individual tokens as strings. Each token represents
            how the tokenizer splits the input text.
            
    Example:
        >>> plain_text_tokens("Hello there gpt-2!")
        ['Hello', ' there', ' g', 'pt', '-', '2', '!']
    """
    rv = []
    ######## YOUR CODE HERE ########
    encoded_input = tokenizer(prefix, return_tensors='pt')
    for i in encoded_input['input_ids'][0]:
        rv.append(tokenizer.decode([i]))
    return rv

# test out your implementation on different inputs to get a sense of how the tokenizer works!
print(plain_text_tokens("Hello there gpt-2!"))
print(plain_text_tokens("https://xrisk.uchicago.edu/fellowship/"))

['Hello', ' there', ' g', 'pt', '-', '2', '!']
['https', '://', 'x', 'risk', '.', 'uch', 'icago', '.', 'edu', '/', 'fell', 'owship', '/']


In [7]:
xlab.tests.section1_0.task1(plain_text_tokens)

Running tests for Section 1.0, Task 1...

✓ 1. Test case function runs without crashing                  [92mPASSED[0m
✓ 2. Test case 'Hello there gpt-2'                             [92mPASSED[0m
✓ 3. Test case '??!hello--*- world#$'                          [92mPASSED[0m
✓ 4. Test case 'https://xrisk.uchicago.edu/fellowship/'        [92mPASSED[0m
✓ 5. Test case ''                                              [92mPASSED[0m
✓ 6. Test case '.,.,.,.,.,.,.,'                                [92mPASSED[0m

🎉 All tests passed! (6/6)


{'total_tests': 6, 'passed': 6, 'failed': 0, 'score': 100}

In [8]:
text = "Barack Obama taught constitutional law at the University of"

encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
logits = output.logits  # Shape: (batch_size, sequence_length, vocab_size)
token_id = torch.argmax(logits[0][-1])

generated_text = tokenizer.decode([token_id.item()]) 
print(generated_text)

 Chicago
