Change default model caching directory from `~/.cache` to the default work directory in the cluster meant for big files like model weights etc.


In [1]:
# import os
# os.environ['TRANSFORMERS_CACHE'] = '/work/mremeli/huggingface'

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from tqdm import tqdm

## LLM

In [3]:
checkpoint = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, pad_token_id=tokenizer.eos_token_id)

## Input
The input will be the first few lines of a Dr Seuss kids story called 'Green eggs and ham'.
This is a nice example because it uses relatively few tokens, is quite repetitive, so the language model should be able to predict the next few tokens accurately.

In [4]:
past_text = """
I AM SAM. I AM SAM. SAM I AM.

THAT SAM-I-AM! THAT SAM-I-AM! I DO NOT LIKE THAT SAM-I-AM!

DO WOULD YOU LIKE GREEN EGGS AND HAM?

I DO NOT LIKE THEM,SAM-I-AM.
I DO NOT LIKE GREEN EGGS AND HAM.

WOULD YOU LIKE THEM HERE OR THERE?

I WOULD NOT LIKE THEM HERE OR THERE.
I WOULD NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU LIKE THEM IN A HOUSE?
WOULD YOU LIKE THEN WITH A MOUSE?

I DO NOT LIKE THEM IN A HOUSE.
I DO NOT LIKE THEM WITH A MOUSE.
I DO NOT LIKE THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU EAT THEM IN A BOX?
WOULD YOU EAT THEM WITH A FOX?

NOT IN A BOX. NOT WITH A FOX.
NOT IN A HOUSE. NOT WITH A MOUSE.
I WOULD NOT EAT THEM HERE OR THERE.
I WOULD NOT EAT THEM ANYWHERE.
I WOULD NOT EAT GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU? COULD YOU? IN A CAR?
EAT THEM! EAT THEM! HERE THEY ARE.

I WOULD NOT, COULD NOT, IN A CAR.

YOU MAY LIKE THEM. YOU WILL SEE.
YOU MAY LIKE THEM IN A TREE!

"""

future_text = """I WOULD NOT, COULD NOT IN A TREE.
NOT IN A CAR! YOU LET ME BE.
I DO NOT LIKE THEM IN A BOX.
I DO NOT LIKE THEM WITH A FOX.
I DO NOT LIKE THEM IN A HOUSE.
I DO NOT LIKE THEM WITH A MOUSE.
I DO NOT LIKE THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

"""

Next, we tokenize the 'past' text which will serve as a prior for our generated output.

In [5]:
tokenized_prompt = tokenizer(past_text, return_tensors="pt")
num_past_tokens = tokenized_prompt.input_ids.shape[1]
outputs = model.generate(**tokenized_prompt, max_new_tokens=20)

Then, we extract the generated text from the output:

In [6]:
generated_text = tokenizer.batch_decode(outputs[:,num_past_tokens:])[0]
print(generated_text)


I WOULD NOT, COULD NOT, IN A TREE.

YOU MAY LIKE


As we can see, the next line was accurately predicted. --> 'I WOULD NOT, COULD NOT, IN A TREE.'


Afterwards the rhyme does not follow the true text. --> ~~'YOU MAY LIKE'~~ 'NOT IN A CAR! YOU LET ME BE.'


Not bad nevertheless!

## Continuous next word prediction

There are X past tokens and Y future tokens.

So far, our task was the following: `pred(tokens[:X]) ?= tokens[X:]`. Predict the future tokens based on past tokens.

Next, we would only like to predict the next word accurately.

Task: 
```
for k=0...Y
    pred(tokens[:X+k]) ?= tokens[X+k+1]
    
```

We would like to see how frequently we are able to predict the next word, and express that in the percentage of correctly predicted words (correct/Y).

In [7]:
def slice_token_dict(token_dict, max_idx):
    new_dict = {}
    for key, val in token_dict.items():
        new_dict[key] = val[:,:max_idx]
    return new_dict

In [8]:
correct = 0
total_preds = 0

full_text = past_text + future_text
full_tokenized = tokenizer(full_text, return_tensors="pt") # tokenize full text
num_tokens = full_tokenized.input_ids.shape[1]

for k in tqdm(range(num_tokens - num_past_tokens)):
    num_used_tokens = k + num_past_tokens
    partial_tokenized = slice_token_dict(full_tokenized, num_used_tokens)
    outputs = model.generate(**partial_tokenized, max_new_tokens=1) # predict only next word
    
    predicted_token = outputs[:,-1].item()
    true_token = full_tokenized.input_ids[:,num_used_tokens]
    
    total_preds += 1
    if predicted_token == true_token:
        correct += 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|████████████████████████████████████████████████████████| 112/112 [01:17<00:00,  1.44it/s]


In [9]:
print("Num correct: %d" % correct)
print("Total preds: %d" % total_preds)
print("%% of correctly predicted words: %.2f%%" % (100*(correct/total_preds)))

Num correct: 87
Total preds: 112
% of correctly predicted words: 77.68%
