<a href="https://colab.research.google.com/github/yotamgardosh/Between-Artificial-and-Human-Intelligence/blob/main/language_acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings and Language Acquisition

In this workshop you will get a chance to experiment with both static  (word2vec/GloVe) and contextual (self-attention) word embeddings.
To do so, you will train your own GPT2 from scratch on "Alice's Adventures in Wonderland".

Along the way you will (hopefully) learn to:
1. Run code in colab
2. Work with python libraries useful for NLP and ML in general (gensim, pytorch, numpy, etc.)
2. Use code from outside [git repo](https://github.com/karpathy/minGPT.git)
3. Build a training dataset from [raw text](https://archive.org/stream/alicesadventures19033gut/19033.txt)
4. Train a Transformer based Language Model from scratch!
5. Generate text using your LM 😀
6. Extract and plot self-attention scores from your model.


To get a feeling for static embedding arithmetic you can play around with the following demo:

https://dash.gallery/dash-word-arithmetic/

(These are word2vec embedding trained on the Google News Dataset.)



To see how easy it is to implement in code you can try out the [gensim](https://radimrehurek.com/gensim/models/keyedvectors.html) library, and check if the same effets reproduce with GloVe embeddings.

The follwoing functions may be helfpul: most_similar and doesnt_match.

In [None]:
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")


In [None]:
res = word_vectors.most_similar(positive=['hitler','italy'], negative=['germany'])
most_similar_key, similarity = res[0]
print(f"{most_similar_key}: {similarity:.4f}")

res = word_vectors.similar_by_word('israel')
print(res)

print(word_vectors.doesnt_match(["football", "tenis", "judo"]))

mussolini: 0.8272
[('israeli', 0.8549681901931763), ('palestinians', 0.8094196915626526), ('palestinian', 0.7847806811332703), ('lebanon', 0.7811506390571594), ('syria', 0.7781012654304504), ('israelis', 0.7683233022689819), ('jerusalem', 0.766927182674408), ('gaza', 0.7554308772087097), ('netanyahu', 0.732354462146759), ('arafat', 0.731320321559906)]
tenis




---





It's time to train your very own GPT2!

First, you need to clone the repo and set up the environment.

In [None]:
!git clone https://github.com/karpathy/minGPT.git
!pip install -e 'minGPT/'

Cloning into 'minGPT'...
remote: Enumerating objects: 489, done.[K
remote: Total 489 (delta 0), reused 0 (delta 0), pack-reused 489[K
Receiving objects: 100% (489/489), 1.44 MiB | 5.06 MiB/s, done.
Resolving deltas: 100% (260/260), done.
Obtaining file:///content/minGPT
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->minGPT==0.0.1)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->minGPT==0.0.1)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->minGPT==0.0.1)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->minGPT==0.0.1)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 

After cloning, go to Runtime and "Restart Session".
Then run the next cell.

In [None]:
import sys
sys.path.insert(0, './minGPT/')

import torch
from torch.utils.data import Dataset
from torch.nn import functional as F
import numpy as np
from mingpt.bpe import get_encoder, BPETokenizer

Here are some consts you'll need for your training.

Feel free to experiment with the different hyper-parameters, but be aware that increasing the BLOCK_SIZE or BATCH_SIZE too much may lead to your colab running out of memory.

Don't forget to upload the text file of "Alice's Adventures in Wonderland" to colab, and update the ALICE_IN_WONDERLAND_PATH const if necessary.

In [None]:
ALICE_IN_WONDERLAND_PATH = 'alice_in_wonderland.txt'
BLOCK_SIZE = 32
VOCAB_SIZE = 50257
NUM_BLOCKS_IN_GPT = 12
MODEL_TYPE = 'gpt2'
MAX_ITERS = 2000
BATCH_SIZE = 32
LEARNING_RATE = 5e-4
NUM_ATTN_HEADS = 12


Now it's time to create a training dataset out of the raw text of "Alice's Adventures in Wonderland".

FIrst, read the text from the file into a string.
Then encode the string using the supplied Byte Pair Encoder.

In the file [bpe.py](https://github.com/karpathy/minGPT/blob/master/mingpt/bpe.py), you can find an example for how to use the encoder.

Instead of encoder.encode_and_show_work() use encoder.encode().
This function will return a list of token indexes corresponding to the text.


In [None]:
with open (ALICE_IN_WONDERLAND_PATH, 'r') as f:
  Alice_text = f.read()

encoder = get_encoder()
BP_idx = encoder.encode(Alice_text)

downloading https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json to /root/.cache/mingpt/encoder.json
downloading https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe to /root/.cache/mingpt/vocab.bpe


Next, you should write a custom dataset class that gets the token indexes from the encoder as input, inherets from [Dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files), and  implements three functions: \_\_init__, \_\_len__, and \_\_getitem__.

The \_\_init__ function just needs to save the token indexes.

The \_\_len__ function returns the number of samples in our dataset (the largest i for which __getitem__ returns a valid sample).

The \_\_getitem__ function loads and returns a sample (input, label) from the dataset at the given index i.
\_getitem_ should return  2 objects of type torch.tensor with dtype=torch.long

Remember that GPT2 is an Auto-Regressive text model so we train the model to predict the next token based on the given prefix.
Meaning that the i-th **input** will be the next block_size tokens up to i, tokens[i : i + block_size] and the **label** will be
tokens[i + 1 : i + block_size + 1].

In [None]:
class AIWDataset(Dataset):

    def __init__(self, token_idxs):
        self.token_idxs = token_idxs;

    def __len__(self):
        # ensures that there are enough tokens for each input sequence
        return len(self.token_idxs) - (BLOCK_SIZE + 1) # + 1 accounts for overlap between input and output

    def __getitem__(self, i): # return x,y for training
        # the tokens from i to i + Block size, model predicts until i + BLOCK_SIZE + 1
        x = torch.tensor(self.token_idxs[i : i + BLOCK_SIZE], dtype = torch.long)
        # the tokens from i + 1 to i + Block size + 1 for testing the prediction
        y = torch.tensor(self.token_idxs[i+1 : i + BLOCK_SIZE + 1], dtype = torch.long)

        return x, y

dataset = AIWDataset(BP_idx)

Now all that's left is to load and train you model.
Check out the README of the minGPT repo for [instructions](https://github.com/karpathy/minGPT/blob/master/README.md).

Use the constants provided above to set the values in model_config (model_type, block_size, vocab_size).

You can add a callback function to your trainer to see how the loss changes during training.
Use the constants provided above to set the values in trainer_config (learning_rate, max_iters, batch_size).

If you haven't changed the hyper-parmeters, training on T4 machine should take ~10m.

If you get out of memory error, try using a smaller version of the model for example 'gpt-mini'

In [None]:
from mingpt.model import GPT
from mingpt.trainer import Trainer


def load_model():
    """
    Load a pre-configured GPT model.

    Returns:
        model (GPT): The loaded GPT model.
    """
    # Get default configuration for the GPT model
    model_config = GPT.get_default_config()
    # Set the model type (e.g., "GPT2" or "GPT3")
    model_config.model_type = MODEL_TYPE
    # Set the vocabulary size based on the OpenAI model's vocabulary
    model_config.vocab_size = VOCAB_SIZE
    # Set the block size, representing the input context length for the model
    model_config.block_size = BLOCK_SIZE
    # Initialize and return the GPT model with the specified configuration
    model = GPT(model_config)
    return model

def batch_end_callback(trainer):
    """
    Callback function called at the end of each training batch.

    Args:
        trainer (Trainer): The trainer object managing the training process.
    """
    # Print training progress every 100 iterations
    if trainer.iter_num % 100 == 0:
        print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")

def train_model(model, train_dataset):
    """
    Train the specified model using the provided dataset.

    Args:
        model (GPT): The GPT model to be trained.
        train_dataset (Dataset): The dataset containing training examples.

    Returns:
        device (str): The device (CPU or GPU) used for training.
    """
    # Get default training configuration
    train_config = Trainer.get_default_config()
    # Set learning rate for training
    train_config.learning_rate = LEARNING_RATE
    # Set maximum number of training iterations
    train_config.max_iters = MAX_ITERS
    # Set batch size for training
    train_config.batch_size = BATCH_SIZE
    # Initialize trainer with the specified configuration, model, and dataset
    trainer = Trainer(train_config, model, train_dataset)
    # Set callback function to monitor training progress
    trainer.set_callback('on_batch_end', batch_end_callback)
    # Start training process
    trainer.run()
    # Initialize variables for tracking training loss and time
    trainer.init_time = None
    trainer.train_loss_vals = []
    trainer.train_loss_times = []
    # Run training process again
    trainer.run()
    # Set the model to evaluation mode after training
    model.eval()
    # Return the device used for training
    return trainer.device


mini_gpt = load_model()
train_device = train_model(mini_gpt, dataset)



number of parameters: 123.68M
running on device cuda


  self.pid = os.fork()


iter_dt 0.00ms; iter 0: train loss 10.91157
iter_dt 293.77ms; iter 100: train loss 5.09858
iter_dt 308.88ms; iter 200: train loss 4.66978
iter_dt 334.16ms; iter 300: train loss 4.48186
iter_dt 324.31ms; iter 400: train loss 3.78834
iter_dt 320.70ms; iter 500: train loss 3.92948
iter_dt 357.58ms; iter 600: train loss 3.16907
iter_dt 323.99ms; iter 700: train loss 3.61388
iter_dt 325.01ms; iter 800: train loss 2.89561
iter_dt 325.95ms; iter 900: train loss 2.86004
iter_dt 336.38ms; iter 1000: train loss 2.73472
iter_dt 328.73ms; iter 1100: train loss 2.37208
iter_dt 314.89ms; iter 1200: train loss 2.44860
iter_dt 335.84ms; iter 1300: train loss 2.27997
iter_dt 315.60ms; iter 1400: train loss 1.94524
iter_dt 326.44ms; iter 1500: train loss 1.78241
iter_dt 326.62ms; iter 1600: train loss 1.61986
iter_dt 326.55ms; iter 1700: train loss 1.49093
iter_dt 296.66ms; iter 1800: train loss 1.23891
iter_dt 322.92ms; iter 1900: train loss 1.05599


  self.pid = os.fork()


iter_dt 304.23ms; iter 0: train loss 1.00131
iter_dt 401.87ms; iter 100: train loss 0.92797
iter_dt 252.33ms; iter 200: train loss 0.76371
iter_dt 323.08ms; iter 300: train loss 0.81715
iter_dt 297.52ms; iter 400: train loss 0.66154
iter_dt 325.12ms; iter 500: train loss 0.71316
iter_dt 325.33ms; iter 600: train loss 0.62251
iter_dt 325.79ms; iter 700: train loss 0.59499
iter_dt 325.54ms; iter 800: train loss 0.49902
iter_dt 324.62ms; iter 900: train loss 0.53952
iter_dt 325.70ms; iter 1000: train loss 0.47086
iter_dt 327.06ms; iter 1100: train loss 0.54779
iter_dt 325.21ms; iter 1200: train loss 0.43279
iter_dt 326.11ms; iter 1300: train loss 0.48745
iter_dt 326.79ms; iter 1400: train loss 0.46009
iter_dt 326.16ms; iter 1500: train loss 0.49696
iter_dt 326.30ms; iter 1600: train loss 0.44778
iter_dt 321.00ms; iter 1700: train loss 0.39358
iter_dt 325.81ms; iter 1800: train loss 0.42523
iter_dt 322.22ms; iter 1900: train loss 0.43914


You now have a trained LM - congratulations! 🎉🎉🎉


It's time to put it to the test and generate some text (don't expect any miracles here, all it knows is "Alice's Adventures in Wonderland", so cut it some slack).

You can experiment with different prompts, number of generated tokens, etc.


To do so you will need to use the in-built in function 'generate' of the model (model.generate()), which receives as input:

1. A torch.tensor of dtype=torch.long with the indexes of the tokens in the prompt - just like in training, start with a string prompt of your choosing, encode it using the encoder and convert the result to torch.tensor after wrapping it in list using torch.tensor([encoder_output], dtype=torch.long)

**Note**: very technical but to avoid device error, convert the resulting tensor (t) to the device of the trainer using t.to(device = trainer.device)

2. The number of words to generate - you can choose whatever positive number you like.


The function returns token indexes, which you can covert to tokens/words using `encoder.decode(token_indexes.squeeze().tolist())`

In [None]:
def generate_text(model, device, encoder, prompt, new_token_length):
  initial_idx = torch.tensor([encoder.encode(prompt)], dtype = torch.long).to(device=device)
  tokenized_output = model.generate(initial_idx, max_new_tokens=new_token_length - init_idx.shape[1], do_sample=False, top_k=40)
  return encoder.decode(tokenized_output.squeeze().tolist())

generate_text(mini_gpt, train_device, encoder, "how is the queen doing?" 20)


To understand the contextual embedding your model uses to generate text, we will take a closer look at the self-attention output.

Generate a sentence with at least 10 tokens.
In the last word prediction extract the attention scores
from the **first** and **last** transformer blocks.
Average the scores over the different attention heads.

To extract the (averaged) attention scores you will need to modify your local **model.py** file at the following points in the code:
1. Return the average attention scores in the *forward* function of the **CausalSelfAttention** class (apply the mean after the softmax before the dropout) as an additional output.
2. Handle this additional output in the *forward* function of the **Block** class, and return it as well.
3. In the *forward* function of the **GPT** class, make sure to save the attention scores from the relevant blocks and return them as a list along side the existing outputs.


Or you can take the modified file from moodle to run the next section of code (replacing the model.py file you already have under mingpt directory).

IMPORTANT NOTE:
After modifying the model.py file you will need to restart your runtime and re-train your model.
To do so change line 93 in trainer.py to be (restart runtime after changing this line):

```
logits, self.loss, _ = model(x, y)

```



With the modified model.py you can generate tokens using the in-build model.forward() function.
Make sure to call `with torch.no_grad():` before the forward function.

This function expects two inputs: torch.tensor of tokens indexes (like before), and attn_block_nums which is a list of blocks from which you would like to extract the attention.

So an example of a call would be: `model.forward(tokens, attn_block_nums=[0, 3])`

In [None]:
ATTENTION_PROMPT = "The King and Queen of Hearts were seated on their"

prompt_token_idxs = encoder.encode(ATTENTION_PROMPT)
prompt_token_idxs_tensor = torch.tensor([prompt_token_idxs], dtype=torch.long).to(device=device)

with torch.no_grad():
      logits, _, attn_scores = model(prompt_token_idxs_tensor, attn_block_nums=[0, NUM_BLOCKS_IN_GPT - 1])

The function returns three values: logits, loss, attns.

You can ignore the loss, and using the function `logits_to_next_token_idx` get the token index of the word predicted/generated by the model.

Then you can decode this token index using encoder.decode() to get the actual token and print it out.


In [None]:
def logits_to_token_idx(logits):
    next_token_logits = logits[:, -1, :]
    probs = F.softmax(next_token_logits, dim=-1)
    idx = torch.topk(probs, k=1)[1]
    return idx.cpu().tolist()[0]

In [None]:
next_token_index = logits_to_token_idx(logits)
netx_token = encoder.decode(next_token_index)

print(f"Sentence sampled for attention: \n" + ATTENTION_PROMPT + netx_token)

attns as returned by the forward function is a tuple of (first_block_att_scores, last_block_att_scores).

This is the attention score for all tokens predicted by the model, so you will need to use `first_block_att_scores.squeeze()[-1]` (and the same for last_block_att_scores) to get the attention for the last predicted token.

So you need to split it, and then pass each of those to the function "paint_tokens_by_attention" function, to paint the tokens by their attention weight.

Remember to pass tokens not the original prompt to this painting function.

To get them you can encode the prompt with the encoder, and then decode each item in the list separately wrapped in a list (the decode() function accepts only lists).

In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm

def paint_tokens_by_attention(tokens, attention_scores):
    plt.cla()
    attn = [attention_scores.cpu().numpy().round(3)]
    plt.table(cellText=attn,
              colLabels=tokens,
              loc='center',
              cellColours=cm.Oranges(attn))
    plt.axis('off')
    plt.rcParams['figure.dpi'] = 300
    plt.rcParams['savefig.dpi'] = 300
    plt.show()

In [None]:
first_block_att_scores, last_block_att_scores = attn_scores
first_block_att_scores = first_block_att_scores.squeeze()[-1]
last_block_att_scores = last_block_att_scores.squeeze()[-1]

prompt_tokens = [encoder.decode([token_idx]) for token_idx in prompt_token_idxs]

paint_tokens_by_attention(prompt_tokens, first_block_att_scores)
paint_tokens_by_attention(prompt_tokens, last_block_att_scores)

Pay attention to the following:
1. What is the sum of all attention scores?
2. What is the difference in attention between the blocks?
3. Based on this analysis why do you think the model "chose" the predicted word/token?