# Week 4: Text Generation

### What we are building
A smart compose system that assists in writing movie reviews using the IMDB movie review dataset. FYI: You probably interact with smart compose multiple times a day while typing in Gmail, typing on your phone, or just using Google search.

### Instructions

We will compare a really simple memorization model that just remembers how often certain words follow a phrase with a pre-trained GPT-2. Finetuning a GPT-2 can take a long time even with a GPU so we'll leave that as an extension project.

### Code Overview

- Dependencies: Install and import python dependencies
- Datasets - Methods and dataset for evaluation
- Models
  - Memorization
  - GPT-2 Pretrained
- Extensions


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [1]:
# Install all the required dependencies for the project
!pip install transformers==4.17.0
!pip install datasets==1.15.1
!pip install pytorch-lightning==1.6.5

You should consider upgrading via the '/Users/vitalii.mishchenko/Documents/experiments/2302-nlp-course/venv/bin/python -m pip install --upgrade pip' command.[0m
Collecting datasets==1.15.1
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
     |████████████████████████████████| 290 kB 2.7 MB/s            
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
     |████████████████████████████████| 115 kB 20.2 MB/s            
Collecting dill
  Using cached dill-0.3.6-py3-none-any.whl (110 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp37-cp37m-macosx_10_9_x86_64.whl (34 kB)
Collecting pyarrow!=4.0.0,>=1.0.0
  Downloading pyarrow-11.0.0-cp37-cp37m-macosx_10_14_x86_64.whl (24.4 MB)
     |████████████████████████████████| 24.4 MB 13.1 MB/s            
Installing collected packages: dill, xxhash, pyarrow, multiprocess, datasets
Successfully installed datasets-1.15.1 dill-0.3.6 multiprocess-0.70.14 pyarrow-11.0.0 xxhash-3.2.0


Import all the necessary libraries we need throughout the project.

In [1]:
# Import all the relevant libraries
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2TokenizerFast

import torch
import numpy as np

from datasets import load_dataset_builder
from datasets import load_dataset
from collections import defaultdict, Counter

import torch.nn.functional as F
from torch import nn
import torchmetrics
import pytorch_lightning as pl

### Dataset Loading (common to all solutions)

In [2]:
dataset_builder = load_dataset_builder('imdb')
train_dataset = [d["text"] for d in load_dataset('imdb', split='train')]
test_dataset = [d["text"] for d in load_dataset('imdb', split='test')]

print(f"Length of training data: {len(train_dataset)}")
print(f"Length of test data: {len(test_dataset)}")

Reusing dataset imdb (/Users/vitalii.mishchenko/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Reusing dataset imdb (/Users/vitalii.mishchenko/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


Length of training data: 25000
Length of test data: 25000


### Evaluation Dataset

Running GPT-2 is really expensive so we create a small sample dataset of size 500 and use that for our evaluations. 

In [3]:
# Fix the random seed
np.random.seed(0)

def create_eval_dataset(dataset, num_examples=500):
  if len(dataset) < num_examples:
    raise ValueError(f"Can not select {num_examples} unique examples from dataset of size {len(dataset)}")

  # Since it is really expensive to run GPT, we'll use a smaller dataset for eval
  sample = np.random.choice(dataset, num_examples, replace=False)

  prefixes = []
  output_words = []
  for d in sample:
    words = d.lower().split(" ")
    boundary = np.random.randint(1, len(words)-1)
    prefix = " ".join(words[:boundary])
    prefixes.append(prefix)
    output_words.append(words[boundary])
  return prefixes, output_words

prefixes, output_words = create_eval_dataset(test_dataset, 500)

**Evaluation Function**: Create a single function to compute correct predictions in the top_k from the model.

In [4]:
def evaluate_exact_match_at(model, prefixes, output_words, top_k):
  em_count = 0
  i = 0
  for i, (prefix, output_word) in enumerate(zip(prefixes, output_words)):
    for p in model.predict(prefix, top_k):
      if p.strip() == output_word.strip():
        em_count += 1
        break
    if i % 20 == 0:
      print(f"Evaluated {i} prefixes")
  print(f"Exact match evaluation em@{top_k}:{em_count /len(prefixes)} . Model got {em_count} matches out of {len(prefixes)}")

# Memorizer

Model takes the largest prefix it will memorize which defaults to 3. This means for each sentence of the 4 words such as `I like learning NLP` it'll memorize that it saw `NLP` follow the prefix `I like learning` once.  

The model also memorizes any window of size between 1 to the largest_prefix length that are fall back options if we encounter new words. So following our example the model has learned the following:

```python
[
  ('I like learning', 'NLP'),
  ('I like', 'learning'), ('like learning', 'NLP'),
  ('I', 'like'), ('like', 'learning'), ('learning', 'NLP'),
]
```

This is done so that if we encounter a sentence like `We like learning` we can fall back to the prefix of length 2 and then predict `NLP`.

**Implement** the predict function that checks from the largest to the smaller possible prefix and uses the memory dictionary to make predictions and returns the top_k.

## ASSIGNMENT PART 1

In [41]:
def _window(seq, n=2):
  """Returns a sliding window based on n
  """
  seq = tuple(seq)
  if len(seq) < n: 
    return []
  for i in range(0, len(seq) - n + 1):
    yield seq[i:i+n]


class Memorizer:
  def __init__(self, train_dataset, largest_prefix=3): 
    self.largest_prefix = largest_prefix
    self.memory = {}
    # Build the dictionaries for each prefix length
    for prefix_size in range(largest_prefix+1):
      self.memory[prefix_size] = defaultdict(Counter)
      self._build(train_dataset, prefix_size + 1, self.memory[prefix_size])

  def _build(self, train_dataset, window_size, memory):
    """Build the memory dictionary for a provided window_size
    """
    for data in train_dataset:
      words = data.split(" ")
      # Compute the different word windows using the _window function
      for window in _window(words, window_size):
        if window_size == 1:
          # There is no window, just memorize how frequently each word occurs in the dataset
          output_word = window[0]
          # Default all the prefixes to UNK
          prefix = "UNK"
        else:
          # Use the prefix and update the count of the word that follows it
          prefix = " ".join(window[:-1])
          output_word = window[-1]
        memory[prefix][output_word] += 1

  def predict(self, prefix, top_k=1):
    """Top_k words that might follow the given the prefix in our dataset
    """
    prefix_words = prefix.split(" ")
    for prefix_len in range(min(len(prefix_words), self.largest_prefix), 0, -1):

      # Compute the prefix string for the size of the window
      ### TO BE COMPLETED ### 
      prefix_str = " ".join(prefix_words[-prefix_len:])

      # If prefix is in memory "return" the top_k matches 
      # Remember we've to return here since we want to use the data from the longest prefix that matches
      if prefix_str in self.memory[prefix_len]:
        ### TO BE COMPLETED ###
        top_counters = self.memory[prefix_len][prefix_str].most_common(top_k)
        top_tokens = [top_counter[0] for top_counter in top_counters]
        return top_tokens

    # None of the prefix matched so just return the most common words in the dataset
    predictions = self.memory[0]["UNK"].most_common(top_k)
    return [p[0] for p in predictions]

## Experiment with Memorizer widget

Ha! Here is a cute trick to build fun widgets within the colab. Just try different sentences for the dataset and prefix to see if the memorizer is working correctly.

In [39]:
#@title Experiment with Memorizer
"""
In this cell, we've built a toy dataset with only 3 examples. 
Now given the prefix 'I like', a trie would emit 'football' and 'tennis' based 
on the co-occurence.
"""
dataset_1 = "I like football" #@param {type:"string"}
dataset_2 = "I like tennis sometimes" #@param {type:"string"}
dataset_3 = "I like football way too much" #@param {type:"string"}
prefix = "I like" #@param {type:"string"}

memorized_toy_model = Memorizer([dataset_1, dataset_2, dataset_3])

predictions = memorized_toy_model.predict(prefix, 2)

# The model should predict [football, tennis]
# since football occurred twice while tennis was just once.
print("Predictions: ", predictions)

Predictions:  ['football', 'tennis']


### Train memorizer on the actual training data

In [42]:
memorized_model = Memorizer(train_dataset)

### Evaluation on top_1 and top_3
##### <font color='red'>Expected em@1: ~0.12%</font>
##### <font color='red'>Expected em@3: ~0.122%</font>

In [44]:
evaluate_exact_match_at(memorized_model, prefixes, output_words, 1)

Evaluated 0 prefixes
Evaluated 20 prefixes
Evaluated 40 prefixes
Evaluated 60 prefixes
Evaluated 80 prefixes
Evaluated 100 prefixes
Evaluated 120 prefixes
Evaluated 140 prefixes
Evaluated 160 prefixes
Evaluated 180 prefixes
Evaluated 200 prefixes
Evaluated 220 prefixes
Evaluated 240 prefixes
Evaluated 260 prefixes
Evaluated 280 prefixes
Evaluated 300 prefixes
Evaluated 320 prefixes
Evaluated 340 prefixes
Evaluated 360 prefixes
Evaluated 380 prefixes
Evaluated 400 prefixes
Evaluated 420 prefixes
Evaluated 440 prefixes
Evaluated 460 prefixes
Evaluated 480 prefixes
Exact match evaluation em@1:0.178 . Model got 89 matches out of 500


In [45]:
evaluate_exact_match_at(memorized_model, prefixes, output_words, 3)

Evaluated 0 prefixes
Evaluated 20 prefixes
Evaluated 40 prefixes
Evaluated 60 prefixes
Evaluated 80 prefixes
Evaluated 100 prefixes
Evaluated 120 prefixes
Evaluated 140 prefixes
Evaluated 160 prefixes
Evaluated 180 prefixes
Evaluated 200 prefixes
Evaluated 220 prefixes
Evaluated 240 prefixes
Evaluated 260 prefixes
Evaluated 280 prefixes
Evaluated 300 prefixes
Evaluated 320 prefixes
Evaluated 340 prefixes
Evaluated 360 prefixes
Evaluated 380 prefixes
Evaluated 400 prefixes
Evaluated 420 prefixes
Evaluated 440 prefixes
Evaluated 460 prefixes
Evaluated 480 prefixes
Exact match evaluation em@3:0.246 . Model got 123 matches out of 500


## GPT-2: Generative Pre-trained Transformer

We'll use the pretrainined GPT-2 model provided by the transformers package. Make sure you implement the predict function.

Implementation Steps:
1. Encode the sentence using `tokenizer.encode` and make sure it returns a torch tensor.
2. Run this through the model and those are your predictions.
3. Decode the indices from the output of top_k using the tokenizer

### ASSIGNMENT PART 2

In [13]:
class GPT2PreTrained:
  def __init__(self): 
    self.tokenizer = GPT2TokenizerFast.from_pretrained('distilgpt2')
    self.model = GPT2LMHeadModel.from_pretrained('gpt2')
    self.model.eval()

  def predict(self, prefix, top_k=1):
    ### TO BE IMPLEMENTED ### 
    indexed_tokens = self.tokenizer(prefix, return_tensors="pt")
    predictions = self.model(**indexed_tokens)
    ### TO BE IMPLEMENTED ### 

    _, indices = torch.topk(predictions[0][0, -1, :], k=top_k)
    ### TO BE IMPLEMENTED ### 
    predictions = [self.tokenizer.decode(id) for id in indices]
    
    return predictions

### Experiment with GPT-2 Widget

In [14]:
#@title Experiment with GPT-2
"""
In this cell, we've built a toy prompt from which we predict 
the next words using GPT-2.
"""
text = "pitcher threw a" #@param {type:"string"}

gpt_model = GPT2PreTrained()

predictions = gpt_model.predict(text, 2)
## Output should be "pitch, ball" or something similar
print("Predictions: ", predictions)

Predictions:  [' pitch', ' ball']


### Evaluation on top_1 and top_3
##### <font color='red'>Expected em@1: ~0.21%</font>
##### <font color='red'>Expected em@3: ~0.298%</font>

In [15]:
evaluate_exact_match_at(gpt_model, prefixes, output_words, 1)

Evaluated 0 prefixes
Evaluated 20 prefixes
Evaluated 40 prefixes
Evaluated 60 prefixes
Evaluated 80 prefixes
Evaluated 100 prefixes
Evaluated 120 prefixes
Evaluated 140 prefixes
Evaluated 160 prefixes
Evaluated 180 prefixes
Evaluated 200 prefixes
Evaluated 220 prefixes
Evaluated 240 prefixes
Evaluated 260 prefixes
Evaluated 280 prefixes
Evaluated 300 prefixes
Evaluated 320 prefixes
Evaluated 340 prefixes
Evaluated 360 prefixes
Evaluated 380 prefixes
Evaluated 400 prefixes
Evaluated 420 prefixes
Evaluated 440 prefixes
Evaluated 460 prefixes
Evaluated 480 prefixes
Exact match evaluation em@1:0.21 . Model got 105 matches out of 500


In [16]:
evaluate_exact_match_at(gpt_model, prefixes, output_words, 3)

Evaluated 0 prefixes
Evaluated 20 prefixes
Evaluated 40 prefixes
Evaluated 60 prefixes
Evaluated 80 prefixes
Evaluated 100 prefixes
Evaluated 120 prefixes
Evaluated 140 prefixes
Evaluated 160 prefixes
Evaluated 180 prefixes
Evaluated 200 prefixes
Evaluated 220 prefixes
Evaluated 240 prefixes
Evaluated 260 prefixes
Evaluated 280 prefixes
Evaluated 300 prefixes
Evaluated 320 prefixes
Evaluated 340 prefixes
Evaluated 360 prefixes
Evaluated 380 prefixes
Evaluated 400 prefixes
Evaluated 420 prefixes
Evaluated 440 prefixes
Evaluated 460 prefixes
Evaluated 480 prefixes
Exact match evaluation em@3:0.324 . Model got 162 matches out of 500


🎉 YAYYYY!!! We did it, that's it. Take a second to pause how many different things you've tried in the last 4 weeks. Go you!!

## Extensions
- Build an LSTM based generation model (Remember to cut sequences at about 10-15 words, LSTMs don't work on long sentences).
- Try fine-tuning the GPT-2 model using a GPU runtime for the notebook. (NOTE: colab free GPUs are pretty bad so this is probably not worth doing in the free tier)