<a href="https://colab.research.google.com/github/soumyabodavula/nlp-projects/blob/main/proj_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment 1: Language Modeling

*This assignment is adapted from one created by David Gaddy, Daniel Fried, Nikita Kitaev, Mitchell Stern, Rodolfo Corona, John DeNero, and Dan Klein.*

TA contact for this assignment:
Sanxing Chen (sanxing.chen@duke.edu)


---



In this assignment, you will implement several different types of language models for text.  We'll start with n-gram models, then move on to neural n-gram and GRU language models.

**Warning**: Do not start this project the day before it is due!  Some parts require 20 minutes or more to run, so debugging and tuning can take a significant amount of time.

Our dataset for this project will be the WikiText2 language modeling dataset.  This dataset comes with some of the basic preprocessing done for us, such as tokenization and rare word filtering (using the `<unk>` token).
Therefore, we can assume that all word types in the test set also appear at least once in the training set.
We'll also use the Huggingface `datasets` and `tokenizers` libraries to help with some of the data preprocessing, such as converting tokens into id numbers.

**Note on GPU usage**: You will need to use a GPU for the neural n-gram and GRU models but **not** for the n-gram models. Colab places some restrictions on GPU usage due to which you might get locked out after continuously using one (~8 hours). To avoid this, you should only use the GPU when needed, i.e., on training and inference for the last two parts of this assignment. You can enable / disable GPU usage by changing the Runtime type under the Runtime menu.
If you do get locked out of using a GPU, a potential workaround is to sign in using a different account.

When training a model on the GPU it is also a good idea to save your model periodically in case you get locked out. You can use `torch.save(network.state_dict(), path)` and `network.load_state_dict()` for this; see [here](https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_models_for_inference.html). You can also save your `*.npy` files to google drive to avoid lossing them after Colab session cut-off (see the sample script below).

```
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p /content/drive/MyDrive/CS572-S24-A1
!cp *.npy /content/drive/MyDrive/CS572-S24-A1
!ls /content/drive/MyDrive/CS572-S24-A1
```

**Grading rubric**
- 70% results
 - 15% bigram_predictions.npy (correctness)
 - 15% trigram_kn_predictions.npy (correctness)
 - 15% neural_trigram_predictions.npy (meets target)
 - 15% gru_predictions.npy (meets target)
 - 10% gru_predictions.npy (improvement over target)
  
- 30% writeup
 - 12.5% clarity
 - 12.5% correctness
 - 5% interestingness of ideas

In [None]:
# Install some required packages.
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16


In [None]:
# This block handles some basic setup and data loading.
# You shouldn't need to edit this, but if you want to
# import other standard python packages, that is fine.

from collections import defaultdict, Counter
import numpy as np
import math
import tqdm
import random
import pdb

import torch
from torch import nn
import torch.nn.functional as F

# We'll use HuggingFace's datasets and tokenizers libraries, which are a bit
# heavy-duty for what we're doing, but it's worth getting to know them.

from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import WhitespaceSplit

dataset = load_dataset("wikitext", "wikitext-2-v1")
tokenizer = Tokenizer(WordLevel(unk_token='<unk>'))
tokenizer.pre_tokenizer = WhitespaceSplit() # should be equivalent to split()

# "Training" a tokenizer below just feeds it all the tokens so it can map from
# word type to id.

trainer = WordLevelTrainer( # should only be 33,278 distinct types in Wikitext-2
    vocab_size=33300, special_tokens=["<unk>", "<eos>"])
generator_bsz = 512
all_splits_generator = (dataset[split][i:i+generator_bsz]["text"]
                        for split in ["train", "validation", "test"]
                          for i in range (0, len(dataset[split]), generator_bsz))
tokenizer.train_from_iterator(all_splits_generator, trainer)

# If desired, we could make a transformers tokenizer object now with:
# fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

orig_vocab = tokenizer.get_vocab() # The tokenizer reserves a <pad> id, which we'll ignore.
word_types = sorted(list(orig_vocab.keys()), key=lambda w: orig_vocab[w]) # no <pad>
vocab = {w: i for i, w in enumerate(word_types)} # no <pad>
vocab_size = len(vocab)

# Make a single stream of tokens, with an <eos> after each newline.

train_text = []
for example in dataset["train"]["text"]:
  train_text.extend(tokenizer.encode(example).tokens + ["<eos>"])

validation_text = []
for example in dataset["validation"]["text"]:
  validation_text.extend(tokenizer.encode(example).tokens + ["<eos>"])

print(validation_text[:30])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/685k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.07M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/618k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

['<eos>', '=', 'Homarus', 'gammarus', '=', '<eos>', '<eos>', 'Homarus', 'gammarus', ',', 'known', 'as', 'the', 'European', 'lobster', 'or', 'common', 'lobster', ',', 'is', 'a', 'species', 'of', '<unk>', 'lobster', 'from', 'the', 'eastern', 'Atlantic', 'Ocean']


We've implemented a unigram model here as a demonstration.

In [None]:
class UnigramModel:
    def __init__(self, train_text):
        self.counts = Counter(train_text)
        self.total_count = len(train_text)

    def probability(self, word):
        return self.counts[word] / self.total_count

    def next_word_probabilities(self, text_prefix):
        """Return a list of probabilities for each word in the vocabulary."""
        return [self.probability(word) for word in word_types]

    def perplexity(self, full_text):
        """Return the perplexity of the model on a text as a float.

        full_text -- a list of string tokens
        """
        log_probabilities = []
        for word in full_text:
            # Note that the base of the log doesn't matter
            # as long as the log and exp use the same base.
            log_probabilities.append(math.log(self.probability(word), 2))
        return 2 ** -np.mean(log_probabilities)

unigram_demonstration_model = UnigramModel(train_text)
print('unigram validation perplexity:',
      unigram_demonstration_model.perplexity(validation_text))

def check_validity(model):
    """Performs several sanity checks on your model:
    1) That next_word_probabilities returns a valid distribution
    2) That perplexity matches a perplexity calculated from next_word_probabilities

    Although it is possible to calculate perplexity from next_word_probabilities,
    it is still good to have a separate more efficient method that only computes
    the probabilities of observed words.
    """

    log_probabilities = []
    for i in range(10):
        prefix = validation_text[:i]
        probs = model.next_word_probabilities(prefix)
        assert min(probs) >= 0, "Negative value in next_word_probabilities"
        assert max(probs) <= 1 + 1e-8, "Value larger than 1 in next_word_probabilities"
        assert abs(sum(probs)-1) < 1e-4, "next_word_probabilities do not sum to 1"

        word_id = vocab[validation_text[i]]
        selected_prob = probs[word_id]
        log_probabilities.append(math.log(selected_prob))

    perplexity = math.exp(-np.mean(log_probabilities))
    your_perplexity = model.perplexity(validation_text[:10])
    assert abs(perplexity-your_perplexity) < 0.1, "your perplexity does not " + \
    "match the one we calculated from `next_word_probabilities`,\n" + \
    "at least one of `perplexity` or `next_word_probabilities` is incorrect.\n" + \
    f"we calcuated {perplexity} from `next_word_probabilities`,\n" + \
    f"but your perplexity function returned {your_perplexity} (on a small sample)."


check_validity(unigram_demonstration_model)

unigram validation perplexity: 965.0860734119312


To generate from a language model, we can sample one word at a time conditioning on the words we have generated so far.

In [None]:
def generate_text(model, n=20, prefix=('<eos>', '<eos>')):
    prefix = list(prefix)
    for _ in range(n):
        probs = model.next_word_probabilities(prefix)
        word = random.choices(word_types, probs)[0]
        prefix.append(word)
    return ' '.join(prefix)

print(generate_text(unigram_demonstration_model))

<eos> <eos> Andrew , , idea was and religion is was The Other Veeru ( the easily ) as April the the


In fact there are many strategies to get better-sounding samples, such as only sampling from the top-k words or sharpening the distribution with a temperature.  You can read more about sampling from a language model in this paper: https://arxiv.org/pdf/1904.09751.pdf.

You will need to submit some outputs from the models you implement for us to grade.  The following will download the required output files.

In [None]:
!gdown --id 1aHC9RfmeSa8dDwC9XGQN8uRmtTwJaIRw
!gdown --id 1oedI437UeS9AhUGwsC-PgBdD1sHOjhaY

Downloading...
From: https://drive.google.com/uc?id=1aHC9RfmeSa8dDwC9XGQN8uRmtTwJaIRw
To: /content/nu_eval_output_vocab_short.txt
100% 3.67k/3.67k [00:00<00:00, 18.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1oedI437UeS9AhUGwsC-PgBdD1sHOjhaY
To: /content/nu_eval_prefixes_short.txt
100% 104k/104k [00:00<00:00, 103MB/s]


In [None]:
def save_truncated_distribution(model, filename, short=True):
    """Generate a file of truncated distributions.

    Probability distributions over the full vocabulary are large,
    so we will truncate the distribution to a smaller vocabulary.

    Please do not edit this function
    """
    vocab_name = 'nu_eval_output_vocab'
    prefixes_name = 'nu_eval_prefixes'

    if short:
      vocab_name += '_short'
      prefixes_name += '_short'

    with open('{}.txt'.format(vocab_name), 'r') as eval_vocab_file:
        eval_vocab = [w.strip() for w in eval_vocab_file]
    eval_vocab_ids = [vocab[s] for s in eval_vocab]

    all_selected_probabilities = []
    with open('{}.txt'.format(prefixes_name), 'r') as eval_prefixes_file:
        lines = eval_prefixes_file.readlines()
        for line in tqdm.notebook.tqdm(lines, leave=False):
            prefix = line.strip().split(' ')
            probs = model.next_word_probabilities(prefix)
            selected_probs = np.array([probs[i] for i in eval_vocab_ids], dtype=np.float32)
            all_selected_probabilities.append(selected_probs)

    all_selected_probabilities = np.stack(all_selected_probabilities)
    np.save(filename, all_selected_probabilities)
    print('saved', filename)

In [None]:
save_truncated_distribution(unigram_demonstration_model,
                            'unigram_demonstration_predictions.npy')

  0%|          | 0/1000 [00:00<?, ?it/s]

KeyboardInterrupt: 

**Before you proceed**: At this point you should check whether you are able to upload the submission files to Gradescope. For this we will generate *dummy* prediction files by copying the unigram predictions above. Download the `unigrm_demonstration_predictions.npy` (you can do this by clicking the folder icon on left menu) and then copy this file and rename it to generate the required submision files:
* bigram_predictions.npy
* trigram_kn_predictions.npy
* neural_trigram_predictions.npy
* gru_predictions.npy

Also save a copy of this notebook as `proj_1.ipynb` and create a `report.pdf` (this can be empty for now). Upload these files to Gradescope and confirm that the autograder runs and produces an output score of 0.

### N-gram Model

Now it's time to implement an n-gram language model.

Because not every n-gram will have been observed in training, use add-alpha smoothing to make sure no output word has probability 0.

$$P(w_2|w_1)=\frac{\#(w_1,w_2)+\alpha}{\#(w_1)+V\alpha}$$

where $V$ is the vocab size and $\#()$ is the count for the given bigram.  An alpha value around `3e-3`  should work.  Later, we'll replace this smoothing with model backoff.

One edge case you will need to handle is at the beginning of the text where you don't have `n-1` prior words.  You can handle this however you like as long as you produce a valid probability distribution, but just using a uniform distribution over the vocabulary is reasonable for the purposes of this project.

A properly implemented bi-gram model should get a perplexity **below 505** on the validation set.

**Note**: Do not change the signature of the `next_word_probabilities` and `perplexity` functions.  We will use these as a common interface for all of the different model types.  Make sure these two functions call `n_gram_probability`, because later we are going to override `n_gram_probability` in a subclass.
Also, we suggest pre-computing and caching the counts $\#()$ when you initialize `NGramModel` for efficiency.

In [None]:
class NGramModel:
    def __init__(self, train_text, n=2, alpha=3e-3):
        # get counts and perform any other setup
        self.n = n
        self.smoothing = alpha

        # YOUR CODE HERE
        self.word_counts = Counter(train_text) # self.counts[word] gives you count of each word in text
        self.total_count = len(train_text)
        self.V = vocab_size

        self.ngram_counts = {} # cache counts of all possible n-grams in text (including next word)

        for i in range(self.total_count):
          if i < self.n - 1:
            n_gram = train_text[:i+1]
            while len(n_gram) != self.n:
              n_gram = ['<eos>'] + n_gram
          else:
            n_gram = train_text[i+1-self.n : i+1]

          n_gram = tuple(n_gram)
          if n_gram not in self.ngram_counts:
            self.ngram_counts[n_gram] = 1
          else:
            self.ngram_counts[n_gram] += 1

        self.prev_ngram_counts = {} # cache counts of all possible (n-1)-gram prefixes in text (just previous n-1 words)
        if self.n != 1:
          k = self.n - 1
          for i in range(self.total_count):
            if i < k - 1:
              n_gram = train_text[:i+1]
              while len(n_gram) != k:
                n_gram = ['<eos>'] + n_gram
            else:
              n_gram = train_text[i+1-k : i+1]

            n_gram = tuple(n_gram)
            if n_gram not in self.prev_ngram_counts:
              self.prev_ngram_counts[n_gram] = 1
            else:
              self.prev_ngram_counts[n_gram] += 1

    def _to_string(self, tokens):
      return "_".join(tokens)


    def n_gram_probability(self, n_gram):
        """Return the probability of the last word in an n-gram.

        n_gram -- a list of string tokens
        returns the conditional probability of the last token given the rest.
        """
        assert len(n_gram) == self.n


        # YOUR CODE HERE
        if self.n == 1:
          word = n_gram[0]
          prob = self.word_counts[word] / self.total_count
          return prob

        denominator = self.prev_ngram_counts.get(tuple(n_gram[:-1]), 0) + self.V*self.smoothing
        numerator = self.ngram_counts.get(tuple(n_gram), 0) + self.smoothing
        prob = numerator / denominator
        return prob


    def next_word_probabilities(self, text_prefix):
        """Return a list of probabilities for each word in the vocabulary.
        """

        # YOUR CODE HERE
        # use your function n_gram_probability
        # Recall word_types contains a list of words to return probabilities for

        if self.n==1:
          text_prefix=[]

        if len(text_prefix) < self.n - 1: # prefix isn't long enough for n-gram
          while len(text_prefix) != self.n-1:
            text_prefix = ['<eos>'] + text_prefix

        if len(text_prefix) > self.n - 1:
          start = len(text_prefix) - self.n + 1
          text_prefix = text_prefix[start:]

        probs = []
        for word in word_types:
          n_gram = text_prefix + [word]
          prob = self.n_gram_probability(n_gram)
          probs.append(prob)

        return probs


    def perplexity(self, full_text):
        """ full_text is a list of string tokens
        return perplexity as a float """

        # YOUR CODE HERE
        # use your function n_gram_probability
        # This method should differ a bit from the example unigram model because
        # the first n-1 words of full_text must be handled as a special case.

        '''
        Majority case:
        For each possible consecutive ngram in full text, call n_gram_probability to
        calculate the log probablity of its last word appearing given its text prefix.
        Append these proababilities to a list and take the mean of this list

        Special case: first n-1 words of full_text that don't have a full ngram yet

        '''
        log_probs = []
        num_tokens = len(full_text)

        for i in range(num_tokens):
          if i < self.n - 1:
            n_gram = full_text[:i+1]
            while len(n_gram) != self.n:
              n_gram = ['<eos>'] + n_gram
            prob = self.n_gram_probability(n_gram)
            log_probs.append(math.log(prob,2))
          else:
            n_gram = full_text[i+1-self.n : i+1]
            prob = self.n_gram_probability(n_gram)
            log_probs.append(math.log(prob, 2))

        return 2 ** -np.mean(log_probs)

In [None]:
unigram_model = NGramModel(train_text, n=1)
check_validity(unigram_model)
print('unigram validation perplexity:', unigram_model.perplexity(validation_text)) # this should be the almost the same as our unigram model perplexity above

bigram_model = NGramModel(train_text, n=2)
check_validity(bigram_model)
print('bigram validation perplexity:', bigram_model.perplexity(validation_text))

trigram_model = NGramModel(train_text, n=3)
check_validity(trigram_model)
print('trigram validation perplexity:', trigram_model.perplexity(validation_text)) # this won't do very well...

save_truncated_distribution(bigram_model, 'bigram_predictions.npy') # this might take a few minutes

unigram validation perplexity: 965.0860734119312
bigram validation perplexity: 504.40436063011833
trigram validation perplexity: 2965.3608021638647


  0%|          | 0/1000 [00:00<?, ?it/s]

saved bigram_predictions.npy



Please download `bigram_predictions.npy` once you finish this section so that you can submit it.

In the block below, please report your bigram validation perplexity.  (We will use this to help us calibrate our scoring on the test set.)

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Bigram validation perplexity: ***504.40436063011833***

We can also generate samples from the model to get an idea of how it is doing.

In [None]:
print(generate_text(bigram_model))

<eos> <eos> <eos> <eos> <eos> <eos> Anzac Cove perimeter Viva landslide stamped Windsor demonstrate sediment packages Female Church teaching them and then


We now free up some RAM, **it is important to run the cell below, otherwise you will likely run out of RAM in the Colab runtime.**

In [None]:
# Free up some RAM.
del bigram_model
del trigram_model

This basic model works okay for bigrams, but a better strategy (especially for higher-order models) is to use backoff.  Implement backoff with absolute discounting.
$$P\left(w_i|h_i\right)=\frac{max\left\{\#(h_i, w_i)-d,0\right\}}{\#(h_i)} + \lambda(h_i) P(w_i|w_{i-n+2},\ldots, w_{i-1})$$

$$\lambda\left(h_i\right)=\frac{d N_{1+}(h_i)}{{\#(h_i)}}$$
where $h_i=(w_{i-n+1}, \ldots, w_{i-1})$ is the prefix before token $i$, $N_{1+}$ is the number of words that appear after the previous $n-1$ words (the number of times the max will select something other than 0 in the first equation).  If $\#(h_i)=0$, use the lower order model probability directly (the above equations would have a division by 0).

We found a discount $d$ of 0.9 to work well based on validation performance.  A trigram model with this discount value should get a validation perplexity below 272.

In [None]:
class DiscountBackoffModel(NGramModel):
    def __init__(self, train_text, lower_order_model, n=2, delta=0.9):
        super().__init__(train_text, n=n)
        self.lower_order_model = lower_order_model
        self.discount = delta
        self.n = n

        # YOUR CODE HERE

        self.ngram_counts = {} # cache counts of all possible n-grams in text (including next word)
        self.ngram_prefix_words = {} # cache set of words for each corresponding n-gram prefix

        for i in range(self.total_count):
          if i < self.n - 1:
            n_gram = train_text[:i+1]
            while len(n_gram) != self.n:
              n_gram = ['<eos>'] + n_gram
          else:
            n_gram = train_text[i+1-self.n : i+1]

          n_gram = tuple(n_gram)
          prefix = tuple(n_gram[:-1])
          last_word = n_gram[-1]

          if n_gram not in self.ngram_counts:
            self.ngram_counts[n_gram] = 1
          else:
            self.ngram_counts[n_gram] += 1

          if prefix not in self.ngram_prefix_words:
            self.ngram_prefix_words[prefix] = {last_word}
          else:
            self.ngram_prefix_words[prefix].add(last_word)


        self.prev_ngram_counts = {} # cache counts of all possible (n-1)-gram prefixes in text (just previous n-1 words)
        if self.n != 1:
          k = self.n - 1
          for i in range(self.total_count):
            if i < k - 1:
              n_gram = train_text[:i+1]
              while len(n_gram) != k:
                n_gram = ['<eos>'] + n_gram
            else:
              n_gram = train_text[i+1-k : i+1]

            n_gram = tuple(n_gram)
            if n_gram not in self.prev_ngram_counts:
              self.prev_ngram_counts[n_gram] = 1
            else:
              self.prev_ngram_counts[n_gram] += 1


    def n_gram_probability(self, n_gram):
      assert len(n_gram) == self.n

      # YOUR CODE HERE
      # back off to the lower_order model with n'=n-1 using its n_gram_probability function

      lower_prob = self.lower_order_model.n_gram_probability(n_gram[1:])

      if self.n == 2:
        hi_count = self.lower_order_model.word_counts[n_gram[0]]
      else:
        hi_count = self.prev_ngram_counts.get(tuple(n_gram[:-1]), 0)

      if hi_count == 0:
        return lower_prob

      numerator = max(self.ngram_counts.get(tuple(n_gram), 0) - self.discount, 0)
      denominator = hi_count

      prefix = tuple(n_gram[:-1])
      N1 = 0
      if prefix in self.ngram_prefix_words:
        N1 = len(self.ngram_prefix_words[prefix])

      lmda = (self.discount * N1) / hi_count
      prob = (numerator / denominator) + lmda * lower_prob

      return prob

In [None]:
bigram_backoff_model = DiscountBackoffModel(train_text, unigram_model, 2)
trigram_backoff_model = DiscountBackoffModel(train_text, bigram_backoff_model, 3)
check_validity(trigram_backoff_model)
print('bigram backoff validation perplexity:', bigram_backoff_model.perplexity(validation_text))
print('trigram backoff validation perplexity:', trigram_backoff_model.perplexity(validation_text))

bigram backoff validation perplexity: 303.6792191669459
trigram backoff validation perplexity: 271.10291054998066


Free up RAM.

In [None]:
# Release models we don't need any more.
del unigram_model
del bigram_backoff_model
del trigram_backoff_model

Now, implement Kneser-Ney to replace the unigram base model.
$$P(w)\propto |\{w':\#(w',w) > 0\}|$$
A Kneser-Ney trigram model should get a validation perplexity below 257.

In [None]:
class KneserNeyBaseModel(NGramModel):
    def __init__(self, train_text):
      super().__init__(train_text, n=1)

      # YOUR CODE HERE
      # self.ngram_counts = {} # dictionary with counts of all ngrams in text
      # self.ngram_prefix_words = {} # key: prefix of each ngram above; value: set of words following that prefix in ngrams above

      # for i in range(len(train_text)):
      #   if i < self.n - 1:
      #       n_gram = train_text[:i + 1]
      #       while len(n_gram) != self.n:
      #           n_gram = ['<eos>'] + n_gram
      #   else:
      #       n_gram = train_text[i + 1 - self.n: i + 1]

      #   n_gram = tuple(n_gram)
      #   prefix = tuple(n_gram[:-1])
      #   last_word = n_gram[-1]

      #   if n_gram not in self.ngram_counts:
      #       self.ngram_counts[n_gram] = 1
      #   else:
      #       self.ngram_counts[n_gram] += 1

      #   if prefix not in self.ngram_prefix_words:
      #       self.ngram_prefix_words[prefix] = {last_word}
      #   else:
      #       self.ngram_prefix_words[prefix].add(last_word)

      # self.word_counts = Counter(train_text) # self.counts[word] gives you count of each word in text

      self.prev_word_counts = {} # set of all unique word prefixes that occur previously to each unigram in word_counts
      for i in range(1,len(train_text)):
        unigram = train_text[i] # w
        prefix_word = train_text[i-1] # w'
        if unigram not in self.prev_word_counts:
          self.prev_word_counts[unigram] = {prefix_word}
        else:
          self.prev_word_counts[unigram].add(prefix_word)

      self.denominator = 0
      for k in self.prev_word_counts.keys():
        self.denominator += len(self.prev_word_counts.get(k, []))


    def n_gram_probability(self, n_gram):
      assert len(n_gram) == 1
      # YOUR CODE HERE
      word = n_gram[0]
      numerator = len(self.prev_word_counts.get(word, []))  # Number of unique words that precede input word/unigram in text
      # iterate through every possible w' in train_text (everything but last word)
      prob = numerator / self.denominator
      #print(prob)
      return prob

In [None]:
kn_base = KneserNeyBaseModel(train_text)
check_validity(kn_base)
bigram_kn_backoff_model = DiscountBackoffModel(train_text, kn_base, 2)
trigram_kn_backoff_model = DiscountBackoffModel(train_text, bigram_kn_backoff_model, 3)
print('trigram Kneser-Ney backoff validation perplexity:', trigram_kn_backoff_model.perplexity(validation_text))

#save_truncated_distribution(trigram_kn_backoff_model, 'trigram_kn_predictions.npy') # this might take a few minutes

trigram Kneser-Ney backoff validation perplexity: 256.59210020004696


In [None]:
save_truncated_distribution(trigram_kn_backoff_model, 'trigram_kn_predictions.npy') # this might take a few minutes

  0%|          | 0/1000 [00:00<?, ?it/s]

saved trigram_kn_predictions.npy


In [None]:
print(generate_text(trigram_kn_backoff_model))
print(generate_text(trigram_kn_backoff_model, prefix=['What','about']))

<eos> <eos> = = = Aftermath = = = <eos> <eos> = = <eos> <eos> The gods and goddesses worshipped in San
What about the leaves the north . The portico for its club 's neck to be " . <eos> <eos> = =


Fill in your trigram backoff perplexities with and without Kneser Ney.

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Trigram backoff validation perplexity: ***271.10291054998066***

Trigram backoff with Kneser Ney perplexity: ***256.59210020004696***

Free up RAM.

In [None]:
# Delete models we don't need.
del kn_base
del bigram_kn_backoff_model
del trigram_kn_backoff_model

### Neural N-gram Model

In this section, you will implement a neural version of an n-gram model.  The model will use a simple feedforward neural network that takes the previous `n-1` words and outputs a distribution over the next word.

You will use PyTorch to implement the model.  We've provided a little bit of code to help with the data loading using PyTorch's data loaders (https://pytorch.org/docs/stable/data.html)

A model with the following architecture and hyperparameters should reach a validation perplexity **below or about 220**.
* embed the words with dimension 128, then flatten into a single embedding for $n-1$ words (with size $(n-1)*128$)
* run 2 hidden layers with 1024 hidden units, then project down to size 128 before the final layer (ie. 4 layers total).
* use weight tying for the embedding and final linear layer (this made a very large difference in our experiments); you can do this by creating the output layer with [`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html), then using [`F.embedding`](https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding.html) with the linear layer's `.weight` to embed the input
* rectified linear activation (ReLU) and dropout 0.1 after first 2 hidden layers. **Note: You will likely find a performance drop if you add a nonlinear activation function after the dimension reduction layer.**
* train for 10 epochs with the Adam optimizer (should take around 15-20 minutes)
* do early stopping based on validation set perplexity (see Project 0)


We encourage you to try other architectures and hyperparameters, and you will likely find some that work better than the ones listed above.  A proper implementation with these should be enough to receive full credit on the assignment, although you might need to retrain the model for a few time.

When training a model on the GPU it is also a good idea to save your model periodically in case you get locked out. You can use torch.save(network.state_dict(), path) and network.load_state_dict() for this; see here. You can also save your *.npy files to google drive to avoid lossing them after Colab session cut-off (see the sample script below).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p /content/drive/MyDrive/CS572-S24-A1
!cp *.npy /content/drive/MyDrive/CS572-S24-A1
!ls /content/drive/MyDrive/CS572-S24-A1

Mounted at /content/drive
bigram_predictions.npy	trigram_kn_predictions.npy  unigram_demonstration_predictions.npy


In [None]:
from functools import total_ordering
import torch.optim as optim

def ids(tokens):
    return [vocab[t] for t in tokens]

# assert torch.cuda.is_available(), "no GPU found, in Colab go to 'Edit->Notebook settings' and choose a GPU hardware accelerator"

device = torch.device("cuda")

In [None]:
class NeuralNgramDataset(torch.utils.data.Dataset):
    def __init__(self, text_token_ids, n):
        self.text_token_ids = text_token_ids
        self.n = n

    def __len__(self):
        return len(self.text_token_ids)

    def __getitem__(self, i):
        if i < self.n-1:
            prev_token_ids = [vocab['<eos>']] * (self.n-i-1) + self.text_token_ids[:i]
        else:
            prev_token_ids = self.text_token_ids[i-self.n+1:i]

        assert len(prev_token_ids) == self.n-1

        x = torch.tensor(prev_token_ids)
        y = torch.tensor(self.text_token_ids[i])
        return x, y

class NeuralNGramNetwork(nn.Module):
    # a PyTorch Module that holds the neural network for your model

    def __init__(self, n):
        super().__init__()
        self.n = n

        # Initialize the different layers needed in the computation graph.
        # A full list of available layers in Pytorch is given here:
        # https://pytorch.org/docs/stable/nn.html
        # You will need linear and dropout layers only for this model.

        # An example on how to define neural networks in Pytorch is given here:
        # https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#define-the-network

        # YOUR CODE HERE

        # embed the words with dimension 128,
        # then flatten into a single embedding for  n−1  words (with size  (n−1)∗128 )
        # run 2 hidden layers with 1024 hidden units,
        # then project down to size 128 before the final layer (ie. 4 layers total).

        self.embDim = 128
        self.hidDim = 1024

        # embedding layer
        self.embed = nn.Embedding(vocab_size, self.embDim)
        # dimension reduction layer
        self.l1 = nn.Linear((self.n - 1)*self.embDim, self.hidDim) # (n-1)*128, 1024
        # hidden layers
        self.l2 = nn.Linear(self.hidDim, self.hidDim) # 1024, 1024
        self.l3 = nn.Linear(self.hidDim, self.embDim) # 1024, 128
        # output layer
        self.final_layer = nn.Linear(self.embDim, vocab_size)
        # final layer weight tying
        self.dropout = nn.Dropout(p=0.15)


    def forward(self, x):
        # x is a tensor of inputs with shape (batch, n-1)
        # this function returns a tensor of log probabilities with shape (batch, vocab_size)

        # Run the input data through the network for a forward pass. See the
        # tutorial above on how to use the layers to construct a forward pass.

        # You can use a non-linear activation function from the list here:
        # https://pytorch.org/docs/stable/nn.functional.html

        # To convert the scores of the network into log-probabilities you can use
        # the log-softmax function:
        # https://pytorch.org/docs/stable/generated/torch.nn.functional.log_softmax.html#torch.nn.functional.log_softmax

        # YOUR CODE HERE
        if len(x.shape) == 1:
          x = x.unsqueeze(dim=0)
        x_embed = F.embedding(x, weight=self.final_layer.weight)
        new_embed = x_embed.view(x_embed.shape[0], -1) # flatten
        x_l1 = self.l1(new_embed)
        x_l2 = self.dropout(F.relu(self.l2(x_l1)))
        x_l3 = self.dropout(F.relu(self.l3(x_l2)))

        output = self.final_layer(x_l3)

        # convert output into log-softmax
        log_prob = F.log_softmax(output)

        return log_prob


class NeuralNGramModel:
    # a class that wraps NeuralNGramNetwork to handle training and evaluation
    # it's ok if this doesn't work for unigram modeling
    def __init__(self, n):
        self.n = n
        self.network = NeuralNGramNetwork(n).to(device)

    def train(self, n_epochs=10):
        dataset = NeuralNgramDataset(ids(train_text), self.n)
        train_loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=True)
        # iterating over train_loader with a for loop will return a 2-tuple of batched tensors
        # the first tensor will be previous token ids with size (batch, n-1),
        # and the second will be the current token id with size (batch, )
        # you will need to move these tensors to GPU, e.g. by using the Tensor.cuda() function.

        # this will take some time to run; use tqdm.notebook.tqdm to get a progress bar
        # (see Project 0 for example)

        # A basic recipe for training networks in Pytorch is given here:
        # https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#train-the-network

        # You should also print the perplexity on the validation set after each epoch by
        # calling self.perplexity(validation_text). You can do early stopping by
        # comparing this perplexity to the perplexity from the previous epoch
        # and stop training when it gets larger.

        # YOUR CODE HERE
        self.network.cuda()
        self.network.train()
        optimizer = torch.optim.Adam(self.network.parameters(), lr=1e-3, weight_decay=1e-5)
        ce_loss = nn.CrossEntropyLoss()

        prev_perplexity = 1e12
        num_epochs = 10
        for epoch in range(num_epochs):
          print('EPOCH........', epoch)
          for batch in tqdm.tqdm_notebook(train_loader, leave=False):
            x = batch[0].cuda()
            y = batch[1].cuda()
            optimizer.zero_grad()
            output = self.network(x)
            loss = ce_loss(output, y)
            loss.backward()
            optimizer.step()

          perp = self.perplexity(validation_text)
          if perp < prev_perplexity:
            prev_perplexity = perp
            torch.save(self.network.state_dict(), 'neural_ngram.pt')

          if perp >= 300:
            print("Stopping early, perplexity is........", perp)
            break

        self.network.load_state_dict(torch.load('neural_ngram.pt'))
        return self.network


        # [note] use tqdm.notebook.tqdm
        # [note] use GPU by using Tensor.cuda() function
        # [assignment 0] * if the validation score is better than your previous best score, save the model
        #   use `network.state_dict()` and `torch.save` (https://pytorch.org/docs/stable/notes/serialization.html)
        #   this gives us a form of early stopping in case the model starts overfitting

        # don't forget self.network.train() every time after you run an evalution.

        # torch.save(network.state_dict(), path)
        # network.load_state_dict(torch.load(path))

    def next_word_probabilities(self, text_prefix):
        # you will need to convert text_prefix from strings to numbers with the `ids` function
        # if your `perplexity` function below is based on a NeuralNgramDataset DataLoader,
        # you will need to use the same strategy for prefixes with less than n-1 tokens to pass the validity check
        # the data loader appends extra "<eos>" (end of sentence) tokens to the
        # start of the input so there are always enough to run the network

        # do a forward pass through the network and convert the log probabilities
        # into probabilities before returning.

        # YOUR CODE HERE

        while len(text_prefix) < self.n - 1:
          text_prefix = ["<eos>"] + text_prefix
        if len(text_prefix) > self.n - 1:
          text_prefix = text_prefix[len(text_prefix)-self.n+1 :]

        numbers = torch.tensor(ids(text_prefix))

        with torch.no_grad():
          self.network.eval()
          outputs = self.network(numbers.cuda())
          probs = torch.exp(outputs) # covert log probs to probs
          return probs.squeeze().cpu().numpy()

    def perplexity(self, text):
        # you may want to use a DataLoader here with a NeuralNgramDataset
        # don't forget self.network.eval()
        valdataset = NeuralNgramDataset(ids(text), self.n)
        val_loader = torch.utils.data.DataLoader(valdataset, batch_size=128, shuffle=False)

        # Iterate over the val_loader, do a forward pass across the batches and
        # compute the perplexities using the returned log probabilities.
        # (Hint: you can use torch.nn.functional.nll_loss for computing perplexity)

        # YOUR CODE HERE
        loss_total = 0
        num_words=0
        # log_probs_array = []
        with torch.no_grad():
          self.network.eval()
          for x, y in val_loader:
            prob = self.network(x.cuda()) # (10, vocab_size)
            # prob = F.softmax(output, dim=1)
            # log_probs = prob.log()
            # log_probs_array.append(log_probs.cpu().numpy())
            loss = F.nll_loss(prob, y.cuda(), reduction='sum')
            loss_total += loss.item()
            num_words += y.size(0)
        print("LOSS TOTAL IS", loss_total)
        #perp = math.e ** loss_total
        avg_loss = loss_total / num_words
        perplexity = math.exp(avg_loss)
        # perp = math.e ** -np.mean(log_probs_array)
        print('PERPLEXITY IS', perplexity)
        return perplexity


neural_trigram_model = NeuralNGramModel(3)
check_validity(neural_trigram_model)
neural_trigram_model.train()
print('neural trigram validation perplexity:', neural_trigram_model.perplexity(validation_text))

save_truncated_distribution(neural_trigram_model, 'neural_trigram_predictions.npy')

  log_prob = F.log_softmax(output)


LOSS TOTAL IS 104.31520080566406
PERPLEXITY IS 33911.85986010029
EPOCH........ 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch in tqdm.tqdm_notebook(train_loader, leave=False):


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1211188.2189483643
PERPLEXITY IS 261.11108153367013
EPOCH........ 1


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1185773.9468078613
PERPLEXITY IS 232.33425081187963
EPOCH........ 2


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1180151.9584197998
PERPLEXITY IS 226.40969954788153
EPOCH........ 3


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1176293.5801086426
PERPLEXITY IS 222.4313283995457
EPOCH........ 4


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1174275.1538391113
PERPLEXITY IS 220.37805911269353
EPOCH........ 5


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1174023.0590362549
PERPLEXITY IS 220.12294759667867
EPOCH........ 6


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1176959.0904083252
PERPLEXITY IS 223.11251202739197
EPOCH........ 7


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1174345.8774719238
PERPLEXITY IS 220.44968215890827
EPOCH........ 8


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1171774.3232574463
PERPLEXITY IS 217.86032870568607
EPOCH........ 9


  0%|          | 0/16318 [00:00<?, ?it/s]

LOSS TOTAL IS 1173603.847946167
PERPLEXITY IS 219.69937367906786
LOSS TOTAL IS 1171774.3232574463
PERPLEXITY IS 217.86032870568607
neural trigram validation perplexity: 217.86032870568607


  0%|          | 0/1000 [00:00<?, ?it/s]

saved neural_trigram_predictions.npy


Fill in your neural trigram perplexity.

<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Neural trigram validation perplexity: ***fill in here***

Free up RAM.

In [None]:
# Delete model we don't need.
del neural_trigram_model

### RNN (GRU) Model

For this stage of the project, you will implement an RNN language model using a popular variant -- [Gated recurrent unit (GRU)](https://en.wikipedia.org/wiki/Gated_recurrent_unit).

For recurrent language modeling, the data batching strategy is a bit different from what is used in some other tasks.  Sentences are concatenated together so that one sentence starts right after the other, and an unfinished sentence will be continued in the next batch.  We have provided a helper class `RecurrentLMDataset` to this for you.  To properly deal with this input format, you should save the last state of the GRU from a batch to feed in as the first state of the next batch.  When you save state across different batches, you should call `.detach()` on the state tensors before the next batch to tell PyTorch not to backpropagate gradients through the state into the batch you have already finished (which will cause a runtime error).

We expect your model to reach a validation perplexity **below 130**.  The following architecture and hyperparameters should be sufficient to get there.
* 2 GRU layers with 768 units
* dropout of 0.5 after each GRU layer and each linear layer (use different dropout modules)
* instead of projecting directly from the last GRU output to the vocabulary size for softmax, project down to a smaller size first (e.g. 768->128->vocab_size). **NOTE: You may find that adding nonlinearities between these layers can hurt performance, try without first.**
* use the same weights for the embedding layer and the pre-softmax layer; dimension 128
* train with Adam (using default learning rates) for at least 10 epochs
* clip gradient norms to be lower than 5 before taking an optimization step (code example below)


In [None]:
from functools import total_ordering
import torch.optim as optim

def ids(tokens):
    return [vocab[t] for t in tokens]

assert torch.cuda.is_available(), "no GPU found, in Colab go to 'Edit->Notebook settings' and choose a GPU hardware accelerator"

device = torch.device("cuda")

In [None]:
torch.cuda.empty_cache()
del gru_model

In [None]:
class RecurrentLMDataset:
    def __init__(self, text_token_ids, bsz, bptt_len=32):
        self.bsz = bsz
        self.bptt_len = bptt_len
        token_ids = torch.tensor(text_token_ids)
        ncontig = token_ids.size(0) // bsz
        token_ids = token_ids[:ncontig*bsz].view(bsz, -1) # batch_size x ncontig
        self.token_ids = token_ids.t().contiguous() # ncontig x batch_size

    def __len__(self):
        return int(math.ceil(self.token_ids.size(0) / self.bptt_len))

    def __iter__(self):
        for i in range(0, self.token_ids.size(0)-1, self.bptt_len):
            seqlen = min(self.bptt_len, self.token_ids.size(0) - i - 1)
            x = self.token_ids[i:i+seqlen] # seqlen x batch_size
            y = self.token_ids[i+1:i+seqlen+1] # seqlen x batch_size
            yield x, y

# Hyperparameters
batch_size = 64
n_epochs = 15
bptt_len = 32
n_layers = 2
hidden_size = 768
embed_size = 128
clip_value = 5.


class GRUNetwork(nn.Module):
    # a PyTorch Module that holds the neural network for your model

    def __init__(self):
        super().__init__()

        # You will need to use a torch.nn.GRU layer in addition to torch.nn.Linear
        # and torch.nn.Dropout layers.

        # YOUR CODE HERE
        # NOTE: You may find that adding nonlinearities between these layers can hurt performance, try without first.

        # embedding layer
        self.embed = nn.Embedding(vocab_size, embed_size)

        # GRU
        self.gru = nn.GRU(input_size=128, hidden_size=768, num_layers=n_layers, dropout=0.5)

        # linear 1
        self.l1 = nn.Linear(hidden_size, embed_size)
        self.dropout3 = nn.Dropout(p=0.5)
        # linear 2
        self.l2 = nn.Linear(embed_size, vocab_size)
        self.dropout4 = nn.Dropout(p=0.5)


    def forward(self, x, state):
        """Compute the output of the network.

        Note: In the Pytorch GRU tutorial, the state variable is named "hidden":
        https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

        The torch.nn.GRU documentation is quite helpful:
        https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU

        x - a tensor of int64 inputs with shape (seq_len, batch)
        state - a tuple of two tensors with shape (num_layers, batch, hidden_size)
                representing the hidden state and cell state of the GRU.
        returns a tuple with two elements:
          - a tensor of log probabilities with shape (seq_len, batch, vocab_size)
          - a state tuple returned by applying the GRU with the same sized tensors
            as the state tuple provided as input.
        """

        # Note that the nn.GRU module expects inputs with the sequence
        # dimension before the batch by default.
        # In this case the dimensions are already in the right order,
        # but watch out for this since sometimes people put the batch first

        # you can again use torch.nn.functional.embedding to convert input token
        # ids to embeddings looked up from the output layer's .weight tensor.

        # make sure you use .detach() before returning the state tuple

        # use torch.nn.functional.log_softmax for computing log-probabilities.

        # YOUR CODE HERE

        x_embed = F.embedding(x, weight=self.l2.weight)
        x1, state = self.gru(x_embed, state)
        x2 = self.dropout3(self.l1(x1))
        x3 = self.l2(x2)

        log_prob = F.log_softmax(x3, dim=-1)
        state = state.detach()
        return log_prob, state

class GRUModel:
    "A class that wraps GRUNetwork to handle training and evaluation."

    def __init__(self):
        self.network = GRUNetwork().to(device)

    def train(self):
        rnn_lm_dataset = RecurrentLMDataset(ids(train_text), batch_size, bptt_len)
        #train_loader = torch.utils.data.DataLoader(rnn_lm_dataset, batch_size=batch_size, shuffle=True)

        # Obtain an iterator over rnn_lm_dataset by calling `iter(rnn_lm_dataset)`
        # or by using tqdm. Looping thru this iterator with a for loop gives (x, y) tuples,
        # where x is a seqlen x batch_size token id tensor, and y is a seqlen x batch_size token id tensor.
        # The token ids in y are the next word targets for the sequence up till that position
        # in x.

        # The initial state passed into the GRU should be set to zero.

        # You can use gradient clipping before calling optimizer.step() as follows:
        # torch.nn.utils.clip_grad_norm_(
        #     [p for group in optimizer.param_groups for p in group['params']], clip_value)

        # YOUR CODE HERE
        self.network.cuda()
        self.network.train()
        optimizer = torch.optim.Adam(self.network.parameters(), lr=1e-3, weight_decay=1e-5)
        ce_loss = nn.CrossEntropyLoss()
        prev_perp = 1e12

        for epoch in range(n_epochs):
          print('EPOCH........', epoch)
          self.network.train()
          state = None
          # for batch in tqdm.tqdm_notebook(rnn_lm_dataset, leave=False):
          for batch in tqdm.tqdm_notebook(rnn_lm_dataset):
            x, y = batch
            x = x.cuda() # (seqlen, batch_size) token_id tensor
            y = y.cuda() # (seqlen, batch_size) token_id tensor 32x64 -- next word targets for corresponding sequence in x
            optimizer.zero_grad()
            if state == None:
              state = torch.zeros([n_layers, batch_size, hidden_size]).cuda()
            log_prob, state = self.network(x, state) # log_prob: (seq_len, batch, vocab_size)
            state = state.detach()

            loss = ce_loss(log_prob.permute(1,2,0), y.permute(1,0))
            loss.backward()
            torch.nn.utils.clip_grad_norm_([p for group in optimizer.param_groups for p in group['params']], clip_value)
            optimizer.step()

          perp = self.perplexity(validation_text)
          if perp < prev_perp:
            prev_perp = perp
            torch.save(self.network.state_dict(), 'gru_ngram.pt')

        self.network.load_state_dict(torch.load('gru_ngram.pt'))
        return self.network


    def next_word_probabilities(self, text_prefix):
        "Return a list of probabilities for each word in the vocabulary."

        # We won't be calling check_validity for GRUs so you don't need to
        # worry about an empty prefix.

        # Make sure you initialize the hidden states to zero and you return
        # probabilities instead of log-probabilities.

        # YOUR CODE HERE
        # don't forget self.network.eval()
        # don't forget to move tensors to the GPU

        numbers = torch.tensor(ids(text_prefix))
        state = None

        with torch.no_grad():
          self.network.eval()
          x = numbers.cuda()
          x = x.view(-1, 1) # seq_len, batch_size = 20, 1

          if state == None:
            state = torch.zeros(n_layers, 1, hidden_size).cuda()
            state = state.cuda()

          log_probs, state = self.network(x, state)
          probs = torch.exp(log_probs)
          probs = probs[-1].squeeze().cpu().numpy()
          return probs


    def perplexity(self, text):
        "Return perplexity as a float."
        # Your code should be very similar to next_word_probabilities, but
        # run in a loop over batches. Use torch.no_grad() for extra speed.

        # make sure you pass the hidden state from one batch to the next.

        valdataset = RecurrentLMDataset(ids(text), batch_size, bptt_len)
        # YOUR CODE HERE
        # don't forget self.network.eval()
        # don't forget to move tensors to the GPU

        with torch.no_grad():
          self.network.eval()
          state = None

          log_probabilities = []
          targets = []

          for x, y in valdataset:
            x = x.cuda()
            y = y.cuda() # (seqlen, batch_size) token_id tensor 32x64 -- next word targets for corresponding sequence in x
            if state == None:
              state = torch.zeros(n_layers, batch_size, hidden_size).cuda()

            log_probs, state = self.network(x, state)
            state = state.detach()
            loss = F.nll_loss(log_probs.permute(1,2,0), y.permute(1,0)) # (seq_len, batch, vocab_size) -> (batch, vocab_size, seq_len)
            log_probabilities.append(loss.item())

          total_loss = np.mean(log_probabilities)
          print("LOSS TOTAL IS", total_loss)
          perplexity = math.exp(total_loss)
          print('PERPLEXITY IS', perplexity)
          return perplexity

gru_model = GRUModel()
gru_model.train()

print('gru validation perplexity:', gru_model.perplexity(validation_text))

EPOCH........ 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch in tqdm.tqdm_notebook(rnn_lm_dataset):


  0%|          | 0/1020 [00:00<?, ?it/s]

LOSS TOTAL IS 5.617132663726807
PERPLEXITY IS 275.0994486441085
EPOCH........ 1


  0%|          | 0/1020 [00:00<?, ?it/s]

LOSS TOTAL IS 5.28838051590964
PERPLEXITY IS 198.02247134764957
EPOCH........ 2


  0%|          | 0/1020 [00:00<?, ?it/s]

LOSS TOTAL IS 5.139294575307971
PERPLEXITY IS 170.5953836691755
EPOCH........ 3


  0%|          | 0/1020 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
save_truncated_distribution(gru_model, 'gru_predictions.npy')

  0%|          | 0/1000 [00:00<?, ?it/s]

saved gru_predictions.npy


<!-- Do not remove this comment, it is used by the autograder: RqYJKsoTS6 -->

Fill in your GRU perplexity.

GRU validation perplexity: ***122.6002332972709***

# Experimentation: 1-Page Report

Now it's time for you to experiment.  Try to reach a validation perplexity **below 120**. You may either modify the GRU class above, or copy it down to the code cell below and modify it there. Just **be sure to run code cell below to generate results with your improved GRU**.  

It is okay if the bulk of your improvements are due to hyperparameter tuning (such as changing the number or size of layers, adding dropout to more outputs), but implement at least one more substantial change to the model.  Here are some ideas (several of which come from https://arxiv.org/pdf/1708.02182.pdf), they do not always work, so might need to experiment with multiple ideas before seeing an improvement:
* weight-drop regularization - apply dropout to the weight matrices instead of activations
* learning rate scheduling - decrease the learning rate during training
* weight regularization - add a l2 regularization penalty on the weights of the networks
* ensembling - average the predictions of several models trained with different initialization random seeds
* embedding dropout - zero out the entire embedding for a random set of words in the embedding matrix
* activation regularization - add a l2 regularization penalty on the activation of the GRU output
* temporal activation regularization - add l2 regularization on the difference between the GRU output activations at adjacent timesteps

You may notice that most of these suggestions are regularization techniques. This dataset is considered fairly small, so regularization is one of the best ways to improve performance.

For this section, you will submit a write-up describing the extensions and/or modifications that you tried.  Your write-up should be **1-page maximum** in length and should be submitted in PDF format.  You may use any editor you like, but we recommend using LaTeX and working in an environment like Overleaf.
For full credit, your write-up should include:
1.   A concise and precise description of the extension(s) that you tried.
2.   A motivation for why you believed this approach might improve your model.
3.   A discussion of whether the extension was effective and/or an analysis of the results.  This will generally involve some combination of tables, learning curves, etc.
4.   A bottom-line summary of your results comparing validation perplexities of your improvement to the original GRU.
The purpose of this exercise is to experiment, so feel free to try/ablate multiple of the suggestions above as well as any others you come up with!
When you submit the file, please name it `report.pdf`.



changes tried


*   using a learning rate scheduler (https://machinelearningmastery.com/using-learning-rate-schedule-in-pytorch-training/)
  *   increase initial lr from 1e-3 = 0.001 to 0.007
  * exponential decay scheduler vs linear decay scheduler (https://neptune.ai/blog/how-to-choose-a-learning-rate-scheduler)
    * deciding on gamma value for exponentialLR (https://discuss.pytorch.org/t/exponential-decay-learning-rate/76384)
    * start at 0.005, end at 0.0007 => gamma = (0.0007 / 0.005)^(1/15) = 0.877152
    * use momentum??

* hyperparameter tuning
  * number of layers: try 1 and 3






In [None]:
from torch.optim import lr_scheduler

In [None]:
class RecurrentLMDataset:
    def __init__(self, text_token_ids, bsz, bptt_len=32):
        self.bsz = bsz
        self.bptt_len = bptt_len
        token_ids = torch.tensor(text_token_ids)
        ncontig = token_ids.size(0) // bsz
        token_ids = token_ids[:ncontig*bsz].view(bsz, -1) # batch_size x ncontig
        self.token_ids = token_ids.t().contiguous() # ncontig x batch_size

    def __len__(self):
        return int(math.ceil(self.token_ids.size(0) / self.bptt_len))

    def __iter__(self):
        for i in range(0, self.token_ids.size(0)-1, self.bptt_len):
            seqlen = min(self.bptt_len, self.token_ids.size(0) - i - 1)
            x = self.token_ids[i:i+seqlen] # seqlen x batch_size
            y = self.token_ids[i+1:i+seqlen+1] # seqlen x batch_size
            yield x, y

# Hyperparameters
batch_size = 64
n_epochs = 20
bptt_len = 32
n_layers = 2
hidden_size = 768
embed_size = 128
clip_value = 5.


class GRUNetwork(nn.Module):
    # a PyTorch Module that holds the neural network for your model

    def __init__(self):
        super().__init__()

        # You will need to use a torch.nn.GRU layer in addition to torch.nn.Linear
        # and torch.nn.Dropout layers.

        # NOTE: You may find that adding nonlinearities between these layers can hurt performance, try without first.

        # embedding layer
        self.embed = nn.Embedding(vocab_size, embed_size)

        # GRU
        self.gru = nn.GRU(input_size=128, hidden_size=768, num_layers=n_layers, dropout=0.5)

        # linear 1
        self.l1 = nn.Linear(hidden_size, embed_size)
        self.dropout3 = nn.Dropout(p=0.5)
        # linear 2
        self.l2 = nn.Linear(embed_size, vocab_size)
        self.dropout4 = nn.Dropout(p=0.5)


    def forward(self, x, state):
        """Compute the output of the network.

        Note: In the Pytorch GRU tutorial, the state variable is named "hidden":
        https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

        The torch.nn.GRU documentation is quite helpful:
        https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU

        x - a tensor of int64 inputs with shape (seq_len, batch)
        state - a tuple of two tensors with shape (num_layers, batch, hidden_size)
                representing the hidden state and cell state of the GRU.
        returns a tuple with two elements:
          - a tensor of log probabilities with shape (seq_len, batch, vocab_size)
          - a state tuple returned by applying the GRU with the same sized tensors
            as the state tuple provided as input.
        """

        # Note that the nn.GRU module expects inputs with the sequence
        # dimension before the batch by default.
        # In this case the dimensions are already in the right order,
        # but watch out for this since sometimes people put the batch first

        # you can again use torch.nn.functional.embedding to convert input token
        # ids to embeddings looked up from the output layer's .weight tensor.

        # make sure you use .detach() before returning the state tuple

        # use torch.nn.functional.log_softmax for computing log-probabilities.

        x_embed = F.embedding(x, weight=self.l2.weight)
        x1, state = self.gru(x_embed, state)
        x2 = self.dropout3(self.l1(x1))
        x3 = self.l2(x2)

        log_prob = F.log_softmax(x3, dim=-1)
        state = state.detach()
        return log_prob, state

class GRUModel:
    "A class that wraps GRUNetwork to handle training and evaluation."

    def __init__(self):
        self.network = GRUNetwork().to(device)

    def train(self):
        rnn_lm_dataset = RecurrentLMDataset(ids(train_text), batch_size, bptt_len)

        # Obtain an iterator over rnn_lm_dataset by calling `iter(rnn_lm_dataset)`
        # or by using tqdm. Looping thru this iterator with a for loop gives (x, y) tuples,
        # where x is a seqlen x batch_size token id tensor, and y is a seqlen x batch_size token id tensor.
        # The token ids in y are the next word targets for the sequence up till that position
        # in x.

        # The initial state passed into the GRU should be set to zero.

        # You can use gradient clipping before calling optimizer.step() as follows:
        # torch.nn.utils.clip_grad_norm_(
        #     [p for group in optimizer.param_groups for p in group['params']], clip_value)

        self.network.cuda()
        self.network.train()
        ce_loss = nn.CrossEntropyLoss()
        # optimizer = torch.optim.Adam(self.network.parameters(), lr=1e-3, weight_decay=1e-5)
        optimizer = torch.optim.Adam(self.network.parameters(), lr=.005, weight_decay=1e-5)
        # scheduler = lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.1, total_iters=7)
        scheduler = lr_scheduler.ExponentialLR(optimizer, gamma = 0.87715)

        prev_perp = 1e12

        for epoch in range(n_epochs):
          print('EPOCH........', epoch)
          self.network.train()
          state = None
          for batch in tqdm.tqdm_notebook(rnn_lm_dataset):
            x, y = batch
            x = x.cuda() # (seqlen, batch_size) token_id tensor
            y = y.cuda() # (seqlen, batch_size) token_id tensor 32x64 -- next word targets for corresponding sequence in x
            optimizer.zero_grad()
            if state == None:
              state = torch.zeros([n_layers, batch_size, hidden_size]).cuda()
            log_prob, state = self.network(x, state) # log_prob: (seq_len, batch, vocab_size)
            state = state.detach()

            loss = ce_loss(log_prob.permute(1,2,0), y.permute(1,0))
            loss.backward()
            torch.nn.utils.clip_grad_norm_([p for group in optimizer.param_groups for p in group['params']], clip_value)
            optimizer.step()
          before_lr = optimizer.param_groups[0]["lr"]
          scheduler.step()
          after_lr = optimizer.param_groups[0]["lr"]
          print("Epoch %d: SGD lr %.4f -> %.4f" % (epoch, before_lr, after_lr))

          perp = self.perplexity(validation_text)
          if perp < prev_perp:
            prev_perp = perp
            torch.save(self.network.state_dict(), 'gru_ngram.pt')

          # if perp >= 300:
          #   print("Stopping early, perplexity is........", perp)
          #   break

        self.network.load_state_dict(torch.load('gru_ngram.pt'))
        return self.network


    def next_word_probabilities(self, text_prefix):
        "Return a list of probabilities for each word in the vocabulary."

        # We won't be calling check_validity for GRUs so you don't need to
        # worry about an empty prefix.

        # Make sure you initialize the hidden states to zero and you return
        # probabilities instead of log-probabilities.

        # don't forget self.network.eval()
        # don't forget to move tensors to the GPU

        numbers = torch.tensor(ids(text_prefix))
        state = None

        with torch.no_grad():
          self.network.eval()
          x = numbers.cuda()
          x = x.view(-1, 1) # seq_len, batch_size = 20, 1

          if state == None:
            state = torch.zeros(n_layers, 1, hidden_size).cuda()
            state = state.cuda()

          log_probs, state = self.network(x, state)
          probs = torch.exp(log_probs)
          probs = probs[-1].squeeze().cpu().numpy()
          return probs


    def perplexity(self, text):
        "Return perplexity as a float."
        # Your code should be very similar to next_word_probabilities, but
        # run in a loop over batches. Use torch.no_grad() for extra speed.

        # make sure you pass the hidden state from one batch to the next.

        # don't forget self.network.eval()
        # don't forget to move tensors to the GPU

        valdataset = RecurrentLMDataset(ids(text), batch_size, bptt_len)

        with torch.no_grad():
          self.network.eval()
          state = None

          log_probabilities = []
          targets = []

          for x, y in valdataset:
            x = x.cuda()
            y = y.cuda() # (seqlen, batch_size) token_id tensor 32x64 -- next word targets for corresponding sequence in x
            if state == None:
              state = torch.zeros(n_layers, batch_size, hidden_size).cuda()

            log_probs, state = self.network(x, state)
            state = state.detach()
            loss = F.nll_loss(log_probs.permute(1,2,0), y.permute(1,0)) # (seq_len, batch, vocab_size) -> (batch, vocab_size, seq_len)
            log_probabilities.append(loss.item())

          total_loss = np.mean(log_probabilities)
          print("LOSS TOTAL IS", total_loss)
          perplexity = math.exp(total_loss)
          print('PERPLEXITY IS', perplexity)
          return perplexity

In [None]:
gru_model_new = GRUModel()
gru_model_new.train()

print('gru validation perplexity:', gru_model_new.perplexity(validation_text))

EPOCH........ 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch in tqdm.tqdm_notebook(rnn_lm_dataset):


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 0: SGD lr 0.0050 -> 0.0044
LOSS TOTAL IS 6.223433757496771
PERPLEXITY IS 504.43236001598126
EPOCH........ 1


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 1: SGD lr 0.0044 -> 0.0038
LOSS TOTAL IS 6.047920931165463
PERPLEXITY IS 423.23218584682667
EPOCH........ 2


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 2: SGD lr 0.0038 -> 0.0034
LOSS TOTAL IS 6.017748289019148
PERPLEXITY IS 410.6528822938648
EPOCH........ 3


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 3: SGD lr 0.0034 -> 0.0030
LOSS TOTAL IS 5.612222408579889
PERPLEXITY IS 273.75195114616594
EPOCH........ 4


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 4: SGD lr 0.0030 -> 0.0026
LOSS TOTAL IS 5.472148908632938
PERPLEXITY IS 237.97102168513678
EPOCH........ 5


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 5: SGD lr 0.0026 -> 0.0023
LOSS TOTAL IS 5.313909891609834
PERPLEXITY IS 203.1429445556457
EPOCH........ 6


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 6: SGD lr 0.0023 -> 0.0020
LOSS TOTAL IS 5.188013099064337
PERPLEXITY IS 179.1123207123489
EPOCH........ 7


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 7: SGD lr 0.0020 -> 0.0018
LOSS TOTAL IS 5.101853027522007
PERPLEXITY IS 164.3261261842733
EPOCH........ 8


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 8: SGD lr 0.0018 -> 0.0015
LOSS TOTAL IS 5.0298375281218055
PERPLEXITY IS 152.90816740018968
EPOCH........ 9


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 9: SGD lr 0.0015 -> 0.0013
LOSS TOTAL IS 4.962415160419785
PERPLEXITY IS 142.93859899874536
EPOCH........ 10


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 10: SGD lr 0.0013 -> 0.0012
LOSS TOTAL IS 4.924365524933717
PERPLEXITY IS 137.6020088968479
EPOCH........ 11


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 11: SGD lr 0.0012 -> 0.0010
LOSS TOTAL IS 4.873939322534008
PERPLEXITY IS 130.83530552720256
EPOCH........ 12


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 12: SGD lr 0.0010 -> 0.0009
LOSS TOTAL IS 4.84467276011672
PERPLEXITY IS 127.06169553531777
EPOCH........ 13


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 13: SGD lr 0.0009 -> 0.0008
LOSS TOTAL IS 4.831197328656633
PERPLEXITY IS 125.36096911287355
EPOCH........ 14


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 14: SGD lr 0.0008 -> 0.0007
LOSS TOTAL IS 4.816357559132799
PERPLEXITY IS 123.51437661522351
EPOCH........ 15


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 15: SGD lr 0.0007 -> 0.0006
LOSS TOTAL IS 4.784783020197788
PERPLEXITY IS 119.67539312335282
EPOCH........ 16


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 16: SGD lr 0.0006 -> 0.0005
LOSS TOTAL IS 4.786026165864178
PERPLEXITY IS 119.82425958186232
EPOCH........ 17


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 17: SGD lr 0.0005 -> 0.0005
LOSS TOTAL IS 4.780443717386121
PERPLEXITY IS 119.15721044236282
EPOCH........ 18


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 18: SGD lr 0.0005 -> 0.0004
LOSS TOTAL IS 4.7645080156415425
PERPLEXITY IS 117.27340641093119
EPOCH........ 19


  0%|          | 0/1020 [00:00<?, ?it/s]

Epoch 19: SGD lr 0.0004 -> 0.0004
LOSS TOTAL IS 4.76023777845864
PERPLEXITY IS 116.77368886588958
LOSS TOTAL IS 4.760237823022861
PERPLEXITY IS 116.77369406981815
gru validation perplexity: 116.77369406981815


Run the cell below in order to train your improved GRU and evaluate it.  

In [None]:
print('gru validation perplexity:', gru_model_new.perplexity(validation_text))
save_truncated_distribution(gru_model_new, 'gru_predictions_new.npy')

LOSS TOTAL IS 4.760237787371484
PERPLEXITY IS 116.77368990667524
gru validation perplexity: 116.77368990667524


  0%|          | 0/1000 [00:00<?, ?it/s]

saved gru_predictions_new.npy


### Submission

Upload a submission with the following files to Gradescope:
* proj_1.ipynb (rename to match this exactly)
* gru_predictions.npy (this should also include all improvements from your exploration)
* neural_trigram_predictions.npy
* trigram_kn_predictions.npy
* bigram_predictions.npy
* report.pdf

You can upload files individually or as part of a zip file, but if using a zip file be sure you are zipping the files directly and not a folder that contains them.

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.  Note that the test set perplexities shown by the autograder are on a different scale from your validation set perplexities due to selecting different text and truncating the distribution.  Don't worry if the values seem worse. We will compare your perplexity on the test set to our model's perplexity and assign a score based on that.