## Part 1: Language Modeling

In this project, you will implement several different types of language models for text.  We'll start with n-gram models and then move on to neural n-gram.

Warning: Do not start this project the day before it is due!  Some parts require 20 minutes or more to run, so debugging and tuning can take a significant amount of time.

Our dataset for this project will be the Penn Treebank language modeling dataset.  This dataset comes with some of the basic preprocessing done for us, such as tokenization and rare word filtering (using the `<unk>` token).
Therefore, we can assume that all word types in the test set also appear at least once in the training set.

In [3]:
# This block handles some basic setup and data loading.
# You shouldn't need to edit this, but if you want to
# import other standard python packages, that is fine.

# imports
from collections import defaultdict, Counter
import numpy as np
import math
import tqdm
import random
import pdb

import torch
from torch import nn
import torch.nn.functional as F
from datasets import Dataset
import os

from datasets import load_dataset

# Load WikiText-2 from HuggingFace
dataset = load_dataset('wikitext', 'wikitext-2-v1', split=['train', 'validation', 'test'])
train_dataset, validation_dataset, test_dataset = dataset

# Convert to list of tokens (HuggingFace returns text as strings, we need to tokenize)
# WikiText-2 is already tokenized with space-separated tokens
def get_tokens(example):
    # Split by whitespace and filter out empty strings
    tokens = example['text'].split()
    return tokens

# Get all tokens from each split
train_text = []
for example in train_dataset:
    tokens = get_tokens(example)
    if tokens:  # Skip empty examples
        train_text.extend(tokens)

validation_text = []
for example in validation_dataset:
    tokens = get_tokens(example)
    if tokens:
        validation_text.extend(tokens)

test_text = []
for example in test_dataset:
    tokens = get_tokens(example)
    if tokens:
        test_text.extend(tokens)

# Build vocabulary from training set only
# (Validation and test sets may contain unknown tokens, which will be mapped to <unk>)
token_counts = Counter(train_text)

# Create vocabulary with special tokens
# Add special tokens: <unk> for unknown words, <eos> for end of sentence
special_tokens = ['<unk>', '<eos>', '<pad>']
for token in special_tokens:
    token_counts[token] = 0
vocab_list = sorted(token_counts.keys())
vocab_size = len(vocab_list)

# Create a simple vocab class compatible with torchtext interface
class Vocab:
    def __init__(self, vocab_list, token_counts):
        self.itos = vocab_list  # index to string
        self.stoi = {word: idx for idx, word in enumerate(vocab_list)}  # string to index
        self.freqs = token_counts  # frequency counts

    def __len__(self):
        return len(self.itos)

vocab = Vocab(vocab_list, token_counts)

print(f"Vocabulary size: {vocab_size}")
print(f"First 30 validation tokens: {validation_text[:30]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

wikitext-2-v1/test-00000-of-00001.parque(…):   0%|          | 0.00/685k [00:00<?, ?B/s]

wikitext-2-v1/train-00000-of-00001.parqu(…):   0%|          | 0.00/6.07M [00:00<?, ?B/s]

wikitext-2-v1/validation-00000-of-00001.(…):   0%|          | 0.00/618k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Vocabulary size: 33279
First 30 validation tokens: ['=', 'Homarus', 'gammarus', '=', 'Homarus', 'gammarus', ',', 'known', 'as', 'the', 'European', 'lobster', 'or', 'common', 'lobster', ',', 'is', 'a', 'species', 'of', '<unk>', 'lobster', 'from', 'the', 'eastern', 'Atlantic', 'Ocean', ',', 'Mediterranean', 'Sea']


In [4]:
print(validation_text[:300])

['=', 'Homarus', 'gammarus', '=', 'Homarus', 'gammarus', ',', 'known', 'as', 'the', 'European', 'lobster', 'or', 'common', 'lobster', ',', 'is', 'a', 'species', 'of', '<unk>', 'lobster', 'from', 'the', 'eastern', 'Atlantic', 'Ocean', ',', 'Mediterranean', 'Sea', 'and', 'parts', 'of', 'the', 'Black', 'Sea', '.', 'It', 'is', 'closely', 'related', 'to', 'the', 'American', 'lobster', ',', 'H.', 'americanus', '.', 'It', 'may', 'grow', 'to', 'a', 'length', 'of', '60', 'cm', '(', '24', 'in', ')', 'and', 'a', 'mass', 'of', '6', 'kilograms', '(', '13', 'lb', ')', ',', 'and', 'bears', 'a', 'conspicuous', 'pair', 'of', 'claws', '.', 'In', 'life', ',', 'the', 'lobsters', 'are', 'blue', ',', 'only', 'becoming', '"', 'lobster', 'red', '"', 'on', 'cooking', '.', 'Mating', 'occurs', 'in', 'the', 'summer', ',', 'producing', 'eggs', 'which', 'are', 'carried', 'by', 'the', 'females', 'for', 'up', 'to', 'a', 'year', 'before', 'hatching', 'into', '<unk>', 'larvae', '.', 'Homarus', 'gammarus', 'is', 'a', 'h

We've implemented a unigram model here as a demonstration.

In [5]:
class UnigramModel:
    def __init__(self, train_text):
        self.counts = Counter(train_text)
        self.total_count = len(train_text)

    def probability(self, word):
        return self.counts[word] / self.total_count

    def next_word_probabilities(self, text_prefix):
        """Return a list of probabilities for each word in the vocabulary."""
        return [self.probability(word) for word in vocab.itos]

    def perplexity(self, full_text):
        """Return the perplexity of the model on a text as a float.

        full_text -- a list of string tokens
        """
        log_probabilities = []
        for word in full_text:
            # Note that the base of the log doesn't matter
            # as long as the log and exp use the same base.
            prob = self.probability(word)
            # Handle 0 probability by using a very small epsilon to avoid log(0)
            if prob == 0:
                prob = 1e-10
            log_probabilities.append(math.log(prob, 2))
        return 2 ** -np.mean(log_probabilities)

unigram_demonstration_model = UnigramModel(train_text)
print('unigram validation perplexity:',
      unigram_demonstration_model.perplexity(validation_text))

def check_validity(model):
    """Performs several sanity checks on your model:
    1) That next_word_probabilities returns a valid distribution
    2) That perplexity matches a perplexity calculated from next_word_probabilities

    Although it is possible to calculate perplexity from next_word_probabilities,
    it is still good to have a separate more efficient method that only computes
    the probabilities of observed words.
    """

    log_probabilities = []
    for i in range(10):
        prefix = validation_text[:i]
        probs = model.next_word_probabilities(prefix)
        assert min(probs) >= 0, "Negative value in next_word_probabilities"
        assert max(probs) <= 1 + 1e-8, "Value larger than 1 in next_word_probabilities"
        assert abs(sum(probs)-1) < 1e-4, "next_word_probabilities do not sum to 1"

        word_id = vocab.stoi[validation_text[i]]
        selected_prob = probs[word_id]
        # Handle 0 probability by using a very small epsilon to avoid log(0)
        if selected_prob == 0:
            selected_prob = 1e-10
        log_probabilities.append(math.log(selected_prob, 2))

    perplexity = 2 ** -np.mean(log_probabilities)
    your_perplexity = model.perplexity(validation_text[:10])
    assert abs(perplexity-your_perplexity) < 0.1, "your perplexity does not " + \
    "match the one we calculated from `next_word_probabilities`,\n" + \
    "at least one of `perplexity` or `next_word_probabilities` is incorrect.\n" + \
    f"we calcuated {perplexity} from `next_word_probabilities`,\n" + \
    f"but your perplexity function returned {your_perplexity} (on a small sample)."


check_validity(unigram_demonstration_model)


unigram validation perplexity: 996.5031614073108


To generate from a language model, we can sample one word at a time conditioning on the words we have generated so far.

In [6]:
def generate_text(model, n=20, prefix=('<eos>', '<eos>')):
    prefix = list(prefix)
    for _ in range(n):
        probs = model.next_word_probabilities(prefix)
        word = random.choices(vocab.itos, probs)[0]
        prefix.append(word)
    return ' '.join(prefix)

print(generate_text(unigram_demonstration_model))

<eos> <eos> Eleven Cass producer by @,@ The in Benigno in cliff years guitar , , . used Auckland also palace <unk>


In fact there are many strategies to get better-sounding samples, such as only sampling from the top-k words or sharpening the distribution with a temperature.  You can read more about sampling from a language model in this recent paper: https://arxiv.org/pdf/1904.09751.pdf.

You will need to submit some outputs from the models you implement for us to grade.  The following function will be used to generate the required output files.

In [7]:
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes_short.txt
!wget https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab_short.txt

def save_truncated_distribution(model, filename, short=True):
    """Generate a file of truncated distributions.

    Probability distributions over the full vocabulary are large,
    so we will truncate the distribution to a smaller vocabulary.

    Please do not edit this function
    """
    vocab_name = 'eval_output_vocab'
    prefixes_name = 'eval_prefixes'

    if short:
      vocab_name += '_short'
      prefixes_name += '_short'

    with open('{}.txt'.format(vocab_name), 'r') as eval_vocab_file:
        eval_vocab = [w.strip() for w in eval_vocab_file]
    # Map unknown words to <unk> token ID
    unk_id = vocab.stoi['<unk>']
    eval_vocab_ids = [vocab.stoi.get(s, unk_id) for s in eval_vocab]

    all_selected_probabilities = []
    with open('{}.txt'.format(prefixes_name), 'r') as eval_prefixes_file:
        lines = eval_prefixes_file.readlines()
        for line in tqdm.tqdm(lines, leave=False):
            prefix = line.strip().split(' ')
            probs = model.next_word_probabilities(prefix)
            selected_probs = np.array([probs[i] for i in eval_vocab_ids], dtype=np.float32)
            all_selected_probabilities.append(selected_probs)

    all_selected_probabilities = np.stack(all_selected_probabilities)
    np.save(filename, all_selected_probabilities)
    print('saved', filename)

--2026-02-04 20:55:39--  https://cal-cs288.github.io/sp21/project_files/proj_1/eval_prefixes.txt
Resolving cal-cs288.github.io (cal-cs288.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to cal-cs288.github.io (cal-cs288.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 519055 (507K) [text/plain]
Saving to: ‘eval_prefixes.txt’


2026-02-04 20:55:39 (149 MB/s) - ‘eval_prefixes.txt’ saved [519055/519055]

--2026-02-04 20:55:39--  https://cal-cs288.github.io/sp21/project_files/proj_1/eval_output_vocab.txt
Resolving cal-cs288.github.io (cal-cs288.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to cal-cs288.github.io (cal-cs288.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12497 (12K) [text/plain]
Saving to: ‘eval_output_vocab.txt’


2026-02-04 20:55:40 (109 MB/s) - ‘eval_output_vocab.txt’ saved [12497/12497]

--2026-02-04 2

In [8]:
save_truncated_distribution(unigram_demonstration_model, 'unigram_predictions.npy')

                                                   

saved unigram_predictions.npy




### N-gram Model

Now it's time to implement an n-gram language model.

Because not every n-gram will have been observed in training, use add-alpha smoothing to make sure no output word has probability 0.

$$P(w_2|w_1)=\frac{C(w_1,w_2)+\alpha}{C(w_1)+N\alpha}$$

where $N$ is the vocab size and $C$ is the count for the given bigram.  An alpha value around `3e-3`  should work.

One edge case you will need to handle is at the beginning of the text where you don't have `n-1` prior words.  You can handle this however you like as long as you produce a valid probability distribution, but just using a uniform distribution over the vocabulary is reasonable for the purposes of this project.

A properly implemented bi-gram model should get a perplexity below 550 on the validation set.

**Note**: Do not change the signature of the `next_word_probabilities` and `perplexity` functions.  We will use these as a common interface for all of the different model types.  Make sure these two functions call `n_gram_probability`, because later we are going to override `n_gram_probability` in a subclass.
Also, we suggest pre-computing and caching the counts $C$ when you initialize `NGramModel` for efficiency.

**Deliverable**: Submit the bigram distribution from the Ngram model.

In [9]:
class NGramModel:
    def __init__(self, train_text, n=2, alpha=3e-3):
        # get counts and perform any other setup
        self.n = n
        self.smoothing = alpha
        self.vocab_size = len(vocab)

        # n-gram counts dictionary
        self.ngram_counts = Counter()

        # (n-1)-gram counts dictionary (context counts)
        self.context_counts = Counter()

        # iterate through training text to collect counts
        for i in range(len(train_text)):
            # get the n-gram ending at position i
            if i >= n - 1:
                ngram = tuple(train_text[i - n + 1:i + 1])
                context = tuple(train_text[i - n + 1:i])
            else:
                # if there's not enough history, pad with <eos>
                needed = n - i - 1
                ngram = tuple(['<eos>'] * needed + train_text[:i + 1])
                context = tuple(['<eos>'] * needed + train_text[:i])

            self.ngram_counts[ngram] += 1
            if len(context) > 0:  # only count context if n > 1
                self.context_counts[context] += 1

    def n_gram_probability(self, n_gram):
        """Return the probability of the last word in an n-gram.

        n_gram -- a list of string tokens
        returns the conditional probability of the last token given the rest.
        """
        assert len(n_gram) == self.n

        # add-alpha smoothing formula: P(w_n | w_1...w_{n-1}) = (C(w_1...w_n) + alpha) / (C(w_1...w_{n-1}) + N*alpha)
        ngram_tuple = tuple(n_gram)
        context_tuple = tuple(n_gram[:-1])

        numerator = self.ngram_counts[ngram_tuple] + self.smoothing

        if self.n == 1:
            # for unigram, denominator is total count + vocab_size * alpha
            denominator = sum(self.ngram_counts.values()) + self.vocab_size * self.smoothing
        else:
            # for n>1, denominator is context count + vocab_size * alpha
            denominator = self.context_counts[context_tuple] + self.vocab_size * self.smoothing

        return numerator / denominator


    def next_word_probabilities(self, text_prefix):
        """Return a list of probabilities for each word in the vocabulary."""

        # get the context (last n-1 words from prefix)
        if self.n == 1:
            # for unigram, context is empty
            context = []
        elif len(text_prefix) >= self.n - 1:
            context = text_prefix[-(self.n - 1):]
        else:
            # pad with <eos> in case the prefix is too short
            needed = (self.n - 1) - len(text_prefix)
            context = ['<eos>'] * needed + text_prefix

        # compute probability for each word in vocab
        probabilities = []
        for word in vocab.itos:
            ngram = context + [word]
            prob = self.n_gram_probability(ngram)
            probabilities.append(prob)

        return probabilities


    def perplexity(self, full_text):
        """ full_text is a list of string tokens
        return perplexity as a float """

        log_probabilities = []

        # iterate through each position in full_text
        for i in range(len(full_text)):
            # construct the n-gram ending at position i
            if i >= self.n - 1:
                # we have enough history
                ngram = full_text[i - self.n + 1:i + 1]
            else:
                # pad with <eos> for the beginning of text
                needed = self.n - i - 1
                ngram = ['<eos>'] * needed + full_text[:i + 1]

            # get probability and compute log
            prob = self.n_gram_probability(ngram)
            log_probabilities.append(math.log(prob, 2))

        # compute perplexity: 2^(-average log probability)
        return 2 ** -np.mean(log_probabilities)


unigram_model = NGramModel(train_text, 1)
check_validity(unigram_model)
print('unigram validation perplexity:', unigram_model.perplexity(validation_text)) # this should be the almost the same as our unigram model perplexity above

bigram_model = NGramModel(train_text, n=2)
check_validity(bigram_model)
print('bigram validation perplexity:', bigram_model.perplexity(validation_text))

trigram_model = NGramModel(train_text, n=3)
check_validity(trigram_model)
print('trigram validation perplexity:', trigram_model.perplexity(validation_text)) # this won't do very well...

save_truncated_distribution(bigram_model, 'bigram_predictions.npy') # this might take a few minutes


unigram validation perplexity: 996.5087415741611
bigram validation perplexity: 542.5386602962352
trigram validation perplexity: 3289.7930661935084


                                                  

saved bigram_predictions.npy




Please download `bigram_predictions.npy` once you finish this section so that you can submit it.

We can also generate samples from the model to get an idea of how it is doing.

In [10]:
print(generate_text(bigram_model))

<eos> <eos> overhead 996 adultery moon exile = = Thunderbirds strove JGT Serious soft sand summarized Hunters synthpop attendance chutes Oriente deceptive


We now free up some RAM, **it is important to run the cell below, otherwise you may quite possibly run out of RAM in the runtime.**

In [11]:
# Free up some RAM.
del bigram_model
del trigram_model

### Neural N-gram Model

In this section, you will implement a neural version of an n-gram model.  The model will use a simple feedforward neural network that takes the previous `n-1` words and outputs a distribution over the next word.

You will use PyTorch to implement the model.  We've provided a little bit of code to help with the data loading using PyTorch's data loaders (https://pytorch.org/docs/stable/data.html)

A model with the following architecture and hyperparameters should reach a validation perplexity below 230.
* embed the words with dimension 128, then flatten into a single embedding for $n-1$ words (with size $(n-1)*128$)
* run 2 hidden layers with 1024 hidden units, then project down to size 128 before the final layer (ie. 4 layers total).
* use weight tying for the embedding and final linear layer (this made a very large difference in our experiments); you can do this by creating the output layer with `nn.Linear`, then using `F.embedding` with the linear layer's `.weight` to embed the input
* rectified linear activation (ReLU) and dropout 0.1 after first 2 hidden layers. **Note: You will likely find a performance drop if you add a nonlinear activation function after the dimension reduction layer.**
* train for 10 epochs with the Adam optimizer (should take around 15-20 minutes)


We encourage you to try other architectures and hyperparameters, and you will likely find some that work better than the ones listed above.  A proper implementation with these should be enough to receive full credit on the assignment, though.

In [12]:
def ids(tokens):
    return [vocab.stoi[t] for t in tokens]

assert torch.cuda.is_available(), "no GPU found; in Colab go to 'Edit->Notebook settings' and choose a GPU hardware accelerator; \n in Kaggle go to 'Settings->Accelerator' and choose a GPU hardware accelerator"

class NeuralNgramDataset(torch.utils.data.Dataset):
    def __init__(self, text_token_ids, n):
        self.text_token_ids = text_token_ids
        self.n = n

    def __len__(self):
        return len(self.text_token_ids)

    def __getitem__(self, i):
        if i < self.n-1:
            prev_token_ids = [vocab.stoi['<eos>']] * (self.n-i-1) + self.text_token_ids[:i]
        else:
            prev_token_ids = self.text_token_ids[i-self.n+1:i]

        assert len(prev_token_ids) == self.n-1

        x = torch.tensor(prev_token_ids)
        y = torch.tensor(self.text_token_ids[i])
        return x, y

class NeuralNGramNetwork(nn.Module):
    # a PyTorch Module that holds the neural network for your model

    def __init__(self, n):
        super().__init__()
        self.n = n

        # architecture parameters
        embed_dim = 128
        hidden_dim = 1024
        reduction_dim = 128
        vocab_size = len(vocab)

        # output layer (for weight tying with embeddings)
        self.output_layer = nn.Linear(reduction_dim, vocab_size)

        # network layers
        # input: (n-1) * embed_dim after flattening embeddings
        self.fc1 = nn.Linear((n-1) * embed_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, reduction_dim)

        # dropout
        self.dropout = nn.Dropout(0.1)


    def forward(self, x):
        # x is a tensor of inputs with shape (batch, n-1)
        # this function returns a tensor of log probabilities with shape (batch, vocab_size)

        # embed input tokens using weight tying (use output layer weights as embeddings)
        # F.embedding expects weight matrix of shape (vocab_size, embed_dim)
        # output_layer.weight has shape (vocab_size, reduction_dim=128) from nn.Linear
        embedded = F.embedding(x, self.output_layer.weight)  # (batch, n-1, 128)

        # flatten embeddings
        embedded_flat = embedded.view(x.shape[0], -1)  # (batch, (n-1)*128)

        # pass through network
        h1 = F.relu(self.fc1(embedded_flat))  # (batch, 1024)
        h1 = self.dropout(h1)

        h2 = F.relu(self.fc2(h1))  # (batch, 1024)
        h2 = self.dropout(h2)

        h3 = self.fc3(h2)  # (batch, 128) - no activation after reduction

        # output layer
        logits = self.output_layer(h3)  # (batch, vocab_size)

        # return log probabilities
        return F.log_softmax(logits, dim=1)


class NeuralNGramModel:
    # a class that wraps NeuralNGramNetwork to handle training and evaluation
    # it's ok if this doesn't work for unigram modeling
    def __init__(self, n):
        self.n = n
        self.network = NeuralNGramNetwork(n).cuda()

    def train(self):
        dataset = NeuralNgramDataset(ids(train_text), self.n)
        train_loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=True)
        # iterating over train_loader with a for loop will return a 2-tuple of batched tensors
        # the first tensor will be previous token ids with size (batch, n-1),
        # and the second will be the current token id with size (batch, )
        # you will need to move these tensors to GPU, e.g. by using the Tensor.cuda() function.

        # this will take some time to run; use tqdm.tqdm_notebook to get a progress bar
        # (see Project 1a for example)

        # setup optimizer and loss
        optimizer = torch.optim.Adam(self.network.parameters())
        criterion = nn.NLLLoss()  # negative log likelihood loss (works with log_softmax)

        num_epochs = 10

        # training loop
        for epoch in range(num_epochs):
            self.network.train()
            total_loss = 0

            for x, y in tqdm.tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}'):
                x = x.cuda()
                y = y.cuda()

                # forward pass
                log_probs = self.network(x)
                loss = criterion(log_probs, y)

                # backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            avg_loss = total_loss / len(train_loader)
            print(f'Epoch {epoch+1}, Average Loss: {avg_loss:.4f}')


    def next_word_probabilities(self, text_prefix):
        self.network.eval()
        # convert text_prefix to token ids
        prefix_ids = ids(text_prefix)

        # get context (last n-1 tokens), padding with <eos> if needed
        if len(prefix_ids) >= self.n - 1:
            context_ids = prefix_ids[-(self.n - 1):]
        else:
            # pad with <eos> tokens
            needed = (self.n - 1) - len(prefix_ids)
            context_ids = [vocab.stoi['<eos>']] * needed + prefix_ids
        x = torch.tensor([context_ids]).cuda()  # (1, n-1)

        # get log probabilities
        with torch.no_grad():
            log_probs = self.network(x)  # (1, vocab_size)

        # convert to probabilities
        probs = torch.exp(log_probs).squeeze(0).cpu().tolist()  # (vocab_size,)
        return probs


    def perplexity(self, text):
        self.network.eval()

        # create dataset and dataloader
        dataset = NeuralNgramDataset(ids(text), self.n)
        loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=False)

        log_probabilities = []

        with torch.no_grad():
            for x, y in loader:
                x = x.cuda()
                y = y.cuda()

                # get log probabilities
                log_probs = self.network(x)  # (batch, vocab_size)

                # get log probability of correct token for each example
                # gather selects the log_prob at index y for each batch element
                selected_log_probs = log_probs.gather(1, y.unsqueeze(1)).squeeze(1)
                # convert to base 2 (currently in natural log)
                log_probs_base2 = selected_log_probs / math.log(2)

                log_probabilities.extend(log_probs_base2.cpu().tolist())

        # compute perplexity: 2^(-average log probability)
        return 2 ** -np.mean(log_probabilities)


neural_trigram_model = NeuralNGramModel(3)
check_validity(neural_trigram_model)
neural_trigram_model.train()
print('neural trigram validation perplexity:', neural_trigram_model.perplexity(validation_text))

Epoch 1/10: 100%|██████████| 16031/16031 [00:55<00:00, 287.72it/s]


Epoch 1, Average Loss: 5.9472


Epoch 2/10: 100%|██████████| 16031/16031 [00:55<00:00, 289.06it/s]


Epoch 2, Average Loss: 5.4397


Epoch 3/10: 100%|██████████| 16031/16031 [00:55<00:00, 288.58it/s]


Epoch 3, Average Loss: 5.2611


Epoch 4/10: 100%|██████████| 16031/16031 [00:55<00:00, 288.88it/s]


Epoch 4, Average Loss: 5.1494


Epoch 5/10: 100%|██████████| 16031/16031 [00:55<00:00, 286.56it/s]


Epoch 5, Average Loss: 5.0712


Epoch 6/10: 100%|██████████| 16031/16031 [00:55<00:00, 287.77it/s]


Epoch 6, Average Loss: 5.0118


Epoch 7/10: 100%|██████████| 16031/16031 [00:55<00:00, 287.02it/s]


Epoch 7, Average Loss: 4.9656


Epoch 8/10: 100%|██████████| 16031/16031 [00:55<00:00, 287.24it/s]


Epoch 8, Average Loss: 4.9272


Epoch 9/10: 100%|██████████| 16031/16031 [00:55<00:00, 288.51it/s]


Epoch 9, Average Loss: 4.8940


Epoch 10/10: 100%|██████████| 16031/16031 [00:55<00:00, 288.89it/s]


Epoch 10, Average Loss: 4.8694
neural trigram validation perplexity: 246.6612335981799


In [13]:

save_truncated_distribution(neural_trigram_model, 'neural_trigram_predictions.npy', short=False)



saved neural_trigram_predictions.npy


Free up RAM.

In [None]:
# Delete model we don't need.
del neural_trigram_model

### Submission

Upload a submission with the following files to Gradescope:
* Part1.ipynb (rename to match this exactly)
* neural_trigram_predictions.npy
* bigram_predictions.npy

You can upload files individually or as part of a zip file, but if using a zip file be sure you are zipping the files directly and not a folder that contains them.

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.  Note that the test set perplexities shown by the autograder are on a completely different scale from your validation set perplexities due to truncating the distribution and selecting different text.  Don't worry if the values seem much worse.