# Lab 1: Language modelling

In this lab you will implement and train two neural language models: the fixed-window model and the recurrent neural network model. You will evaluate these models by computing their perplexity on a benchmark dataset.

In [None]:
import torch

For this lab, you should use a GPU if you have one.

In [None]:
device = torch.device('cpu')
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')    # NVIDIA
# device = torch.device('mps')    # Apple Silicon

## Data

The data for this lab is [WikiText](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), a collection of more than 100 million tokens extracted from the ‚ÄúGood‚Äù and ‚ÄúFeatured‚Äù articles on Wikipedia. We will use the small version of the dataset, which contains slightly more than 2.5 million tokens.

The next cell contains code for an object that will act as a container for the ‚Äútraining‚Äù and the ‚Äúvalidation‚Äù section of the data. We fill this container by reading the corresponding text files. The only processing we do is to split at whitespace and replace each newline with an end-of-sentence token. Importantly, we also build the vocabulary (`self.vocab`) that maps each word to an integer id.

In [None]:
class WikiText(object):
    
    def __init__(self):
        self.vocab = {}
        self.train = self.read_data('wiki.train.tokens')
        self.valid = self.read_data('wiki.valid.tokens')
    
    def read_data(self, path):
        ids = []
        with open(path, encoding='utf-8') as source:
            for line in source:
                line = line.rstrip()
                if line:
                    for token in line.split() + ['<eos>']:
                        if token not in self.vocab:
                            self.vocab[token] = len(self.vocab)
                        ids.append(self.vocab[token])
        return ids

The cell below loads the data and prints the total number of tokens and the size of the vocabulary.

In [None]:
wikitext = WikiText()

print('Tokens in train:', len(wikitext.train))
print('Tokens in valid:', len(wikitext.valid))
print('Vocabulary size:', len(wikitext.vocab))

## Problem 1: Fixed-window model

In this section, you will implement and train the fixed-window neural language model proposed by [Bengio et al. (2003)](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) and presented in the lectures. Recall that an input to this model takes the form of a vector of $n-1$ integer ids representing preceding words. We will refer to this vector as the *context window* and to its length as the *window size*. Each word id is mapped to a vector via an embedding layer. (All positions share the same embedding.) The embedding vectors are then concatenated and sent through a two-layer feed-forward network with a non-linearity in the form of a rectified linear unit (ReLU). The output of that network can be interpreted as a categorical probability distribution over all possible words.

### Problem 1.1: Vectorise the data

Your first task is to write code for transforming the data in the WikiText container into a vectorised form that can be fed to the fixed-window model. Concretely, you will implement this as a PyTorch [Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset). Read the documentation of that class and complete the skeleton code in the cell below:

In [None]:
from torch.utils.data import Dataset

class FixedWindowDataset(Dataset):
    def __init__(self, word_ids, window_size):
        self.token_ids = word_ids
        self.window_size = window_size

    def __len__(self):
        # TODO: Replace the following line with your own code
        raise NotImplemented

    def __getitem__(self, idx):
        # TODO: Replace the following line with your own code
        raise NotImplemented

Your code should implement the following specification:

**__init__** (*self*, *word_ids*, *window_size*)

> Creates a new dataset, wrapping the underlying WikiText data *word_ids* (a list of word ids). The parameter *window_size* specifies the size of the context window.

**__len__** (*self*)

> Returns the number of samples in this dataset.

**__getitem__** (*self*, *idx*)

> Fetches a data sample for the given index (*idx*). A sample is a pair $(\mathbf{x}, y)$ representing a context window and the next word after that window, where $\mathbf{x}$ is a vector of length *window_size* and $y$ is a scalar.

#### ü§û Test your code

Test your implementation by running the code in the next cell.

In [None]:
def test_11():
    # Create the model-specific dataset
    dataset = FixedWindowDataset(wikitext.valid, 2)

    # Print the number of samples
    print(len(dataset))

    # Fetch a sample and print it
    print(dataset[42])

test_11()

# Expected output:
# 216345
# tensor([22, 17]) tensor(1204)

### Problem 1.2: Implement the model

Your next task is to implement the fixed-window model based on the graphical specification given in the lecture.

In [None]:
import torch.nn as nn

class FixedWindowModel(nn.Module):

    def __init__(self, window_size, n_words, embedding_dim=50, hidden_dim=50):
        super().__init__()
        # TODO: Add your own code

    def forward(self, x):
        # TODO: Replace the next line with your own code
        raise NotImplemented

Here is the specification of the two methods:

**__init__** (*self*, *window_size*, *n_words*, *embedding_dim*=50, *hidden_dim*=50)

> Creates a new fixed-window neural language model. The argument *window_size* specifies the length of the context window. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the output dimensionalities of the embedding layer and the hidden layer of the feedforward network, respectively; their default value is 50.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, s)$, where $B$ is the batch size and $s$ is the window size. The output of the forward pass is a tensor of shape $(B, V)$ where $V$ is the number of words in the vocabulary.

#### ü§û Test your code

The following code instantiates the model and feeds it batches of samples from the training data.

In [None]:
def test_12():
    from torch.utils.data import DataLoader

    # Set the context window size
    window_size = 2

    # Instantiate a small dataset and a data loader
    dataset = FixedWindowDataset(wikitext.train[:30], window_size)
    data_loader = DataLoader(dataset, batch_size=10, shuffle=True)

    # Instantiate the model
    model = FixedWindowModel(window_size, len(wikitext.vocab))

    for batch_x, batch_y in data_loader:
        # Feed the model a batch of samples from the training data
        output = model(batch_x)

        # Print the shape of the model output
        print(output.shape)

test_12()

# Expected output:
# torch.Size([10, 33278])
# torch.Size([10, 33278])
# torch.Size([8, 33278])

#### ü§î Questions for the oral report

* What do the numbers 30 and 10 refer to in the test code?
* How do these numbers affect the output shapes?
* What parameters in the model affect the output shape?

### Problem 1.3: Train the model

Your final task is to write code to train the fixed-window model using minibatch gradient descent and the cross-entropy loss function. This should be a straightforward generalisation of the training loops you have already seen. Complete the skeleton code in the cell below:

In [None]:
def train_fixed_window(window_size, n_epochs=1, batch_size=512, lr=1e-3):
    # TODO: Replace the following line with your own code
    return None

Here is the specification of the training function:

**train_fixed_window** (*window_size*, *n_epochs* = 1, *batch_size* = 512, *lr* = 0.001)

> Trains a fixed-window neural language model with context window size *window_size* using minibatch gradient descent and returns the trained model. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

The code in the cell below trains a model with window size 2:

In [None]:
model_fixed_window = train_fixed_window(2)

#### Hints

* Computing the validation perplexity in one go (for the full validation set) will most probably exhaust your computer‚Äôs memory and/or take a lot of time. If you run into this problem, do the computation at the minibatch level and aggregate the results.
* Training and even evaluation will take some time ‚Äì when using a CPU, you should expect several minutes per epoch, depending on the hardware. Our reference implementation uses a GPU and runs in less than 30 seconds per epoch on [Colab](http://colab.research.google.com).

#### ü§û Test your code

**Your submitted notebook must contain output demonstrating a validation perplexity of at most 400 after the first epoch.** You should not change the parameters of the model or the training to meet this target.

To see whether your network is learning something, print or plot the running loss on the training data. If this value not decrease during training, try to find the problem before wasting time (and electricity) on useless computation.

## Problem 2: Recurrent neural network model

In this section, you will implement the recurrent neural network language model. Recall that an input to this model is a vector of word ids. Each id is mapped to an embedding vector, and the sequence of these vectors is then fed into an unrolled LSTM. At each position $i$ in the sequence, the hidden state of the LSTM at that position is sent through a linear transformation whose output can be interpreted as a categorical probability distribution over the words at position $i+1$. In theory, the input vector could represent the complete training data; for practical reasons, however, we will truncate the input to some fixed length.

### Problem 2.1: Vectorise the data

As in the previous problem, your first task is to transform the data in the WikiText container into a vectorised form that can be fed to the model. The *input sequences* in this model-specific dataset are obtained by partitioning the data into non-overlapping segments. The *output sequences* correspond to the input sequences shifted one position to the right, so that corresponding elements in the two sequences represent words and next words.

In [None]:
from torch.utils.data import Dataset

class RNNDataset(Dataset):
    def __init__(self, word_ids, seq_len):
        self.token_ids = word_ids
        self.seq_len = seq_len

    def __len__(self):
        # TODO: Replace the following line with your own code
        raise NotImplemented

    def __getitem__(self, idx):
        # TODO: Replace the following line with your own code
        raise NotImplemented

Your code should implement the following specification:

**__init__** (*self*, *word_ids*, *seq_len*)

> Creates a new dataset, wrapping the underlying WikiText data *word_ids* (a list of word ids). The parameter *seq_len* specifies the length of an input sequence.

**__len__** (*self*)

> Returns the number of samples in this dataset.

**__getitem__** (*self*, *idx*)

> Fetches a data sample for the given index (*idx*). A sample is a pair $(\mathbf{x}, \mathbf{y})$ representing contiguous subsequences of the underlying data. Compared to the input sequence, the output sequence is shifted one position to the right. More precisely, if $\mathbf{x}$ is the sequence that starts at token position $k$, then $\mathbf{y}$ is the sequence that starts at position $k+1$.

#### ü§û Test your code

Test your implementation by running the following code:

In [None]:
def test_21():
    # Create the model-specific dataset
    dataset = RNNDataset(wikitext.valid, 5)

    # Print the number of samples
    print(len(dataset))

    # Fetch a sample and print it
    print(dataset[42])

test_21()

# Expected output:
# 43269
# (tensor([1179, 2376, 1839, 1450, 1179]), tensor([2376, 1839, 1450, 1179, 1450]))

### Problem 2.2: Implement the model

Your next task is to implement the recurrent neural network model based on the graphical specification.

In [None]:
import torch.nn as nn

class RNNModel(nn.Module):
    
    def __init__(self, n_words, embedding_dim=50, hidden_dim=50):
        super().__init__()
        # TODO: Add your own code

    def forward(self, x):
        # TODO: Replace the next line with your own code
        raise NotImplemented

Your implementation should follow this specification:

**__init__** (*self*, *n_words*, *embedding_dim* = 50, *hidden_dim* = 50)

> Creates a new recurrent neural network language model based on an LSTM. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the LSTM hidden layer, respectively; their default value is 50.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, H)$, where $B$ is the batch size and $H$ is the length of each input sequence. The shape of the output tensor is $(B, H, V)$, where $V$ is the size of the vocabulary.

#### ü§û Test your code

Test your code by instantiating the model and feeding it a batch of examples from the training data.

In [None]:
def test_22():
    from torch.utils.data import DataLoader

    # Set the sequence length
    seq_len = 5

    # Instantiate a small dataset and a data loader
    dataset = RNNDataset(wikitext.train[:51], seq_len)
    data_loader = DataLoader(dataset, batch_size=3, shuffle=True)

    # Instantiate the model
    model = RNNModel(len(wikitext.vocab))

    for batch_x, batch_y in data_loader:
        # Feed the model a batch of samples from the training data
        output = model(batch_x)

        # Print the shape of the model output
        print(output.shape)

test_22()

# Expected output:
# torch.Size([3, 5, 33278])
# torch.Size([3, 5, 33278])
# torch.Size([3, 5, 33278])
# torch.Size([1, 5, 33278])

#### ü§î Questions for the oral report

* What do the numbers 5, 51 and 3 refer to in the test code?
* How do these numbers affect the output shapes?
* What is the total number of samples in the small subset of the data used in the test code?

### Problem 2.3: Train the model

The training loop for the recurrent neural network model is essentially identical to the loop that you wrote for the feed-forward model. The only thing to note is that the cross-entropy loss function expects its input to be a two-dimensional tensor; you will therefore have to re-shape the output tensor from the LSTM as well as the gold-standard output tensor in a suitable way. The most efficient way to do so is to use the [`view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) method.

In [None]:
def train_rnn(seq_len=32, n_epochs=1, batch_size=16, lr=1e-2):
    # TODO: Replace the next line with your own code
    return None

Here is the specification of the training function:

**train_rnn** (*seq_len* = 32, *n_epochs* = 1, *batch_size* = 16, *lr* = 0.01)

> Trains a recurrent neural network language model on the WikiText data using minibatch gradient descent and returns it. The parameter *seq_len* specifies the length of the input and output sequences. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

Evaluate your model by running the following code cell:

In [None]:
model_rnn = train_rnn()

#### ü§û Test your code

**Your submitted notebook must contain output demonstrating a validation perplexity of at most 280 after the first epoch.** You should not have to change the parameters of the model or the training to meet this target.

## Problem 3: Parameter initialisation

The error surfaces explored when training neural networks can be very complex. Because of this, it is important to choose ‚Äúgood‚Äù initial values for the parameters. In PyTorch, the weights of the embedding layer are initialised by sampling from the standard normal distribution $\mathcal{N}(0, 1)$. How do different initialisations affect the perplexity of your fixed-window language model?

Run a small experiment where you try a few non-standard normal distributions with different values for the mean and the variance. Document your results in a table and add it to this notebook. Prepare the following questions for the oral report:

* What different parameter combinations did you try? What results did you get?
* How do you interpret your results? Did they match your expectations?
* What did you learn? Why does this learning matter?

Several authors have developed a theory for more principled choices of initialisation strategies based on properties of the specific network. Two standard articles in this area are [Glorot and Bengio (2010)](https://proceedings.mlr.press/v9/glorot10a.html) and [He et al. (2015)](https://doi.org/10.1109/ICCV.2015.123), whose initialisation strategies have been implemented in the [`nn.init`](https://pytorch.org/docs/stable/nn.init.html) module. Have a look at the articles and the PyTorch documentation if you want to learn more.

Congratulations on finishing Lab 1!