<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1">Objective</a></span></li><li><span><a href="#Data-Models" data-toc-modified-id="Data-Models-2">Data Models</a></span></li><li><span><a href="#Datasets" data-toc-modified-id="Datasets-3">Datasets</a></span></li><li><span><a href="#GRU" data-toc-modified-id="GRU-4">GRU</a></span></li></ul></div>

In [31]:
import os

import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import Dataset
from torch.utils.data import DataLoader

------------

# Objective

The goal of this notebook is to train a language model from scratch on `wikitext-2`, which you can find [here](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). Our focus will be on getting the pre-processing and training loops working in the traditional, non-federated setting. In a separate notebook we'll do the same thing for the federated setting, which you can read more about in [this](https://arxiv.org/pdf/1811.03604.pdf) paper (which we'll refer to as the `Google Smart Keyboard Paper`).

This notebook borrows heavily from [this](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html) pytorch tutorial, which is absolutely outstanding.

--------

# Data Models

Let's create data models for a corpus of text. The `Google Smart Keyboard Paper` uses a vocabulary of size `10,000` and inlcudes tokens for the beggining of sentence, end of setence, and out-of-vocab words. During inference time the probabilities for these tokens are ignored.

TODO: 

* Limit vocab size
* Add `<BOS>` tokens
* Sort words by descending frequency to enable training with `AdaptiveLogSoftMax`

In [28]:
class Dictionary:
    """Base class for encoding a vocabulary."""
    
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        
    def __len__(self):
        return len(self.idx2word)
        
    def add_word(self, word):
        """Add a new word to the dictionary."""
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
                        
    def get_index(self, word):
        """Return the index of a word."""
        return self.word2idx[word]

In [29]:
class Corpus:
    """Base class for encoding a corpus of text."""
    
    def __init__(self, dirpath):
        """Initialise a corpus given a dir with train, valid, and test .txt files."""
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(dirpath, "train.txt"))
        self.valid = self.tokenize(os.path.join(dirpath, "valid.txt"))
        self.test = self.test(os.path.join(dirpath, "test.txt"))
        
    def vectorize(self, fpath):
        """Return a tensor of indexes encoding the words in a file."""
        idxs = []
        with open(fpath, "r", encoding="utf8") as f:
            for line in f:
                words = f.split().append("<EOS>")
                for word in words:
                    self.dictionary.add_word(word)
                    idxs.append(self.dictionary.get_index(word))
        return torch.LongTensor(idxs)

-----------

# Datasets

------------

# GRU 

We'll use a single layer `GRU` with somewhere around `600` hidden units. We'll tie the embedding weights to the softmax layer, as described in [this](https://arxiv.org/pdf/1611.01462.pdf) paper.

TODO:

* Add Layer Normalization

In [30]:
class GRUModel(nn.Module):
    """Container module with an encoder, recurrent module, and a decoder."""
    
    def __init__(self, vocab_size, emb_dim, hid_dim, n_layers=1, dropout=0.5, tie_weights=False):
        super().__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, n_layers)
        self.decoder = nn.Linear(hid_dim, vocab_size)
        self.init_weights()
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        if tie_weights:
            assert hid_dim == emb_dim, f"{hid_dim= } must match {emb_dim= }!"
            self.decoder.weight = self.encoder.weight
    
    #TODO: why do we initialise the bias with zeros?
    def init_weights(self, k=0.1):
        """Initialise weights from a uniform distribution U(-k, k)."""
        self.encoder.weight.data.uniform_(-k, k)
        self.decoder.weight.data.uniform_(-k, k)
        self.decoder.bias.data.zero_()
        
    #TODO: why do we return a tuple?
    def init_hidden(self, batch_sz):
        """Initialise hidden weights."""
        weights = next(self.parameters())
        return (weight.new_zeros(self.n_layers, batch_sz, self.hid_dim),
                weight.new_zeros(self.n_layers, batch_sz, self.hid_dim))
    
    def forward(self, x, hidden):
        emb = self.drop(self.encoder(x))
        output, hidden = self.gru(emb, hidden)
        decoded = self.decoder(self.drop(output))
        return decoded, hidden

-----------