<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1">Objective</a></span></li><li><span><a href="#Data-Models" data-toc-modified-id="Data-Models-2">Data Models</a></span></li><li><span><a href="#Datasets" data-toc-modified-id="Datasets-3">Datasets</a></span></li><li><span><a href="#GRU" data-toc-modified-id="GRU-4">GRU</a></span></li><li><span><a href="#Training-Functions" data-toc-modified-id="Training-Functions-5">Training Functions</a></span></li><li><span><a href="#Train-Model" data-toc-modified-id="Train-Model-6">Train Model</a></span><ul class="toc-item"><li><span><a href="#Prepare-DataLoaders" data-toc-modified-id="Prepare-DataLoaders-6.1">Prepare DataLoaders</a></span></li><li><span><a href="#Training-Loop" data-toc-modified-id="Training-Loop-6.2">Training Loop</a></span></li></ul></li></ul></div>

In [1]:
import os
import time
from collections import Counter
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import Dataset
from torch.utils.data import DataLoader

In [2]:
# Set seed for reproducibility
torch.manual_seed(9)

<torch._C.Generator at 0x7f73afd6cad0>

------------

# Objective

The goal of this notebook is to train a language model from scratch on `wikitext-2`, which you can find [here](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). Our focus will be on getting the pre-processing and training loops working in the traditional, non-federated setting. In a separate notebook we'll do the same thing for the federated setting, which you can read more about in [this](https://arxiv.org/pdf/1811.03604.pdf) paper (which we'll refer to as the `Google Smart Keyboard Paper`).

This notebook borrows heavily from [this](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html) pytorch tutorial, which is absolutely outstanding.

--------

# Data Models

Let's create data models for a corpus of text. The `Google Smart Keyboard Paper` uses a vocabulary of size `10,000` and inlcudes tokens for the beggining of sentence, end of setence, and out-of-vocab words. During inference time the probabilities for these tokens are ignored.

TODO: 

* Limit vocab size
* Add `<BOS>` tokens?
* Sort words by descending frequency to enable training with `AdaptiveLogSoftMax`

In [3]:
class DescendingDict:
    """Model a vocab as a mapping word <-> index, ordered by descending frequency."""
    
    def __init__(self, words, vocab_size=None):
        word2freq = Counter(words).most_common(vocab_size)
        self.idx2word = [tup[0] for tup in word2freq]
        self.word2idx = {word: idx for idx, word in enumerate(self.idx2word)}
                
    def __len__(self):
        return len(self.idx2word)
                                
    def get_index(self, word):
        return self.word2idx[word]
    
    def get_indices(self, words):
        return [self.word2idx[word] for word in words]

In [4]:
class WikiCorpus:
    """Encode a corpus of text already processed in the wikitext style."""
    
    def __init__(self, dirpath, vocab_size=None):
        """Build a corpus given a dir with train, valid, and test .txt files."""
        words = self.tokenize(os.path.join(dirpath, "train.txt"))
        self.dictionary = DescendingDict(words, vocab_size)
        self.train = self.vectorize(os.path.join(dirpath, "train.txt"))
        self.valid = self.vectorize(os.path.join(dirpath, "valid.txt"))
        self.test = self.vectorize(os.path.join(dirpath, "test.txt"))
        
    def tokenize(self, fpath):
        """Split on new lines and append <eos> tokens."""
        words = []
        with open(fpath, "r", encoding="utf8") as f:
            for line in f:
                words += line.split() + ["<eos>"]
        return words
        
    def vectorize(self, fpath):
        """Return a tensor of indexes encoding the words in a file."""
        idxs = []
        with open(fpath, "r", encoding="utf8") as f:
            for line in f:
                words = line.split() + ["<eos>"]
                idxs.extend(self.dictionary.get_indices(words))
        return torch.LongTensor(idxs)

--------

# Datasets

In [5]:
class LMDataset(Dataset):
    """Dataset for language models."""
    
    def __init__(self, data, n_partitions, device):
        """Partition text data into sequences and move to device."""
        self.data = self.partition(data, n_partitions, device)
        self.n_partitions = n_partitions
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, i):
        """Return the i-th element from all sequences along with their targets."""
        if i == len(self.data) - 1:
            return self.data[i-1], self.data[i]
        return self.data[i], self.data[i + 1]
        
    def partition(self, data, n_partitions, device):
        """Re-shape data to have ``n_partitions`` columns (discards remainder)."""
        n_rows = len(data) // n_partitions
        data = data[:n_rows * n_partitions]
        data = data.view(n_partitions, -1).t().contiguous()
        return data.to(device)

------------

# GRU 

We'll use a single layer `GRU` with somewhere around `600` hidden units. We'll tie the embedding weights to the softmax layer, as described in [this](https://arxiv.org/pdf/1611.01462.pdf) paper.

TODO:

* Add adpative softmax
* Add Layer Normalization
* Variational dropout
* Weight dropping
* Ignore `<unk>`, `<eos>` at inference time
* Use neural cache
* Simplify `reapackage_hidden` procedure

In [6]:
class GRULModel(nn.Module):
    """Language model with an encoder, GRU module, and a decoder."""
    
    def __init__(self, vocab_sz, emb_dim, hid_dim, n_layers=1, dropout=0.5, tie_weights=False):
        super().__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(vocab_sz, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, n_layers)
        self.decoder = nn.Linear(hid_dim, vocab_size)
                
        self.vocab_sz = vocab_sz
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.init_weights()
        if tie_weights:
            assert hid_dim == emb_dim, f"{hid_dim= } must match {emb_dim= }!"
            self.decoder.weight = self.encoder.weight
    
    #TODO: why do we initialise with zero bias?
    def init_weights(self, k=0.1):
        """Initialise weights from a uniform distribution U(-k, k)."""
        self.encoder.weight.data.uniform_(-k, k)
        self.decoder.weight.data.uniform_(-k, k)
        self.decoder.bias.data.zero_()
        
    def init_hidden(self, n_partitions):
        """Initialise hidden weights."""
        weights = next(self.parameters())
        return weights.new_zeros(self.n_layers, n_partitions, self.hid_dim)
    
    def repackage_hidden(self, hidden):
        """Return hidden states in new tensors detatched from their history."""
        if isinstance(hidden, torch.Tensor):
            return hidden.detach()
        return tuple(repackage_hidden(v) for v in hidden)
    
    def forward(self, x, hidden):
        emb = self.drop(self.encoder(x))
        output, hidden = self.gru(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.vocab_sz)
        return decoded, hidden
    
    def predict(self, x, ignore=None):
        """Return the most likely next word, ignoring certain tokens."""
        pass
    
    def save(self, path):
        """Serialise weights to disc."""
        torch.save(self.state_dict(), path)

-----------

# Trainers

TODO:

* Add `critetion` param to `Trainer` class

In [7]:
class Trainer:
    """Abstract base class for training a general neural model."""
    
    def __init__(self, train_dl, val_dl, model, optimizer, scheduler):
        self.train_dl = train_dl
        self.val_dl = val_dl
        self.model = model
        self.optimizer = optimizer
        self.scheduler = scheduler
        
    def _descend(self, loss, max_norm=None):
        """Perform one step of gradient descent."""
        self.optimizer.zero_grad()
        loss.backward()
        if max_norm:
            nn.utils.clip_grad_norm_(self.model.parameters(), max_norm)
        self.optimizer.step()

In [19]:
class LMTrainer(Trainer):
    """Container for training a language model."""
    
    def __init__(self, train_dl, val_dl, model, optimizer, scheduler):
        super().__init__(train_dl, val_dl, model, optimizer, scheduler)
        self.losses, self.timestamps = [], []
                
    def _train(self, hidden, max_norm=None):
        """Train for a single epoch and return the loss."""
        total_loss, N = 0, 0
        for data, targets in self.train_dl:
            hidden = self.model.repackage_hidden(hidden)
            output, hidden = self.model(data, hidden)
            loss = F.cross_entropy(output, targets)
            self._descend(loss, max_norm)
            total_loss += loss.item() * len(data)
            N += len(data)
        return total_loss / N
        
    #TODO: why does the tutorial call init_hidden() inside each epoch?
    def train(self, n_epochs, max_norm=None, save_path="/tmp/lmodel"):
        """Train a language model with optional graident clipping."""
        start = time.time()
        best_val_loss, best_val_ppl = None, None
        hidden = self.model.init_hidden(self.train_dl.dataset.n_partitions)

        for epoch in range(n_epochs): 
            self.model.train()
            train_loss = self._train(hidden, max_norm)  
            scheduler.step()

            elapsed = (time.time() - start) / 60
            self.timestamps.append(elapsed)
            self.losses.append(train_loss)

            val_loss, val_ppl = self.get_val_metrics()
            print(f"{epoch= :3d} | {elapsed= :4.2f}min | {train_loss= :5.2f} | "
                  f"{val_loss= :5.2f} | {val_ppl= :8.2f}")

            if not best_val_loss or val_loss < best_val_loss:
                best_val_loss = val_loss
                best_val_loss_ppl = val_ppl
                model.save(save_path)
        os.rename(save_path, f"{save_path}_{best_val_loss:.2f}_loss_{best_val_loss_ppl:.2f}_ppl.pt")
        
    def get_val_metrics(self):
        """Return validation loss and perplexity."""
        self.model.eval()
        val_loss = 0
        hidden = self.model.init_hidden(self.val_dl.dataset.n_partitions)

        for data, targets in self.val_dl:
            output, hidden = self.model(data, hidden)
            loss = F.cross_entropy(output, targets)
            val_loss += loss
        val_ppl = val_loss.exp()
        return val_loss.item(), val_ppl.item()

--------

# Train Model

TODO:

* ASGD optimizer

In [9]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [10]:
# Load corpus
dirpath = Path("../data/wikitext-2/")
corpus = WikiCorpus(dirpath)

## Prepare DataLoaders

In [11]:
def collate_fn(batch):
    """Return data as a 2D-tensor and targets as a 1D tensor."""
    data = torch.stack([tup[0] for tup in batch])
    targets = torch.cat([tup[1] for tup in batch])
    return data, targets

In [12]:
# Prep Datasets
n_sequences = 50
train_ds = LMDataset(corpus.train[:35*50], n_sequences, device)
val_ds = LMDataset(corpus.valid[:35*50], n_sequences, device)
test_ds = LMDataset(corpus.test, n_sequences, device)

# Prep DataLoaders
batch_size = 35
train_dl = DataLoader(train_ds, batch_size, collate_fn=collate_fn)
val_dl = DataLoader(val_ds, batch_size, collate_fn=collate_fn)
test_dl = DataLoader(test_ds, batch_size, collate_fn=collate_fn)

## Training Loop

In [24]:
# Set model params
vocab_size = len(corpus.dictionary)
emb_dim = 96
hidden_dim = 96
dropout = 0.5

# Initialise model
model = GRULModel(vocab_size, emb_dim, hidden_dim, dropout=dropout, tie_weights=True).to(device)

In [25]:
# Path to save model to
save_path = Path("../models/tmp")

# Set training & schedluer params
max_lr = 1
max_norm = 0.25
n_epochs = 40
steps_per_epoch = len(train_dl)

# Initialize optimizer & scheduler
optimizer = optim.SGD(model.parameters(), lr=1, momentum=0.9, nesterov=True)
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=n_epochs, steps_per_epoch=steps_per_epoch)

# Initialize trainer
trainer = LMTrainer(train_dl, val_dl, model, optimizer, scheduler)

In [26]:
# Train model
trainer.train(n_epochs, max_norm, save_path)

epoch=   0 | elapsed= 0.01min | train_loss= 10.42 | val_loss= 10.42 | val_ppl= 33566.98
epoch=   1 | elapsed= 0.03min | train_loss= 10.42 | val_loss= 10.42 | val_ppl= 33454.07
epoch=   2 | elapsed= 0.04min | train_loss= 10.42 | val_loss= 10.41 | val_ppl= 33174.87
epoch=   3 | elapsed= 0.06min | train_loss= 10.41 | val_loss= 10.39 | val_ppl= 32596.22
epoch=   4 | elapsed= 0.07min | train_loss= 10.39 | val_loss= 10.36 | val_ppl= 31617.87
epoch=   5 | elapsed= 0.09min | train_loss= 10.35 | val_loss= 10.31 | val_ppl= 30154.77
epoch=   6 | elapsed= 0.10min | train_loss= 10.30 | val_loss= 10.24 | val_ppl= 28087.32
epoch=   7 | elapsed= 0.12min | train_loss= 10.23 | val_loss= 10.13 | val_ppl= 25083.55
epoch=   8 | elapsed= 0.13min | train_loss= 10.10 | val_loss=  9.91 | val_ppl= 20231.46
epoch=   9 | elapsed= 0.15min | train_loss=  9.87 | val_loss=  9.53 | val_ppl= 13780.87
epoch=  10 | elapsed= 0.16min | train_loss=  9.47 | val_loss=  9.09 | val_ppl=  8875.25
epoch=  11 | elapsed= 0.18min | 

--------

# Scratch

In [None]:
data, targets = next(iter(train_dl))
data.shape, targets.shape

In [None]:
cutoffs = [int(0.2 * model.vocab_size)]
cutoffs

In [None]:
criterion = nn.AdaptiveLogSoftmaxWithLoss(model.hid_dim, model.vocab_size, cutoffs=cutoffs)