<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1">Objective</a></span></li><li><span><a href="#Data-Models" data-toc-modified-id="Data-Models-2">Data Models</a></span></li><li><span><a href="#Datasets" data-toc-modified-id="Datasets-3">Datasets</a></span></li><li><span><a href="#GRU" data-toc-modified-id="GRU-4">GRU</a></span></li><li><span><a href="#Training-Functions" data-toc-modified-id="Training-Functions-5">Training Functions</a></span></li><li><span><a href="#Train-Model" data-toc-modified-id="Train-Model-6">Train Model</a></span><ul class="toc-item"><li><span><a href="#Prepare-DataLoaders" data-toc-modified-id="Prepare-DataLoaders-6.1">Prepare DataLoaders</a></span></li><li><span><a href="#Training-Loop" data-toc-modified-id="Training-Loop-6.2">Training Loop</a></span></li></ul></li></ul></div>

In [1]:
import os
import json
import time
from collections import Counter
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import Dataset
from torch.utils.data import DataLoader

In [2]:
# Set seed for reproducibility
torch.manual_seed(9)

<torch._C.Generator at 0x7fdd0530daf0>

------------

# Objective

The goal of this notebook is to train a language model from scratch on `wikitext-2`, which you can find [here](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). Our focus will be on getting the pre-processing and training loops working in the traditional, non-federated setting. In a separate notebook we'll do the same thing for the federated setting, which you can read more about in [this](https://arxiv.org/pdf/1811.03604.pdf) paper (which we'll refer to as the `Google Smart Keyboard Paper`).

This notebook borrows heavily from [this](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html) pytorch tutorial, which is absolutely outstanding.

--------

# Data Models

Let's create data models for a corpus of text. The `Google Smart Keyboard Paper` uses a vocabulary of size `10,000` and inlcudes tokens for the beggining of sentence, end of setence, and out-of-vocab words. During inference time the probabilities for these tokens are ignored.

TODO: 

* Limit vocab size
* Add `<BOS>` tokens?
* Sort words by descending frequency to enable training with `AdaptiveLogSoftMax`

In [3]:
class DescendingDict:
    """Model a vocab as a mapping word <-> index, ordered by descending frequency."""
    
    def __init__(self, words, vocab_sz=None):
        word2freq = Counter(words).most_common(vocab_sz)
        self.idx2word = [tup[0] for tup in word2freq]
        self.word2idx = {word: idx for idx, word in enumerate(self.idx2word)}
                
    def __len__(self):
        return len(self.idx2word)
                                
    def get_index(self, word):
        return self.word2idx[word]
    
    def get_indices(self, words):
        return [self.word2idx[word] for word in words]

In [4]:
class WikiCorpus:
    """Encode a corpus of text already processed in the wikitext style."""
    
    def __init__(self, dirpath, vocab_sz=None):
        """Build a corpus given a dir with train, valid, and test .txt files."""
        words = self.tokenize(os.path.join(dirpath, "train.txt"))
        self.dictionary = DescendingDict(words, vocab_sz)
        self.train = self.vectorize(os.path.join(dirpath, "train.txt"))
        self.valid = self.vectorize(os.path.join(dirpath, "valid.txt"))
        self.test = self.vectorize(os.path.join(dirpath, "test.txt"))
        
    def tokenize(self, fpath):
        """Split on new lines and append <eos> tokens."""
        words = []
        with open(fpath, "r", encoding="utf8") as f:
            for line in f:
                words += line.split() + ["<eos>"]
        return words
        
    def vectorize(self, fpath):
        """Return a tensor of indexes encoding the words in a file."""
        idxs = []
        with open(fpath, "r", encoding="utf8") as f:
            for line in f:
                words = line.split() + ["<eos>"]
                idxs.extend(self.dictionary.get_indices(words))
        return torch.LongTensor(idxs)

--------

# Datasets

In [5]:
class LMDataset(Dataset):
    """Dataset for language models."""
    
    def __init__(self, data, n_sequences, device):
        """Partition text data and move to device."""
        self.data = self.partition(data, n_sequences, device)
        self.n_sequences = n_sequences
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, i):
        """Return the i-th element and its target."""
        if i == len(self.data) - 1:
            return self.data[i-1], self.data[i]
        return self.data[i], self.data[i + 1]
        
    def partition(self, data, n_sequences, device):
        """Re-shape data to have ``n_sequences`` columns (discards remainder)."""
        n_rows = len(data) // n_sequences
        data = data[:n_rows * n_sequences]
        data = data.view(n_sequences, -1).t().contiguous()  #TODO: get rid of .t()
        return data.to(device)

------------

# GRU 

We'll use a single layer `GRU` with somewhere around `600` hidden units. We'll tie the embedding weights to the softmax layer, as described in [this](https://arxiv.org/pdf/1611.01462.pdf) paper.

TODO:

* Variational dropout
* Weight dropping
* Ignore `<unk>`, `<eos>` at inference time
* Neural cache
* Simplify `reapackage_hidden` procedure
* Support multiple layers?

In [6]:
class LModel(nn.Module):
    """Abstract base class for language models with a recurrent unit and encoder, decoder."""
    
    def __init__(self, vocab_sz, emb_dim, hid_dim, dropout, tie_weights=False, layer_norm=False):
        super().__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(vocab_sz, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim)
        self.decoder = nn.Linear(hid_dim, vocab_sz)
        
        self.vocab_sz = vocab_sz
        self.hid_dim = hid_dim

        self.init_weights()
        if tie_weights:
            assert hid_dim == emb_dim, f"{hid_dim= } must match {emb_dim= }!"
            self.decoder.weight = self.encoder.weight
        
        self.layer_norm = layer_norm   
        if layer_norm:
            self.lnorm = nn.LayerNorm(hid_dim)
            
    def init_weights(self, k=0.1):
        """Initialise weights from a uniform distribution U(-k, k)."""
        self.encoder.weight.data.uniform_(-k, k)
        self.decoder.weight.data.uniform_(-k, k)
        self.decoder.bias.data.zero_()
        
    def init_hidden(self, n_sequences):
        """Initialise hidden weights."""
        return torch.zeros(1, n_sequences, self.hid_dim, requires_grad=False)
    
    def repackage_hidden(self, hidden):
        """Return hidden states in new tensors detatched from their history."""
        if isinstance(hidden, torch.Tensor):
            return hidden.detach()
        return tuple(self.repackage_hidden(v) for v in hidden)
    
    def save(self, fpath):
        """Serialise weights to disc."""
        torch.save(self.state_dict(), fpath)
        
    def load(self, fpath):
        """Load pre-trained weights."""
        self.load_state_dict(torch.load(fpath))

In [7]:
class GRULModel(LModel):
    """Language model to be trained with traditional softmax."""
    
    def __init__(self, vocab_sz, emb_dim, hid_dim, dropout, tie_weights=False, layer_norm=False):
        super().__init__(vocab_sz, emb_dim, hid_dim, dropout, tie_weights, layer_norm)
        
    def forward(self, x, hidden):
        emb = self.drop(self.encoder(x))
        output, hidden = self.gru(emb, hidden)
        if self.layer_norm:
            output = self.lnorm(output)
            hidden = self.lnorm(hidden)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.vocab_sz)
        return decoded, hidden
    
    def predict(self, x, ignore=None):
        """Return the most likely next word, ignoring certain tokens."""
        pass

In [8]:
class AdaptiveGRULModel(LModel):
    """Language model to be trained with adaptive softmax."""
    
    def __init__(self, vocab_sz, emb_dim, hid_dim, dropout, layer_norm=False):
        super().__init__(vocab_sz, emb_dim, hid_dim, dropout, tie_weights=True, layer_norm=layer_norm)
        
    def forward(self, x, hidden):
        emb = self.drop(self.encoder(x))
        output, hidden = self.gru(emb, hidden)
        output = output.view(-1, self.hid_dim)
        if self.layer_norm:
            output = self.lnorm(output)
            hidden = self.lnorm(hidden)
        return output, hidden
    
    def predict(self, x):
        """Predict the next word, ignoring <unk> and <eos>."""
        pass

-----------

# Trainers

TODO:

* Save losses, timestamps to disc

In [9]:
class Trainer:
    """Abstract base class for training a general neural model."""
    
    def __init__(self, train_dl, val_dl, model, criterion, optimizer, scheduler):
        self.train_dl = train_dl
        self.val_dl = val_dl
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        self.scheduler = scheduler
        
    def _descend(self, loss, max_norm=None):
        """Perform one step of gradient descent."""
        self.optimizer.zero_grad()
        loss.backward()
        if max_norm:
            nn.utils.clip_grad_norm_(self.model.parameters(), max_norm)
        self.optimizer.step()

In [10]:
class LMTrainer(Trainer):
    """Container for training a language model."""
    
    def __init__(self, train_dl, val_dl, model, criterion, optimizer, scheduler):
        """Initialise trainer with traditional or adaptive softmax."""
        super().__init__(train_dl, val_dl, model, criterion, optimizer, scheduler)
        self.losses = []
        self.timestamps = []
        
    def _compute_loss(self, output, targets):
        """Compute traditional or adaptive softmax."""
        if isinstance(self.criterion, nn.AdaptiveLogSoftmaxWithLoss):
            return self.criterion(output, targets)[1]
        return self.criterion(output, targets)
                        
    def _train(self, hidden, max_norm):
        """Train for a single epoch and return the loss."""
        loss, N = 0, 0
        for data, targets in self.train_dl:
            hidden = self.model.repackage_hidden(hidden)
            output, hidden = self.model(data, hidden)
            batch_loss = self._compute_loss(output, targets)
            self._descend(batch_loss, max_norm)
            loss += batch_loss.item() * len(data)
            N += len(data)
        return loss / N
        
    #TODO: why does the tutorial call init_hidden() inside each epoch?
    def train(self, n_epochs, max_norm, save_path):
        """Train a language model with optional graident clipping."""
        start = time.time()
        best_val_loss, best_val_ppl = None, None
        hidden = self.model.init_hidden(self.train_dl.dataset.n_sequences)

        for epoch in range(n_epochs): 
            self.model.train()
            train_loss = self._train(hidden, max_norm)  
            scheduler.step()

            elapsed = (time.time() - start) / 60
            self.timestamps.append(elapsed)
            self.losses.append(train_loss)

            val_loss, val_ppl = self.evaluate(self.val_dl)
            print(f"{epoch= :3d} | {elapsed= :4.2f}min | {train_loss= :5.2f} | "
                  f"{val_loss= :5.2f} | {val_ppl= :8.2f}")

            if not best_val_loss or val_loss < best_val_loss:
                best_val_loss = val_loss
                best_val_loss_ppl = val_ppl
                model.save(save_path)
        os.rename(save_path, f"{save_path}_{best_val_loss:.2f}_loss_{best_val_loss_ppl:.2f}_ppl.pt")
        self.log_results(f"{save_path}_{best_val_loss:.2f}_loss_{best_val_loss_ppl:.2f}_ppl_log.json")
        
    def evaluate(self, dl):
        """Return loss and perplexity."""
        self.model.eval()
        loss = 0
        hidden = self.model.init_hidden(dl.dataset.n_sequences)

        for data, targets in dl:
            output, hidden = self.model(data, hidden)
            batch_loss = self._compute_loss(output, targets)
            loss += batch_loss
        return loss.item(), loss.exp().item()
    
    def log_results(self, fpath):
        """Write losses and training times to disc."""
        results = {"losses": self.losses, "timestamps": self.timestamps}
        with open(fpath, "w") as f:
            json.dump(results, f, indent=4)

--------

# Train Model

TODO:

* ASGD optimizer

In [11]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [12]:
# Load corpus
dirpath = Path("../data/wikitext-2/")
corpus = WikiCorpus(dirpath)

## DataLoaders

In [13]:
def collate_fn(batch):
    """Return data as a 2D-tensor and targets as a 1D tensor."""
    data = torch.stack([tup[0] for tup in batch])
    targets = torch.cat([tup[1] for tup in batch])
    return data, targets

In [14]:
# Prep Datasets
n_sequences = 50
train_ds = LMDataset(corpus.train[:35*50], n_sequences, device)
val_ds = LMDataset(corpus.valid[:35*50], n_sequences, device)
test_ds = LMDataset(corpus.test[:35*50], n_sequences, device)

# Prep DataLoaders
batch_sz = 64
train_dl = DataLoader(train_ds, batch_sz, collate_fn=collate_fn)
val_dl = DataLoader(val_ds, batch_sz, collate_fn=collate_fn)
test_dl = DataLoader(test_ds, batch_sz, collate_fn=collate_fn)

## Training Loop

In [15]:
# Set model params
vocab_sz = len(corpus.dictionary)
dropout = 0.3
emb_dim = 96
hidden_dim = emb_dim

# Initialise model
model = GRULModel(vocab_sz, emb_dim, hidden_dim, dropout=dropout, tie_weights=True, layer_norm=True).to(device)

In [16]:
# Path to save model to
save_path = Path("../models/lmodel")

# Set training & schedluer params
max_lr = 1
max_norm = 0.25
n_epochs = 40
steps_per_epoch = len(train_dl)

# Choose criterion
# cutoffs = [2000, 10_000]
# criterion = nn.AdaptiveLogSoftmaxWithLoss(hidden_dim, vocab_sz, cutoffs=cutoffs)
criterion = F.cross_entropy

# Initialize optimizer & scheduler
optimizer = optim.SGD(model.parameters(), lr=1, momentum=0.9, nesterov=True)
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=n_epochs, steps_per_epoch=steps_per_epoch)

# Initialize trainer
trainer = LMTrainer(train_dl, val_dl, model, criterion, optimizer, scheduler)

In [17]:
# Train model
trainer.train(n_epochs, max_norm, save_path)

epoch=   0 | elapsed= 0.01min | train_loss= 10.69 | val_loss= 10.66 | val_ppl= 42756.83
epoch=   1 | elapsed= 0.03min | train_loss= 10.66 | val_loss= 10.61 | val_ppl= 40511.17
epoch=   2 | elapsed= 0.04min | train_loss= 10.60 | val_loss= 10.47 | val_ppl= 35408.62
epoch=   3 | elapsed= 0.06min | train_loss= 10.44 | val_loss= 10.21 | val_ppl= 27307.49
epoch=   4 | elapsed= 0.07min | train_loss= 10.15 | val_loss=  9.88 | val_ppl= 19610.41
epoch=   5 | elapsed= 0.09min | train_loss=  9.78 | val_loss=  9.53 | val_ppl= 13821.92
epoch=   6 | elapsed= 0.10min | train_loss=  9.38 | val_loss=  9.16 | val_ppl=  9526.23
epoch=   7 | elapsed= 0.12min | train_loss=  8.95 | val_loss=  8.97 | val_ppl=  7836.64
epoch=   8 | elapsed= 0.13min | train_loss=  8.70 | val_loss=  8.63 | val_ppl=  5605.45
epoch=   9 | elapsed= 0.15min | train_loss=  8.33 | val_loss=  8.42 | val_ppl=  4547.23
epoch=  10 | elapsed= 0.17min | train_loss=  8.06 | val_loss=  8.35 | val_ppl=  4243.54
epoch=  11 | elapsed= 0.18min | 

--------

# Evaluate on Test Set

In [18]:
trainer.evaluate(test_dl)

(7.548727512359619, 1898.3255615234375)