<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1">Objective</a></span></li><li><span><a href="#Data-Models" data-toc-modified-id="Data-Models-2">Data Models</a></span></li><li><span><a href="#Datasets" data-toc-modified-id="Datasets-3">Datasets</a></span></li><li><span><a href="#GRU" data-toc-modified-id="GRU-4">GRU</a></span></li><li><span><a href="#Training-Functions" data-toc-modified-id="Training-Functions-5">Training Functions</a></span></li><li><span><a href="#Train-Model" data-toc-modified-id="Train-Model-6">Train Model</a></span><ul class="toc-item"><li><span><a href="#Prepare-DataLoaders" data-toc-modified-id="Prepare-DataLoaders-6.1">Prepare DataLoaders</a></span></li><li><span><a href="#Training-Loop" data-toc-modified-id="Training-Loop-6.2">Training Loop</a></span></li></ul></li></ul></div>

In [44]:
import os
import time
from collections import Counter
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import Dataset
from torch.utils.data import DataLoader

In [45]:
# Set seed for reproducibility
torch.manual_seed(9)

<torch._C.Generator at 0x7f61dc08d370>

------------

# Objective

The goal of this notebook is to train a language model from scratch on `wikitext-2`, which you can find [here](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). Our focus will be on getting the pre-processing and training loops working in the traditional, non-federated setting. In a separate notebook we'll do the same thing for the federated setting, which you can read more about in [this](https://arxiv.org/pdf/1811.03604.pdf) paper (which we'll refer to as the `Google Smart Keyboard Paper`).

This notebook borrows heavily from [this](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html) pytorch tutorial, which is absolutely outstanding.

--------

# Data Models

Let's create data models for a corpus of text. The `Google Smart Keyboard Paper` uses a vocabulary of size `10,000` and inlcudes tokens for the beggining of sentence, end of setence, and out-of-vocab words. During inference time the probabilities for these tokens are ignored.

TODO: 

* Limit vocab size
* Add `<BOS>` tokens?
* Sort words by descending frequency to enable training with `AdaptiveLogSoftMax`

In [46]:
class DescendingDict:
    """Model a vocab as a mapping word <-> index, ordered by descending frequency."""
    
    def __init__(self, words, vocab_size=None):
        word2freq = Counter(words).most_common(vocab_size)
        self.idx2word = [tup[0] for tup in word2freq]
        self.word2idx = {word: idx for idx, word in enumerate(self.idx2word)}
                
    def __len__(self):
        return len(self.idx2word)
                                
    def get_index(self, word):
        return self.word2idx[word]
    
    def get_indices(self, words):
        return [self.word2idx[word] for word in words]

In [47]:
class WikiCorpus:
    """Encode a corpus of text already processed in the wikitext style."""
    
    def __init__(self, dirpath, vocab_size=None):
        """Build a corpus given a dir with train, valid, and test .txt files."""
        words = self.tokenize(os.path.join(dirpath, "train.txt"))
        self.dictionary = DescendingDict(words, vocab_size)
        self.train = self.vectorize(os.path.join(dirpath, "train.txt"))
        self.valid = self.vectorize(os.path.join(dirpath, "valid.txt"))
        self.test = self.vectorize(os.path.join(dirpath, "test.txt"))
        
    def tokenize(self, fpath):
        """Split on new lines and append <eos> tokens."""
        words = []
        with open(fpath, "r", encoding="utf8") as f:
            for line in f:
                words += line.split() + ["<eos>"]
        return words
        
    def vectorize(self, fpath):
        """Return a tensor of indexes encoding the words in a file."""
        idxs = []
        with open(fpath, "r", encoding="utf8") as f:
            for line in f:
                words = line.split() + ["<eos>"]
                idxs.extend(self.dictionary.get_indices(words))
        return torch.LongTensor(idxs)

--------

# Datasets

In [48]:
class LMDataset(Dataset):
    """Dataset for language models."""
    
    def __init__(self, data, n_partitions, device):
        """Partition text data into sequences and move to device."""
        self.data = self.partition(data, n_partitions, device)
        self.n_sequences = n_sequences
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, i):
        """Return the i-th element from all sequences along with their targets."""
        if i == len(self.data) - 1:
            return self.data[i-1], self.data[i]
        return self.data[i], self.data[i + 1]
        
    def partition(self, data, n_partitions, device):
        """Re-shape data to have ``n_partitions`` columns (discards remainder)."""
        n_rows = len(data) // n_partitions
        data = data[:n_rows * n_partitions]
        data = data.view(n_partitions, -1).t().contiguous()
        return data.to(device)

------------

# GRU 

We'll use a single layer `GRU` with somewhere around `600` hidden units. We'll tie the embedding weights to the softmax layer, as described in [this](https://arxiv.org/pdf/1611.01462.pdf) paper.

TODO:

* Add Layer Normalization
* Train with and without weight tying
* Variational dropout
* Weight dropping
* Gradient clipping
* Ignore `<unk>`, `<eos>` at inference time

In [49]:
class GRULModel(nn.Module):
    """Language model with an encoder, GRU module, and a decoder."""
    
    def __init__(self, vocab_size, emb_dim, hid_dim, n_layers=1, dropout=0.5, tie_weights=False):
        super().__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, n_layers)
        self.decoder = nn.Linear(hid_dim, vocab_size)
                
        self.vocab_size = vocab_size
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.init_weights()
        if tie_weights:
            assert hid_dim == emb_dim, f"{hid_dim= } must match {emb_dim= }!"
            self.decoder.weight = self.encoder.weight
    
    #TODO: why do we initialise with zero bias?
    def init_weights(self, k=0.1):
        """Initialise weights from a uniform distribution U(-k, k)."""
        self.encoder.weight.data.uniform_(-k, k)
        self.decoder.weight.data.uniform_(-k, k)
        self.decoder.bias.data.zero_()
        
    def init_hidden(self, batch_size):
        """Initialise hidden weights."""
        weights = next(self.parameters())
        return weights.new_zeros(self.n_layers, batch_size, self.hid_dim)
    
    def forward(self, x, hidden):
        emb = self.drop(self.encoder(x))
        output, hidden = self.gru(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.vocab_size)
        return decoded, hidden
    
    def predict(self, x, ignore=None):
        """Return the most likely next word, ignoring certain tokens."""
        emb = self.encoder(x)
        output, hidden = self.gru(emb)
        decoded = self.decoder(output).view(-1, self.vocab_size)
        return decoded.argmax(dim=0)   
    
    def save(self, path):
        """Serialise weights to disc."""
        torch.save(self.state_dict(), path)

-----------

# Training Functions

TODO:

* Refactor to a class `Trainer`

In [50]:
def repackage_hidden(h):
    """Return hidden states in new tensors detatched from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    return tuple(repackage_hidden(v) for v in h)

In [51]:
def get_val_metrics(model, val_dl):
    """Return validation loss and perplexity."""
    model.eval()
    val_loss = 0
    hidden = model.init_hidden(val_dl.dataset.n_sequences)
    
    for data, targets in val_dl:
        output, hidden = model(data, hidden)
        loss = F.cross_entropy(output, targets)
        val_loss += loss
    val_ppl = val_loss.exp()
    return val_loss.item(), val_ppl.item()

In [52]:
def _descend(loss, optimizer, max_norm=None):
    """Perform one step of gradient descent."""
    optimizer.zero_grad()
    loss.backward()
    if max_norm:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    optimizer.step()

In [64]:
#TODO: why does the tutorial call init_hidden() inside each epoch?
def train(train_dl, val_dl, model, optimizer, scheduler, n_epochs, max_norm, save_path):
    """Train a language model with the given optimizer and lr scheduler."""
    start = time.time()
    losses, timestamps = [], []
    best_val_loss, best_val_ppl = None, None
    hidden = model.init_hidden(train_dl.dataset.n_sequences)
        
    for epoch in range(n_epochs): 
        model.train()
        total_loss, N = 0, 0
        
        for data, targets in train_dl:
            hidden = repackage_hidden(hidden)
            output, hidden = model(data, hidden)
            print(f"{output.shape= } {hidden.shape= }")
            loss = F.cross_entropy(output, targets)
            _descend(loss, optimizer, max_norm)
            total_loss += loss.item() * len(data)
            N += len(data)
        scheduler.step()

        elapsed = (time.time() - start) / 60
        train_loss = total_loss / N
        losses.append(train_loss)
        timestamps.append(elapsed)

        val_loss, val_ppl = get_val_metrics(model, val_dl)
        print(f"{epoch= :3d} | {elapsed= :3.2f}min | {train_loss= :5.2f} | "
              f"{val_loss= :5.2f} | {val_ppl= :8.2f}")
        
        if not best_val_loss or val_loss < best_val_loss:
            best_val_loss = val_loss
            best_val_loss_ppl = val_ppl
            model.save(save_path)
    
    os.rename(save_path, f"{save_path}_{best_val_loss:.2f}_loss_{best_val_loss_ppl:.2f}_ppl.pt")
    return losses, timestamps

--------

# Train Model

TODO:

* ASGD optimizer

In [54]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [55]:
# Load corpus
dirpath = Path("../data/wikitext-2/")
corpus = WikiCorpus(dirpath)

## Prepare DataLoaders

In [56]:
def collate_fn(batch):
    """Return data as a 2D-tensor and targets as a 1D tensor."""
    data = torch.stack([tup[0] for tup in batch])
    targets = torch.cat([tup[1] for tup in batch])
    return data, targets

In [57]:
# Prep Datasets
n_sequences = 50
train_ds = LMDataset(corpus.train[:35*50], n_sequences, device)
val_ds = LMDataset(corpus.valid[:35*50], n_sequences, device)
test_ds = LMDataset(corpus.test, n_sequences, device)

# Prep DataLoaders
batch_size = 35
train_dl = DataLoader(train_ds, batch_size, collate_fn=collate_fn)
val_dl = DataLoader(val_ds, batch_size, collate_fn=collate_fn)
test_dl = DataLoader(test_ds, batch_size, collate_fn=collate_fn)

## Training Loop

In [65]:
# Set model params
vocab_size = len(corpus.dictionary)
emb_dim = 96
hidden_dim = 96
dropout = 0.5

# Initialise model
model = GRULModel(vocab_size, emb_dim, hidden_dim, dropout=dropout, tie_weights=True).to(device)

In [66]:
# Path to save model to
save_path = Path("../models/tmp")

# Set training & schedluer params
max_lr = 1
max_norm = 0.25
n_epochs = 40
steps_per_epoch = len(train_dl)

# Initialize optimizer & scheduler
optimizer = optim.SGD(model.parameters(), lr=1, momentum=0.9, nesterov=True)
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=n_epochs, steps_per_epoch=steps_per_epoch)

In [67]:
# Train model
losses, timestamps = train(train_dl, val_dl, model, optimizer, scheduler, n_epochs, max_norm, save_path)

output.shape= torch.Size([1750, 33278]) hidden.shape= torch.Size([1, 50, 96])
epoch=   0 | elapsed= 0.01min | train_loss= 10.42 | val_loss= 10.41 | val_ppl= 33267.00
output.shape= torch.Size([1750, 33278]) hidden.shape= torch.Size([1, 50, 96])
epoch=   1 | elapsed= 0.03min | train_loss= 10.42 | val_loss= 10.41 | val_ppl= 33153.23
output.shape= torch.Size([1750, 33278]) hidden.shape= torch.Size([1, 50, 96])
epoch=   2 | elapsed= 0.04min | train_loss= 10.41 | val_loss= 10.40 | val_ppl= 32865.16


KeyboardInterrupt: 

--------

In [63]:
data, targets = next(iter(train_dl))
data.shape, targets.shape

(torch.Size([35, 50]), torch.Size([1750]))

In [61]:
cutoffs = [int(0.2 * model.vocab_size)]
cutoffs

[6655]

In [62]:
criterion = nn.AdaptiveLogSoftmaxWithLoss(model.hid_dim, model.vocab_size, cutoffs=cutoffs)