# Language Modeling using GluonNLP

In this notebook, we will use GluonNLP to train a pre-defined LSTM language model on a corpus of real data

A statistical model is simple a probability distribution over sequences of words or characters [1].
In this tutorial, we'll restrict our attention to word-based language models.
Given a reliable language model we can answer questions like 
*which among the following strings are we more likely to encounter?*

1. 'On Monday, Mr. Lamar’s “DAMN.” took home an even more elusive honor, one that may never have even seemed within reach: the Pulitzer Prize" 
1. "Frog zealot flagged xylophone the bean wallaby anaphylaxis extraneous porpoise into deleterious carrot banana apricot."

Even if we've never seen either of these sentences in our entire lives, and even though no rapper has previously been awarded a Pulitzer Prize, we wouldn't be shocked to see the first sentence in the New York Times. By comparison, we can all agree that the second sentence, consisting of incoherent babble, is comparatively unlikely. 
A statistical language model can assign precise probabilities to each string of words.

Given a large corpus of text, we can estimate (i.e., train) a language model $\hat{p}(x_1, ..., x_n)$. And given such a model, we can sample strings $\mathbf{x} \sim \hat{p}(x_1, ..., x_n)$, generating new strings according to their estimated probability. Among other useful applications, we can use language models to score candidate transcriptions from speech recognition models, given preference to sentences that seem more probable (at the expense of those deemed anomalous).

These days recurrent neural networks (RNNs) are the preferred method for LM. 

## Language model definition - one sentence

The standard approach to language modeling consists of training a model that given a trailing window of text, predicts  the next word in the sequence. When we train the model we feed in the inputs $x1, x_2, ...$ and try at each time step to predict the corresponding next word $x_2, ..., x_{n+1}$. To generate text from a language model, we can iteratively predict the next word, and then feed this word as the input to the model at the subsequent time step.

Predict the next word based on the sequence history, measured by perplexity (surprise).
<img src='https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/rnnlm.png' width='700px'>

## Train your own language model
Now let's train a language model with GluonNLP.

### Preparation
We'll start by taking care of our basic dependencies and setting up our environment

#### Load gluonnlp

In [1]:
import warnings
warnings.filterwarnings('ignore')
import time
import math
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp
from utilities import detach, train_one_epoch

#### Set environment

In [2]:
num_gpus = 1
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]
log_interval = 200

#### Set hyperparameters

In [3]:
batch_size = 64 * len(context)
lr = 20
epochs = 3
bptt = 35
grad_clip = 0.25

#### Load dataset, extract vocabulary, numericalize, and batchify for truncated BPTT

In [4]:
dataset_name = 'wikitext-2'
train_dataset, val_dataset, test_dataset = [
    nlp.data.WikiText2(segment=segment,
                       bos=None, eos='<eos>',
                       skip_empty=False)
    for segment in ['train', 'val', 'test']]

vocab = nlp.Vocab(nlp.data.Counter(train_dataset),
                  padding_token=None, bos_token=None)

bptt_batchify = nlp.data.batchify.CorpusBPTTBatchify(
    vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = [
    bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset]
]

#### Load pre-defined language model architecture

In [5]:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
print(model)
print(vocab)

StandardRNN(
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 33278, linear)
  )
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")


#### Intialize Paramter and Trainer

In [6]:
model.initialize(mx.init.Xavier(), ctx=context)
trainer = gluon.Trainer(model.collect_params(), 'sgd',
                        {'learning_rate': lr,
                         'momentum': 0,
                         'wd': 0})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

### Training and Evaluation

Now that everything is ready, we can start training the model.

#### Evaluation

States are carried over through time.

<img src=https://upload.wikimedia.org/wikipedia/commons/e/ee/Unfold_through_time.png width="500">

In [7]:
def evaluate(model, data_source, batch_size, ctx):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    for i, (data, target) in enumerate(data_source):
        # with autograd.record(): -> only needed for training
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        output, hidden = model(data, hidden)
        # hidden = detach(hidden) -> only needed for training
        L = loss(output.reshape(-3, -1),
                 target.reshape(-1))
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

#### Training loop

Our loss function will be the standard cross-entropy loss function used for multiclass classification,
applied at each time step to compare our predictions to the true next word in the sequence.
We can calculate gradients with respect to our parameters using truncated [back-propagation-through-time (BPTT)](https://en.wikipedia.org/wiki/Backpropagation_through_time). 
In this case, we'll backpropagate for $35$ time steps, updating our weights with stochastic gradient descent with the learning rate of $20$, hyperparameters that we chose earlier in the notebook.

In [8]:
def train(model, train_data, val_data, test_data, batch_size, grad_clip, log_interval, loss, epochs, lr):
    best_val = float("Inf")
    start_train_time = time.time()
    parameters = model.collect_params().values()
    for epoch in range(epochs):
        start_epoch_time = time.time()
        train_one_epoch(epoch, model, train_data, batch_size, grad_clip,
                        log_interval, loss, parameters, trainer, context)        
        mx.nd.waitall()
        val_L = evaluate(model, val_data, batch_size, context[0])
        print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
            epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))
        if val_L < best_val:
            best_val = val_L
            test_L = evaluate(model, test_data, batch_size, context[0])
            print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
        else:
            lr = lr*0.25
            trainer.set_learning_rate(lr)

    print('Total training throughput %.2f samples/s'%(
                            (batch_size * len(train_data) * epochs) / 
                            (time.time() - start_train_time)))

#### Train and evaluate

In [9]:
train(model, train_data, val_data, test_data, batch_size, grad_clip, log_interval, loss, epochs, lr)

[Epoch 0 Batch 200/932] loss 7.65, ppl 2097.86, throughput 1511.51 samples/s
[Epoch 0 Batch 400/932] loss 6.74, ppl 844.45, throughput 1573.54 samples/s
[Epoch 0 Batch 600/932] loss 6.33, ppl 563.60, throughput 1528.89 samples/s
[Epoch 0 Batch 800/932] loss 6.16, ppl 471.36, throughput 1472.79 samples/s
[Epoch 0] throughput 1531.30 samples/s
[Epoch 0] time cost 43.20s, valid loss 5.90, valid ppl 365.83
test loss 5.81, test ppl 334.62
[Epoch 1 Batch 200/932] loss 5.92, ppl 372.51, throughput 1515.07 samples/s
[Epoch 1 Batch 400/932] loss 5.77, ppl 320.56, throughput 1529.00 samples/s
[Epoch 1 Batch 600/932] loss 5.62, ppl 275.64, throughput 1526.68 samples/s
[Epoch 1 Batch 800/932] loss 5.56, ppl 259.94, throughput 1525.26 samples/s
[Epoch 1] throughput 1522.00 samples/s
[Epoch 1] time cost 43.43s, valid loss 5.42, valid ppl 226.65
test loss 5.34, test ppl 207.50
[Epoch 2 Batch 200/932] loss 5.46, ppl 235.34, throughput 1524.60 samples/s
[Epoch 2 Batch 400/932] loss 5.38, ppl 216.97, th

## Conclusion

- GluonNLP provides high-level APIs that could drastically simplify the development process of modeling for NLP tasks.
- Low-level APIs in GluonNLP enables easy customization.

Documentation can be found at http://gluon-nlp.mxnet.io/index.html

Code is here https://github.com/dmlc/gluon-nlp

## Reference
[1] https://en.wikipedia.org/wiki/Language_model

[2] Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018