Word-level Language Modeling and the torchtext Package
======================================================

The PyTorch [torchtext package](https://github.com/pytorch/text)
consists of data processing utilities and popular
datasets for natural language. Its goals are to ease the painful and tedious
efforts of preprocessing text data for Natural Language Processing (NLP).

The steps below will demonstrate a use case of using `torchtext` in place of the
data preprocessing code used in the PyTorch
"Word-level Language Modeling"
[example](https://github.com/pytorch/examples/blob/master/word_language_model/main.py).
The aim is to keep the original code as much as possible, while at the same time,
showcase various features of torchtext to perform data loading, tokenization,
numericalization, vocabulary building, using pre-trained word embeddings, and
batching of processed data for model consumption.

Official `torchtext` documentation can be found
[here](http://torchtext.readthedocs.io/en/latest/index.html).

To begin, please make sure that `torchtext` is installed using the instructions provided
in the [torchtext package](https://github.com/pytorch/text). Please also install the
[SpaCy](https://spacy.io/) package since it will be used for tokenization in this demo (`pip install spacy`).
In addition, this notebook will be run with the "Word-level Language Modeling"
[example](https://github.com/pytorch/examples/blob/master/word_language_model)
environment as the current working directory.

In [1]:
import argparse
import math
import os
import spacy
import time
import torch
import torch.nn as nn
import torchtext

import model

In [2]:
parser = argparse.ArgumentParser(description='PyTorch Wikitext-2 RNN/LSTM Language Model')

# Setting --wordem or --emsize as mutually exclusive flags
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument('--wordem', type=str,
                    help='name of word embeddings. For example:"glove.6B.200d". \
                    For aliases names, see  \
                    http://torchtext.readthedocs.io/en/latest/vocab.html#vectors')
group.add_argument('--emsize', type=int,
                    help='size of word embeddings')

#parser.add_argument('--data', type=str, default='./data/wikitext-2',
#                    help='location of the data corpus')
parser.add_argument('--model', type=str, default='LSTM',
                    help='type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)')
parser.add_argument('--nhid', type=int, default=200,
                    help='number of hidden units per layer')
parser.add_argument('--nlayers', type=int, default=2,
                    help='number of layers')
parser.add_argument('--lr', type=float, default=20,
                    help='initial learning rate')
parser.add_argument('--clip', type=float, default=0.25,
                    help='gradient clipping')
parser.add_argument('--epochs', type=int, default=40,
                    help='upper epoch limit')
parser.add_argument('--batch_size', type=int, default=20, metavar='N',
                    help='batch size')
parser.add_argument('--bptt', type=int, default=35,
                    help='sequence length')
parser.add_argument('--dropout', type=float, default=0.2,
                    help='dropout applied to layers (0 = no dropout)')
parser.add_argument('--tied', action='store_true',
                    help='tie the word embedding and softmax weights')
parser.add_argument('--seed', type=int, default=1111,
                    help='random seed')
parser.add_argument('--cuda', action='store_true',
                    help='use CUDA')
parser.add_argument('--log-interval', type=int, default=200, metavar='N',
                    help='report interval')
parser.add_argument('--save', type=str, default='model.pt',
                    help='path to save the final model')
parser.add_argument('--onnx-export', type=str, default='',
                    help='path to export the final model in onnx format')

# Since '--embsize' and '--wordem' are mutually exclusive required options,
# one of the two options must be provided. The following example we will
# run training on GPU for 1 epoch with embedding size equals to 200
args = parser.parse_args(['--epochs', '1', '--cuda', '--emsize','200'])

# Example of setting the wordem (word embeddings) flags.
# See http://torchtext.readthedocs.io/en/latest/vocab.html#vectors for
# pre-trained word embeddings alias names.
# args = parser.parse_args(['--epochs', '1', '--cuda', '--wordem', 'glove.6B.200d'])

# Set the random seed manually for reproducibility.
torch.manual_seed(args.seed)

if torch.cuda.is_available():
    if not args.cuda:
        print("WARNING: You have a CUDA device, so you should probably run with --cuda")

device = torch.device("cuda" if args.cuda else "cpu")

## Using torchtext to process data

The following are the steps involved to preprocess input text data for model
consumption:

* Define [Fields](http://torchtext.readthedocs.io/en/latest/data.html#fields) object(s)
     
  The Field object(s) defines a datatype together with instructions for
  converting input to Tensor. For example: Should the data be numericalized or
  not?
     
  
* Create [Dataset](http://torchtext.readthedocs.io/en/latest/data.html#dataset) object(s)
  
  Create Dataset object(s) from input text data by processing text data per the
  definition defined by the Fields object(s).
  
  
* Create [Vocab](http://torchtext.readthedocs.io/en/latest/vocab.html) object(s)
  
  Create vocabulary object(s) which contain the vocabularies from the input
  datasets. Initialize the objects(s) with pre-trained word embeddings if
  specified. 
  
  
* Define [Iterators](http://torchtext.readthedocs.io/en/latest/data.html#iterators)
  
  Define iterator(s) to load batches of data from datasets.
  

### Define Field object
In this demo, SpaCy will be used as the tokenizer.  This is not a hard
requirement, as any other tokenizer should work too.

In [3]:
# Use torchtext to create train, validation and test datasets

# 1. create Field objects using SpaCy as the tokenizer. SpaCy is not a hard
#    requirement, as any other tokenizer should work too.
TEXT = torchtext.data.Field(tokenize=torchtext.data.get_tokenizer('spacy'),
                 eos_token='<EOS>')

### Create Datasets

The `torchtext` module provides built-in Dataset classes for a number of
commonly used datasets such as wikiText-2 for language modeling, SST and IMDb
for sentiment analysis, etc.  In ths demonstration, we will use the wikiText-2
Dataset class and the TEXT field object defined above to create datasets.

In [4]:
# 2. create train, validation and test datasets by processing the input data
#    per the definition defined by the "TEXT" (Field) object.
train_ds,val_ds,test_ds = torchtext.datasets.WikiText2.splits(text_field=TEXT)

print("Length of the train dataset:     ", len(train_ds.examples[0].text))
print("Length of the validation dataset:", len(val_ds.examples[0].text))
print("Length of the test dataset:      ", len(test_ds.examples[0].text))

Length of the train dataset:      2236652
Length of the validation dataset: 245042
Length of the test dataset:       280576


### Create Vocab object

Build the Vocab object from the *train* dataset and initialize it with the
pre-trained word embeddings if specified.

In [5]:
# 3. create Vocab object
if args.wordem:
    TEXT.build_vocab(train_ds, vectors = args.wordem)
    args.emsize = TEXT.vocab.vectors.size(1)
else:
    TEXT.build_vocab(train_ds)
vocab = TEXT.vocab

print("Number of unique vocabularies in the dataset:", len(vocab))

Number of unique vocabularies in the dataset: 33244


### Create Iterators

Create iterators to load batches of data from the datasets.

In [6]:
# 4. create iterators
train_iter = torchtext.data.BPTTIterator(
    train_ds,
    batch_size=args.batch_size,
    bptt_len=args.bptt,
    device=(0 if args.cuda else -1),
    repeat=False)

eval_batch_size = 10
val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (val_ds, test_ds),
    batch_size=eval_batch_size,
    bptt_len=args.bptt,
    device=(0 if args.cuda else -1),
    repeat=False)

## Prepare for training

In [7]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensor, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)
    
def evaluate(model, criterion, ds, ds_iter, ntokens, batch_size):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    hidden = model.init_hidden(batch_size)
    with torch.no_grad():
        for batch_num, batch in enumerate(ds_iter):
            text, targets = batch.text, batch.target.view(-1)
            output, hidden = model(text, hidden)
            output_flat = output.view(-1, ntokens)
            total_loss += len(text) * criterion(output_flat, targets).item()
            hidden = repackage_hidden(hidden)
    return total_loss / len(ds.examples[0].text)


def train(model, criterion, ds_iter, ntokens, epoch, args):
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    hidden = model.init_hidden(args.batch_size)
    for batch_num, batch  in enumerate(ds_iter):
        text, targets = batch.text, batch.target.view(-1)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        hidden = repackage_hidden(hidden)
        model.zero_grad()
        output, hidden = model(text, hidden)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
        for p in model.parameters():
            p.data.add_(-args.lr, p.grad.data)

        total_loss += loss.item()

        if batch_num % args.log_interval == 0 and batch_num > 0:
            cur_loss = total_loss / args.log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch_num, len(ds_iter), args.lr,
                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

## Training with RNNModel

In [8]:
# ========================================================
# Training with RNNModel
# ========================================================
ntokens = len(vocab)
best_val_loss = None
lr = args.lr

# Create model
model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid,
                       args.nlayers, args.dropout, args.tied).to(device)
if args.wordem:
    print("Initialize word embeddings vectors")
    model.encoder.weight.data.copy_(vocab.vectors)

criterion = nn.CrossEntropyLoss()

print("Training start time", time.asctime( time.localtime(time.time())))
for epoch in range(1, args.epochs+1):
    epoch_start_time = time.time()
    train(model, criterion, train_iter, ntokens, epoch, args )
    val_loss = evaluate(model, criterion, val_ds, val_iter, ntokens, eval_batch_size)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
            'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
    print('-' * 89)
    # Save the model if the validation loss is the best we've seen so far.
    if not best_val_loss or val_loss < best_val_loss:
        with open(args.save, 'wb') as f:
            torch.save(model, f)
        best_val_loss = val_loss
    else:
        # Anneal the learning rate if no improvement has been seen in the validation dataset.
        lr /= 4.0

print("Training end time", time.asctime( time.localtime(time.time())))

Training start time Mon Jul 23 10:11:23 2018
| epoch   1 |   200/ 3196 batches | lr 20.00 | ms/batch 72.13 | loss  7.20 | ppl  1339.30
| epoch   1 |   400/ 3196 batches | lr 20.00 | ms/batch 69.55 | loss  6.33 | ppl   559.32
| epoch   1 |   600/ 3196 batches | lr 20.00 | ms/batch 69.55 | loss  6.01 | ppl   407.06
| epoch   1 |   800/ 3196 batches | lr 20.00 | ms/batch 69.45 | loss  5.80 | ppl   331.70
| epoch   1 |  1000/ 3196 batches | lr 20.00 | ms/batch 69.41 | loss  5.79 | ppl   327.89
| epoch   1 |  1200/ 3196 batches | lr 20.00 | ms/batch 69.38 | loss  5.70 | ppl   299.46
| epoch   1 |  1400/ 3196 batches | lr 20.00 | ms/batch 69.46 | loss  5.59 | ppl   268.33
| epoch   1 |  1600/ 3196 batches | lr 20.00 | ms/batch 69.45 | loss  5.47 | ppl   238.39
| epoch   1 |  1800/ 3196 batches | lr 20.00 | ms/batch 69.40 | loss  5.49 | ppl   241.93
| epoch   1 |  2000/ 3196 batches | lr 20.00 | ms/batch 69.41 | loss  5.47 | ppl   237.07
| epoch   1 |  2200/ 3196 batches | lr 20.00 | ms/batch