# Introduction
This tutorial will introduce you to a powerful neural network-RNN in solving variable-length data. Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. Specifically, we consider the problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models.

# Architecture
The architecture of sequence prediction with LSTM is shown in the picture below. 
![LSTM in sequence prediction](https://cdn-images-1.medium.com/max/525/1*epcf2SBjRHBynBNFf-CpQA.png) The model can predict next word by maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence, thus dependency between words and language rules can be learned by the model.

# Install the libraries
In this tutorial, we will use [PyTorch](http://pytorch.org) to build and train the LSTM model. 

In [1]:
import numpy as np
import torch
from torch.autograd import Variable
from torch.utils.data import Dataset
import torch.nn as nn
import time
import math
import string
import re

# Loading data and preprocessing
The whole logic of preprocessing the real data is first we make everyting lowercase, and trim most punctuation. Then, we need to generate the vocabulary of words in an pre-defined order and represent each text by indexing the vocabulary to be a numerical vector.

In [None]:
# Lowercase, trim, and remove non-letter characters
def normalize_string(s):
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

vocab = {}

def generate_vocab():
    # Read the file and split into lines
    lines = open('xxx.txt').read().strip().split('\n')
    
    # Split every line into pairs and normalize
    lines = [normalize_string(s) for l in lines]
    
    words = [l.split(' ') for l in lines]
    
    cnt = 0
    for w in words:
        if w not in vocab:
            vocab[w] = cnt
            cnt += 1


Here, we use the data from [Word-level WikiText-2](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip), which is a preprocessed dataset. The vocabulary file contains an array of strings. Each string is a word in the vocabulary. There are 33,278 vocabulary items. The train and validation file contain an array of articles. Each article is an array of integers, corresponding to words in the vocabulary. There are 579 articles in the training set. For example, the first article in the training set contains 3803 integers. The first 6 integers of the first article are [1420 13859 3714 7036 1420 1417].

[PyTorch](http://pytorch.org) provides many tools to make data loading easy and hopefully, to make the code more readable. Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

*   \__len\__ so that len(dataset) returns the size of the dataset.
*  \__getitem\__ to support the indexing such that dataset[i] can be used to get ith sample

We can create a dataset class for the dataset to read train and validation data. We will read the .npy in the method \__init\__ but leave the reading of images to \__getitem\__. There are two kinds of frequently used data type: Tensor and Variable. Tensors are similar to NumPy’s ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing. Variable is a very thin wrapper around a Tensor. You can access the raw tensor through the .data attribute, and after computing the backward pass, a gradient w.r.t. this variable is accumulated into .grad attribute. So we need to first convert our numpy array to tensor and then convert tensor to Variable to support autograd within Pytorch.

In [2]:
def to_tensor(numpy_array):
    # Numpy array -> Tensor
    return torch.from_numpy(numpy_array).long()

class MyDataSet(Dataset):
    def __init__(self, datatype):
        super(Dataset, self).__init__()
        self.dataType = datatype
        self.dataX, self.labels = load_data(datatype)

    def __len__(self):
        return self.dataX.shape[0]

    def __getitem__(self, index):
        utterance = self.dataX[index]
        if self.dataType == 'test':
            label = []
        try:
            label = self.labels[index]
        except:
            print(index, self.dataX.shape[0], len(self.labels))
            print(self.labels)
        return utterance, label

Here, we create a new class called myDataLoader which simulates the function of DataLoader in Pytorch since we will not use the random mechanism of sampling one batch of the default one. However, if we want to maintain the sampling logic but do something else, we can override the method of collate_fn in DataLoader to customize our needs. For example, through overriding collate_fn, we can manipulate the data to extract features or adjust the shape or type of one batch data. 

In [3]:
class myDataLoader(object):
    def __init__(self, datatype, batch_size=1):
        dataset = None
        if datatype == 'train':
            dataset = np.load('wiki.train.npy')
        if datatype == 'valid':
            dataset = np.load('wiki.valid.npy')
            
        # shuffle texts
        random_data = np.concatenate(
            dataset[np.random.permutation(dataset.shape[0])], axis=0)
        size = batch_size * (len(random_data) // batch_size)
        true_data = random_data[:size + 1]  # discard extra data
        self.inputs = true_data[:-1].reshape((batch_size, -1)).T
        self.targets = true_data[1:].reshape((batch_size, -1)).T

    def __iter__(self):
        index = 0
        while index < self.inputs.shape[0]:
            x, y = np.random.normal(70, 5), np.random.normal(35, 5)
            seq_len = int(
                np.random.choice([x, y], size=1, p=[0.95, 0.05]))
            yield to_tensor(
                self.inputs[index:index + seq_len]), to_tensor(
                self.targets[index:index + seq_len])
            index += seq_len

The full process of processing the data is:

*   Read .npy file as a whole, randomly generate a permutation of the dataset.
*   split data into input data and target data. Every pair of input and target word are consecutive in the original dataset.
*   Tanspose input and target dataset to set batch_size to be the second axis.
*   In \__iter\__ method, we randomly choose the value of batch_size to be $0.95 * N(70,5)+0.05 * N(35,5)$ according to the setting in the paper [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf).

# LSTM model

In [PyTorch](http://pytorch.org), we can inherit the class nn.Module to define the structure of our network and override the forward method in the LSTM class to facilitate the forward propagation. Additionaly, we define the method to initialize the weights and bias of embedding layer and linear layer, and the first hidden state for each layer within a LSTM cell. This method of implementation is very flexible so if you want to define another totally different neural network, you only need to change how the input is passed along and how different layers are connected together according to official document. 

You can also print the structure of the model very conveniently by just calling an instantiated LSTM object and printing it.

In [4]:
class LSTM(nn.Module):
    def __init__(self, word_cnt=1000, dropout=0.2, embedding_dim=400,
                 hidden_dim=1150, num_layers=3):
        super().__init__()
        self.drop = nn.Dropout(dropout)
        self.embedding = nn.Embedding(num_embeddings=word_cnt,
                                      embedding_dim=embedding_dim)
        self.rnns = nn.LSTM(input_size=embedding_dim,
                            hidden_size=hidden_dim,
                            num_layers=num_layers, dropout=0.3)
        self.projection = nn.Linear(in_features=hidden_dim,
                                    out_features=word_cnt)
        self.init_weights()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim

    def init_weights(self):
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.projection.bias.data.fill_(0)
        self.projection.weight.data.uniform_(-initrange, initrange)

    def forward(self, input, hidden):
        embedding = self.drop(self.embedding(input))
        output, hidden = self.rnns(embedding, hidden)
        output = self.drop(output)
        decoded = self.projection(
            output.view(output.size(0) * output.size(1),
                        output.size(2)))
        return decoded.view(output.size(0), output.size(1),
                            decoded.size(1)), hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data

        return (Variable(weight.new(self.num_layers, batch_size,
                                    self.hidden_dim).zero_()),
                Variable(weight.new(self.num_layers, batch_size,
                                    self.hidden_dim).zero_()))

print(LSTM())

LSTM(
  (drop): Dropout(p=0.2)
  (embedding): Embedding(1000, 400)
  (rnns): LSTM(400, 1150, num_layers=3, dropout=0.3)
  (projection): Linear(in_features=1150, out_features=1000)
)


# Train the model
After preprocessing the data and designing the model, we can start to train the LSTM model. For convenience, we separate the train logic and validation logic into two different methods. Later, we will call them in the main function and compare the loss of train data and validation data to make sure that our training will not overfit.

The routine for training is to iterate through all the train data--call the model to train one batch, calculate the loss and backpropagate to update the weights. 
Here, I have to point out the repackage_hidden method whose purpose is to take out the tensor containing the hidden state through h.data and wrap it in a new Variable, which has no history in it. Since autograd mechanism of Pytorch will replay every variable and differentiate each operation within a graph, if you want BPTT not to backpropagate to the hidden state after finishing a sentence, then you need to get rid of the reference and free up the memory for next iteration.

In [5]:
def repackage_hidden(h):
    if type(h) == Variable:
        return Variable(h.data)
    else:
        return tuple(repackage_hidden(v) for v in h)

In [6]:
def train(epoch):
    lstm_model.train()
    total_loss = 0
    start_time = time.time()
    hidden = lstm_model.init_hidden(batch_size)
    train_data = myDataLoader('train', batch_size)

    for i, (input, targets) in enumerate(train_data):
        hidden = repackage_hidden(hidden)
        lstm_model.zero_grad()
        output, hidden = lstm_model(to_variable(input), hidden)
        loss = criterion(output.view(-1, word_cnt),
                         to_variable(targets).view(-1))
        loss.backward()
        torch.nn.utils.clip_grad_norm(lstm_model.parameters(), clip)
        optim.step()  # Update the network
        total_loss += loss.data
        print_log(i, total_loss, start_time)
        

As for evaluation part, one thing I need to point out is that we need to set the model to be evaluation mode to skip Dropout or BatchNorm during validation and test time. Reasons for us to do this is:
*   Dropout makes neurons output 'wrong' values on purpose.
*   Avoid inconsistency because we disable neurons randomly and the network will have different outputs every activation.


In [7]:
def evaluate():
    # Turn on evaluation mode which disables dropout.
    lstm_model.eval()
    total_loss = 0
    valid_data = myDataLoader('valid', batch_size)
    hidden = lstm_model.init_hidden(batch_size)
    cnt = 0
    for i, (input, targets) in enumerate(valid_data):
        hidden = repackage_hidden(hidden)
        
        output, hidden = lstm_model(to_variable(input), hidden)
        output_flat = output.view(-1, word_cnt)
        total_loss += criterion(output_flat, to_variable(targets).view(-1)).data
        cnt += 1

    return total_loss[0] / cnt

Next, we can try different regularization methods to tune the hyperparameters of the model in order to obtain a reasonably good model without overfitting.
There are some common regularization methods for this task. You can do some experiments to either combine them all or just use a subset of them. 
*   Apply locked dropout between LSTM layers
*   Apply embedding dropout
*   Apply weight decay
*   Tie the weights of the embedding and the output layer
*   Activity regularization
*   Temporal activity regularization


If you are not familiar with tuning the parameters, I list some major methods to tune the hyperparameters here. However, this topic is beyond the scope of this tutorial so I will not implement different methods talked below. I focus on the 
*   Manual Search
*   Grid Search [Grid Search Hyperparameters in Python with Keras](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/)
*   Random Search
*   Bayesian Optimization

With all these helper functions in place (it looks like extra work, but it makes it easier to run multiple experiments) we can actually initialize a network and start training. While you train the model, feel free to interrupt the kernel and adjust the learning rate then run it again later with loading the model file kept previously.

In [8]:
batch_size = 20
embedding_dim = 400
hidden_dim = 200
num_layers = 3
clip = 0.25
lr = 0.0001
dropout = 0.2
epochs = 40
log_interval = 20
word_cnt = 33278

lstm_model = LSTM(word_cnt, dropout, embedding_dim, hidden_dim, num_layers)
criterion = nn.CrossEntropyLoss()
optim = torch.optim.Adam(lstm_model.parameters(), lr)

if torch.cuda.is_available():
    lstm_model = lstm_model.cuda()
    criterion = criterion.cuda()

def to_variable(tensor):
    # Tensor -> Variable (on GPU if possible)
    if torch.cuda.is_available():
        # Tensor -> GPU Tensor
        tensor = tensor.cuda()
    return torch.autograd.Variable(tensor)


def main():
    best_val_loss = None
    epoch = 0
    global lr
    try:
        while epoch < epochs:
            epoch_start_time = time.time()
            train(epoch)
            val_loss = evaluate()
            print('-' * 89)
            print(
                '| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(
                    epoch, (time.time() - epoch_start_time), val_loss,
                    math.exp(val_loss)))
            print('-' * 89)

            if not best_val_loss or val_loss < best_val_loss:
                torch.save(lstm_model.state_dict(), 'model.pt')
                best_val_loss = val_loss
            else:
                lr /= 4.0
            epoch += 1

    except KeyboardInterrupt:
        torch.save(lstm_model.state_dict(), 'model.pt')
        print('-' * 89)
        print('Exiting from training early')
        
def print_log(i, total_loss, start_time):
    if i % log_interval == 0 and i > 0:
            cur_loss = total_loss[0] / log_interval
            elapsed = time.time() - start_time
            print()
            print('| epoch {:3d} | lr {:02.2f} | ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                    epoch, lr, elapsed * 1000 / log_interval,
                    cur_loss, math.exp(cur_loss)))

            total_loss = 0
            start_time = time.time()




# Prediction
This function takes as input a batch of sequences, shaped [batch size, sequence length].
This function will load your trained model and perform a forward pass.
These input sequences are drawn from the unseen test data. We will calculate the negative log likelihood to have a sense of the model's performance.

In [11]:
embedding_dim = 200
hidden_dim = 200
num_layers = 3
dropout = 0.5
word_cnt = 33278

def log_softmax(x, axis):
    ret = x - np.max(x, axis=axis, keepdims=True)
    lsm = np.log(np.sum(np.exp(ret), axis=axis, keepdims=True))
    return ret - lsm

def prediction(inp):
    # Load the best saved model.
    lstm = LSTM(word_cnt, dropout, embedding_dim, hidden_dim,
                num_layers)

    model_path = 'model.pt'

    # Load dictionary into your model
    m = torch.load(model_path, map_location=lambda storage, loc: storage)
    lstm.load_state_dict(m)
    lstm.eval()

    if torch.cuda.is_available():
        lstm.cuda()

    hidden = lstm.init_hidden(inp.shape[0])
    input = to_variable(to_tensor(inp)).transpose(0, 1)
    output, hidden = lstm(input, hidden)

    return output[-1, :, :].data.cpu().numpy()

def test_prediction():
    fixture = np.load('prediction.npz')
    inp = fixture['inp']
    targ = fixture['out']
    out = prediction(inp.copy())
    vocab = np.load('vocab.npy')
    out = log_softmax(out, 1)
    nlls = out[np.arange(out.shape[0]), targ]
    nll = -np.mean(nlls)
    print("Your mean NLL for predicting a single word: {}".format(nll))


In [13]:
if __name__ == '__main__':
    main()
    test_prediction()

Your mean NLL for predicting a single word: 4.747096061706543


# Summary and References

This tutorial highlighted just a few elements of how to generate new text based on variable-length sequence of words in Python.  Much more detail about the libraries and questions on LSTM and sequence prediction are available from the following links.

1. Pytorch: http://pytorch.org
2. Regularizing and Optimizing LSTM Language Models: https://arxiv.org/pdf/1708.02182.pdf
3. Grid Search Hyperparameters in Python with Keras: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
4. Word-level language modeling RNN: https://github.com/pytorch/examples/tree/master/word_language_model