# Training a Recurrent Neural Network (RNN) with Gluon

# Lab Objectives

1. End to end process of training a deep learning model
2. Text data processing - Tokenization, clipping, vocabulary indexes, embedding 
3. Recurrent Neural Network (RNN) with MXNet Gluon
4. Inference on text data with MXNet Gluon

# Step 1 - Problem definition

1. Given an input text, ex: "I love my mother", predict an EMOJI, ex: ❤️
2. In this problem, we restrict our setting to predict one of **5 EMOJIs**:
    * sad - 😞
    * love - ❤️
    * happy - 😄
    * fork & knife - 🍴
    * baseball - ⚾
3. This is a `Text Classification` problem.

## Prerequisites

In [1]:
# Uncomment the following lines if matplotlib and emoji is not installed.
# !pip install matplotlib
# !pip install emoji

# Uncomment the following lines if gluon-nlp, spacy is not installed.
# !pip install gluon-nlp
# !pip install spacy -U --quiet
# !python -m spacy download en

# Import required modules from MXNet and other dependencies

In [2]:
import warnings
warnings.filterwarnings('ignore')

import csv
import emoji
import random
import time
import multiprocessing as mp
import numpy as np

import mxnet as mx
from mxnet import nd, gluon, autograd

import gluonnlp as nlp

random.seed(123)
np.random.seed(123)
mx.random.seed(123)

In [3]:
# Set MXNet Context. Use mx.cpu() for CPU. Use mx.gpu(0) for 1 GPU
context = mx.cpu() #mx.gpu(0)

# Step 2 - Load Pretrained wikitext-2 Language Model

We use a pretrained model on wikitext-2 dataset [6]. Specifically, we use Vocabulary, Language Model i.e., Embeddings and Encodings (LSTM weights) based on wikitext-2 dataset.

**Intuition:** Using pretrained language model weights is a common approach for semi-supervised learning in NLP. In order to do a good job with large language modeling on a large corpus of text, our model must learn representations that contain information about the structure of natural language. Intuitively, by starting with these good features, vs random features, we’re able to converge faster upon a good model for our downsteam task.

In [4]:
language_model_name = 'standard_lstm_lm_200'

In [5]:
lm_model, vocab = nlp.model.get_model(name=language_model_name,
                                      dataset_name='wikitext-2',
                                      pretrained=True,
                                      ctx=context,
                                      dropout=0)

# Step 3 - Data collection and data preparation

## 3.1 Data collection

* Load Emojify dataset
* This is a tiny dataset (X, Y) where:
       * X contains 127 sentences (strings) 
       * Y contains a integer label between 0 and 4 corresponding to an emoji for each sentence

    ![Emoji Dataset](../assets/emoji_data_set.png)

In [6]:
def read_csv(filename):
    """Utility to read CSV data from given filename
    """
    phrase = []
    emoji = []

    with open (filename) as csvDataFile:
        csvReader = csv.reader(csvDataFile)

        for row in csvReader:
            phrase.append(row[0])
            emoji.append(row[1])
    return phrase, emoji

In [8]:
X_train, Y_train = read_csv('./data/train_emoji.csv')
X_test, Y_test = read_csv('./data/test_emoji.csv')

In [9]:
train_dataset = mx.gluon.data.dataset.ArrayDataset(X_train, Y_train)
test_dataset = mx.gluon.data.dataset.ArrayDataset(X_test, Y_test)

In [10]:
emoji_dictionary = {"0": "\u2764\uFE0F",    # :heart: prints a black instead of red heart depending on the font
                    "1": ":baseball:",
                    "2": ":smile:",
                    "3": ":disappointed:",
                    "4": ":fork_and_knife:"}

def label_to_emoji(label):
    """ Converts a label (int or string) into the corresponding emoji code (string) ready to be printed
    """
    return emoji.emojize(emoji_dictionary[str(label)], use_aliases=True)

## 3.2 Process data and prepare dataset

* Tokenize using spaCy- Extract words, punctuation marks from review text
* Convert each token to an index in the vocabulary. Vocabulary is obtained from wikitext2
* Prepare data iterators that can iterate on training and test data

In [14]:
# Use spaCy English (en) tokenizer on input sentences to get tokens(words and punctuation marks)
tokenizer = nlp.data.SpacyTokenizer('en')

# Clip sentences to be max 500 tokens
length_clip = nlp.data.ClipSequence(500)

def preprocess(x):
    """
    1. Tokenize - Extract words, punctuation marks from review text.
    2. Convert each token to an index in the vocabulary.
    """
    data, label = x

    # Tokenize the data
    tokenized_data = tokenizer(data)
    # Clip the tokens
    tokenized_clipped_data = length_clip(tokenized_data)
    # Get vocabulary indexes for the tokens. Use pre-loaded 'vocab'.
    data = vocab[tokenized_clipped_data]

    return data, label

def get_length(x):
    return float(len(x[0]))

def preprocess_dataset(dataset):
    with mp.Pool() as pool:
        # Each sample is processed in an asynchronous manner.
        dataset = gluon.data.SimpleDataset(pool.map(preprocess, dataset))
        lengths = gluon.data.SimpleDataset(pool.map(get_length, dataset))
    
    return dataset, lengths

# Preprocess the dataset
print("Preparing Train dataset...")
train_dataset, train_data_lengths = preprocess_dataset(train_dataset)

print("Preparing Test dataset...")
test_dataset, test_data_lengths = preprocess_dataset(test_dataset)

print("Data is ready!!!")

Preparing Train dataset...
Preparing Test dataset...
Data is ready!!!


## 3.3 Visualize sample data

In [15]:
# Raw data - Text and label
# Print a Test data
index = 77
print("Raw Data - ", X_train[index], label_to_emoji(Y_train[index]))

# Processed data - Text is tokenized and converted to index of vocabulary 
print("Processed Data - ", train_dataset[index])

Raw Data -  they are so kind and friendly ❤️
Processed Data -  ([56, 36, 143, 1735, 6, 4903], '0')


## 3.4 Prepare MXNet Dataloader

You know have a dataset for training data and test data. But, you need to feed this data into the neural network for training the model. You use data loader!

1. Data Loader reads the data from dataset and feed them as batches to the neural network for model training
2. Data Loader is fast, efficient and can run multiple processes in parallel to feed the batches of data for model training

Below, we prepare data loaders for training data and test data:
1. We use 4 workers to load the data in parallel
2. Randomly shuffle the training data (Intuition: Without shuffling, you will keep feeding the training data in same order leading to neural network memorizing rather than learn important feature)
3. Feed 16 sentences per batch during the model training

Few important notes to consider in Text data modelling:

* Input sentences can be of different lengths.
* Use FixedBucketSampler, which assigns each data sample to a fixed bucket based on its length.
* Batchify function (batchify) is applied on all the samples as the loaders read the batches.
* We apply *Pad* for padding smaller length sequence to max length sequence in the bucket.
* We apply *Stack* for stacking data, label, data_length i.e., [sentence, sentiment label, sentence_length]

In [19]:
batch_size = 16
bucket_num, bucket_ratio = 5, 0.2

In [20]:
def get_dataloader():
    batchify_fn = nlp.data.batchify.Tuple(
        nlp.data.batchify.Pad(axis=0, ret_length=True),
        nlp.data.batchify.Stack(dtype='float32'))
    batch_sampler = nlp.data.sampler.FixedBucketSampler(
        train_data_lengths,
        batch_size=batch_size,
        num_buckets=bucket_num,
        ratio=bucket_ratio,
        shuffle=True)
    print(batch_sampler.stats())
    train_dataloader = gluon.data.DataLoader(
        dataset=train_dataset,
        batch_sampler=batch_sampler,
        batchify_fn=batchify_fn, num_workers=4)
    test_dataloader = gluon.data.DataLoader(
        dataset=test_dataset,
        batch_size=batch_size,
        shuffle=False,
        batchify_fn=batchify_fn, num_workers=4)
    return train_dataloader, test_dataloader

train_dataloader, test_dataloader = get_dataloader()

FixedBucketSampler:
  sample_num=132, batch_num=10
  key=[2, 4, 6, 8, 10]
  cnt=[9, 61, 41, 15, 6]
  batch_size=[16, 16, 16, 16, 16]


# Step 4 - Define the Neural Network

* **Embedding, LSTM Layer:** To use pre-trained weights, we base our network on the Language Model Network (Embedding -> LSTM). 
* **Mean Pooling Layer:** We have multiple words input (reviews) and one output (sentiment). Hence, we average(mean) states across all time steps into one value.
* **Dense Layer:** To generate the final output

<img src="../assets/emoji_model.png" width="400" height="1100">

## Intuition - Embedding

<img src="../assets/emoji_embedding.png" width="1000" height="1100">

In [21]:
class MeanPoolingLayer(gluon.HybridBlock):
    """A block for mean pooling of encoder features"""
    def __init__(self, prefix=None, params=None):
        super(MeanPoolingLayer, self).__init__(prefix=prefix, params=params)

    def hybrid_forward(self, F, data, valid_length):
        masked_encoded = F.SequenceMask(data,
                                        sequence_length=valid_length,
                                        use_sequence_length=True)
        agg_state = F.broadcast_div(F.sum(masked_encoded, axis=0),
                                    F.expand_dims(valid_length, axis=1))
        return agg_state


class SentimentNet(gluon.HybridBlock):
    """Network for sentiment analysis."""
    def __init__(self, prefix=None, params=None):
        super(SentimentNet, self).__init__(prefix=prefix, params=params)
        with self.name_scope():
            self.embedding = None # will set with lm embedding later
            self.encoder = None # will set with lm encoder later
            self.agg_layer = MeanPoolingLayer()
            self.output = gluon.nn.HybridSequential()
            with self.output.name_scope():
                # 5 possible emojis, hence, output 5 values from last layer.
                self.output.add(gluon.nn.Dense(5, flatten=False))

    def hybrid_forward(self, F, data, valid_length): 
        embedded = self.embedding(data)
        encoded = self.encoder(embedded)
        agg_state = self.agg_layer(encoded, valid_length)
        out = self.output(agg_state)
        return out

# Step 5 - Initialize the parameters in the Neural Network

1. Initialize Embedding and Encoding layer with pre-trained weights from "wikitext-2" dataset (see Step 1)
2. Use Xavier initializer to randomly initialize last Dense layer

In [22]:
net = SentimentNet()

# Use Pretrained Embeddings from wikitext-2
net.embedding = lm_model.embedding

# Use Pretrained Encoder states (LSTM) from wikitext-2
net.encoder = lm_model.encoder

net.hybridize()

# Random initialize the last Dense Laywer
net.output.initialize(mx.init.Xavier(), ctx=context)
print(net)

SentimentNet(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2)
  (agg_layer): MeanPoolingLayer(
  
  )
  (output): HybridSequential(
    (0): Dense(None -> 5, linear)
  )
)


# Step 6 - Train the Model

## 6.1 Hyperparameters

In [23]:
learning_rate = 0.005
epochs = 10
grad_clip = None

## 6.2 Evaluation Function


In [24]:
def evaluate(net, dataloader, context):
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    total_L = 0.0
    total_sample_num = 0
    total_correct_num = 0
    print('Begin Testing...')
    for i, ((data, valid_length), label) in enumerate(dataloader):
        # Step 1: Prepare data
        data = mx.nd.transpose(data.as_in_context(context))
        valid_length = valid_length.as_in_context(context).astype(np.float32)
        label = label.as_in_context(context)
        
        # Step 2: Forward pass
        output = net(data, valid_length)
        
        # Step 3: Calculate loss
        L = loss(output, label)
        
        # Step 4: Statistics - Keeping moving average loss and accuracy
        pred =  nd.argmax(output, axis=1) #(output > 0.5).reshape(-1)
        
        total_L += L.sum().asscalar()
        total_sample_num += label.shape[0]
        total_correct_num += (pred == label).sum().asscalar()
    avg_L = total_L / float(total_sample_num)
    acc = total_correct_num / float(total_sample_num)
    return avg_L, acc

## 6.3 Train the Model

### Training loop

1. Take a batch of training data from data loader
2. Do the forward pass (prediction)
3. Calculate the loss (error)
4. Do the backward pass (gradient - change required to reduce the loss)
5. Use Trainer with optimizer to make the updates (change in weights)
6. Continue with Step 1 with new batch of data

**NOTE:** Softmax cross entropy loss is used to measure error in multi class classification problem

In [25]:
def train(net, context, epochs):
    # Use Follow the Moving Leader Optimizer - [7]
    trainer = gluon.Trainer(net.collect_params(), 'ftml',
                            {'learning_rate': learning_rate})
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    parameters = net.collect_params().values()
    print("Training the Emoji Prediction Model...")
    for epoch in range(epochs):
        epoch_L = 0.0
        epoch_sent_num = 0
        print("[Epoch - {}]".format(epoch))
        for i, ((data, length), label) in enumerate(train_dataloader):
            L = 0
            with autograd.record():
                # Step 1: Forward pass
                output = net(data.as_in_context(context).T,
                             length.as_in_context(context)
                                   .astype(np.float32))
                # Step 2: Calculate Loss
                L = L + loss(output, label.as_in_context(context)).mean()
            
            # Step 3: Backward pass
            L.backward()
            
            # Step 3.1: Clip gradient - Avoid gradient explosion
            if grad_clip:
                gluon.utils.clip_global_norm(
                    [p.grad(context) for p in parameters],
                    grad_clip)
            
            # Step 4: Do parameter updates
            trainer.step(1)
            
            # For epoch statistics - Loss and data sample count
            epoch_sent_num += data.shape[1]
            epoch_L += L.asscalar()
    
        print('Train Avg Loss {:.6f}'.format(epoch_L / epoch_sent_num))
        
        # Step 5: Evaluation after each epoch
        test_avg_L, test_acc = evaluate(net, test_dataloader, context)
        print('Test Acc {:.2f}, Test Avg Loss {:.6f}'.format(test_acc, test_avg_L))

In [26]:
# Train the model
train(net, context, epochs)

Training the Emoji Prediction Model...
[Epoch - 0]
Train Avg Loss 0.298171
Begin Testing...
Test Acc 0.23, Test Avg Loss 1.555015
[Epoch - 1]
Train Avg Loss 0.276428
Begin Testing...
Test Acc 0.38, Test Avg Loss 1.507533
[Epoch - 2]
Train Avg Loss 0.255712
Begin Testing...
Test Acc 0.38, Test Avg Loss 1.454944
[Epoch - 3]
Train Avg Loss 0.231783
Begin Testing...
Test Acc 0.38, Test Avg Loss 1.394452
[Epoch - 4]
Train Avg Loss 0.205929
Begin Testing...
Test Acc 0.48, Test Avg Loss 1.328168
[Epoch - 5]
Train Avg Loss 0.173582
Begin Testing...
Test Acc 0.57, Test Avg Loss 1.262833
[Epoch - 6]
Train Avg Loss 0.139932
Begin Testing...
Test Acc 0.59, Test Avg Loss 1.196853
[Epoch - 7]
Train Avg Loss 0.109334
Begin Testing...
Test Acc 0.59, Test Avg Loss 1.141708
[Epoch - 8]
Train Avg Loss 0.078590
Begin Testing...
Test Acc 0.61, Test Avg Loss 1.101847
[Epoch - 9]
Train Avg Loss 0.056978
Begin Testing...
Test Acc 0.64, Test Avg Loss 1.077297


# Step 7 - Prediction


In [27]:
prob1 = net(
            mx.nd.reshape(
            mx.nd.array(vocab[['He', 'likes', 'playing', 'baseball']], ctx=context),
            shape=(-1, 1)), mx.nd.array([4], ctx=context)).sigmoid()
nd.argmax(prob1, axis=1).asscalar()
label_to_emoji(int(nd.argmax(prob1, axis=1).asscalar()))

'⚾'

In [28]:
prob1 = net(
            mx.nd.reshape(
            mx.nd.array(vocab[['I', 'love', 'my', 'mother']], ctx=context),
            shape=(-1, 1)), mx.nd.array([4], ctx=context)).sigmoid()
nd.argmax(prob1, axis=1).asscalar()
label_to_emoji(int(nd.argmax(prob1, axis=1).asscalar()))

'❤️'

In [29]:
prob1 = net(
            mx.nd.reshape(
            mx.nd.array(vocab[['Lets', 'have', 'food', 'in', 'indian', 'restaurant']], ctx=context),
            shape=(-1, 1)), mx.nd.array([4], ctx=context)).sigmoid()
nd.argmax(prob1, axis=1).asscalar()
label_to_emoji(int(nd.argmax(prob1, axis=1).asscalar()))

'🍴'

In [31]:
prob1 = net(
            mx.nd.reshape(
            mx.nd.array(vocab[['I', 'am', 'not', 'feeling', 'good']], ctx=context),
            shape=(-1, 1)), mx.nd.array([4], ctx=context)).sigmoid()
nd.argmax(prob1, axis=1).asscalar()
label_to_emoji(int(nd.argmax(prob1, axis=1).asscalar()))

'😞'

In [32]:
prob1 = net(
            mx.nd.reshape(
            mx.nd.array(vocab[['He', 'is', 'funny']], ctx=context),
            shape=(-1, 1)), mx.nd.array([4], ctx=context)).sigmoid()
nd.argmax(prob1, axis=1).asscalar()
label_to_emoji(int(nd.argmax(prob1, axis=1).asscalar()))

'😄'

# Summary


In this lab, we learnt about:

1. End to end process of training a deep learning model for Text Data usecase
2. Gluon Dataset, Data Loader and text data processing
3. Spacy tokenizer, Gluon NLP bucketizer, Vocabulary from WikiText2 dataset
4. Pre-trained Embedding and Encoding weights (WikiText2 dataset) using easy to use Gluon-NLP toolkit
5. Gluon Blocks - Embedding, LSTM
6. Loss function - gluon.loss.SoftmaxCrossEntropyLoss
7. Trainer and optimizer - gluon.Train and FTML (Follow the moving leader optimizer)
8. Forward -> Loss -> Backward -> Update -> Repeat

# Credits

Problem statement and dataset is borrowed from [Coursera deeplearning.ai Sequence Models lecture](https://www.coursera.org/learn/nlp-sequence-models/home/welcome)

# References

1. http://mxnet.incubator.apache.org/
2. https://gluon-nlp.mxnet.io
3. https://spacy.io/usage/
4. https://gluon-nlp.mxnet.io/examples/sentiment_analysis/sentiment_analysis.html
5. http://ai.stanford.edu/~amaas/data/sentiment/
6. https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset
7. https://mxnet.incubator.apache.org/api/python/optimization/optimization.html#mxnet.optimizer.FTML