# Sentiment Analysis (SA) with Pre-trained Language Model (LM)

In this notebook, we demonstrate how to analyze sentiments of IMDB reviews using a pre-trained language model.

- Model definition
- Data pipeline
- Training and evaluation

Now that we've covered some advanced topics, let's go back and show how these techniques can help us even when addressing the comparatively simple problem of classification. In particular, we'll look at the classic problem of sentiment analysis: taking an input consisting of a string of text and classifying its sentiment as positive of negative.

In this notebook, we are going to use GluonNLP to build a sentiment analysis model whose weights are initialized based on a pretrained language model. Using pre-trained language model weights is a common approach for semi-supervised learning in NLP. In order to do a good job with language modeling on a large corpus of text, our model must learn representations that contain information about the structure of natural language. Intuitively, by starting with these good features, vs random features, we're able to converge faster upon a good model for our downsteam task.

With GluonNLP, we can quickly prototype the model and it's easy to customize. The building process consists of just three simple steps. For this demonstration we'll focus on movie reviews from the Large Movie Review Dataset, also known as the IMDB dataset. Given a movie, our model will output prediction of its sentiment, which can be positive or negative.

## Model Architecture

We can easily transplant the pre-trained weights, we'll base our model architecture on the pre-trained LM. Following the LSTM layer, we have one representation vector for each word in the sentence. Because we plan to make a single prediction (not one per word), we'll first pool our predictions across time steps before feeding them through a dense layer to produce our final prediction (a single sigmoid output node).

<img src='samodel-v3.png' width='250px'>

Dense: make predictions

Pooling: downsampling

Embedding: from LM

Encoder: from LM


Specifically, our model represents input words by their embeddings. Following the embedding layer, our model consists of a two-layer LSTM, followed by an average pooling layer, followed by a sigmoid output layer (all illustrated in the figure above)

Thus, given an input sequence, the memory cells in the LSTM layer will produce a representation sequence. This representation sequence is then averaged over all timesteps resulting in a fixed-length sentence representation $h$. Finally, we apply a sigmoid output layer on top of $h$. We’re using the sigmoid  because we’re trying to predict if this text has positive or negative sentiment, and a sigmoid activation function squashes the output values to the range [0,1], allowing us to interpret this output as a probability.

## Model Definition in GluonNLP

In [1]:
import multiprocessing as mp
import random
import time
import numpy as np
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp
import utils

random.seed(123)
np.random.seed(123)
mx.random.seed(123)

In [2]:
class MeanPoolingLayer(mx.gluon.HybridBlock):
    """A block for mean pooling of encoder features"""
    def __init__(self, prefix=None, params=None):
        super(MeanPoolingLayer, self).__init__(prefix=prefix, params=params)

    def hybrid_forward(self, F, data, valid_length):
        # Data will have shape (T, N, C)
        masked_encoded = F.SequenceMask(data,
                                        sequence_length=valid_length,
                                        use_sequence_length=True)
        agg_state = F.broadcast_div(F.sum(masked_encoded, axis=0),
                                    F.expand_dims(valid_length, axis=1))
        return agg_state

In [3]:
class SentimentNet(gluon.HybridBlock):
    """Network for sentiment analysis."""
    def __init__(self, prefix=None, params=None):
        super(SentimentNet, self).__init__(prefix=prefix, params=params)
        with self.name_scope():
            self.embedding = None # will set with lm embedding later
            self.encoder = None   # will set with lm encoder later
            self.agg_layer = MeanPoolingLayer()
            self.output = gluon.nn.HybridSequential()
            with self.output.name_scope():
                self.output.add(gluon.nn.Dense(1, flatten=False))

    def hybrid_forward(self, F, data, valid_length):
        # shape Shape = (T, N, C)
        encoded = self.encoder(self.embedding(data))
        agg_state = self.agg_layer(encoded, valid_length)
        out = self.output(agg_state)
        return out

## Hyperparameters and Model Initialization

### Load Pre-trained Language Model

In [4]:
language_model_name = 'standard_lstm_lm_200'
pretrained = True
context = mx.gpu(0)
lm_model, vocab = nlp.model.get_model(name=language_model_name,
                                      dataset_name='wikitext-2',
                                      pretrained=pretrained,
                                      ctx=context)

### Create SA model from Pre-trained Model

In the below code, we first acquire a pre-trained model on the Wikitext-2 dataset using nlp.model.get_model. We then construct a SentimentNet object, which takes as input the embedding layer and encoder of the pre-trained model.

As we employ the pre-trained embedding layer and encoder, **we only need to initialize the output layer** using `net.out_layer.initialize(mx.init.Xavier(), ctx=context)`.

In [5]:
learning_rate = 0.005
batch_size = 16
epochs = 1

net = SentimentNet()
net.embedding = lm_model.embedding
net.encoder = lm_model.encoder
net.hybridize()
# initialize only the output layer
net.output.initialize(mx.init.Xavier(), ctx=context)
loss = gluon.loss.SigmoidBCELoss()
trainer = gluon.Trainer(net.collect_params(),'ftml',
                        {'learning_rate': learning_rate})

In [6]:
print(net)

SentimentNet(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (agg_layer): MeanPoolingLayer(
  
  )
  (output): HybridSequential(
    (0): Dense(None -> 1, linear)
  )
)


## Data Pipeline

### Preparation
The data preprocessing logic depends on English spaCy tokenizer. If you are not running this example on the provided AMI, please add a cell and run the following command:

```bash
!python -m spacy download en
```

### A Glance at the Dataset

In [7]:
raw_train_dataset = nlp.data.IMDB(root='data/imdb', segment='train')
raw_test_dataset = nlp.data.IMDB(root='data/imdb', segment='test')
print('score:\t', raw_train_dataset[11][1])
print('\nreview:\n\n', raw_test_dataset[11][0])

score:	 8

review:

 I was fortunate to attend the London premier of this film. While I am not at all a fan of British drama, I did find myself deeply moved by the characters and the BAD CHOICES they made. I was in tears by the end of the film. Every scene was mesmerizing. The attention to detail and the excellent acting was quite impressive.<br /><br />I would have to agree with some of the other comments here which question why all these women were throwing themselves at such a despicable character.<br /><br />*******SPOLIER ALERT******** I was also hoping that Dylan would have been killed by William when he had the chance! ****END SPOILER*****<br /><br />Keira Knightley did a great job and radiate beauty and innocence from the screen, but it was Sienna Miller's performance that was truly Oscar worthy.<br /><br />I am sure this production will be nominated for other awards.


We need to do the following to preprocess the dataset:

- Tokenization
- Label generation
- Batching with bucketing

#### Tokenization and Label Generation

In [8]:
# tokenizer takes as input a string and outputs a list of tokens.
tokenizer = nlp.data.SpacyTokenizer('en')

def preprocess(x):
    # length_clip takes as input a list 
    # and outputs a list with maximum length 500.
    length_clip = nlp.data.ClipSequence(500)
    data, label = x
    label = int(label > 5)
    # A token index or a list of token indices is
    # returned according to the vocabulary.
    data = vocab[length_clip(tokenizer(data))]
    return data, label

def get_length(x):
    return float(len(x[0]))

#### Preprocess the Dataset with Multi-processing

In [9]:
def preprocess_dataset(dataset):
    start = time.time()
    with mp.Pool() as pool:
        # Each sample is processed in an asynchronous manner.
        dataset = gluon.data.SimpleDataset(pool.map(preprocess, dataset))
        lengths = gluon.data.SimpleDataset(pool.map(get_length, dataset))
    end = time.time()
    print('Done tokenization. Time = {:.2f}s, num sentences = {}'.format(end - start, len(dataset)))
    return dataset, lengths

train_dataset, train_data_lengths = preprocess_dataset(raw_train_dataset)
test_dataset, test_data_lengths = preprocess_dataset(raw_test_dataset)

Done tokenization. Time = 5.57s, num sentences = 25000
Done tokenization. Time = 7.35s, num sentences = 25000


#### Batchify

In the following code, we use FixedBucketSampler, which assigns each data sample to a fixed bucket based on its length. The bucket keys are either given or generated from the input sequence lengths and the number of buckets.

In [10]:
bucket_num, bucket_ratio = 10, 0.2
batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(axis=0, ret_length=True),
    nlp.data.batchify.Stack(dtype='float32'))
batch_sampler = nlp.data.sampler.FixedBucketSampler(
    train_data_lengths,
    batch_size=batch_size,
    num_buckets=bucket_num,
    ratio=bucket_ratio,
    shuffle=True)
print(batch_sampler.stats())

FixedBucketSampler:
  sample_num=25000, batch_num=1551
  key=[59, 108, 157, 206, 255, 304, 353, 402, 451, 500]
  cnt=[590, 1999, 5092, 5102, 3038, 2085, 1477, 1165, 870, 3582]
  batch_size=[27, 16, 16, 16, 16, 16, 16, 16, 16, 16]


#### Data Loader

In [11]:
train_dataloader = gluon.data.DataLoader(dataset=train_dataset,
                                         batch_sampler=batch_sampler,
                                         batchify_fn=batchify_fn)

test_dataloader = gluon.data.DataLoader(dataset=test_dataset,
                                        batch_size=batch_size,
                                        shuffle=False,
                                        batchify_fn=batchify_fn)

## Training

In [12]:
def train(net, context, epochs):
    for epoch in range(epochs):
        train_avg_L, train_throughput = utils.train_one_epoch(epoch, trainer, train_dataloader, 
                                                              net, loss, context)
        test_avg_L, test_acc = utils.evaluate(net, test_dataloader, context)
        print('[Epoch {}] train avg loss {:.6f}, test acc {:.2f}, '
              'test avg loss {:.6f}, throughput {:.2f}K wps'.format(
                  epoch, train_avg_L, test_acc, test_avg_L, train_throughput))

In [13]:
train(net, context, epochs)

[Epoch 0 Batch 300/1551] elapsed 13.41 s, avg loss 0.002270, throughput 87.28K wps
[Epoch 0 Batch 600/1551] elapsed 12.97 s, avg loss 0.001673, throughput 95.13K wps
[Epoch 0 Batch 900/1551] elapsed 12.38 s, avg loss 0.001376, throughput 103.13K wps
[Epoch 0 Batch 1200/1551] elapsed 14.13 s, avg loss 0.001504, throughput 79.35K wps
[Epoch 0 Batch 1500/1551] elapsed 13.15 s, avg loss 0.001342, throughput 87.78K wps
Begin Testing...
[Batch 400/1563] elapsed 16.42 s
[Batch 800/1563] elapsed 16.65 s
[Batch 1200/1563] elapsed 16.85 s
[Epoch 0] train avg loss 0.001623, test acc 0.87, test avg loss 0.330541, throughput 90.31K wps


### Evaluate with Reviews

In [14]:
sample = ['This', 'movie', 'is', 'amazing']
test_review = mx.nd.array(vocab[sample], ctx=context)
test_length = mx.nd.array([4], ctx=context)
net(test_review.reshape(-1, 1), test_length).sigmoid()


[[0.7197244]]
<NDArray 1x1 @gpu(0)>

## Practice

- Try with a negative sample. Does the network correctly predict the sentiment?
- Try re-initialize the network without pre-trained model. Does pre-trained model provide any advantage?