## Structured Self-attentive Sentence Embedding

The code is based on [gluon-nlp tutorial](https://gluon-nlp.mxnet.io/examples/sentence_embedding/self_attentive_sentence_embedding.html).

## Import Related Packages

In [1]:
import os
import json
import zipfile
import time
import itertools

import numpy as np
import mxnet as mx
import multiprocessing as mp
import gluonnlp as nlp

from mxnet import gluon, nd, init
from mxnet.gluon import nn, rnn
from mxnet import autograd, gluon, nd
from d2l import try_gpu
import pandas as pd

# iUse sklearn's metric function to evaluate the results of the experiment
from sklearn.metrics import accuracy_score, f1_score

# fixed random number seed
np.random.seed(9102)
mx.random.seed(9102)

## Data pipeline

### Load Dataset

See [dataloader](data_loader.ipynb) for how the training samples are obtained.

In [2]:
data_folder = 'data/imdb/'
file_name = 'train.csv'
file_path = data_folder + file_name

## load json data.
nrows = 5000
data = pd.read_csv(file_path, nrows=nrows)

# create a list of review a label paris.
dataset = [[left, right, int(label)] for left, right, label in \
           zip(data['review_text'], data['plot_summary'], data['is_spoiler'])]

# randomly divide one percent from the training set as a verification set.
train_dataset, valid_dataset = nlp.data.train_valid_split(dataset, 0.2)
len(train_dataset), len(valid_dataset)

(4000, 1000)

### Data Processing

The purpose of the following code is to process the raw data so that the processed data can be used for model training and prediction. We will use the `SpacyTokenizer` to split the document into tokens, `ClipSequence` to crop the comments to the specified length, and build a vocabulary based on the word frequency of the training data. Then we attach the [Glove](https://nlp.stanford.edu/pubs/glove.pdf)[2]  pre-trained word vector to the vocabulary and convert each token into the corresponding word index in the vocabulary.
Finally get the standardized training data set and verification data set

In [3]:
# tokenizer takes as input a string and outputs a list of tokens.
tokenizer = nlp.data.SpacyTokenizer('en')

# length_clip takes as input a list and outputs a list with maximum length 300.
length_clip_review = nlp.data.ClipSequence(300)
length_clip_plot = nlp.data.ClipSequence(100)

def preprocess(x):

    # now the first element in tuple is review, second plot and third label
    try:
        left, right, label = x[0], x[1], int(x[2])
        assert(type(left)==type('str') and type(right)==type('str'))
        assert(label==0 or label==1)
    except:
        print(left, right)
        left, right=str(left), str(right)
    # clip the length of review words
    left, right = length_clip_review(tokenizer(left.lower())), \
                  length_clip_plot(tokenizer(right.lower()))
    return left, right, label

def get_length(x):
    return float(len(x[0]))

def preprocess_dataset(dataset):
    start = time.time()

    with mp.Pool() as pool:
        # Each sample is processed in an asynchronous manner.
        dataset = gluon.data.ArrayDataset(pool.map(preprocess, dataset))
        lengths = gluon.data.ArrayDataset(pool.map(get_length, dataset))
    end = time.time()

    print('Done! Tokenizing Time={:.2f}s, #Sentences={}'.format(end - start, len(dataset)))
    return dataset, lengths

# Preprocess the dataset
train_dataset, train_data_lengths = preprocess_dataset(train_dataset)
valid_dataset, valid_data_lengths = preprocess_dataset(valid_dataset)

Done! Tokenizing Time=1.98s, #Sentences=4000
Done! Tokenizing Time=0.76s, #Sentences=1000


In [4]:
# create vocab
train_seqs = [sample[0]+sample[1] for sample in train_dataset]
counter = nlp.data.count_tokens(list(itertools.chain.from_iterable(train_seqs)))

vocab = nlp.Vocab(counter, max_size=20000)

# load pre-trained embedding, Glove
embedding_weights = nlp.embedding.GloVe(source='glove.twitter.27B.200d')
vocab.set_embedding(embedding_weights)
print(vocab)

# NOTE: to use the same encoder, we need to ensure that two inputs are of the same length
# this is achieved by manual padding
def token_to_idx(x):
    return vocab[x[0]], vocab[x[1]], x[2]

# A token index or a list of token indices is returned according to the vocabulary.
with mp.Pool() as pool:
    train_dataset = pool.map(token_to_idx, train_dataset)
    valid_dataset = pool.map(token_to_idx, valid_dataset)

Vocab(size=20004, unk="<unk>", reserved="['<pad>', '<bos>', '<eos>']")


In [5]:
idx = vocab.embedding.token_to_idx['xmen']
idx, vocab.embedding.idx_to_vec[idx]

(8473, 
 [ 7.3501e-01  2.3659e-01  8.1485e-02  6.9116e-02 -8.2966e-02  5.1736e-01
  -2.2106e-01  1.2918e-01 -1.4550e-01  1.8944e-01  3.4552e-01 -2.7266e-01
   2.5191e-01  4.1132e-02 -2.9738e-01 -1.9861e-01  2.3085e-01  2.0603e-01
  -4.2740e-02  2.6072e-01 -6.5018e-01  5.0332e-02  1.6095e-01 -5.4888e-01
  -2.7124e-01  1.1846e-01 -5.7249e-02 -3.1976e-01 -1.1128e-01  1.8404e-01
  -2.0640e-01  4.7762e-01  3.5954e-02 -1.4585e-02 -6.9505e-02  6.1174e-01
   4.4875e-01 -5.2554e-01 -1.5218e-01  3.7220e-01  3.6481e-01 -4.6937e-01
  -9.1761e-02 -6.8574e-01 -5.4097e-01  8.5439e-01 -3.5236e-01 -2.0113e-01
  -2.0686e-01 -1.0377e+00  4.0393e-01 -7.6687e-01 -2.3442e-01  2.8207e-01
  -8.8382e-01 -5.5162e-03  6.4569e-01 -6.6990e-02  6.1193e-01  7.7833e-02
   1.2878e-01  1.1748e-01 -4.4662e-01 -1.5219e-01 -9.0765e-01  6.0395e-01
   1.4690e-01  2.9322e-01 -2.7783e-01 -5.7250e-01 -3.1507e-01  2.0820e-01
  -1.5681e-01  1.1861e+00  5.2476e-01  3.4866e-01 -4.0212e-01 -3.9851e-01
   2.9129e-01 -2.7880e-01  2.8

## Bucketing and DataLoader
Since each sentence may have a different length, we need to use `Pad` to fill the sentences in a minibatch to equal lengths so that the data can be quickly tensored on the GPU. At the same time, we need to use `Stack` to stack the category tags of a batch of data. For convenience, we use `Tuple` to combine `Pad` and `Stack`.

In order to make the length of the sentence pad in each minibatch as small as possible, we should make the sentences with similar lengths in a batch as much as possible. In light of this, we consider constructing a sampler using `FixedBucketSampler`, which defines how the samples in a dataset will be iterated in a more economic way.

Finally, we use `DataLoader` to build a data loader for the training. dataset and validation dataset. The training dataset requires FixedBucketSampler, but the validation dataset doesn't require the sampler.

In [6]:
batch_size = 64
bucket_num = 10
bucket_ratio = 0.5


def get_dataloader():
    # Construct the DataLoader Pad data, stack label and lengths
    batchify_fn = nlp.data.batchify.Tuple(nlp.data.batchify.Pad(axis=0), \
                                          nlp.data.batchify.Pad(axis=0),
                                          nlp.data.batchify.Stack())

    # n this example, we use a FixedBucketSampler,
    # which assigns each data sample to a fixed bucket based on its length.
    batch_sampler = nlp.data.sampler.FixedBucketSampler(
        train_data_lengths,
        batch_size=batch_size,
        num_buckets=bucket_num,
        ratio=bucket_ratio,
        shuffle=True)
    print(batch_sampler.stats())

    # train_dataloader
    train_dataloader = gluon.data.DataLoader(
        dataset=train_dataset,
        batch_sampler=batch_sampler,
        batchify_fn=batchify_fn)
    # valid_dataloader
    valid_dataloader = gluon.data.DataLoader(
        dataset=valid_dataset,
        batch_size=batch_size,
        shuffle=False,
        batchify_fn=batchify_fn)
    return train_dataloader, valid_dataloader

train_dataloader, valid_dataloader = get_dataloader()

FixedBucketSampler:
  sample_num=4000, batch_num=64
  key=[48, 76, 104, 132, 160, 188, 216, 244, 272, 300]
  cnt=[23, 114, 160, 251, 493, 484, 365, 245, 231, 1634]
  batch_size=[200, 126, 92, 72, 64, 64, 64, 64, 64, 64]


In [9]:
# experiement one the two dataloaders
# for left, right, label in train_dataloader:
    # print(left.shape, right.shape, label.shape)

When the number of samples for labels are very unbalanced, applying different weights on different labels may improve the performance of the model. This can be seen as a method to correct [covariance shift](http://d2l.ai/chapter_multilayer-perceptrons/environment.html).

In [48]:
class WeightedSoftmaxCE(nn.HybridBlock):
    def __init__(self, sparse_label=True, from_logits=False,  **kwargs):
        super(WeightedSoftmaxCE, self).__init__(**kwargs)
        with self.name_scope():
            self.sparse_label = sparse_label
            self.from_logits = from_logits

    def hybrid_forward(self, F, pred, label, class_weight, depth=None):
        if self.sparse_label:
            label = F.reshape(label, shape=(-1, ))
            label = F.one_hot(label, depth)
        if not self.from_logits:
            pred = F.log_softmax(pred, -1)

        weight_label = F.broadcast_mul(label, class_weight)
        loss = -F.sum(pred * weight_label, axis=-1)

        # return F.mean(loss, axis=0, exclude=True)
        return loss

In [49]:
class AlignAttention(nn.HybridBlock):
    def __init__(self, **kwargs):
        super(AlignAttention, self).__init__()
    
    def hybrid_forward(self, F, inp_left, inp_right):
        # input dimension is of (batch_size, seq_len, embed_size)
        # att dimension is of (batch_size, seq_len_left, seq_len_right)
        att = F.batch_dot(inp_left, F.transpose(inp_right, axes = (0, 2, 1)))
        # inp_left_dot dimention is of (batch_size, seq_left, embed_size)
        inp_left_dot = F.batch_dot(F.softmax(att, axis=-1), inp_right)
        # inp_right_dot dimension is of (batch_size, seq_right, embed_size)
        inp_right_dot = F.batch_dot(F.softmax(F.transpose(att, axes=(0, 2, 1)), axis=-1), inp_left)
        # concat original (lstm output, dot multiplier, substraction, elementwise product)
        # therefore, the real size is (batch_size, seq_len_left/right, embed_size*4)
        aug_left = F.concat(inp_left, inp_left_dot, inp_left-inp_left_dot, inp_left*inp_left_dot, dim=-1)
        aug_right = F.concat(inp_right, inp_right_dot, inp_right-inp_right_dot, inp_right*inp_right_dot, dim=-1)
        return aug_left, aug_right

In [50]:
class SelfAttentiveBiLSTM(nn.HybridBlock):
    def __init__(self, vocab_len, embsize, nhidden, nlayers, natt_unit, natt_hops, \
                 nfc, nclass, # these two params are not used currrently
                 drop_prob, pool_way, prune_p=None, prune_q=None, **kwargs):
        super(SelfAttentiveBiLSTM, self).__init__(**kwargs)
        with self.name_scope():
            # now we switch back to shared layers
            self.drop_prob = drop_prob
            self.pool_way = pool_way
            # word embedding
            self.embedding_layer = nn.Embedding(vocab_len, embsize)
            # lstm
            self.bilstm1 = rnn.LSTM(nhidden, num_layers=nlayers, dropout=drop_prob, bidirectional=True)
            self.bilstm2 = rnn.LSTM(nhidden, num_layers=1, dropout=drop_prob, bidirectional=True)
            # enhancement
            self.align_att = AlignAttention()
            # this layer is used to output the final class
            self.output_layer = nn.Dense(nclass, activation='tanh')

    def hybrid_forward(self, F, inp_left, inp_right):
        # inp is a list containing left_text and right_text
        # their size: [batch, token_idx]
        # inp_embed_left/right size: [batch, seq_len, embed_size]
        inp_embed_left = self.embedding_layer(inp_left)
        inp_embed_right = self.embedding_layer(inp_right)
        # rnn requires the first dimension to be the time steps, output is (seq_len, batch_size, embed_size)
        h_output_left = self.bilstm1(F.transpose(inp_embed_left, axes=(1, 0, 2)))
        h_output_right = self.bilstm1(F.transpose(inp_embed_right, axes=(1, 0, 2)))
        left_aligned, right_aligned = self.align_att(F.transpose(h_output_left, axes=(1, 0, 2)), \
                                                     F.transpose(h_output_right, axes=(1, 0, 2)))
        # apply another layer of lstm
        left_composed = self.bilstm2(F.transpose(left_aligned, axes=(1, 0, 2)))
        right_composed = self.bilstm2(F.transpose(right_aligned, axes=(1, 0, 2)))
        # FIXME: more elaborate output layer
        output = self.output_layer(left_composed - right_composed)

        return output

## Configure parameters and build models

The resulting `M` is a matrix, and the way to classify this matrix is `flatten`, `mean` or `prune`. Prune is a way of trimming parameters proposed in the original paper and has been implemented here.

In [51]:
vocab_len = len(vocab)
emsize = 200   # word embedding size
nhidden = 300    # lstm hidden_dim
nlayers = 4     # lstm layers
natt_unit = 300     # the hidden_units of attention layer
natt_hops = 20    # the channels of attention

# these two variables are not used, preserve for now
nfc = 512
nclass = 2

drop_prob = 0
pool_way = 'flatten'    # # The way to handle M
prune_p = None
prune_q = None

ctx = try_gpu()

model = SelfAttentiveBiLSTM(vocab_len, emsize, nhidden, nlayers,
                            natt_unit, natt_hops, nfc, nclass,
                            drop_prob, pool_way, prune_p, prune_q)

model.initialize(init=init.Xavier(), ctx=ctx)
model.hybridize()

# Attach a pre-trained glove word vector to the embedding layer
model.embedding_layer.weight.set_data(vocab.embedding.idx_to_vec)
# fixed the embedding layer
model.embedding_layer.collect_params().setattr('grad_req', 'null')

print(model)

SelfAttentiveBiLSTM(
  (embedding_layer): Embedding(20004 -> 200, float32)
  (bilstm1): LSTM(None -> 300, TNC, num_layers=4, bidirectional)
  (bilstm2): LSTM(None -> 300, TNC, bidirectional)
  (align_att): AlignAttention(
  
  )
  (output_layer): Dense(None -> 2, Activation(tanh))
)


In [52]:
def calculate_loss(x_left, x_right, y, model, loss, class_weight, penal_coeff):
    pred = model(x_left, x_right)
    y = nd.array(y.asnumpy().astype('int32')).as_in_context(ctx)
    if loss_name in ['sce', 'l1', 'l2']:
        l = loss(pred, y)
    elif loss_name == 'wsce':
        l = loss(pred, y, class_weight, class_weight.shape[0])
    else:
        raise NotImplemented
    return pred, l

In [53]:
def one_epoch(data_iter, model, loss, trainer, ctx, is_train, epoch,
              penal_coeff=0.0, clip=None, class_weight=None, loss_name='sce'):

    loss_val = 0.
    total_pred = []
    total_true = []
    n_batch = 0

    for batch_x_left, batch_x_right, batch_y in data_iter:
        batch_x_left = batch_x_left.as_in_context(ctx)
        batch_x_right = batch_x_right.as_in_context(ctx)
        batch_y = batch_y.as_in_context(ctx)

        if is_train:
            with autograd.record():
                batch_pred, l = calculate_loss(batch_x_left, batch_x_right, \
                                               batch_y, model, loss, class_weight, \
                                               penal_coeff)

            # backward calculate
            l.backward()

            # clip gradient
            clip_params = [p.data() for p in model.collect_params().values()]
            if clip is not None:
                norm = nd.array([0.0], ctx)
                for param in clip_params:
                    if param.grad is not None:
                        norm += (param.grad ** 2).sum()
                norm = norm.sqrt().asscalar()
                if norm > clip:
                    for param in clip_params:
                        if param.grad is not None:
                            param.grad[:] *= clip / norm

            # update parmas
            trainer.step(batch_x_left.shape[0])

        else:
            batch_pred, l = calculate_loss(batch_x_left, batch_x_right, \
                                           batch_y, model, loss, class_weight, \
                                           penal_coeff)

        # keep result for metric
        batch_pred = nd.argmax(nd.softmax(batch_pred, axis=1), axis=1).asnumpy()
        batch_true = np.reshape(batch_y.asnumpy(), (-1, ))
        total_pred.extend(batch_pred.tolist())
        total_true.extend(batch_true.tolist())
        
        batch_loss = l.mean().asscalar()

        n_batch += 1
        loss_val += batch_loss

        # check the result of traing phase
        if is_train and n_batch % 400 == 0:
            print('epoch %d, batch %d, batch_train_loss %.4f, batch_train_acc %.3f' %
                  (epoch, n_batch, batch_loss, accuracy_score(batch_true, batch_pred)))

    # metric
    F1 = f1_score(np.array(total_true), np.array(total_pred), average='binary')
    acc = accuracy_score(np.array(total_true), np.array(total_pred))
    loss_val /= n_batch

    if is_train:
        print('epoch %d, learning_rate %.5f \n\t train_loss %.4f, acc_train %.3f, F1_train %.3f, ' %
              (epoch, trainer.learning_rate, loss_val, acc, F1))
        # declay lr
        if epoch % 3 == 0:
            trainer.set_learning_rate(trainer.learning_rate * 0.9)
    else:
        print('\t valid_loss %.4f, acc_valid %.3f, F1_valid %.3f, ' % (loss_val, acc, F1))


In [54]:
def train_valid(data_iter_train, data_iter_valid, model, loss, trainer, ctx, nepochs,
                penal_coeff=0.0, clip=None, class_weight=None, loss_name='sce'):

    for epoch in range(1, nepochs+1):
        start = time.time()
        # train
        is_train = True
        one_epoch(data_iter_train, model, loss, trainer, ctx, is_train,
                  epoch, penal_coeff, clip, class_weight, loss_name)

        # valid
        is_train = False
        one_epoch(data_iter_valid, model, loss, trainer, ctx, is_train,
                  epoch, penal_coeff, clip, class_weight, loss_name)
        end = time.time()
        print('time %.2f sec' % (end-start))
        print("*"*100)


## Train
Now that we are training the model, we use WeightedSoftmaxCE to alleviate the problem of data category imbalance.

In [55]:
class_weight = None
loss_name = 'sce'
optim = 'adam'
lr = 0.001
penal_coeff = 0.003
clip = .5
nepochs = 20

trainer = gluon.Trainer(model.collect_params(), optim, {'learning_rate': lr})

if loss_name == 'sce':
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
elif loss_name == 'wsce':
    loss = WeightedSoftmaxCE()
    # the value of class_weight is obtained by counting data in advance. It can be seen as a hyperparameter.
    class_weight = nd.array([1., 3.], ctx=ctx)
elif loss_name == 'l1':
    loss = gluon.loss.L1Loss()
elif loss_name == 'l2':
    loss = gluon.loss.L2Loss()

In [56]:
# train and valid
train_valid(train_dataloader, valid_dataloader, model, loss, \
            trainer, ctx, nepochs, penal_coeff=penal_coeff, \
            clip=clip, class_weight=class_weight, loss_name=loss_name)

TypeError: hybrid_forward() got multiple values for argument 'weight'

## Predict
Now we will randomly input a movie plot summary and its review to see if the model decide the review is a spoiler or not.

In [38]:

'''
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    model = gluon.nn.SymbolBlock.imports("model/att-symbol.json", ['data'], \
                                         "model/att-0001.params", ctx=ctx)
'''

AssertionError: Parameter 'embedding0_weight' is missing in file 'model/att-0001.params', which contains parameters: 'selfattentivebilstm2_selfattentivebilstm12_embedding0_weight', 'selfattentivebilstm2_selfattentivebilstm12_lstm0_l0_i2h_weight', 'selfattentivebilstm2_selfattentivebilstm12_lstm0_l0_h2h_weight', ..., 'selfattentivebilstm2_selfattentivebilstm12_selfattention0_dense1_weight', 'selfattentivebilstm2_selfattentivebilstm12_selfattention0_dense1_bias', 'selfattentivebilstm2_selfattentivebilstm12_dense0_weight', 'selfattentivebilstm2_selfattentivebilstm12_dense0_bias'. Please make sure source and target networks have the same prefix.

In [44]:
right = 'john is a former police, his daughter, sarah was killed in a car accident, which he \
        did not think was purely accidental. with his investigation went deeper, he found \
        that sarah\'s boyfriend, jack, was the one to blame.'
left = 'Word embedding can effectively represent the semantic similarity between words, \
        which brings many breakthroughs for natural language processing tasks. \
        The attention mechanism can intuitively grasp the important semantic features \
        in the sentence. I liked sarah and jack in this movie, but hate john, because he \
        killed sarah'
left_token = vocab[tokenizer(left)]
right_token = vocab[tokenizer(right)]

In [45]:
left_input = nd.array(left_token, ctx=ctx).reshape(1,-1)
right_input = nd.array(right_token, ctx=ctx).reshape(1,-1)
pred = model(left_input, right_input)
pred
#model.embedding_layer_right.weight.list_data()

(
 [[ 0.28886735 -0.08389135]]
 <NDArray 1x2 @gpu(0)>, (1, 20, 56), (1, 20, 49))

In order to intuitively feel the role of the attention mechanism, we visualize the output of the model's attention on the predicted samples.