# Assignment 9 - NLP using Deep Learning

## Goals

In this assignment you will get to work with recurrent network architectures with application to language processing tasks and observe behaviour of the learning using tensorboard visualization.

You'll learn to use

 * word embeddings,
 * LSTMs,
 * tensorboard visualization to develop and tune deep learning architectures.


## Use the deep learning environment in the lab

With the same kind of preparation as in [Assignment 6](../A6/A6.html) we are going to use [pytorch](http://pytorch.org) for the deep learning aspects of the assignment. 

There is a `pytorch` setup in the big data lab under the globally available anaconda installation.
However, it is recommended that you use the custom **py36** conda environment that contains all python package dependencies that are relevant for this assignment (and also nltk, gensim, tensorflow, keras, and tensorboard).

Either you load it directly
```
source activate /usr/shared/CMPT/big-data/condaenv/py36
```
or you prepare
```
cd ~
mkdir -p .conda/envs
ln -s /usr/shared/CMPT/big-data/condaenv/py36 .conda/envs
```
and from thereon simply use
```
source activate py36
```

Also, there are some relevant datasets available in our shared folder.

In [None]:
import os
bdenv_loc = '/usr/shared/CMPT/big-data'
bdata = os.path.join(bdenv_loc,'data')

# Task 1: Explore Word Embeddings

Word embeddings are mappings between words and multi-dimensional vectors, where the difference between two word vectors has some relationship with the meaning of the corresponding words, i.e. words that are similar in meaning are mapped closely together (ideally). This part of the assignment should enable you to

* Load a pretrained word embedding
* Perform basic operations, such as distance queries and evaluate simple analogies

Note, each of the tasks below can be addressed with one or two lines of code using the [word2vec API in gensim](https://radimrehurek.com/gensim/models/word2vec.html).

In [None]:
import gensim
# Load Google's pre-trained Word2Vec model, trained on news articles
model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join('data','GoogleNews-vectors-negative300.bin'), binary=True)

Obtain a vector representation for a word of your choice.
To confirm that this worked, print out the number of elements of the vector.

In [None]:
vector = model['chocolate']
vector.size

Determine the 10 words that are closest in the embedding to the word vector you produced above.

In [None]:
model.most_similar(positive=[vector], topn=10)

Are the nearest neighbours similar in meaning?
Try different seed words, until you find one whose neighbourhood looks OK.

Using a combination of positive and negative words, find out which word is most
similar to `woman + king - man`. Note that gensim's API allows you to combine positive and negative words without explicitly obtaing their vectors.

In [None]:
model.most_similar(positive=['woman','king'], negative=['man'])

You may find that the results of most word analogy combinations don't work as well as we'd hope.

Explore a bit and *show two more cases* where the output of gensim's built-in word vector algebra looks somewhat meaningful, i.e. show more word analogy examples or produce lists of words where a word that doesn't match is identified.

In [None]:
model.most_similar(positive=['woman','actor'],negative=['man'])

#### 'LIEV_SCHREIBER' is a man

In [None]:
model.most_similar(positive=['animal','reptile'], negative=['mammal'])

#### 'DOG' is not a reptile

# Task 2: Sequence modeling with RNNs

In this task you will get to use a learning and a rule-based model of text sentiment analysis. To keep things simple, you will receive almost all the code and are just left with the task to tune the given algorithms, see the part about instrumentation below.

First let's create a simple LSTM model that is capable of producing a label for a sequence of vector encoded words, based on code from [this repo](https://github.com/clairett/pytorch-sentiment-classification).

In [4]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

class LSTMSentiment(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size,
                 use_gpu, batch_size, dropout=0.5, bidirectional=False):
        """Prepare individual layers"""
        super(LSTMSentiment, self).__init__()
        self.hidden_dim = hidden_dim
        self.use_gpu = use_gpu
        self.batch_size = batch_size
        self.dropout = dropout
        self.num_directions = 2 if bidirectional else 1
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, bidirectional=bidirectional)
        self.hidden2label = nn.Linear(hidden_dim*self.num_directions, label_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        """Choose appropriate size and type of hidden layer"""
        # first is the hidden h
        # second is the cell c
        if self.use_gpu:
            return (Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim).cuda()),
                    Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim).cuda()))
        else:
            return (Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim)),
                    Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim)))

    def forward(self, sentence):
        """Use the layers of this model to propagate input and return class log probabilities"""
        if self.use_gpu:
            sentence = sentence.cuda()
        x = self.embeddings(sentence).view(len(sentence), self.batch_size, -1)
        lstm_out, self.hidden = self.lstm(x, self.hidden)
        
        y = self.hidden2label(lstm_out[-1])
        log_probs = F.log_softmax(y, dim=0)
        return log_probs


In [5]:
from torch import optim
import time, random
import os
from tqdm import tqdm_notebook as tqdm
tqdm.write = print
from torchtext import data
import numpy as np
import argparse

torch.set_num_threads(8)
torch.manual_seed(1)
random.seed(1)


def load_bin_vec(fname, vocab):
    """
    Loads 300x1 word vecs from Google (Mikolov) word2vec
    """
    word_vecs = {}
    with open(fname, "rb") as f:
        header = f.readline()
        vocab_size, layer1_size = map(int, header.split())
        binary_len = np.dtype('float32').itemsize * layer1_size
        for line in range(vocab_size):
            word = []
            while True:
                ch = f.read(1).decode('latin-1')
                if ch == ' ':
                    word = ''.join(word)
                    break
                if ch != '\n':
                    word.append(ch)
            if word in vocab:
               word_vecs[word] = np.frombuffer(f.read(binary_len), dtype='float32')
            else:
                f.read(binary_len)
    return word_vecs


def get_accuracy(truth, pred):
    assert len(truth) == len(pred)
    right = 0
    for i in range(len(truth)):
        if truth[i].item() == pred[i]:
            right += 1.0
    return right / len(truth)


def train_epoch_progress(model, train_iter, loss_function, optimizer, text_field, label_field, epoch):
    model.train()
    avg_loss = 0.0
    truth_res = []
    pred_res = []
    count = 0
    for batch in tqdm(train_iter, desc='Train epoch '+str(epoch+1)):
        sent, label = batch.text, batch.label
        label.data.sub_(1)
        truth_res += list(label.data)
        model.batch_size = len(label.data)
        model.hidden = model.init_hidden()
        pred = model(sent)
        if USE_GPU:
            pred_label = pred.data.max(1)[1].cpu().numpy()
        else:
            pred_label = pred.data.max(1)[1].numpy()
        pred_res += [x for x in pred_label]
        model.zero_grad()
        loss = loss_function(pred, label)
        avg_loss += loss.data.item()
        count += 1
        loss.backward()
        optimizer.step()
    avg_loss /= len(train_iter)
    acc = get_accuracy(truth_res, pred_res)
    return avg_loss, acc


def train_epoch(model, train_iter, loss_function, optimizer):
    model.train()
    avg_loss = 0.0
    truth_res = []
    pred_res = []
    count = 0
    for batch in train_iter:
        sent, label = batch.text, batch.label
        label.data.sub_(1)
        truth_res += list(label.data)
        model.batch_size = len(label.data)
        model.hidden = model.init_hidden()
        pred = model(sent)
        if USE_GPU:
            pred_label = pred.data.max(1)[1].cpu().numpy()
        else:
            pred_label = pred.data.max(1)[1].numpy()
        pred_res += [x for x in pred_label]
        model.zero_grad()
        loss = loss_function(pred, label)
        avg_loss += loss.data.item()
        count += 1
        loss.backward()
        optimizer.step()
    avg_loss /= len(train_iter)
    acc = get_accuracy(truth_res, pred_res)
    return avg_loss, acc


def evaluate(model, data, loss_function, name):
    model.eval()
    avg_loss = 0.0
    truth_res = []
    pred_res = []
    for batch in data:
        sent, label = batch.text, batch.label
        label.data.sub_(1)
        truth_res += list(label.data)
        model.batch_size = len(label.data)
        model.hidden = model.init_hidden()
        pred = model(sent)
        if USE_GPU:
            pred_label = pred.data.max(1)[1].cpu().numpy()
        else:
            pred_label = pred.data.max(1)[1].numpy()
        pred_res += [x for x in pred_label]
        loss = loss_function(pred, label)
        avg_loss += loss.data.item()
    avg_loss /= len(data)
    acc = get_accuracy(truth_res, pred_res)
    print(name + ': loss %.2f acc %.1f' % (avg_loss, acc*100))
    return avg_loss, acc


def load_sst(text_field, label_field, batch_size, use_gpu=True):
    train, dev, test = data.TabularDataset.splits(path=os.path.join('data','sst2'), train='train.tsv',
                                                  validation='dev.tsv', test='test.tsv', format='tsv',
                                                  fields=[('text', text_field), ('label', label_field)])
    text_field.build_vocab(train, dev, test)
    label_field.build_vocab(train, dev, test)
    train_iter, dev_iter, test_iter = data.BucketIterator.splits((train, dev, test),
                                                                 batch_sizes=(batch_size, len(dev), len(test)),
                                                                 sort_key=lambda x: len(x.text), repeat=False,
                                                                 device=torch.device("cuda" if use_gpu else "cpu"))
    return train_iter, dev_iter, test_iter


**TODO:** After instrumentation with the summary writer (see further below), tune these parameters to improve the performance of the model.

In [7]:
EPOCHS = 20
USE_GPU = torch.cuda.is_available()
#EMBEDDING_TYPE = 'glove'
EMBEDDING_TYPE = 'word2vec'
EMBEDDING_DIM = 300
HIDDEN_DIM = 150
USE_BILSTM = True
DROPOUT = .05
LEARNING_RATE = 1e-2
BATCH_SIZE = 5

timestamp = str(int(time.time()))
best_dev_acc = 0.0

text_field = data.Field(lower=True)
label_field = data.Field(sequential=False)
train_iter, dev_iter, test_iter = load_sst(text_field, label_field, BATCH_SIZE, USE_GPU)

model = LSTMSentiment(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,
                      vocab_size=len(text_field.vocab), label_size=len(label_field.vocab)-1,\
                      use_gpu=USE_GPU, batch_size=BATCH_SIZE, dropout=DROPOUT, bidirectional=USE_BILSTM)

if USE_GPU:
    model = model.cuda()

best_model = model
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = nn.NLLLoss()

The code below lets you try other embedding types, but for this assignment it is fine to keep using word2vec.

In [8]:
if 'glove' in EMBEDDING_TYPE:
    #text_field.vocab.load_vectors('glove.6B.{}d'.format(EMBEDDING_DIM))
    text_field.vocab.load_vectors('glove.twitter.27B.100d')
    if USE_GPU:
        model.embeddings.weight.data = text_field.vocab.vectors.cuda()
    else:
        model.embeddings.weight.data = text_field.vocab.vectors
    #model.embeddings.embed.weight.requires_grad = False
elif 'word2vec' in EMBEDDING_TYPE:
    word_to_idx = text_field.vocab.stoi
    pretrained_embeddings = np.random.uniform(-0.25, 0.25, (len(text_field.vocab), 300))
    pretrained_embeddings[0] = 0
    try:
        word2vec
    except:
        print('Load word embeddings...')
        word2vec = load_bin_vec(os.path.join('data','GoogleNews-vectors-negative300.bin'), word_to_idx)
    for word, vector in word2vec.items():
        pretrained_embeddings[word_to_idx[word]-1] = vector
    # text_field.vocab.load_vectors(wv_type='', wv_dim=300)

    model.embeddings.weight.data.copy_(torch.from_numpy(pretrained_embeddings));
else:
    print('Unknown embedding type {}'.format(EMBEDDING_TYPE))

Load word embeddings...


### The actual task (B1): Tensorboard instrumentation

To get you to work with the some of the basic tools that enable development and tuning of deep learning architectures, we would like you to use Tensorboard.

1. read up on how to instrument your code for profiling and visualization in [tensorboard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard), e.g. [at this blog](http://www.erogol.com/use-tensorboard-pytorch/)
1. [partly done] use the tensorboard `SummaryWriter` to keep track of training loss for each epoch, writing to a local `runs` folder (which is the default)
1. launch tensorboard and inspect the log folder, i.e. run `tensorboard --logdir runs` from the assignment folder

Note that only point 2 requires you to write code, about 4 lines of it.

In [9]:
from tensorboardX import SummaryWriter

#out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs"))
writer = SummaryWriter(comment='-{}lstm-em{}{}-hid{}-do{}-bs{}-lr{}'
                                .format('BI' if USE_BILSTM else '',
                                        EMBEDDING_TYPE, EMBEDDING_DIM,
                                        HIDDEN_DIM,
                                        DROPOUT, BATCH_SIZE, LEARNING_RATE))
print("Writing to {}\n".format(out_dir))
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

Writing to G:\SFU\CMPT733\ass9\runs



In [10]:
print('Training...')
trial = 0 # increment this if you manually decide to add more epochs to the current training
for epoch in range(EPOCHS*trial,EPOCHS*(trial+1)):
    avg_loss, acc = train_epoch_progress(model, train_iter, loss_function, optimizer, text_field, label_field, epoch)
    tqdm.write('Train: loss %.2f acc %.1f' % (avg_loss, acc*100))
    # TODO: add scalars for training loss and training accuracy to the summary writer
    # call the scalars 'Train/Loss' and 'Train/Acc', respectively, and associate them with the current epoch
    writer.add_scalar('Train/Loss',avg_loss,epoch)
    writer.add_scalar('Train/Acc',acc,epoch)

    dev_loss, dev_acc = evaluate(model, dev_iter, loss_function, 'Dev')
    # TODO: add scalars for test loss and training accuracy to the summary writer
    # call the scalars 'Val/Loss' and 'Val/Acc', respectively, and associate them with the current epoch
    writer.add_scalar('Val/Loss',dev_loss,epoch)
    writer.add_scalar('Val/Acc',dev_acc,epoch)
    
    if dev_acc > best_dev_acc:
        if best_dev_acc > 0:
            os.system('rm '+ out_dir + '/best_model' + '.pth')
        best_dev_acc = dev_acc
        best_model = model
        torch.save(best_model.state_dict(), out_dir + '/best_model' + '.pth')
        # evaluate on test with the best dev performance model
        test_acc = evaluate(best_model, test_iter, loss_function, 'Test')

test_loss, test_acc = evaluate(best_model, test_iter, loss_function, 'Final Test')

Training...


HBox(children=(IntProgress(value=0, description='Train epoch 1', max=1384), HTML(value='')))


Train: loss 1.63 acc 49.1
Dev: loss 6.77 acc 49.2
Test: loss 7.51 acc 49.9


HBox(children=(IntProgress(value=0, description='Train epoch 2', max=1384), HTML(value='')))


Train: loss 1.63 acc 51.6
Dev: loss 6.77 acc 49.4
Test: loss 7.51 acc 49.9


HBox(children=(IntProgress(value=0, description='Train epoch 3', max=1384), HTML(value='')))


Train: loss 1.66 acc 53.0
Dev: loss 6.79 acc 52.9
Test: loss 7.51 acc 55.5


HBox(children=(IntProgress(value=0, description='Train epoch 4', max=1384), HTML(value='')))


Train: loss 1.63 acc 59.9
Dev: loss 6.81 acc 52.2


HBox(children=(IntProgress(value=0, description='Train epoch 5', max=1384), HTML(value='')))


Train: loss 1.58 acc 65.6
Dev: loss 6.80 acc 51.7


HBox(children=(IntProgress(value=0, description='Train epoch 6', max=1384), HTML(value='')))


Train: loss 1.57 acc 67.1
Dev: loss 6.83 acc 53.6
Test: loss 7.53 acc 57.6


HBox(children=(IntProgress(value=0, description='Train epoch 7', max=1384), HTML(value='')))


Train: loss 1.53 acc 70.3
Dev: loss 6.83 acc 55.7
Test: loss 7.54 acc 57.3


HBox(children=(IntProgress(value=0, description='Train epoch 8', max=1384), HTML(value='')))


Train: loss 1.52 acc 71.7
Dev: loss 6.83 acc 50.8


HBox(children=(IntProgress(value=0, description='Train epoch 9', max=1384), HTML(value='')))


Train: loss 1.54 acc 70.6
Dev: loss 6.84 acc 52.9


HBox(children=(IntProgress(value=0, description='Train epoch 10', max=1384), HTML(value='')))


Train: loss 1.52 acc 72.9
Dev: loss 6.87 acc 54.9


HBox(children=(IntProgress(value=0, description='Train epoch 11', max=1384), HTML(value='')))


Train: loss 1.51 acc 73.3
Dev: loss 6.89 acc 53.4


HBox(children=(IntProgress(value=0, description='Train epoch 12', max=1384), HTML(value='')))


Train: loss 1.51 acc 72.8
Dev: loss 6.88 acc 55.0


HBox(children=(IntProgress(value=0, description='Train epoch 13', max=1384), HTML(value='')))


Train: loss 1.50 acc 73.5
Dev: loss 6.83 acc 57.3
Test: loss 7.54 acc 59.3


HBox(children=(IntProgress(value=0, description='Train epoch 14', max=1384), HTML(value='')))


Train: loss 1.50 acc 74.5
Dev: loss 6.86 acc 57.5
Test: loss 7.55 acc 59.8


HBox(children=(IntProgress(value=0, description='Train epoch 15', max=1384), HTML(value='')))


Train: loss 1.48 acc 75.4
Dev: loss 6.85 acc 58.6
Test: loss 7.56 acc 60.1


HBox(children=(IntProgress(value=0, description='Train epoch 16', max=1384), HTML(value='')))


Train: loss 1.49 acc 75.1
Dev: loss 6.84 acc 57.9


HBox(children=(IntProgress(value=0, description='Train epoch 17', max=1384), HTML(value='')))


Train: loss 1.48 acc 76.0
Dev: loss 6.87 acc 58.8
Test: loss 7.57 acc 61.3


HBox(children=(IntProgress(value=0, description='Train epoch 18', max=1384), HTML(value='')))


Train: loss 1.47 acc 76.7
Dev: loss 6.86 acc 58.4


HBox(children=(IntProgress(value=0, description='Train epoch 19', max=1384), HTML(value='')))


Train: loss 1.48 acc 75.9
Dev: loss 6.89 acc 57.6


HBox(children=(IntProgress(value=0, description='Train epoch 20', max=1384), HTML(value='')))


Train: loss 1.48 acc 76.3
Dev: loss 6.89 acc 57.0
Final Test: loss 7.58 acc 61.4


In [11]:
writer.close()

### Task B2: Tune the model

After connecting the output of your model training and testing performance for monitoring in tensorboard. Change the model and training parameters above to improve the model performance. We would like to see variable plots of how validation accuracy evolves over a number of epochs for different parameter choices, you can stop exploring when you exceed a model accuracy of 76%.

**Show a tensorboard screenshot with performance plots that combine at leat 5 different tuning attempts.** Store the screenshot as `tensorboard.png`. Then keep the best performing parameters set in this notebook for submission and evaluate the comparison with Vader below using your best model.

Note, parameter and architecture tuning is an exercise that can go on for a long time. After you have tensorboard running, enabling you to observe learning progress for the algorithms in this notebook, **spend about half an hour tuning to improve the parameter choices**. Big leaps in performance actually require deeper research and may take days or months. While beyond the scope of this assignment, you now have the tools and background knowledge to do such work, if you want to.

## Comparison against Vader
Vader is a rule-based sentiment analysis algorithm that performs quite well against more complex architectures. The test below is to see, whether LSTMs are able to beat its performance.

In [12]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

da = test_iter.data()
dat = [(d.text, d.label, ' '.join(d.text)) for d in da]
lab_vpred = np.zeros((len(dat), 2))
for k, (_, label, sentence) in enumerate(dat):
    ss = sid.polarity_scores(sentence)
    lab_vpred[k,:] = (int(ss['compound']>0), int(label))
print('vader acc: {}'.format(1-abs(lab_vpred[:,0]-lab_vpred[:,1]).mean()))

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
vader acc: 0.6880834706205381


In [13]:
#test_iter.init_epoch
batch = list(test_iter)[0]
batch.text
best_model.eval()
pred = best_model(batch.text)

In [14]:
labels = batch.label.data.cpu().detach() - 1
labelsnp = labels.cpu().detach().numpy()
prednp = pred.data.max(1)[1].cpu().numpy()
lstm_acc = 1 - abs(prednp-labelsnp).mean()
print('(Bi-)LSTM acc: {}'.format(lstm_acc))

(Bi-)LSTM acc: 0.6046128500823723


**Perform the model tuning and training in the previous task until you outperform the Vader algorithm by at least 5% in accuracy on the test set.** Note, this is not a separate task, but just additional code to check whether your tuning efforts have succeeded.

## Submission

Save [this notebook](A9.ipynb) containing all cell output and upload your submission as one `A9.ipynb` file.
Also, include the screenshot of your tensorboard debugging session as `tensorboard.png`.