# Assignment 9 - NLP using Deep Learning

## Goals

In this assignment you will get to work with recurrent network architectures with application to language processing tasks and observe behaviour of the learning using tensorboard visualization.

You'll learn to use

 * word embeddings,
 * LSTMs,
 * tensorboard visualization to develop and tune deep learning architectures.


## Use the deep learning environment in the lab

With the same kind of preparation as in [Assignment 5](../A5/A5-instruction.html) we are going to use **[pytorch](http://pytorch.org)** for the deep learning aspects of the assignment.

There is a pytorch setup in the big data under the globally available anaconda installation.
However, it is recommended that you use the custom **dlenv** conda environment that contains all python package dependencies that are relevant for this assignment (and also nltk, gensim, tensorflow, keras, etc.).

Either you load it directly
```
source activate /usr/shared/CMPT/big-data/tmp_py/dlenv
```
or you prepare
```
cd ~
mkdir -p .conda/envs
ln -s /usr/shared/CMPT/big-data/tmp_py/dlenv .conda/envs
```
and from thereon simply use
```
source activate dlenv
```

In [1]:
import os
bdenv_loc = '/usr/shared/CMPT/big-data'
bdata = os.path.join(bdenv_loc,'data')

# Task 1: Explore Word Embeddings

Word embeddings are mappings between words and multi-dimensional vectors, where the difference between two word vectors has some relationship with the meaning of the corresponding words, i.e. words that are similar in meaning are mapped closely together (ideally). This part of the assignment should enable you to

* Load a pretrained word embedding
* Perform basic operations, such as distance queries and evaluate simple analogies

In [70]:
import gensim
# Load Google's pre-trained Word2Vec model, trained on news articles
model = gensim.models.KeyedVectors.load_word2vec_format(
    os.path.join(bdata,'GoogleNews-vectors-negative300.bin'), binary=True)

In [68]:
# read up about the word2vec API in gensim and
# obtain a vector representation for a word of your choice

...

# to confirm that this worked, print out the number of elements of the vector

(300,)

In [None]:
# determine the 10 words that are closest in the embedding to the word vector your produced above

...

# are the nearest neighbours similar in meaning?
# try different seed words, until you find one whose neighbourhood looks OK

In [None]:
# using a combination of positive and negative words, find out which word is most
# similar to woman + king - man

...

# note, gensim's API allows you to combine positive and negative words without obtainng their vectors

In [None]:
# you may find that the results of most word analogy combinations don't work as well as we'd hope.
# however, explore a bit and find two more cases where the output of your word vector algebra makes sense.

...

# Task 2: Sequence modeling with RNNs

In this task you will get to use a learning and a rule-based model of text sentiment analysis. To keep things simple, you will receive almost all the code and are just left with the task to tune the given algorithms, see the part about instrumentation below.

First let's create a simple LSTM model that is capable of producing a lable for a sequence of vector encoded words, based on code from [this repo](https://github.com/clairett/pytorch-sentiment-classification).

In [2]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

class LSTMSentiment(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size,
                 use_gpu, batch_size, dropout=0.5, bidirectional=False):
        """Prepare individual layers"""
        super(LSTMSentiment, self).__init__()
        self.hidden_dim = hidden_dim
        self.use_gpu = use_gpu
        self.batch_size = batch_size
        self.dropout = dropout
        self.num_directions = 2 if bidirectional else 1
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, bidirectional=bidirectional)
        self.hidden2label = nn.Linear(hidden_dim*self.num_directions, label_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        """Choose appropriate size and type of hidden layer"""
        # first is the hidden h
        # second is the cell c
        if self.use_gpu:
            return (Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim).cuda()),
                    Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim).cuda()))
        else:
            return (Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim)),
                    Variable(torch.zeros(self.num_directions, self.batch_size, self.hidden_dim)))

    def forward(self, sentence):
        """Use the layers of this model to propagate input and return class log probabilities"""
        if self.use_gpu:
            sentence = sentence.cuda()
        x = self.embeddings(sentence).view(len(sentence), self.batch_size, -1)
        lstm_out, self.hidden = self.lstm(x, self.hidden)
        
        y = self.hidden2label(lstm_out[-1])
        log_probs = F.log_softmax(y, dim=0)
        return log_probs


In [79]:
from torch import optim
import time, random
import os
from tqdm import tqdm_notebook as tqdm
tqdm.write = print
from torchtext import data
import numpy as np
import argparse

torch.set_num_threads(8)
torch.manual_seed(1)
random.seed(1)


def load_bin_vec(fname, vocab):
    """
    Loads 300x1 word vecs from Google (Mikolov) word2vec
    """
    word_vecs = {}
    with open(fname, "rb") as f:
        header = f.readline()
        vocab_size, layer1_size = map(int, header.split())
        binary_len = np.dtype('float32').itemsize * layer1_size
        for line in range(vocab_size):
            word = []
            while True:
                ch = f.read(1).decode('latin-1')
                if ch == ' ':
                    word = ''.join(word)
                    break
                if ch != '\n':
                    word.append(ch)
            if word in vocab:
               word_vecs[word] = np.frombuffer(f.read(binary_len), dtype='float32')
            else:
                f.read(binary_len)
    return word_vecs


def get_accuracy(truth, pred):
    assert len(truth) == len(pred)
    right = 0
    for i in range(len(truth)):
        if truth[i].item() == pred[i]:
            right += 1.0
    return right / len(truth)


def train_epoch_progress(model, train_iter, loss_function, optimizer, text_field, label_field, epoch):
    model.train()
    avg_loss = 0.0
    truth_res = []
    pred_res = []
    count = 0
    for batch in tqdm(train_iter, desc='Train epoch '+str(epoch+1)):
        sent, label = batch.text, batch.label
        label.data.sub_(1)
        truth_res += list(label.data)
        model.batch_size = len(label.data)
        model.hidden = model.init_hidden()
        pred = model(sent)
        if USE_GPU:
            pred_label = pred.data.max(1)[1].cpu().numpy()
        else:
            pred_label = pred.data.max(1)[1].numpy()
        pred_res += [x for x in pred_label]
        model.zero_grad()
        loss = loss_function(pred, label)
        avg_loss += loss.data.item()
        count += 1
        loss.backward()
        optimizer.step()
    avg_loss /= len(train_iter)
    acc = get_accuracy(truth_res, pred_res)
    return avg_loss, acc


def train_epoch(model, train_iter, loss_function, optimizer):
    model.train()
    avg_loss = 0.0
    truth_res = []
    pred_res = []
    count = 0
    for batch in train_iter:
        sent, label = batch.text, batch.label
        label.data.sub_(1)
        truth_res += list(label.data)
        model.batch_size = len(label.data)
        model.hidden = model.init_hidden()
        pred = model(sent)
        if USE_GPU:
            pred_label = pred.data.max(1)[1].cpu().numpy()
        else:
            pred_label = pred.data.max(1)[1].numpy()
        pred_res += [x for x in pred_label]
        model.zero_grad()
        loss = loss_function(pred, label)
        avg_loss += loss.data.item()
        count += 1
        loss.backward()
        optimizer.step()
    avg_loss /= len(train_iter)
    acc = get_accuracy(truth_res, pred_res)
    return avg_loss, acc


def evaluate(model, data, loss_function, name):
    model.eval()
    avg_loss = 0.0
    truth_res = []
    pred_res = []
    for batch in data:
        sent, label = batch.text, batch.label
        label.data.sub_(1)
        truth_res += list(label.data)
        model.batch_size = len(label.data)
        model.hidden = model.init_hidden()
        pred = model(sent)
        if USE_GPU:
            pred_label = pred.data.max(1)[1].cpu().numpy()
        else:
            pred_label = pred.data.max(1)[1].numpy()
        pred_res += [x for x in pred_label]
        loss = loss_function(pred, label)
        avg_loss += loss.data.item()
    avg_loss /= len(data)
    acc = get_accuracy(truth_res, pred_res)
    print(name + ': loss %.2f acc %.1f' % (avg_loss, acc*100))
    return avg_loss, acc


def load_sst(text_field, label_field, batch_size, use_gpu=True):
    train, dev, test = data.TabularDataset.splits(path=os.path.join(bdata,'sst2'), train='train.tsv',
                                                  validation='dev.tsv', test='test.tsv', format='tsv',
                                                  fields=[('text', text_field), ('label', label_field)])
    text_field.build_vocab(train, dev, test)
    label_field.build_vocab(train, dev, test)
    train_iter, dev_iter, test_iter = data.BucketIterator.splits((train, dev, test),
                                                                 batch_sizes=(batch_size, len(dev), len(test)),
                                                                 sort_key=lambda x: len(x.text), repeat=False,
                                                                 device=0 if use_gpu else -1)
    return train_iter, dev_iter, test_iter


In [72]:
# TODO: after instrumention (see below) tune these parameters to improve the performance of the model
EPOCHS = 5
USE_GPU = torch.cuda.is_available()
#EMBEDDING_TYPE = 'glove'
EMBEDDING_TYPE = 'word2vec'
EMBEDDING_DIM = 300
HIDDEN_DIM = 10
USE_BILSTM = False
DROPOUT = .05
LEARNING_RATE = 1e-3
BATCH_SIZE = 5

timestamp = str(int(time.time()))
best_dev_acc = 0.0

text_field = data.Field(lower=True)
label_field = data.Field(sequential=False)
train_iter, dev_iter, test_iter = load_sst(text_field, label_field, BATCH_SIZE, USE_GPU)

model = LSTMSentiment(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,
                      vocab_size=len(text_field.vocab), label_size=len(label_field.vocab)-1,\
                      use_gpu=USE_GPU, batch_size=BATCH_SIZE, dropout=DROPOUT, bidirectional=USE_BILSTM)

if USE_GPU:
    model = model.cuda()

best_model = model
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = nn.NLLLoss()

In [73]:
if 'glove' in EMBEDDING_TYPE:
    #text_field.vocab.load_vectors('glove.6B.{}d'.format(EMBEDDING_DIM))
    text_field.vocab.load_vectors('glove.twitter.27B.100d')
    if USE_GPU:
        model.embeddings.weight.data = text_field.vocab.vectors.cuda()
    else:
        model.embeddings.weight.data = text_field.vocab.vectors
    #model.embeddings.embed.weight.requires_grad = False
elif 'word2vec' in EMBEDDING_TYPE:
    word_to_idx = text_field.vocab.stoi
    pretrained_embeddings = np.random.uniform(-0.25, 0.25, (len(text_field.vocab), 300))
    pretrained_embeddings[0] = 0
    try:
        word2vec
    except:
        print('Load word embeddings...')
        word2vec = load_bin_vec(os.path.join(bdata,'GoogleNews-vectors-negative300.bin'), word_to_idx)
    for word, vector in word2vec.items():
        pretrained_embeddings[word_to_idx[word]-1] = vector
    # text_field.vocab.load_vectors(wv_type='', wv_dim=300)

    model.embeddings.weight.data.copy_(torch.from_numpy(pretrained_embeddings));
else:
    print('Unknown embedding type {}'.format(EMBEDDING_TYPE))

### The actual task (B1): Tensorboard instrumentation

To get you to work with the some of the basic tools that enable development and tuning of deep learning architectures, we would like you to use Tensorboard.

1. read up on how to instrument your code for profiling and visualization in [tensorboard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard), e.g. [at this blog](http://www.erogol.com/use-tensorboard-pytorch/)
1. [partly done] use the tensorboard `SummaryWriter` to keep track of training loss for each epoch, writing to a local `runs` folder (which is the default)
1. launch tensorboard and inspect the log folder, i.e. run `tensorboard --logdir runs` from the assignment folder

In [None]:
from tensorboardX import SummaryWriter

out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
writer = SummaryWriter(comment='-{}lstm-em{}{}-hid{}-do{}-bs{}-lr{}'
                                .format('BI' if USE_BILSTM else '',
                                        EMBEDDING_TYPE, EMBEDDING_DIM,
                                        HIDDEN_DIM,
                                        DROPOUT, BATCH_SIZE, LEARNING_RATE))
print("Writing to {}\n".format(out_dir))
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

In [75]:
print('Training...')
trial = 0 # increment this if you manually decide to add more epochs to the current training
for epoch in range(EPOCHS*trial,EPOCHS*(trial+1)):
    avg_loss, acc = train_epoch_progress(model, train_iter, loss_function, optimizer, text_field, label_field, epoch)
    tqdm.write('Train: loss %.2f acc %.1f' % (avg_loss, acc*100))
    # TODO: add scalars for training loss and training accuracy to the summary writer
    # call the scalars 'Train/Loss' and 'Train/Acc', respectively, and associate them with the current epoch
    #...

    dev_loss, dev_acc = evaluate(model, dev_iter, loss_function, 'Dev')
    # TODO: add scalars for test loss and training accuracy to the summary writer
    # call the scalars 'Val/Loss' and 'Val/Acc', respectively, and associate them with the current epoch
    #...
    
    if dev_acc > best_dev_acc:
        if best_dev_acc > 0:
            os.system('rm '+ out_dir + '/best_model' + '.pth')
        best_dev_acc = dev_acc
        best_model = model
        torch.save(best_model.state_dict(), out_dir + '/best_model' + '.pth')
        # evaluate on test with the best dev performance model
        test_acc = evaluate(best_model, test_iter, loss_function, 'Test')

test_loss, test_acc = evaluate(best_model, test_iter, loss_function, 'Final Test')

Training...



Train: loss 1.53 acc 65.5
Dev: loss 6.75 acc 60.6


  return Variable(arr, volatile=not train)


Test: loss 7.50 acc 57.7



Train: loss 1.27 acc 87.7
Dev: loss 6.85 acc 62.6
Test: loss 7.57 acc 58.7



Train: loss 1.14 acc 93.8
Dev: loss 6.70 acc 75.7
Test: loss 7.40 acc 76.7



Train: loss 1.07 acc 97.0
Dev: loss 7.08 acc 73.7



Train: loss 1.05 acc 97.5
Dev: loss 8.16 acc 60.6
Final Test: loss 8.99 acc 56.8


In [None]:
writer.close()

### Task B2: Tune the model

After connecting the output of your model train and test performance with tensorboard. Change the model and training parameters above to improve the model performance. We would like to see variable plots of how validation accuracy evolves over a number of epochs for different parameter choices, you can stop exploring when you exceed a model accuracy of 76%.

Show a tensorboard screenshot with performance plots that combine at leat 5 different tuning attempts. Store the screenshot as `tensorboard.png`. Then keep the best performing parameters set in this notebook for submission and evaluate the comparison below with your best model. 

## Comparison against Vader
Vader is a rule-based sentiment analysis algorithm that performs quite well against more complex architectures. The test below is to see, whether LSTMs are able to beat its performance.

In [76]:
import nltk
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

da = test_iter.data()
dat = [(d.text, d.label, ' '.join(d.text)) for d in da]
lab_vpred = np.zeros((len(dat), 2))
for k, (_, label, sentence) in enumerate(dat):
    ss = sid.polarity_scores(sentence)
    lab_vpred[k,:] = (int(ss['compound']>0), int(label))
print('vader acc: {}'.format(1-abs(lab_vpred[:,0]-lab_vpred[:,1]).mean()))
writer.add_scalar('Final/VaderAcc', acc)

vader acc: 0.6880834706205381


In [77]:
#test_iter.init_epoch
batch = list(test_iter)[0]
batch.text
best_model.eval()
pred = best_model(batch.text)

  return Variable(arr, volatile=not train)


In [78]:
labels = batch.label.data.cpu().detach() - 1
labelsnp = labels.cpu().detach().numpy()
prednp = pred.data.max(1)[1].cpu().numpy()
lstm_acc = 1 - abs(prednp-labelsnp).mean()
print('(Bi-)LSTM acc: {}'.format(lstm_acc))
writer.add_scalar('Final/LSTMAcc', lstm_acc)

(Bi-)LSTM acc: 0.5551894563426689


Perform the model tuning and training in the previous task until you outperform the Vader algorithm by at least 7% in accuracy.

## Submission

Save [this notebook](A9.ipynb) containing all cell output and upload your submission as one `A9.ipynb` file.
Also, include the screenshot of your tensorboard debugging session as `tensorboard.png`.

### Some detail notes about the dlenv environment (not needed for this assignment, but maybe for your project)

Tensorflow is available in its GPU version now (v1.4.1 based on CUDA 8.0) - before, it was installed in this environment to only run on the CPU.

Also, PyTorch is compiled from github using CUDA 9.1, giving the current version 0.4. This enables a feature that was broken for the past two releases - adding the computational graph of any convolutional network model to a tensorboard visualization, e.g. see `demo_graph.py` and other demos in [this repo](https://github.com/lanpa/tensorboard-pytorch), if you'd like to learn more.