# Credits

This is heavily influenced by https://github.com/pytorch/tutorials.

In [2]:
#Uncomment and run the next lines if torchtext/bokeh/nltk isb't installed
!conda/miniconda3/pip install --user torchtext nltk


[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
# RUN THIS LINE ASAP - Download the dataset while you read the exercise

from pprint import pprint

import numpy as np

from torchtext import data
from torchtext import datasets
from torchtext.vocab import Vectors

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
from torch.nn import Linear
from torch.nn.functional import softmax, relu

from sklearn.manifold import TSNE

# we'll use the bokeh library to create beautiful plots
# *_notebook functions are needed for correct use in jupyter
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import HoverTool
from bokeh.io import output_notebook, show, push_notebook
output_notebook()


url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec'

# Following lines will be explained further below.
# Run the cell and read the notebook further while the data gets downloaded
TEXT = data.Field(sequential=True)
LABEL = data.Field(sequential=False)
train_set, validation_set, _ = datasets.SST.splits(TEXT,
                                                   LABEL,
                                                   fine_grained=False,
                                                   train_subtrees=True,
                                                   filter_pred=lambda ex: ex.label != 'neutral')
TEXT.build_vocab(train_set, max_size=None, vectors=Vectors('wiki.simple.vec', url=url))
LABEL.build_vocab(train_set)

ModuleNotFoundError: No module named 'torchtext'

# Sequential Data

In this lab we will introduce other ways of dealing with sequential data.
As an example we will train a neural network to classify sequences of text as having either positive or negative sentiment.
In the following we will exemplify methods on text given the same challenges as presented in [learning when to skim and when to read](https://einstein.ai/research/learning-when-to-skim-and-when-to-read).

In this notebook we will show you 
* Some different ways to represent text.
* Some PyTorch tools for working with text.
* How to create a simple bag of words model for sentiment analysis.


## Representing Text
In previous labs we mainly considered data $x \in \mathrm{R}^d$, where $d$ is the feature space dimension.
With time sequences our data can be represented as $x \in \mathrm{R}^{t \, \times \, d}$, where $t$ is the sequence length. 
This emphasises sequence dependence and that the samples along the sequence are not independent and identically distributed (i.i.d.).
We will model functions as $\mathrm{R}^{t \, \times \, d} \rightarrow \mathrm{R}^c$, where $c$ is the amount of classes in the output.


There are several ways to represent sequences.
With text the challenge is how to represent a word as the feature vector in $d$ dimensions, as it is required to represent text with decimal numbers.


### One-hot encoding over vocabulary

One way to represent a fixed amount of words is by making a one-hot encoded vector, which consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify each word.

| vocabulary    | one-hot encoded vector   |
| ------------- |--------------------------|
| Paris         | $= [1, 0, 0, \ldots, 0]$ |
| Rome          | $= [0, 1, 0, \ldots, 0]$ |
| Copenhagen    | $= [0, 0, 1, \ldots, 0]$ |

Representing a large vocabulary with one-hot encodings often becomes inefficient because of the size of each sparse vector.
To overcome this challenge it is common practice to truncate the vocabulary to contain the $k$ most used words and represent the rest with a special symbol, $\mathtt{UNK}$, to define unknown/unimportant words.
This often causes entities such as names to be represented with $\mathtt{UNK}$.

Consider the following text
> I love the corny jokes in Spielberg's new movie.

where an example result would be similar to
> I love the corny jokes in $\mathtt{UNK}$'s new movie.

### Embeddings

Word embeddings tries to tackle the intractability of one-hot encoded vectors, as $k$ is often in the range of 50k to 100k elements.
Furthermore, one-hot encoding of vectors assumes orthogonality between all words, which makes it inept to incorporate relationships between words, e.g. `ran` and `run` should be related, where e.g. `awkward` and `space` should be far apart in the vector space.

An embedding is defined as $\mathrm{R}^d \rightarrow \mathrm{R}^{d'}$, where $d' \ll d$.
In practice this is often achieved by having a lookup table with $d'$-dimensional embeddings.

For visualizations and more intuition check out [learning when to skim and when to read](https://einstein.ai/research/learning-when-to-skim-and-when-to-read).

### Bag of Words

A simple way to model sequences of words is by averaging the word embeddings across the sequence dimension.
This gives us a vector which contains information about each word, although completely disregarding the order of the words. Even though this might seem like a lossy approach to condense information it works surprisingly well.

A bag of words model is represented as $\mathrm{R}^{t \, \times \, d'} \rightarrow \mathrm{R}^{d'}$, afterwards the representation can be used to do classification $\mathrm{R}^{d'} \rightarrow \mathrm{R}^{c}$.

## Stanford sentiment treebank

A great public dataset for sentiment analysis is the Stanford sentiment treebank (SST).
The SST provides not only the class (positive, negative) for a sentence, but also each of its grammatical subphrases.
We will not utilize any tree information.
The original SST constitutes five classes: *very positive*, *positive*, *neutral*, *negative* and *very negative*.
We consider the simpler task of binary classification where *very positive* is combined with *positive*, *very negative* is combined with *negative* and all *neutrals* are removed.

### positive examples

<div class="alert alert-info">
  <ul>
    <li>The actors are fantastic</li>
    <li>A smart, witty follow-up.</li>
    <li>You'll probably love it.</li>
  </ul>
</div>

### negative examples

<div class="alert alert-danger">
  <ul>
    <li>Unflinchingly bleak and desperate.</li>
    <li>An absurdist spider web.</li>
    <li>Who cares?</li>
  </ul>
</div>

In [None]:
use_cuda = torch.cuda.is_available()

def get_variable(x):
    """ Converts tensors to cuda, if available. """
    if use_cuda:
        return x.cuda()
    return x

def get_numpy(x):
    """ Get numpy array for both cuda and not. """
    if use_cuda:
        return x.cpu().data.numpy()
    return x.data.numpy()

# Data loader - `torchtext`

Creating data loaders for NLP is quite a hassle.
[torchtext](https://github.com/pytorch/text/) is a convenient library with builtin functionality useful when working with text, e.g. building vocabularies and padding sequences to max length.

## `torchtext` - Fields and Dataset

Our dataset must have a predefined structure, e.g. similar to a database table.

- `torchtext.data.Field()` defines a column in our dataset table
- `torchtext.datasets.SST` is a data loader for the Stanford Sentiment Treebank (SST) dataset
- `torchtext.datasets.SST.split()` is a function to create train/validation/test sets

In [None]:
# we assume that all fields are sequential, i.e. there will be a sequence of data
# however, the label field will not contain any sequence
TEXT = data.Field(sequential=True)
LABEL = data.Field(sequential=False)
# create SST dataset splits
# note, we remove samples with neutral labels
train_set, validation_set, _ = datasets.SST.splits(TEXT,
                                                   LABEL,
                                                   fine_grained=False,
                                                   train_subtrees=True,
                                                   filter_pred=lambda ex: ex.label != 'neutral')

In [None]:
print('train_set.fields:', list(train_set.fields.keys()))
print('validation_set.fields:', list(validation_set.fields.keys()))
print()
print('size of training set', len(train_set))
print('size of validation set', len(validation_set))
print()
print('content of first training sample:')
pprint(vars(train_set[0]))

## `torchtext` - Vocabulary

For each `Field` we build a vocabulary to numericalize the symbols, e.g. `"fun" => 471`.
When building a vocabulary we can attach embedding vectors, e.g. GloVe, FastText, etc.
Many of these are already built into `torchtext`.

In [None]:
# build the vocabularies
url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec'
TEXT.build_vocab(train_set, max_size=None, vectors=Vectors('wiki.simple.vec', url=url))
LABEL.build_vocab(train_set)

In [None]:
print('Text fields:')
#print('keys of TEXT.vocab:', list(TEXT.vocab.__dict__.keys()))
print(' size of vocabulary:', len(TEXT.vocab))
print(" vocabulary's embedding dimension:", TEXT.vocab.vectors.size())
print(' no. times the "fun" appear in the dataset:', TEXT.vocab.freqs['fun'])

print('\nLabel fields:')
#print('keys of LABEL.vocab:', list(LABEL.vocab.__dict__.keys()))
print(" list of vocabulary (int-to-str):", LABEL.vocab.itos)
print(" list of vocabulary (str-to-int):", dict(LABEL.vocab.stoi))

## `torchtext` - Iterator over datasets

`torchtext.data.Iterator` is a class which can be used to create iterators.
These iterators have various useful functionality, e.g. to shuffle at every epoch, or to generate data endlessly.
It is useful to be able to generate endless batches of training data.

In [None]:
# make iterator for splits
# device gives a CUDA enabled device (-1 disables it)
train_iter, val_iter, _ = data.BucketIterator.splits((train_set, validation_set, _),
                                                     batch_size=128, 
                                                     device=0 if use_cuda else -1)

In [None]:
# print batch information
batch = next(iter(train_iter))
print("dimension of batch's text:", batch.text.size())
print("first sequence in text:", batch.text[:,0])
print("correct label index:", batch.label[0])
print("the actual label:", LABEL.vocab.itos[get_numpy(batch.label[0])])

# Simple Bag of Words Model

In [None]:
# size of embeddings
embedding_dim = TEXT.vocab.vectors.size()[1]
num_embeddings = TEXT.vocab.vectors.size()[0]
num_classes = len(LABEL.vocab.itos)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.embeddings = nn.Embedding(num_embeddings, embedding_dim)
        # use pretrained embeddings
        self.embeddings.weight.data.copy_(TEXT.vocab.vectors)
        
        # add hidden layers
        # YOUR CODE HERE!
#         self.l_1 = Linear(in_features=embedding_dim,
#                           out_features=30,
#                           bias=True)
#         self.l_2 = Linear(in_features=30,
#                           out_features=30,
#                           bias=True)
        
        # output layer
        self.l_out = Linear(in_features=embedding_dim,
                            out_features=num_classes,
                            bias=False)
        
    def forward(self, x):
        out = {}
        # get embeddings
        x = self.embeddings(x)
        
        # mean embeddings, this is the bag of words trick
        out['bow'] = x = torch.mean(x, dim=0)
        
        # add hidden layers
        # YOUR CODE HERE!
#         out['l1_activations'] = x = relu(self.l_1(x))
#         out['l2_activations'] = x = relu(self.l_2(x))


        # Softmax
        out['out'] = softmax(self.l_out(x), dim=1)
        return out

net = Net()
if use_cuda:
    net.cuda()
print(net)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

def accuracy(ys, ts):
    # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
    correct_prediction = torch.eq(torch.max(ys, 1)[1], ts)
    # averaging the one-hot encoded vector
    return torch.mean(correct_prediction.float())

In [None]:
def construct_sentences(batch):
    """    
    Parameters
    ----------
    batch: torchtext.data.batch.Batch
    
    Returns
    -------
    [str]
    """
    return [" ".join([TEXT.vocab.itos[elm] 
                      for elm in get_numpy(batch.text[:,i])])
            for i in range(batch.text.size()[1])]

def get_labels(batch):
    """
    Parameters
    ----------
    batch: torchtext.data.batch.Batch
    
    Returns
    -------
    [str]
    """
    return [LABEL.vocab.itos[get_numpy(batch.label[i])] for i in range(len(batch.label))]


In [None]:
# to project our hidden embeddings to a visualizable space
tsne = TSNE(perplexity=10.0, learning_rate=5.0, n_iter=2000)

# index for each label
colormap = {1: 'DodgerBlue', 2: 'FireBrick'}

# create a tmp source to be updated later
validation_set_size = len(validation_set)
source = ColumnDataSource(data={'x': np.random.randn(validation_set_size),
                                'y': np.random.randn(validation_set_size),
                                'colors': ['green']*validation_set_size,
                                'sentences': ["tmp"]*validation_set_size,
                                'labels': ["unk"]*validation_set_size})

# instance to define hover logic in plot
hover = HoverTool(tooltips=[("Sentence", "@sentences"), ("Label", "@labels")])

# set up the bokeh figure for later visualizations
p = figure(tools=[hover])
p.circle(x='x', y='y', fill_color='colors', size=5, line_color=None, source=source)

def update_plot(meta, layer, handle):
    """ Update existing plot
    
    Parameters
    ----------
    meta: dict
    layer: str
    """
    tsne_acts = tsne.fit_transform(meta[layer])
    source.data['x'] = tsne_acts[:,0]
    source.data['y'] = tsne_acts[:,1]
    source.data['colors'] = [colormap[l] for l in meta['label_idx']]
    
    source.data['sentences'] = meta['sentences']
    source.data['labels'] = meta['labels']
    
    # this updates the given plot
    push_notebook(handle=handle)

## Train the bag of words model

**Warning** this might take a while on CPU.
Go get a cop of coffe, and enjoy the visualizations.

Notice that each data point in the plot corresponds to an entire sentence in the validation set.

In [None]:
max_iter = 3000
eval_every = 1000
log_every = 200

# will be updated while iterating
tsne_plot = show(p, notebook_handle=True)

train_loss, train_accs = [], []

net.train()
for i, batch in enumerate(train_iter):
    if i % eval_every == 0:
        net.eval()
        val_losses, val_accs, val_lengths = 0, 0, 0
        val_meta = {'label_idx': [], 'sentences': [], 'labels': []}
        for val_batch in val_iter:
            output = net(val_batch.text)
            # batches sizes might vary, which is why we cannot just mean the batch's loss
            # we multiply the loss and accuracies with the batch's size,
            # to later divide by the total size
            val_losses += criterion(output['out'], val_batch.label) * val_batch.batch_size
            val_accs += accuracy(output['out'], val_batch.label) * val_batch.batch_size
            val_lengths += val_batch.batch_size
            
            for key, _val in output.items():
                if key not in val_meta:
                    val_meta[key] = []
                val_meta[key].append(get_numpy(_val)) 
            val_meta['label_idx'].append(get_numpy(val_batch.label))
            val_meta['sentences'].append(construct_sentences(val_batch))
            val_meta['labels'].append(get_labels(val_batch))
        
        for key, _val in val_meta.items():
            val_meta[key] = np.concatenate(_val)
        
        # divide by the total accumulated batch sizes
        val_losses /= val_lengths
        val_accs /= val_lengths
        
        print("valid, it: {} loss: {:.2f} accs: {:.2f}\n".format(i, get_numpy(val_losses), get_numpy(val_accs)))
        update_plot(val_meta, 'bow', tsne_plot)
        
        net.train()
    
    output = net(batch.text)
    batch_loss = criterion(output['out'], batch.label)
    
    train_loss.append(get_numpy(batch_loss))
    train_accs.append(get_numpy(accuracy(output['out'], batch.label)))
    
    optimizer.zero_grad()
    batch_loss.backward()
    optimizer.step()
    
    if i % log_every == 0:        
        print("train, it: {} loss: {:.2f} accs: {:.2f}".format(i, 
                                                               np.mean(train_loss), 
                                                               np.mean(train_accs)))
        # reset
        train_loss, train_accs = [], []
        
    if max_iter < i:
        break

# Assignment 1: add hidden layer

- add one hidden layer to the bag of words (BoW) model
  - plot the hidden layer's activations instead of the BoW representation
- add a second hidden layer
  - try and plot the activations of the second hidden layer

Notice any difference in the plots?
Describe what you see.
Hover over the data points.