# 0. Introduction

## 0.1 Task

The tasks will be:
1. Train a CBOW model with a real world dataset, explore how the parameters affect the model.
2. Evaluate an embedding qualitatively

## 0.2 Instruction

To follow along this exercise, you want to execute the cells by clicking on them and press *Run* or hit *Shift+Enter*, from top to bottom.


# 1. CBOW


Our first task is to build our very own CBOW. In this section, you will **NOT** be required to write code for the network. Instead, you will be exploring the model by forming hypothesis and testing them through different parameter settings. Although you don't need to write code, you are encouraged to explore the model by writing code to test things out.

First, let's remind ourselves of the famous quote:

> You shall know a word by the company it keeps (Firth, J. R. 1957:11)

What this implies is that it is possible to define a word, or *meaning* of a word in a way that describes a prediction task: **the task of predicting the word given the context**. However, one problem still remains: how to represent the meaning of a word? Luckily, there has already been a line of works that suggest a solution: representing the meaning of a word by a vector -- the vector space model.

Now let's imagine that we are given a near-perfect vector space model that maps meaning of a word to vector. One of the easiest ways to solve the task then, is that we can literally just use the sum of the context vectors as the inputs, the target word vector as the output, and fit it through a linear model!

And that is exactly what CBOW is doing: <img src="figures/cbow.png" alt="cobw" style="width: 400px;"/>

The weights between the input and hidden layer are the vector space model mapping, and the weights between the hidden and output layer are the linear prediction model. 

The only catch is that we don't have that near-perfect vector space model given to us! Thus, our goal is to jointly learn (1) a representation of the word, and (2) a prediction model. In the following sections, we will explore how to do that through PyTorch.



## 1.1 Building our training set

We want to start with building a training set. For the purpose of this exploratory exercise, you will be using a medium size corpus: the [IMDB movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) (you should already downloaded it in `README.md`)


### Data Preprocessing

First, we removed all punctuations and lowercased everything, and tokenize it by whitespace.

In [1]:
# A utility collections for data processing, make sure you already download the data(see README.md)
from utils import read_imdb_data

# read_imdb_data removes punctuations and lowercase everything
X_raw_train, y_train = read_imdb_data('../data/aclImdb/train')
raw_text = ' '.join(X_raw_train).split()

To make sure that it is properly loaded, let's peek into the content a bit:

In [3]:
print(' '.join(raw_text[:30]))

i was seriously looking forward to seeing this film because it seemed truly promising from the coming attractions jim carrey with godlike powers was an idea that most definitely worked


# Filter the Vocab

For the purpose of this assignment, and for the sake of training time, we limit the vocabulary to the most common 1000 words.

In [4]:
from collections import Counter

# pick the top words only
raw_text_count = Counter(raw_text)
vocab = set(list(zip(*raw_text_count.most_common(1000)))[0])
vocab_size = len(vocab)

We then only keep the selected words in the raw text.

In [5]:
# filter raw text by vocab
text = [r for r in raw_text if r in vocab]

---
# Question 1

The preprocessing steps introduced here seem very naive, and potentially problematic. Before you read further down the exercise, based on your understanding of CBOW, 

1. list three potential concerns with the preprocessing choices, explain why they might be a concern.
2. potential fixes for these concerns.
---

### Building the Dataset

Now let's build our dataset! First we define some handy helper functions:

In [6]:
def make_context_vector(context, word_to_ix):
    '''
    helper function to translate context into indexes for inputs
    In:
        context: a list of words
        word_to_ix: a mapping from word to index
    Out:
        idxs: a list of indexes
    '''
    idxs = [word_to_ix[w] for w in context]
    return idxs

In [7]:
def data_batcher(X, Y, batch_size=50):
    '''
    helper function to batch data, batch_size is the size of 
    the batch.
    In:
        X: a matrix of size (num_sample, CONTEXT_SIZE*2)
        Y: an array of size (num_sample)
        batches: how many data points are in a batche, default:50
    Out:
        a batch of X and Y
    '''
    indices = np.arange(X.shape[0])
    # np.random.shuffle(indices)
    count = 0
    
    while count < X.shape[0]:
        draw = indices[count:min(count+batch_size, X.shape[0])]
        yield X[draw, :], Y[draw]
        count += batch_size

Recall that our given (X) is the context, our target (Y) is the word. 

We then build our dataset by iterating through the filtered text and mapping the words to indexes.

In [8]:
import numpy as np

# Let's set the context window size to 2 for now
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right

# A mapping from word to index
word_to_ix = {word: i for i, word in enumerate(vocab)}

X = []
Y = []
# iterate through the text to build dataset
for i in range(CONTEXT_SIZE, len(text) - CONTEXT_SIZE):
    context = [text[i + j] for j in range(-CONTEXT_SIZE, CONTEXT_SIZE+1) if not j==0]
    target = text[i]
    # the data translate to indexes, X is the context, Y is the target word.
    X.append(make_context_vector(context, word_to_ix))
    Y.append(word_to_ix[target])

# convert to numpy for easier data handling later
X = np.array(X, dtype=int)
Y = np.array(Y, dtype=int)

Let's investigate what X and Y are a little bit by peeking into them:

In [9]:
# Let's print the first X, Y pair
x0 = X[0]
y0 = Y[0]
print(x0)
print(y0)

[745 553  38 821]
149


In [10]:
# x0 is the indexes of the context, while y0 is the index of the target word.
print(text[:5])
print([word_to_ix[w] for w in text[:5]])

['i', 'was', 'seriously', 'looking', 'forward']
[745, 553, 149, 38, 821]


## 1.2 CBOW with PyTorch

Now that we have our dataset built, let's import PyTorch!

Note: if importing PyTorch failed, first try to click on **Kernel->Change Kernel->Python \[conda env:pytorch_w2v\]** on the upper bar of the notebook. It is likely that you did not set the kernel (which is what the jupyter notebook is running on) to the one you had pytorch installed. If you change the kernel, please re-run this jupyter notebook.

In [11]:
# import PyTorch
import torch
# torch.nn contains modules or subcomponents of the network required to train the network.
import torch.nn as nn
# torch.nn.functional is a collection of handy functions that you can build into the model
import torch.nn.functional as F
# torch.optim contains optimizers to update the parameters of network
import torch.optim as optim

"""
Variable in torch.autograd is used to tell pytorch that 
the object should be put into the PyTorch computation graph. 
See readings for more details.
"""
from torch.autograd import Variable

"""
Let's also set a fix random seed so you can replicate the result.
Note that neural network is sensitive to initial parameters. If
you can't seem to produce a good result, you might want to use
another random seed.
"""
torch.manual_seed(1234)

<torch._C.Generator at 0x7fdf72ec53b0>

Now, let's look at two modules of PyTorch that will be especially useful for our implementation of CBOW: `nn.Embedding` and `nn.Linear`

Note: In PyTorch, if you don’t specify, the `Tensor` created (and thus the parameters in `nn.Embedding` and `nn.Linear`) will be all random.

In [None]:
# print(nn.Embedding.__doc__)

In [12]:
# print(nn.Linear.__doc__)

Applies a linear transformation to the incoming data: :math:`y = Ax + b`

    Args:
        in_features: size of each input sample
        out_features: size of each output sample
        bias: If set to False, the layer will not learn an additive bias.
            Default: ``True``

    Shape:
        - Input: :math:`(N, *, in\_features)` where `*` means any number of
          additional dimensions
        - Output: :math:`(N, *, out\_features)` where all but the last dimension
          are the same shape as the input.

    Attributes:
        weight: the learnable weights of the module of shape
            (out_features x in_features)
        bias:   the learnable bias of the module of shape (out_features)

    Examples::

        >>> m = nn.Linear(20, 30)
        >>> input = autograd.Variable(torch.randn(128, 20))
        >>> output = m(input)
        >>> print(output.size())
    


### The CBOW model

Now, it's the exciting part. Let's build our model!

Recall that network structure of CBOW is equivalent to $A\left ( \sum_{w \in Context} q_w \right ) + b$, where $q_w$ is the vector representation of word $w$, A and b are parameters for the linear prediction model. If we applied log softmax to convert the output to log probability, the output of network then becomes: 

$$\log P(w_i|context) = logSoftmax\left (A\left ( \sum_{w \in context} q_w \right ) + b\right )$$

Our goal is then

$$arg\min_{A, b, Q} E_{(context, target) \sim D} \left [-\log P(w_{target}|context)\right ] $$

where $Q$ is the matrix of embeddings, and D is a distribution that context and target words are drawn from.


With this in mind, our model is thus:

In [12]:
class CBOW(nn.Module):
    """
    A PyTorch implementation of CBOW for exploratory purpose.
    
    Args:
        - vocab_size: size of the vocabulary
        - embedding_dim: dimension of the representation vector for words
        - word_to_ix: a mapping from word to index
    
    Shape:
        - Input: LongTensor (N, W), N = mini-batch size, 
                 W = number of indices to extract per mini-batch
        - Output: (N, vocab_size),  N = mini-batch size
    
    """

    # Initializing the model, instantiating the required module (Not linking them)
    def __init__(self, vocab_size, embedding_dim, word_to_ix):
        # A standard python way of saying CBOW is going to inherit nn.Module
        super(CBOW, self).__init__()
        self.word_to_ix = word_to_ix
        self.emb = 


    # Here is where we acutally link the modules to describe how the data flow through the network.
    def forward(self, inputs):
        """
        - inputs: LongTensor (N, W), N = mini-batch size, 
                  W = number of indices to extract per mini-batch
        - outputs: (N, vocab_size),  N = mini-batch size
        """
        
        return logsoftmax

    # helper function to retrieve the trained vector space model (or word embedding)
    def word_embedding(self):
        return self.emb.weight
    
    # helper function to do a word to vector lookup
    def word2vec(self, word):
        return self.emb.weight[self.word_to_ix[word], :]

Note that there really wasn't many lines of code! Most of those are comments!


### Training

Next we need to actually train the model.

In [13]:
# handy library to help you visualize the progress
import progressbar

def train_cbow(model, num_epochs=3, batch_size=50):

    loss_function = nn.NLLLoss()
    
    """
    The stochastic gradient descent optimizer used to update weights, 
    once gradient is computed. We set a learning rate of 0.001, and 
    momentum of 0.9.
    """
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    
    # keep track of the loss of the model to see if the model converges.
    losses = []
    
    for epoch in range(num_epochs):
        
        # progressbar setting
        widgets = ['Epoch {} '.format(epoch), progressbar.Percentage(), ' ',
                    progressbar.Bar(), ' ',
                    progressbar.ETA()]
        max_iteration = np.ceil(X.shape[0]/float(batch_size))
        bar = progressbar.ProgressBar(widgets=widgets, max_value=max_iteration)

        """
        bar(data_batcher(X, Y, batches)) is a way for progressbar to keep track of data_batcher(X, Y, batches)
        bar(*) output whatever is pass into it, but also update the counter for progress bar
        """
        for context, target in bar(data_batcher(X, Y, batch_size)):
            
            """
            Wrap our training pair context and target first through torch.LongTensor, 
            and then through Variable.
            
            Variable is used to put x and y into the PyTorch computation graph. See readings for more details.
            """
            x = Variable(torch.LongTensor(context))
            y = Variable(torch.LongTensor(target))
            
            optimizer.zero_grad()

            """
            forward
            
            The final piece how connecting the network to the data! 
            We feed the data in, and get a loss value back.
            """
            
            # run this input x forward through the network to get your output vector
            outputs = model(x)
            # compare outputs against correct output y and generate loss using the loss function
            loss = loss_function(outputs, y)
            
            """Backward"""
            
            loss.backward()
            
            
            """
            Now that for every Variable, we have calculate the gradient, we can use an
            optimizer to update the weights!
            
            SGD for example, does something intuitively like this:
            w = w + learning_rate*w.grad
            """
            optimizer.step()
            
            # just to record the loss
            losses.append(loss.data.numpy())
    return losses

Let's start training!

In [14]:
# import matplotlib to plot the losses
from matplotlib import pyplot as plt

We first initalize a CBOW model of vocab_size 1000 (the top 1000 words we limited it to) and train a vector space model with embbeding dimension=10, for the sake of shortening the training time. Note that a dimension=10 may not be optimal, and will be a parameter that you can potentially tune in Question 2.

We then train the model for 3 epoches, with batch_size of 40 (there are 4404634 training samples). These are also parameters that you can potentially tune.

This [link](https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network) provides a nice discussion about the relationship between batch size and iterations. [Here](https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks) is a discussion for the definition of epoch, iteration, and batch size.

In [15]:
model = CBOW(vocab_size=vocab_size, embedding_dim=10, word_to_ix=word_to_ix)

# Train with 3 epochs, batch size of 40. #100,000 data batches
plt.plot(train_cbow(model, num_epochs=3, batch_size=40))

Epoch 0 100% |#################################################| Time:  0:06:22
Epoch 1   1% |                                                 | ETA:   0:06:28

KeyboardInterrupt: 

# 2. Evaluation

Now that we have a model trained, how do we know if it is good at all?

**Sanity check**, or **smell test**, is a quick and dirty way to check if a model is doing anything reasonable at all. The idea of a sanity check is to check for a property that a good model should certainly hold. In our case, since we are training on a movie review dataset, we can safely assume that *'movie'* and *'film'* should have similar representation in a reasonably well-trained model, as they should appear in similar context.

In [20]:
# this is PyTorch's cosine similarity module
cos = nn.CosineSimilarity(dim=0, eps=1e-6)

# The word 'movie' and 'film' should appear in similar context, and thus should have similar representation.
cos(model.word2vec('movie'), model.word2vec('film'))

tensor(0.8891)

Let's also check the 10 closest words to *movie*

In [21]:
def closest_n_words(model, vocab, word, n=10):
    cos = nn.CosineSimilarity(dim=0, eps=1e-6)
    word_vec = model.word2vec(word)
    scores=[]
    for v in vocab:
        if not v == word:
            score = cos(model.word2vec(v), word_vec)
            scores.append((score, v))
    
    """
    return the n closest words to the target word

    The purpose of using score[0].data.numpy() is to convert Tensor into
    numpy array, since there are many things you cannot easily do to a
    Tensor (like comparing two value and return a boolean, since it is 
    not differentiable!)
    """
    n_closest = list(zip(*(sorted(scores, key=lambda score: score[0].data.numpy(), reverse=True)[:n])))[1]
    return n_closest

In [25]:
closest_n_words(model, vocab, 'year')

('as',
 'memorable',
 'hand',
 'director',
 'comedy',
 'life',
 'filmed',
 'usual',
 'robert',
 'meets')