In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline

# Project 2: Word2Vec (10 pts)
The goal of this project is to obtain the vector representations for words from text.

The main idea is that words appearing in similar contexts have similar meanings. Because of that, word vectors of similar words should be close together. Models that use word vectors can utilize these properties, e.g., in sentiment analysis a model will learn that "good" and "great" are positive words, but will also generalize to other words that it has not seen (e.g. "amazing") because they should be close together in the vector space.

Vectors can keep other language properties as well, like analogies. The question "a is to b as c is to ...?", where the answer is d, can be answered by looking into word vector space and calculating $\mathbf{u}_b - \mathbf{u}_a + \mathbf{u}_c$, and finding the word vector that is the closest to the result.

## Your task
Complete the missing code in this notebook. Make sure that all the functions follow the provided specification, i.e. the output of the function exactly matches the description in the docstring. 

We are given a text that contains $N$ unique words $\{ x_1, ..., x_N \}$. We will focus on the Skip-Gram model in which the goal is to predict the context window $S = \{ x_{i-l}, ..., x_{i-1}, x_{i+1}, ..., x_{i+l} \}$ from current word $x_i$, where $l$ is the window size. 

We get a word embedding $\mathbf{u}_i$ by multiplying the matrix $\mathbf{U}$ with a one-hot representation $\mathbf{x}_i$ of a word $x_i$. Then, to get output probabilities for context window, we multiply this embedding with another matrix $\mathbf{V}$ and apply softmax. The objective is to minimize the loss: $-\mathop{\mathbb{E}}[P(S|x_i;\mathbf{U}, \mathbf{V})]$.

You are given a dataset with positive and negative reviews. Your task is to:
+ Construct input-output pairs corresponding to the current word and a word in the context window
+ Implement forward and backward propagation with parameter updates for Skip-Gram model
+ Train the model
+ Test it on word analogies and sentiment analysis task

## General remarks

Only functionality in the python files will be graded. Carefully read the method docstrings to understand the task, parameters and what output is expected.
Fill in the missing code at the markers in the files `data.py`, `model.py`, `train.py`, `analogies.py`
```python
###########################
# YOUR CODE HERE
###########################
```
Do not add or modify any code at other places in the notebook and python code files except where otherwise explicitly stated.
After you fill in all the missing code, restart the kernel and re-run all the cells in the notebook.

The following things are **NOT** allowed:
- Using additional `import` statements
- Copying / reusing code from other sources (e.g. code by other students)

If you plagiarise even for a single project task, you won't be eligible for the bonus this semester.

# 1. Load data (1.5 pts)

We'll be working with a subset of reviews for restaurants in Las Vegas. The reviews that we'll be working with are either 1-star or 5-star. First, we need to process tokens (words) into integer values. Second, as the embedding model is trained with pairs of tokens, we need to compute pairs of co-occuring tokens.

You need to implemenet this functionality in `data.py`:
- `compute_token_to_index` **(0.5 pts)**: Map tokens (words) in sequences to numerical values (integers)
- `get_token_pairs_from_window` **(1 pts)**: For each token in a sequence, compute all tokens that appear in its context, i.e. tokens that are within a given window size around that word


In [2]:
from data import load_data, build_vocabulary, compute_token_to_index, get_token_pairs_from_window

reviews_1star, reviews_5star = load_data('task03_data.npy')
corpus = reviews_1star + reviews_5star
corpus, vocabulary, counts = build_vocabulary(corpus)
token_to_idx, idx_to_token, idx_to_count = compute_token_to_index(vocabulary, counts)
data = np.array(sum((list(get_token_pairs_from_window(sequence, 3, token_to_idx)) 
                        for sequence in corpus), start = [])) # N, 2
# Should output
# Total number of pairs: 207462
print('Total number of pairs:', data.shape[0])

Total number of pairs: 207462


In [3]:
VOCABULARY_SIZE = len(vocabulary)
EMBEDDING_DIM = 32

In [4]:
print('Number of positive reviews:', len(reviews_1star))
print('Number of negative reviews:', len(reviews_5star))
print('Number of unique words:', VOCABULARY_SIZE)

Number of positive reviews: 1000
Number of negative reviews: 2000
Number of unique words: 201


We calculate a weighting score to counter the imbalance between the rare and frequent words. Rare words will be sampled more frequently. See https://arxiv.org/pdf/1310.4546.pdf

In [5]:
# Compute sampling probabilities
probabilities = np.array([1 - np.sqrt(1e-3 / idx_to_count[token_idx]) for token_idx in data[:, 0]])
probabilities /= np.sum(probabilities)
# Should output: 
# [4.8206203e-06 4.8206203e-06 4.8206203e-06]
print(probabilities[:3])

[4.8206203e-06 4.8206203e-06 4.8206203e-06]


# 2. Model Definition (6.5 pts)

Now you need to implement the word embedding model. In particular, you need to implement the following functionality in the `Embedding` class in `model.py`:
- `one_hot` **(0.5 pts)**: Computes a one-hot encoding for the integer representations of tokens
- `softmax`**(1 pts)**: Applies the softmax normalization to model outputs. (Hint: Watch out for numerical stability!)
- `loss` **(0.5 pts)**: Computes the cross-entropy loss for a prediction (=probability distribution over the vocabulary) given the ground truth observed context word
- `forward` **(2 pts)**: Computes the forward pass of the model. You also need to cache intermediate values as they are needed for backpropagation.
- `backward` **(2.5 pts)**: Computes the gradients with respect to both model weights $V$ and $U$. Use the activation values cached in the `forward` method


# 3. Training (1 pts)

We train our model using stochastic gradient descent. At every step we sample a mini-batch from data and update the weights.

The following function samples words from data and creates mini-batches. It subsamples frequent words based on previously calculated probabilities.

You need to implement the optimizer that iteratively updates the model weights after each training step. We use an optimizer with momentum. In particular, you need to implement the following functionality in `train.py`:
- `step` **(1 pts)**: Applies an update to the model weights given the gradients of the current step.


In [6]:
from model import Embedding
from train import Optimizer

In [7]:
rng = np.random.default_rng(123)
def get_batch(data, size, prob):
    x = rng.choice(data, size, p=prob)
    return x[:,0], x[:,1]

Training the model can take some time so plan accordingly.

In [8]:
model = Embedding(VOCABULARY_SIZE, EMBEDDING_DIM)
optim = Optimizer(model, learning_rate=1.0, momentum=0.5)

losses = []

MAX_ITERATIONS = 15000
PRINT_EVERY = 1000
BATCH_SIZE = 1000

for i in range(MAX_ITERATIONS):
    x, y = get_batch(data, BATCH_SIZE, probabilities)
    
    loss = model.forward(x, y)
    grad = model.backward()
    optim.step(grad)
    
    assert not np.isnan(loss)
    
    losses.append(loss)

    if (i + 1) % PRINT_EVERY == 0:
        print(f'Iteration: {i + 1}, Avg. training loss: {np.mean(losses[-PRINT_EVERY:]):.4f}')

Iteration: 1000, Avg. training loss: 3.7195
Iteration: 2000, Avg. training loss: 3.5692
Iteration: 3000, Avg. training loss: 3.5514
Iteration: 4000, Avg. training loss: 3.5378
Iteration: 5000, Avg. training loss: 3.5259
Iteration: 6000, Avg. training loss: 3.5178
Iteration: 7000, Avg. training loss: 3.5143
Iteration: 8000, Avg. training loss: 3.5015
Iteration: 9000, Avg. training loss: 3.4980
Iteration: 10000, Avg. training loss: 3.4880
Iteration: 11000, Avg. training loss: 3.4951
Iteration: 12000, Avg. training loss: 3.4872
Iteration: 13000, Avg. training loss: 3.4807
Iteration: 14000, Avg. training loss: 3.4832
Iteration: 15000, Avg. training loss: 3.4823


The embedding matrix is given by $\mathbf{U}^T$, where the $i$th row is the vector for $i$th word in the vocabulary.

In [9]:
emb_matrix = model.U.T

# 4. Analogies (1 pts)

As mentioned before, vectors can keep some language properties like analogies. Given a relation a:b and a query c, we can find d such that c:d follows the same relation. We hope to find d by using vector operations. In this case, finding the real word vector $\mathbf{u}_d$ closest to $\mathbf{u}_b - \mathbf{u}_a + \mathbf{u}_c$ gives us d. 

**Note that the quality of the analysis results is not expected to be excellent.**

You need to implement the following functionality in `analogies.py`:
- `get_analogies` (**1 pts**): Given a triplet of tokens (a, b, d), compute the top k tokens with an embedding closest to $u_a - u_b + u_d$

In [10]:
from analogies import get_analogies

In [11]:
triplets = [['is', 'was', 'were'], ['lunch', 'day', 'night'], ['i', 'my', 'your']]

for triplet in triplets:
    a, b, d = triplet
    candidates = get_analogies(emb_matrix, triplet, token_to_idx, idx_to_token, num_candidates=5)
    print(f'`{a}` is to `{b}` as [{", ".join(candidates)}] is to `{d}`')

`is` is to `was` as [are, from, sauce, friendly, cheese] is to `were`
`lunch` is to `day` as [while, great, dinner, first, so] is to `night`
`i` is to `my` as [you, we, if, not, taste] is to `your`
