In [5]:
import numpy as np
import torch
from torch.utils.data import DataLoader

import torch
import torch.nn as nn
from tqdm.auto import tqdm, trange
import random
from torch import optim

import wandb

# Helpful for computing cosine similarity--Note that this is NOT a similarity!
from scipy.spatial.distance import cosine

# We'll use this to save our models
from gensim.models import KeyedVectors

import pickle

#
# IMPORTANT NOTE: Always set your random seeds when dealing with stochastic
# algorithms as it lets your bugs be reproducible and (more importantly) it lets
# your results be reproducible by others.
#
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x757257ff29b0>

In [6]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cuda device


In [8]:
from Word2Vec import RandomNumberGenerator

## Create a class to hold the data

Before we get to training word2vec, we'll need to process the corpus into some representation. The `Corpus` class will handle much of the functionality for corpus reading and keeping track of which word types belong to which ids. The `Corpus` class will also handle the crucial functionality of generating negative samples for training (i.e., randomly-sampled words that were not in the target word's context).

Some parts of this class can be completed after you've gotten word2vec up and running, so see the notes below and the details in the homework PDF.

In [4]:
from Word2Vec import Corpus

## Create the corpus

Now that we have code to turn the text into training data, let's do so. We've provided several files for you to help:

* `reviews-word2vec.tiny.txt` -- use this to debug your corpus reader
* `reviews-word2vec.med.txt` -- use this to debug/verify the whole word2vec works
* `reviews-word2vec.large.txt.gz` -- use this when everything works to generate your vectors for later parts
* `reviews-word2vec.HUGE.gz` -- _do not use this_ unless (1) everything works and (2) you really want to test/explore. This file is not needed at all to do your homework.

We recommend starting to debug with the first file, as it is small and fast to load (quicker to find bugs). When debugging, we recommend setting the `min_token_freq` argument to 2 so that you can verify that part of the code is working but you still have enough word types left to test the rest.

You'll use the remaining files later, where they're described.

In the next cell, create your `Corpus`, read in the data, and generate the negative sampling table.

In [5]:
rng = RandomNumberGenerator(1000000)
rng.set_max_val(1e6)
corpus = Corpus(rng)
dataset_name = 'reviews-word2vec.med.txt'
corpus.load_data(dataset_name, 2)
corpus.generate_negative_sampling_table()

Reading data and tokenizing
Counting token frequencies
Performing minimum thresholding


65523it [00:00, 3542542.74it/s]
100%|██████████| 10607824/10607824 [00:10<00:00, 997668.68it/s] 


Loaded all data from reviews-word2vec.med.txt; saw 7822050 tokens (65523 unique)
Generating sampling table


## Generate the training data

Once we have the corpus ready, we need to generate our training dataset. Each instance in the dataset is a target word and positive and negative examples of contexts words. Given the target word as input, we'll want to predict (or not predict) these positive and negative context words as outputs using our network. Your task here is to create a python `list` of instances. 

Your final training data should be a list of tuples in the format ([target_word_id], [word_id_1, ...], [predicted_labels]), where each item in the list is a list:
1. The first item is a list consisting only of the target word's ID.
2. The second item is a list of word ids for both context words and negative samples 
3. The third item is a list of labels to predicted for each of the word ids in the second list (i.e., `1` for context words and `0` for negative samples). 

You will feed these tuples into the PyTorch `DatasetLoader` later that will do the converstion to `Tensor` objects. You will need to make sure that all of the lists in each tuple are `np.array` instances and are not plain python lists for this `Tensor` converstion to work.

In [6]:
window_size = 2
num_negative_samples_per_target = 2


def generate_training_data():
    training_data = []

    # Loop through each token in the corpus and generate an instance for each,
    # adding it to training_data
    for pos, target_id in tqdm(enumerate(corpus.full_token_sequence_as_ids)):
        
        if target_id == corpus.word_to_index['<UNK>']:
            continue

        # For exach target word in our dataset, select context words
        # within +/- the window size in the token sequence
        lower_bound = max(0, pos - window_size)
        upper_bound = min(
            len(corpus.full_token_sequence_as_ids), pos + window_size + 1)
        context_id = []
        for i in range(lower_bound, upper_bound):
            if i == pos:
                continue
            context_id.append(corpus.full_token_sequence_as_ids[i])

        # For each positive target, we need to select negative examples of
        # words that were not in the context. Use the num_negative_samples_per_target
        # hyperparameter to generate these, using the generate_negative_samples()
        # method from the Corpus class
        negative_size = window_size*2*num_negative_samples_per_target + \
            window_size * 2 - len(context_id)
        negative_samples = corpus.generate_negative_samples(
            target_id, negative_size)
        samples = np.concatenate((context_id, negative_samples))
        labels = np.concatenate(
            (np.ones(len(context_id)), np.zeros(len(negative_samples))))
        training_data.append((target_id, samples, labels))

        # NOTE: this part might not make sense until later when you do the training
        # so feel free to revisit it to see why it happens.
        #
        # Our training will use batches of instances together (compare that
        # with HW1's SGD that used one item at a time). PyTorch will require
        # that all instances in a batches have the same size, which creates an issue
        # for us here since the target wordss at the very beginning or end of the corpus
        # have shorter contexts.
        #
        # To work around these edge-cases, we need to ensure that each instance has
        # the same size, which means it needs to have the same number of positive
        # and negative examples. Since we are short on positive examples here (due
        # to the edge of the corpus), we can just add more negative samples.
        #
        # YOUR TASK: determine what is the maximum number of context words (positive
        # and negative) for any instance and then, for instances that have fewer than
        # this number of context words, add in negative examples.
        #
        # NOTE: The maximum is fixed, so you can precompute this outside the loop
        # ahead of time.
    return training_data


training_data = generate_training_data()

7822050it [00:39, 200052.08it/s]


## Create the network

We'll create a new neural network as a subclass of `nn.Module` like we did in Homework 1. However, _unlike_ the network you built in Homework 1, we do not need to used linear layers to implement word2vec. Instead, we will use PyTorch's `Emedding` class, which maps an index (e.g., a word id in this case) to an embedding. 

Roughly speaking, word2vec's network makes a prediction by computing the dot product of the target word's embedding and a context word's embedding and then passing this dot product through the sigmoid function ($\sigma$) to predict the probability that the context word was actually in the context. The homework write-up has lots of details on how this works. Your `forward()` function will have to implement this computation.

In [7]:
from Word2Vec import Word2Vec

## Train the network!

Now that you have data in the right format and a neural network designed, it's time to train the network and see if it's all working. The trainin code will look surprisingly similar at times to your pytorch code from Homework 1 since all networks share the same base training setup. However, we'll add a few new elements to get you familiar with more common training techniques. 

For all steps, be sure to use the hyperparameters values described in the write-up.

1. Initialize your optimizer and loss function 
2. Create your network
3. Load your dataset into PyTorch's `DataLoader` class, which will take care of batching and shuffling for us (yay!)
4. Create a new `SummaryWriter` to periodically write our running-sum of the loss to a tensorboard
5. Train your model 

Two new elements show up. First, we'll be using `DataLoader` which is going to sample data for us and put it in a batch (and also convert the data to `Tensor` objects. You can iterate over the batches and each iteration will return all the items eventually, one batch at a time (a full epoch's worth).

The second new part is using `tensorboard`. As you might have noticed in Homework 1, training neural models can take some time. [TensorBoard](https://www.tensorflow.org/tensorboard/) is a handy web-based view that you can check during training to see how the model is doing. We'll use it here and periodically log a running sum of the loss after a set number of steps. The Homework write up has a plot of what this looks like. We'll be doing something simple here with tensorboard but it will come in handy later as you train larger models (for longer) and may want to visually check if your model is converging. TensorBoard was initially written for another deep learning framework, TensorFlow, but proved so useful it was ported to work in PyTorch too and is [easy to integrate](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html).

To start training, we recommend training on the `wiki-bios.10k.txt` dataset. This data is small enough you can get through an epoch in a few minutes (or less) while still being large enough you can test whether the model is learning anything by examining common words. Below this cell we've added a few helper functions that you can use to debug and query your model. In particular, the `get_neighbors()` function is a great way to test: if your model has learned anything, the nearest neighbors for common words should seem reasonable (without having to jump through mental hoops). An easy word to test on the `10k` data is "january" which should return month-related words as being most similar.

**NOTE**: Since we're training biographies, the text itself will be skewed towards words likely to show up biographices--which isn't necessary like "regular" text. You may find that your model has few instances of words you think are common, or that the model learns poor or unusual neighbors for these. When querying the neighbors, it can help to think of which words you think are likely to show up in biographies on Wikipedia and use those as probes to see what the model has learned.

Once you're convinced the model is learning, switch to the `med` data and train your model as specified in the PDF. Once trained, save your model using the `save()` function at the end of the notebook. This function records your data in a common format for word2vec vectors and lets you load the vectors into other libraries that have more advanced functionality. In particular, you can use the [gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html) code in other notebook included to explore the vectors and do simple vector analogies.

2: 0%|          | 6360/1378872 [00:15<55:16, 413.88it/s]  
8: 2%|▏         | 6216/344718 [00:15<13:48, 408.53it/s]  
32: 7%|▋         | 6166/86180 [00:15<03:17, 405.75it/s]  
64: 14%|█▎        | 5883/43090 [00:14<01:34, 394.75it/s]  
128: 29%|██▉       | 6222/21545 [00:15<00:38, 392.87it/s]  
256: 52%|█████▏    | 5609/10773 [00:15<00:14, 368.27it/s]  
512: 100%|██████████| 5387/5387 [00:15<00:00, 356.52it/s]  

In [8]:
# TODO: Set your training stuff, hyperparameters, models, etc. here
batch_size = 64
embedding_size = 50
lr = 5e-5
num_epochs = 1
max_steps = 1e6
model = Word2Vec(len(corpus.word_to_index), embedding_size).to(device)
data_loader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
optimizer = optim.Adam(model.parameters(), lr=lr)
loss_fn = nn.BCELoss()

# TODO: Initialize weights and biases (wandb) here
wandb.init(
    project="si630_hw2",
    config={
        "learning_rate": lr,
        "architecture": "word2vec",
        "dataset": dataset_name,
        "epochs": num_epochs,
    }
)

# HINT: wrapping the epoch/step loops in nested tqdm calls is a great way
# to keep track of how fast things are and how much longer training will take

for epoch in range(num_epochs):

    loss_sum = 0

    # TODO: use your DataLoader to iterate over the data
    for step, data in enumerate(tqdm(data_loader)):

        # NOTE: since you created the data as a tuple of three np.array instances,
        # these have now been converted to Tensor objects for us
        target_ids, context_ids, labels = data
        target_ids = target_ids.to(device)
        context_ids = context_ids.to(device)
        labels = labels.to(device)

        # TODO: Fill in all the training details here
        optimizer.zero_grad()
        predictions = model(target_ids, context_ids)
        loss = loss_fn(predictions.float(), labels.float())
        loss.backward()
        loss_sum += loss.item()
        optimizer.step()

        # TODO: Based on the details in the Homework PDF, periodically
        # report the running-sum of the loss to Weights & Biases (wandb).
        # Be sure to reset the running sum after reporting it.
        if step % 100 == 0 and step > 0:
            wandb.log({"loss": loss_sum})
            loss_sum = 0

        # TODO: it can be helpful to add some early stopping here after
        # a fixed number of steps (e.g., if step > max_steps)
        if step > max_steps:
            break

# once you finish training, it's good practice to switch to eval.
model.eval()
wandb.finish()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mzimgong[0m ([33mteam-orion[0m). Use [1m`wandb login --relogin`[0m to force relogin


100%|██████████| 121648/121648 [07:35<00:00, 266.94it/s]


0,1
loss,█▇▆▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁

0,1
loss,60.73434


## Verify things are working

Once you have an initial model trained, try using the following code to query the model for what are the nearest neighbor of a word. This code is intended to help you debug

In [9]:
def get_neighbors(model, word_to_index, target_word):
    """ 
    Finds the top 10 most similar words to a target word
    """
    outputs = []
    for word, index in tqdm(word_to_index.items(), total=len(word_to_index)):
        similarity = compute_cosine_similarity(model, word_to_index, target_word, word)
        result = {"word": word, "score": similarity}
        outputs.append(result)

    # Sort by highest scores
    neighbors = sorted(outputs, key=lambda o: o['score'], reverse=True)
    return neighbors[1:11]

def compute_cosine_similarity(model, word_to_index, word_one, word_two):
    '''
    Computes the cosine similarity between the two words
    '''
    try:
        word_one_index = word_to_index[word_one]
        word_two_index = word_to_index[word_two]
    except KeyError:
        return 0

    embedding_one = model.target_embeddings(torch.LongTensor([word_one_index]).to(device)).cpu()
    embedding_two = model.target_embeddings(torch.LongTensor([word_two_index]).to(device)).cpu()
    similarity = 1 - abs(float(cosine(embedding_one.detach().squeeze().numpy(),
                                      embedding_two.detach().squeeze().numpy())))
    return similarity

In [10]:
get_neighbors(model, corpus.word_to_index, "recommend")

100%|██████████| 65523/65523 [00:04<00:00, 14575.16it/s]


[{'word': 'highly', 'score': 0.9607359843143927},
 {'word': 'I', 'score': 0.9559921167030977},
 {'word': 'm', 'score': 0.9545440313180487},
 {'word': 'bought', 'score': 0.9543723706362278},
 {'word': 'would', 'score': 0.9521774294922356},
 {'word': 'read', 'score': 0.9512196543556726},
 {'word': 've', 'score': 0.9490245503610805},
 {'word': 'have', 'score': 0.9443953314992476},
 {'word': 'be', 'score': 0.9440260320233161},
 {'word': 'anyone', 'score': 0.9438091503515359}]

In [11]:
get_neighbors(model, corpus.word_to_index, "son")

100%|██████████| 65523/65523 [00:04<00:00, 14697.80it/s]


[{'word': 'daughter', 'score': 0.9988393331145972},
 {'word': '6', 'score': 0.9981162856294},
 {'word': '7', 'score': 0.9977655757959818},
 {'word': '10', 'score': 0.9975314691113029},
 {'word': 'day', 'score': 0.9973068436806289},
 {'word': '8', 'score': 0.9972927310894091},
 {'word': '9', 'score': 0.9972801498259447},
 {'word': '12', 'score': 0.9970179092559571},
 {'word': '1', 'score': 0.996735671645931},
 {'word': 'husband', 'score': 0.9967149695562113}]

# Save your vectors for the gensim inspection part!

Once you have a fully trained model, save it using the code below. Note that we only save the `target_embeddings` from the model, but you could modify the code if you want to save the context vectors--or even try doing fancier things like saving the concatenation of the two or the average of the two!

In [12]:
def save(model, corpus, filename):
    '''
    Saves the model to the specified filename as a gensim KeyedVectors in the
    text format so you can load it separately.
    '''

    # Creates an empty KeyedVectors with our embedding size
    kv = KeyedVectors(vector_size=model.embedding_size)        
    vectors = []
    words = []
    # Get the list of words/vectors in a consistent order
    for index in trange(model.target_embeddings.num_embeddings):
        word = corpus.index_to_word[index]
        vectors.append(model.target_embeddings(torch.LongTensor([index]).to(device)).cpu().detach().numpy()[0])
        words.append(word)

    # Fills the KV object with our data in the right order
    kv.add_vectors(words, vectors) 
    kv.save_word2vec_format(filename, binary=False)


# Save your vectors / data for the pytorch classifier in Part 4!

We'll be to using these vectors later in Part 4. We want to save them in a format that PyTorch can easily use. In particular you'll need to save the _state dict_ of the embeddings, which captures all of its information. 

In [13]:
save(model, corpus, "./cache/word2vec_model.kv")
torch.save(model.state_dict(), "./cache/word2vec_emb.pt")

100%|██████████| 65523/65523 [00:02<00:00, 32235.49it/s]


We will also need the mapping from word to index so we can figure out which embedding to use for different words. Save the `corpus` objects mapping to a file using your preferred format (e.g., pickle or json).

In [14]:
pickle.dump(corpus, open("./cache/word2vec_corpus.pkl", "wb"))