# Homework 4: Word Embeddings and Neural Language Models


Names: __Suzanne Becker, Yuval Timen__

Step 1: Word2Vec paper questions
---------------------------

### 1. Describe how a CBOW word embedding is generated.

A CBOW word embedding can be created by using the CBOW algorithm. This would consist of training a simple 2-layer neural network with the hidden layer removed to predict a target word given some context words. Specifically, given the sequence of length M $w_{0} ... w_{M}$ and a window size N, we would first create one-hot vectors for each word in the sequence. These will be used as inputs to the NN. Then for each time step $t$, we will be trying to predict the word $w_{t}$ given the context words within the N-sized window. We take the N words from either side of $w_{t}$, namely $w_{t-N}...w_{t+N}$ excluding $w_{t}$. These will each be fed through the forward pass, which will give an output vector of size 1xV, representing the probability distribution of predicting the next word over our entire vocabulary. The error for each iteration in training will be the sum of the errors for all context words. Once the weights have been updated, we move to the next timestep and shift our target word to be the next word in the sequence ($w_{t+1}$), and repeat the training steps; for each timestep, we run the forward pass, find the error, and do backpropagation.

Once we complete training, our CBOW embeddings will be the weights we have learned. The weight matrix is of size VxE, where E is the size of the embeddings we want to learn. Thus we have V vectors each of length E, meaning that for each word in our vocabulary, we've created an embedding of length E.

### 2. What is a CBOW word embedding and how is it different from a skip-gram word embedding?

A CBOW word embedding is an embedding crested using the CBOW algorithm - namely, embeddings learned by training an RNN to predict some target given some number of neighboring context words. This is very similar to the skip-gram method for generating word embeddings. The main difference is that, during the prediction task, CBOW tries to predict the target word given some neighboring context words, while skip-gram tries to predict the context words given a target word.

### 3. What is the task that the authors use to evaluate the generated word embeddings?

They tested that the word embeddings preserved the linear relationships between words; ideally, they wanted to have __vector("King") - vector("Man") + vector("Woman") = vector("Queen")__. They tested this by creating a large test set of 5 types of semtantic relationships and 9 types of syntactic questions. This was done by manually creating a list of word pairs, for example __(Athens, Greece)__ or __(great, greater)__. Then, each of these word pairs was connected to another word word pair in the same syntactic/semantic category, and the evaluation task consisted of seeing if the embeddings could accurately predict the second word in the second pair, given the relationship computed from the first pair. For example, given __(Athens, Greece)__ <-> __(Oslo, Norway)__, they computed the vectors for each of the first 3 words, did the vector math, and saw whether the result was the correct last word: __vector(Athens) - vector(Greece) + vector(Oslo) =? vector(Norway)__.

### 4. What are PCA and t-SNE? Why are these important to the task of training and interpreting word embeddings?

PCA (or Principal Component Analysis) is a method of finding the set of dimensions in a vector space which maximizes the variance in each dimension. This is done by taking linear combinations of our other features to try to find the first PC dimension which maximizes the variance of our data. Then, each subsequent PCA dimension is calculated using the same method, with the added condition that the new dimension must be orthogonal to all other PC dimensions. This effectively is a way of projecting the data from $\mathbb{R}^{N} \to \mathbb{R}^{N}$}, where the new vector space's axes are sorted in order of "importance" or information. To use PCA to visualize word embeddings, we can run PCA on our embedding space, and then take the first 2 or 3 PCA dimensions and use them to visualize our embeddings in a 2 or 3 dimensional space.

On the other hand, t-SNE is a method for visualizing high-dimensional data while also trying to avoid the "curse of dimensionality". When our data lives in high-dimensional vector spaces, simply projecting the space onto a 2 or 3 dimensional plane will cause our data to overcrowd. This makes it very hard to see clusters or any other meaningful relationship between the data. t-SNE solves this problem by modeling the distribution of points in the high-dimensional space, and then re-creating that distribution in a lower dimensional space. The points in 2-d are sampled according to the distribution, meaning that the user of t-SNE gets to control how many points are sampled. This will avoid the overcrowding problem and will still allow us to visualize our high-dimensional data.


PCA and t-SNE are important in training and interpreting word embeddings because they provide a human-interpretable view on generated vector embeddings, which are otherwise long lists of numbers with no inherent semantic meaning to the human eye. Projecting vector embeddings into smaller-dimensional space would potentially reveal clusters in 2 or 3 dimensions which will allow us to visually inspect the quality of our embeddings: does each cluster contain words that we as humans would consider "similar"?

Step 2: Train your own word embeddings
--------------------------------

__(describe the Spooky Authors Dataset here)__


__Describe what data set you have chosen to compare and contrast with the Spooky Authors Dataset. Make sure to describe where it comes from and it's general properties.__

__(describe your dataset here)__

In [1]:
# Imports
import csv
import re
from gensim.models import Word2Vec

# Some constants
EMBEDDINGS_SIZE = 300
SPOOKY_DATA_PATH_TEST = "./spooky_test.csv"
SPOOKY_DATA_PATH_TRAIN = "./spooky_train.csv"

In [14]:
# Read in the data
with open(SPOOKY_DATA_PATH_TRAIN, newline='') as f:
    reader = csv.reader(f)
    reader.__next__()  # skip the column names 
    train = [row[1] for row in reader]  # Take only the text data
    
with open(SPOOKY_DATA_PATH_TEST, newline='') as f:
    reader = csv.reader(f)
    reader.__next__()  # skip the column names
    test = [row[1] for row in reader]  # Take only the text data
    
# Use all the data
sentences = train + test

sentences_listed = [[word for word in sentence.split(' ')] for sentence in sentences]

print(len(sentences))
print(sentences[10:12])
sentences_string = " ".join([sent for sent in sentences])
all_words = sentences_string.split()
print(len(all_words))
# all_words = [[word for word in all_sentences[i].split()] for i in range(len(all_sentences))]
uniques = set(all_words)
print(len(uniques))


27971
['He shall find that I can feel my injuries; he shall learn to dread my revenge" A few days after he arrived.', 'Here we barricaded ourselves, and, for the present were secure.']
745572
56654


In [15]:
# Preprocessing

# TODO: FILL THIS IN

### a) Train embedding on GIVEN dataset

In [16]:
# Create the model, keep only the weights
# We toss the model to save space
model = Word2Vec(sentences=sentences_listed, size=EMBEDDINGS_SIZE, sg=0, window=5, min_count=1)
spooky_embeddings = model.wv
del model

print(f"Vocabulary size: |V| = {len(spooky_embeddings.vocab)}")

Vocabulary size: |V| = 56654


### b) Train embedding on YOUR dataset

In [5]:
# then do a second data set

What text-normalization and pre-processing did you do and why? __YOUR ANSWER HERE__

Step 3: Evaluate the differences between the word embeddings
----------------------------

(make sure to include graphs, figures, and paragraphs with full sentences)

Cite your sources:
-------------

Step 4: Feedforward Neural Language Model
--------------------------

### a) First, encode  your text into integers

In [6]:
# Importing utility functions from Keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import Embedding

NGRAM = 3 # The size of the ngram language model you want to train

# Initializing a Tokenizer
# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)

# Make sure to include the padding token in your vocabulary size

### b) Next, prepare your sequences from text

#### Fixed ngram based sequences (Used for Feedforward)

In [7]:
def generate_ngram_training_samples():
    '''
    Takes the encoded data (list of lists) and generates the training samples 
    out of it.
    Parameters:
    up to you!
    return: list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    pass


### c) Then, split the sequences into X and y and create a Data Generator

In [8]:
# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]

In [9]:
def read_embeddings():
    '''Loads and parses embeddings trained in earlier.'''
    
    # you may find generating the following two dicts useful:
    # word to embedding : {'the':1, ...}
    # index to embedding : {1:'the', ...} (inverse of word_2_embedding)
    pass


# remember that "0" index is assigned for padding token. 
# Hence, initialize the vector for padding token as all zeros of embedding size

In [10]:
def data_generator(X, y, num_sequences_per_batch):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels (see the to_categorical function)
    
    '''
    pass



In [11]:
# Examples
# initialize data_generator
# num_sequences_per_batch = 1024 # or Batch Size
# steps_per_epoch = len(sequences)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)

### d) Train your models

In [12]:
# code to train a feedforward neural language model 
# on a set of given word embeddings
# make sure not to just copy + paste to train your two 

# Defining the model architecture using Keras Sequential API


In [13]:
# Start training the model
model.fit_generator(train_generator, 
                    steps_per_epoch=steps_per_epoch,
                    epochs=1)

NameError: name 'model' is not defined

### e) Generate Sentences

In [None]:
# generate a sequence from the model
def generate_seq(model, tokenizer, seed, n_words):
    '''
    Parameters:
        model: your neural network
        tokenizer: the keras preprocessing tokenizer
        seed: [w1, w2, w(n-1)]
        n_words: generate a sentence of length n_words
    Returns: string sentence
    '''
    pass

### f) Compare your generated sentences

Sources Cited
----------------------------
