# CS 614 - Applications of Machine Learning## Programming Assignment 5 - Word2Vec

In this assignment we'll play around with training a Word2Vec model

We will generally follow the following article/tutorial:
https://towardsdatascience.com/word2vec-with-pytorch-implementing-original-paper-2cd7040120b0/

### Create Our Data
First let's create our data.   Create a few sentences that have similar words in them.  Here's some to get you started...

In [None]:
sentences = ["What is the best time to call you tomorrow?",  "What is the best hour to call you tomorrow?"]

### Pre-Processing Data
Next let's extract the following information from the sentences
1. Sentences
2. How many times each word occurs
3. And key index pair for each word
4. A reverse index key

We'll crudly *tokenize* our data by splitting using spaces.

In [None]:
import numpy as np
from collections import Counter

## just a little bit of preprocessing for our data, counting/ordering types
data = {'sentences': [s for x in sentences
                      for s in [(x).lower().split()]]}  #the sentences

for i in range(len(data['sentences'])):
    data['sentences'][i] = [e for i in data['sentences'][i] for e in [i]][:-1]


data['counts'] = Counter([t for s in data['sentences'] for t in s]) #unique words and their counts
data['word2index'] = {t: i for i, t in enumerate(data['counts'])}  #indices for each word
data['index2word'] = {v: k for k, v in data['word2index'].items()}

print(data)
print(len(data['word2index']))



### Setting up our Training Data
To create our dataset we will loop through each sentence and:
1. With a moving window of odd integer size *WINDOW_SIZE*, loop through the sentence.
2. All middle words are the thing we want to predict ($y$)
3. All other words in the window are our context, and will be $x$

In [None]:
import torch
WINDOW_SIZE = 5  #TODO:  Play with this
half = WINDOW_SIZE//2
Xtrain = []
Ytrain = []

for s in data['sentences']:  #for each sentence
    for i in range(0,len(s)-WINDOW_SIZE+1): #for each word in sentence
        T = [data['word2index'][s[j]] for j in range(i,i+WINDOW_SIZE)]  #grab the word indices for the window
        Ytrain.append([T[half]])
        Xtrain.append(T[:half]+T[half+1:])

Xtrain = torch.tensor(Xtrain, dtype=torch.long)
Ytrain = torch.tensor(Ytrain, dtype=torch.long)

print(Xtrain.shape)
print(Ytrain.shape)

### Create Model
Our model will be quite simple:
1. Fully-connected layer to take us to our desired embedded dimension.
2. Another fully-connected layer to take us to the number of classes (potential words)

However, since we'll need to so some custom operations (summing over embedding in words of a sample), we'll create our own type of model (instead of just pre-made layers in a sequential model) so we can have more control over the forward (and backward) propagation.

In [None]:
from torch import nn
VOCAB_SIZE = len(data['counts'])
EMBED_DIMENSION = 300  #TODO:  Play with this

class CBOW_Model(nn.Module):
    def __init__(self, vocab_size, embed_dims):
        super().__init__()
        self.embedded_dim = embed_dims
        
        self.embeddings = nn.Linear(vocab_size, embed_dims)
        self.linear = nn.Linear(
            in_features=embed_dims,
            out_features=vocab_size,
        )
        
    def forward(self, inputs_):
        x = torch.zeros(inputs_.shape[0],inputs_.shape[1],self.embedded_dim)  #shape of output.  (samples x items in sample x embedded dims)
        
        for i in range(inputs_.shape[0]):  #for each training sample
            x[i] = self.embeddings(inputs_[i])  #get its embeddings for the words in the sample
        
        x = x.mean(axis=1)  #take the average embedding
        x = self.linear(x)   #predict
        return x

model = CBOW_Model(VOCAB_SIZE, EMBED_DIMENSION)


### Train
Now time for our world-famous training loop!

In [None]:
MAX_EPOCHS = 1000 #TODO:  Play with this

#Choose your loss function and optimizer

loss_fn = 
optimizer = 

model.train()
running_loss = []

XOneHot = torch.nn.functional.one_hot(Xtrain).to(torch.float)
print(Xtrain)
print(XOneHot)

for epoch in range(MAX_EPOCHS):

    optimizer.zero_grad()
    outputs = model(XOneHot)
   
    loss = loss_fn(outputs, Ytrain.squeeze())
    loss.backward()
    optimizer.step()

    running_loss.append(loss.item())

    break
#TODO:  Visualize the training process

### Find Similar Words
Now for some fun.  Let's find similar words!

To do this, we'll get the embeddings for all our words, make them unit length, the compare each by taking their dot product.  This is often known as the *cosine similarity*

In [None]:
weights = list(model.parameters())[0].detach().numpy().T   #Get the weights from the embedded layer
print(weights.shape)
ids = torch.nn.functional.one_hot(torch.arange(VOCAB_SIZE))  #The indices of all the words
embeddings = ids@weights  # The embeddings of the words
print(embeddings.shape)

# normalization
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2) + 10**(-10)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms


#This function takes a word (as a string), gets its embedding, compares to all other embeddings via cosine similarity.
#Then returns the topN
def get_top_similar(word: str, topN: int = 10):
    try:
        word_id = data['word2index'][word]   #This word's ID
    except:
        print("Out of vocabulary word")
        return

    word_vec = embeddings_norm[word_id]   #This word's normalized embedding
        
    word_vec = np.reshape(word_vec, (len(word_vec), 1))
    dists = np.matmul(embeddings_norm, word_vec).flatten()  #Dot product with all the other embeddings
    topN_ids = np.argsort(-dists)[1 : topN + 1]  #Sort by most similar

    topN_dict = {}
    for sim_word_id in topN_ids:
        sim_word = data['index2word'][sim_word_id.item()]
        topN_dict[sim_word] = dists[sim_word_id]
    return topN_dict

myword = "time"
try:
    for word, sim in get_top_similar(myword).items():
        print("{}: {:.3f}".format(word, sim))
except:
    print("Word doesn't exist")