# SI630 Homework 2: Word2vec Vector Analysis

*Important Note:* Start this notebook only after you've gotten your word2vec model up and running!

Many NLP packages support working with word embeddings. In this notebook you can work through the various problems assigned in Task 3. We've provided the basic functionality for loading word vectors using [Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html), a good library for learning and using word vectors, and for working with the vectors. 

One of the fun parts of word vectors is getting a sense of what they learned. Feel free to explore the vectors here! 

In [1]:
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

In [2]:
import torch

In [3]:
word_vectors = torch.load('tensor.pt')

In [4]:
word_vectors = word_vectors.cpu()

In [5]:
word_vectors.shape

torch.Size([462055, 100])

In [6]:
word_to_index = torch.load('dict.pt')

In [7]:
len(word_to_index)

462055

In [8]:
import pandas as pd

In [14]:
word_pair_df = pd.read_csv("word_pair_similarity_predictions.csv")
word_pair_df

Unnamed: 0,word1,word2,sim
0,old,new,
1,smart,intelligent,
2,hard,difficult,
3,happy,cheerful,
4,hard,easy,
...,...,...,...
1625,relatives,sister,
1626,relatives,she,
1627,relatives,her,
1628,relatives,hers,


In [52]:
from scipy.spatial.distance import cosine

In [61]:
def compute_cosine_similarity(word_vectors, word_to_index, word_one, word_two):
    '''
    Computes the cosine similarity between the two words
    '''
    try:
        word_one_index = word_to_index[word_one]
        word_two_index = word_to_index[word_two]
    except KeyError:
        return 0
    
    # convert back to cpu
    word_vectors = word_vectors.to('cpu')

    embedding_one = word_vectors[torch.LongTensor([word_one_index])]
    embedding_two = word_vectors[torch.LongTensor([word_two_index])]
    similarity = 1 - abs(float(cosine(embedding_one.detach().numpy(),
                                      embedding_two.detach().numpy())))
    return similarity

In [62]:
for i in range(len(word_pair_df)):
    word_pair_df.loc[i, "sim"] = compute_cosine_similarity(word_vectors, word_to_index, word_pair_df.loc[i, "word1"], word_pair_df.loc[i, "word2"])

In [63]:
word_pair_df

Unnamed: 0,word1,word2,sim
0,old,new,0.791294
1,smart,intelligent,0.036121
2,hard,difficult,0.527887
3,happy,cheerful,0.039990
4,hard,easy,0.334446
...,...,...,...
1625,relatives,sister,0.341856
1626,relatives,she,0.244936
1627,relatives,her,0.357703
1628,relatives,hers,0.089580


In [64]:
word_pair_df.to_csv("word_pair_similarity_predictions.csv")

In [7]:
# word_vectors = KeyedVectors.load_word2vec_format('tensor.pt', binary=False)

In [None]:
word_vectors['the']

In [None]:
word_vectors.similar_by_word("books")

In [None]:
def get_analogy(a, b, c):
    return word_vectors.most_similar(positive=[b, c], negative=[a])[0][0]

In [None]:
get_analogy('man', 'woman', 'king')