# SI630 Homework 2: Word2vec Vector Analysis

*Important Note:* Start this notebook only after you've gotten your word2vec model up and running!

Many NLP packages support working with word embeddings. In this notebook you can work through the various problems assigned in Task 3. We've provided the basic functionality for loading word vectors using [Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html), a good library for learning and using word vectors, and for working with the vectors. 

One of the fun parts of word vectors is getting a sense of what they learned. Feel free to explore the vectors here! 

In [12]:
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

In [13]:
word_vectors = KeyedVectors.load_word2vec_format('word2vec_weight_med.data', binary=False)

In [14]:
word_vectors['the']

array([-0.0041172 ,  0.13652432,  0.04166884, -0.19393025, -0.00663474,
        0.18001786, -0.0801428 ,  0.40245804,  0.13151415,  0.0624693 ,
        0.10634628,  0.28353432,  0.11042169, -0.01605607,  0.08256505,
        0.39219746, -0.11092702,  0.19388388,  0.11200747,  0.23258185,
       -0.10147015, -0.32964718,  0.45672575, -0.2604698 ,  0.0087484 ,
       -0.21609078, -0.07011503,  0.15418056, -0.07643814, -0.2428526 ,
       -0.2825639 ,  0.04538359,  0.02534867, -0.3569212 , -0.14855587,
        0.11648662, -0.27255863, -0.12053365, -0.04893584, -0.40937024,
        0.27876425, -0.16029464,  0.22211544,  0.13279651,  0.33708835,
       -0.14596154,  0.14458443, -0.06939787,  0.07756472, -0.37361556,
       -0.4815985 , -0.32931057,  0.3084443 , -0.11816235,  0.0482705 ,
        0.01396449,  0.12008078,  0.22734572,  0.16200359,  0.2454627 ,
        0.10973944, -0.3791624 ,  0.23891893, -0.3458776 , -0.0272899 ,
       -0.06424014,  0.02611809, -0.08035917, -0.12821646, -0.02

In [15]:
word_vectors.similar_by_word("books")

[('novels', 0.9809719920158386),
 ('articles', 0.9724070429801941),
 ('poems', 0.9717661142349243),
 ('paintings', 0.970707356929779),
 ('stories', 0.9705720543861389),
 ('compositions', 0.9633492231369019),
 ('papers', 0.963097095489502),
 ('others', 0.9611388444900513),
 ('pieces', 0.9607668519020081),
 ('videos', 0.960437536239624)]

In [16]:
def get_analogy(a, b, c):
    return word_vectors.most_similar(positive=[b, c], negative=[a])

In [17]:
get_analogy('man', 'woman', 'king')

[('jewish', 0.9412981271743774),
 ('queen', 0.9380686283111572),
 ('roman', 0.919913649559021),
 ('christian', 0.9194426536560059),
 ('church', 0.91877281665802),
 ('catholic', 0.90057772397995),
 ('lord', 0.8952214121818542),
 ('merchant', 0.885901927947998),
 ('christ', 0.8858672976493835),
 ('historian', 0.8825202584266663)]

In [19]:
import pandas as pd

In [20]:
word_pair_df = pd.read_csv("word_pair_similarity_predictions.csv")
word_pair_df

Unnamed: 0,word1,word2,sim
0,old,new,
1,smart,intelligent,
2,hard,difficult,
3,happy,cheerful,
4,hard,easy,
...,...,...,...
1625,relatives,sister,
1626,relatives,she,
1627,relatives,her,
1628,relatives,hers,


In [21]:
from scipy.spatial.distance import cosine

In [23]:
def compute_cosine_similarity(embedding_one, embedding_two):
    '''
    Computes the cosine similarity between the two words
    '''

    similarity = 1 - abs(float(cosine(embedding_one,
                                      embedding_two)))
    return similarity

In [24]:
for i in range(len(word_pair_df)):
    word_pair_df.loc[i, "sim"] = compute_cosine_similarity(word_vectors[word_pair_df.loc[i, "word1"]], word_vectors[word_pair_df.loc[i, "word2"]])

In [25]:
word_pair_df

Unnamed: 0,word1,word2,sim
0,old,new,0.273122
1,smart,intelligent,0.963854
2,hard,difficult,0.816973
3,happy,cheerful,0.781015
4,hard,easy,0.897915
...,...,...,...
1625,relatives,sister,0.781331
1626,relatives,she,0.549845
1627,relatives,her,0.772994
1628,relatives,hers,0.897313


In [26]:
word_pair_df.to_csv("word_pair_similarity_predictions.csv")