# SI630 Homework 2: Word2vec Vector Analysis

*Important Note:* Start this notebook only after you've gotten your word2vec model up and running!

Many NLP packages support working with word embeddings. In this notebook you can work through the various problems assigned in Task 3. We've provided the basic functionality for loading word vectors using [Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html), a good library for learning and using word vectors, and for working with the vectors. 

One of the fun parts of word vectors is getting a sense of what they learned. Feel free to explore the vectors here! 

* Kaggle User Name: chenyuntao 
* Name Displayed: Chenyun Tao 
* Uniqname: cyuntao

In [1]:
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

## Problem 9

In [2]:
word_vectors = KeyedVectors.load_word2vec_format('my_word2vec.wv', binary=False)

## Problem 10

In [3]:
word_vectors.similar_by_word('5')

[('7', 0.8617218732833862),
 ('023', 0.8496308922767639),
 ('545', 0.8398237824440002),
 ('938', 0.83519446849823),
 ('037', 0.8346812725067139),
 ('791', 0.8328269720077515),
 ('097', 0.8319663405418396),
 ('909', 0.8264710903167725),
 ('1', 0.8264130353927612),
 ('852', 0.8248751163482666)]

In [4]:
word_vectors.similar_by_word('books')

[('novellas', 0.8293923139572144),
 ('anthologies', 0.8185344338417053),
 ('essays', 0.8104189038276672),
 ('illustrations', 0.7848616242408752),
 ('novels', 0.7826084494590759),
 ('collections', 0.7822340726852417),
 ('articles', 0.779187023639679),
 ('reprints', 0.7691074013710022),
 ('publications', 0.7676923274993896),
 ('taxa', 0.7632808685302734)]

In [5]:
word_vectors.similar_by_word('month')

[('year', 0.7540034055709839),
 ('hundredths', 0.7282611727714539),
 ('kinimaka', 0.696230411529541),
 ('clasps', 0.6880671977996826),
 ('inactivity', 0.6861849427223206),
 ('armband', 0.6746704578399658),
 ('truce', 0.6711289286613464),
 ('misadventure', 0.667199432849884),
 ('bogey', 0.6593464612960815),
 ('fluckey', 0.6590462327003479)]

In [6]:
word_vectors.similar_by_word('anna')

[('elisabeth', 0.7705016732215881),
 ('gudrun', 0.7614984512329102),
 ('tine', 0.7507264614105225),
 ('franziska', 0.7421220541000366),
 ('ivanovna', 0.7388003468513489),
 ('isabelle', 0.7375069260597229),
 ('kournikova', 0.7367249131202698),
 ('elise', 0.7362877130508423),
 ('inga', 0.7359388470649719),
 ('johanna', 0.734180212020874)]

In [7]:
word_vectors.similar_by_word('google')

[('microsoft', 0.762705385684967),
 ('nyse', 0.7428861856460571),
 ('lucasfilm', 0.7327085137367249),
 ('jpmorgan', 0.7212052345275879),
 ('ddb', 0.7090804576873779),
 ('msg', 0.7024335861206055),
 ('shiseido', 0.6980329155921936),
 ('aol', 0.6937880516052246),
 ('foxsports', 0.6900680661201477),
 ('mozilla', 0.6882199048995972)]

In [8]:
word_vectors.similar_by_word('physics')

[('chemistry', 0.8859132528305054),
 ('paleontology', 0.867053747177124),
 ('astrophysics', 0.8587157726287842),
 ('zoology', 0.8538668155670166),
 ('geophysics', 0.8518619537353516),
 ('biophysics', 0.8510938882827759),
 ('geology', 0.8474372029304504),
 ('biology', 0.8412570953369141),
 ('humanities', 0.8356377482414246),
 ('theoretical', 0.8329718708992004)]

In [9]:
word_vectors.similar_by_word('ridiculous')

[('surely', 0.8762216567993164),
 ('malicious', 0.8553721308708191),
 ('inexplicable', 0.8485351204872131),
 ('owe', 0.844571053981781),
 ('likable', 0.8443453311920166),
 ('delete', 0.8442380428314209),
 ('unreasonable', 0.8414052724838257),
 ('heinous', 0.841387152671814),
 ('prejudiced', 0.8342485427856445),
 ('sickening', 0.8329619765281677)]

In [10]:
word_vectors.similar_by_word('michigan')

[('montana', 0.8423749804496765),
 ('arkansas', 0.8313051462173462),
 ('kentucky', 0.8308364152908325),
 ('dakota', 0.8237232565879822),
 ('fayetteville', 0.8157213926315308),
 ('mississippi', 0.8141648769378662),
 ('duluth', 0.8041296005249023),
 ('alabama', 0.803572416305542),
 ('lewiston', 0.8018585443496704),
 ('virginia', 0.800946831703186)]

In [11]:
word_vectors.similar_by_word('regime')

[('dictatorship', 0.8642090559005737),
 ('communists', 0.8238113522529602),
 ('hitler', 0.7834300398826599),
 ('ceauşescu', 0.7785800099372864),
 ('persecution', 0.7748729586601257),
 ('separatist', 0.7742667198181152),
 ('pact', 0.7733154892921448),
 ('pinochet', 0.7670925855636597),
 ('ceaușescu', 0.7637039422988892),
 ('repressions', 0.7629903554916382)]

In general, the predicted words seem to be semantically similar to the target words, but the predictions for the frequent words are better than those for occasional words and rare words. It also seems that the model could do better predictions on some special words whose meanings are specific, and might have specific contexts, for example, 'michigan'. Although 'michigan' is a rare word, the predicted words look quite good.

## Problem 11

In [12]:
def get_analogy(a, b, c):
    return word_vectors.most_similar(positive=[b, c], negative=[a])[0][0]

In [13]:
get_analogy('jump', 'walk', 'jumping')

'walking'

In [14]:
get_analogy('book', 'language', 'books')

'languages'

In [15]:
get_analogy('japan', 'korea', 'japanese')

'korean'

In [16]:
get_analogy('korea', 'japan', 'seoul')

'tokyo'

In [17]:
get_analogy('big', 'large', 'bigger')

'meager'

We can see some of the analogies worked, while some not. In the 5 analogies that I have tried, the model finds 'walking' is to 'walk' as 'jumping' is to 'jump', 'languages' is to 'language' as 'books' is to 'book', 'korean' is to 'korea' as 'japanese' is to 'japan', 'tokyo' is to 'japan' as 'seoul' is to 'korea'. However, given 'bigger' is to 'big', the model picks 'meager' to 'large'. Although 'meager' is in the "-er" form, this should not be correct.

## Problem 12

In [18]:
import pandas as pd

In [19]:
word_pairs = pd.read_csv('word_pairs_to_estimate_similarity.test.csv') 
word_pairs.head()

Unnamed: 0,pair_id,word1,word2
0,0,old,new
1,1,smart,intelligent
2,2,hard,difficult
3,3,happy,cheerful
4,4,hard,easy


In [20]:
sim_scores = []
for index, row in word_pairs.iterrows():
    word1 = row[1]
    word2 = row[2]
    if word1 not in word_vectors:
        word1 = '<UNK>'
    if word2 not in word_vectors:
        word2 = '<UNK>'
    sim_scores.append(word_vectors.similarity(word1, word2))

In [21]:
result = pd.DataFrame(word_pairs['pair_id'])
result['similarity'] = sim_scores
result.to_csv('result.csv', index=False)

## Problem 15

In [22]:
synonym_vectors = KeyedVectors.load_word2vec_format('my_word2vec_synonyms.wv', binary=False)

In [23]:
word_vectors.similar_by_word("adviser")

[('deputy', 0.8019633293151855),
 ('kpmg', 0.7767941951751709),
 ('advisor', 0.7703201770782471),
 ('auditor', 0.7609075903892517),
 ('administrative', 0.7436810731887817),
 ('appointee', 0.7381704449653625),
 ('consultant', 0.7373884916305542),
 ('chief', 0.7365993857383728),
 ('undersecretary', 0.7252454161643982),
 ('liaison', 0.7233219742774963)]

In [24]:
synonym_vectors.similar_by_word("adviser")

[('consultant', 0.8524190187454224),
 ('advisor', 0.8397467732429504),
 ('chief', 0.787283718585968),
 ('economist', 0.7719884514808655),
 ('strategist', 0.7714995741844177),
 ('kpmg', 0.7626684308052063),
 ('boss', 0.7369043231010437),
 ('statistician', 0.7354175448417664),
 ('samora', 0.73353511095047),
 ('directorate', 0.7304761409759521)]

In [25]:
word_vectors.similar_by_word("economical")

[('logistical', 0.8019434809684753),
 ('airway', 0.7648997902870178),
 ('sanitary', 0.7633872628211975),
 ('dewatering', 0.7564762830734253),
 ('epidemiologic', 0.7535840272903442),
 ('facilitating', 0.7523698806762695),
 ('competitiveness', 0.7465103268623352),
 ('radioisotopes', 0.7454443573951721),
 ('solving', 0.7447012662887573),
 ('organisational', 0.7441443800926208)]

In [26]:
synonym_vectors.similar_by_word("economical")

[('economic', 0.9121019840240479),
 ('ecological', 0.8331916332244873),
 ('economy', 0.8126458525657654),
 ('climate', 0.8002585768699646),
 ('globalization', 0.7916080355644226),
 ('policy', 0.7913529276847839),
 ('socioeconomic', 0.7870862483978271),
 ('cohesion', 0.7848232984542847),
 ('economies', 0.7820599675178528),
 ('disparities', 0.7818484902381897)]

In [27]:
word_vectors.similar_by_word("use")

[('conductivity', 0.8310834169387817),
 ('inhibitory', 0.8119741082191467),
 ('incremental', 0.8075635433197021),
 ('gases', 0.7931780219078064),
 ('porous', 0.7930848002433777),
 ('simulations', 0.7930310964584351),
 ('injection', 0.7905918955802917),
 ('diffusion', 0.7883409857749939),
 ('methods', 0.7853349447250366),
 ('reassessment', 0.782145619392395)]

In [28]:
synonym_vectors.similar_by_word("use")

[('invasive', 0.8119388818740845),
 ('enhance', 0.7985661625862122),
 ('stimulation', 0.7984791994094849),
 ('manipulation', 0.7977747917175293),
 ('integrate', 0.7950735688209534),
 ('discover', 0.7936634421348572),
 ('electrolytes', 0.7912244200706482),
 ('emphasize', 0.7909663319587708),
 ('antibiotic', 0.7858322858810425),
 ('immune', 0.7857041358947754)]

In [29]:
word_vectors.similar_by_word("4")

[('2', 0.9241558313369751),
 ('023', 0.8629616498947144),
 ('7', 0.8623941540718079),
 ('9', 0.8603189587593079),
 ('yel', 0.8593716025352478),
 ('8', 0.8487160205841064),
 ('wcq', 0.8475887179374695),
 ('6', 0.8417019248008728),
 ('inzell', 0.8363901972770691),
 ('492', 0.8281565308570862)]

In [30]:
synonym_vectors.similar_by_word("4")

[('8', 0.8649346232414246),
 ('trey', 0.853675365447998),
 ('septet', 0.852579653263092),
 ('eight', 0.8394075632095337),
 ('seven', 0.8384681940078735),
 ('four', 0.838271975517273),
 ('v', 0.836565375328064),
 ('triad', 0.8355855941772461),
 ('troika', 0.833070695400238),
 ('7', 0.8313018083572388)]

In [31]:
word_vectors.similar_by_word("afterwards")

[('immediately', 0.7232730984687805),
 ('whereupon', 0.7148445248603821),
 ('oai', 0.6885701417922974),
 ('torstensson', 0.6845402121543884),
 ('faller', 0.6814539432525635),
 ('bandmann', 0.6697770357131958),
 ('afterward', 0.6655863523483276),
 ('thereafter', 0.663615882396698),
 ('yorihito', 0.6610944271087646),
 ('houstoun', 0.6602509021759033)]

In [32]:
synonym_vectors.similar_by_word("afterwards")

[('after', 0.7626321315765381),
 ('afterward', 0.7478715181350708),
 ('subsequently', 0.7364693880081177),
 ('earlier', 0.7018415927886963),
 ('before', 0.6789159774780273),
 ('later', 0.646862268447876),
 ('curtly', 0.615157961845398),
 ('bons', 0.6011383533477783),
 ('parsemain', 0.6007453799247742),
 ('thenceforth', 0.5835322141647339)]

In [33]:
word_vectors.similar_by_word("pizza")

[('nestlé', 0.7566994428634644),
 ('candy', 0.7299879193305969),
 ('jeep', 0.6904927492141724),
 ('motorcity', 0.6874257326126099),
 ('chicken', 0.683999240398407),
 ('tuff', 0.6827149987220764),
 ('bucket', 0.6823759078979492),
 ('puppy', 0.6819471120834351),
 ('chrysler', 0.6813479065895081),
 ('apple', 0.6784874796867371)]

In [34]:
synonym_vectors.similar_by_word("pizza")

[('wine', 0.7509880065917969),
 ('apple', 0.7420293688774109),
 ('chicken', 0.7248697280883789),
 ('asterix', 0.7106896638870239),
 ('soya', 0.7082656025886536),
 ('chevron', 0.707973837852478),
 ('disney', 0.7070032358169556),
 ('tails', 0.7015137076377869),
 ('opulence', 0.6969329118728638),
 ('eyewear', 0.6911447048187256)]

In [35]:
word_vectors.similar_by_word("swim")

[('competitively', 0.7333688735961914),
 ('skied', 0.7225421667098999),
 ('trampoline', 0.7103570699691772),
 ('skiing', 0.7035570740699768),
 ('mtb', 0.6923143267631531),
 ('sculling', 0.6906573176383972),
 ('bbq', 0.6888306736946106),
 ('fives', 0.6866194605827332),
 ('softball', 0.6848371624946594),
 ('u25', 0.6801198124885559)]

In [36]:
synonym_vectors.similar_by_word("swim")

[('swimming', 0.828518271446228),
 ('archery', 0.8174597024917603),
 ('biking', 0.7810505032539368),
 ('skiing', 0.7792401909828186),
 ('fencing', 0.7785387635231018),
 ('gymnastics', 0.7664653658866882),
 ('wayte', 0.7610089182853699),
 ('mtb', 0.7526524066925049),
 ('u12', 0.7354594469070435),
 ('curling', 0.7335914373397827)]

Here I choose 5 words in the `synonyms.txt` file, which are 'adviser', 'economical', 'use', '4', and 'afterwards', and 2 words that are not in the `synonyms.txt` file, which are 'swim' and 'pizza'.

In my opinion, the new synonym-aware model produces slightly better vectors. For example, in the original model, most of the nearest neighbours for word '4' are numbers, and the rest are some meaningless words, but in the new synonym-aware model, we have more nearest neighbours in text, and most of them are related to the word '4' or other numbers. In addition, the nearest neighbours for the word 'swim' seem to be more related to it and sports. However, I think the improvements are not significant, since some of the similar words picked by the new model are still not so semantically related to the target words.