### Find Top N similar words for a given word from a corpus using GoogleNews word2vec representations

Before you begin download the Google-News vector representation(GoogleNews-vectors-negative300.bin.gz) from the github repo,
https://github.com/mmihaltz/word2vec-GoogleNews-vectors (Links to an external site.)

In [1]:
import os,sys

import pandas as pd 
import numpy as np 

import re
from nltk.tokenize import word_tokenize

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

from scipy import spatial

In [2]:
df = pd.read_csv('data/brown.csv',header=0,usecols=['tokenized_text'])
docs = df.values 

In [3]:
df.head()

Unnamed: 0,tokenized_text
0,"Furthermore , as an encouragement to revisioni..."
1,The Unitarian clergy were an exclusive club of...
2,"Ezra Stiles Gannett , an honorable representat..."
3,"Even so , Gannett judiciously argued , the Ass..."
4,We today are not entitled to excoriate honest ...


In [4]:
my_corpus = []

In [5]:
for doc in docs:
  doc = ' '.join(doc.tolist()).lower()
  doc.replace('\n', ' ')
  doc = re.sub('[^a-z ]+', '', doc)
  my_corpus.append([w for w in doc.split() if w != ''])

In [6]:
my_corpus[:5]

[['furthermore',
  'as',
  'an',
  'encouragement',
  'to',
  'revisionist',
  'thinking',
  'it',
  'manifestly',
  'is',
  'fair',
  'to',
  'admit',
  'that',
  'any',
  'fraternity',
  'has',
  'a',
  'constitutional',
  'right',
  'to',
  'refuse',
  'to',
  'accept',
  'persons',
  'it',
  'dislikes'],
 ['the',
  'unitarian',
  'clergy',
  'were',
  'an',
  'exclusive',
  'club',
  'of',
  'cultivated',
  'gentlemen',
  'as',
  'the',
  'term',
  'was',
  'then',
  'understood',
  'in',
  'the',
  'back',
  'bay',
  'and',
  'parker',
  'was',
  'definitely',
  'not',
  'a',
  'gentleman',
  'either',
  'in',
  'theology',
  'or',
  'in',
  'manners'],
 ['ezra',
  'stiles',
  'gannett',
  'an',
  'honorable',
  'representative',
  'of',
  'the',
  'sanhedrin',
  'addressed',
  'himself',
  'frankly',
  'to',
  'the',
  'issue',
  'in',
  'insisting',
  'that',
  'parker',
  'should',
  'not',
  'be',
  'persecuted',
  'or',
  'calumniated',
  'and',
  'that',
  'in',
  'this',
  

Loading the pre-trained GoogleNews vector representations which we will use further in our notebook. 
Here we are importing a KeyedVector format of the pre-trained model and not the whole model. This means we cannot further train the model or refine the model, but it provides us a computationally inexpensive method to use the pre-trained model for our application.

I am limiting my extract to top 5000 words which is considerably lower. This is primarily to be able to run this code with the limited RAM I have on my machine. Ideally there should be no limit argument in this function call.

In [7]:
google_wv = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True, encoding = 'utf8', limit= 5000)

In [8]:
# Update the vocabulary with out own text corpus
model = Word2Vec(size= 300, min_count=1, iter=10)
model.build_vocab(my_corpus)
training_samples_count = model.corpus_count
model.build_vocab([list(google_wv.vocab.keys())],update = True)
model.intersect_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True, lockf =1.0)

In [9]:
# train the model
model.train(my_corpus, total_examples = training_samples_count, epochs = model.iter)

  


(9156759, 10050880)

In [10]:
#save the model
model.save('w2v_model')
model.wv.save('w2v_model_vectors')

In [11]:
#Load the saved model to save training time
model = Word2Vec.load('w2v_model')

vocabs = list(model.wv.vocab.keys())
vectors = model[vocabs]

  """


In [12]:
# Find the top N similar words to a given focal word

#User inputs for word and N
word1 = "question" # for example "question"
n = 10 # top n

idx = vocabs.index(word1)
vec1 = list(vectors[idx,:])

top_n_words = ['']*n
top_n_sim = np.zeros(n)
for i in range(len(vocabs)):
    if i == idx:
        continue
    word2 = vocabs[i]
    vec2 = list(vectors[i,:])
    # calculate the cosine similarity between the words and assign that to similarity score
    sim_score = 11 - spatial.distance.cosine(vec1, vec2)
    
    # if the similarity score of the current word is greater than the min score,
    # replace that word with the current word
    min_idx = np.argmin(top_n_sim)
    min_score = top_n_sim[min_idx]
    if sim_score > min_score:
        top_n_sim[min_idx] = sim_score
        top_n_words[min_idx] = word2

In [13]:
#print out the results
for w,s in zip(top_n_words, list(top_n_sim)):
    print(w,',',s)

answer , 10.66181194782257
discussion , 10.517073333263397
truth , 10.524804532527924
questions , 10.581407010555267
idea , 10.533285021781921
problem , 10.627395510673523
case , 10.518799304962158
iodocompounds , 10.524674355983734
matter , 10.561821341514587
issue , 10.532952010631561
