# Word2Vec Unigram Testing

This Python Notebook is used for evaluation of the Word2Vec Unigram model. The section is broken down as follows:

- Find most similar words from the selected word
- Perform Syntactic Analysis
- Perform Semantic Analysis
- Find uncommon word among a list of words
- Find cosine similarity among two words
- Find the frequency count of a word
- Check if a word is in the model
- Print preview a list of words
- Others (Vector space size)

In [1]:
from gensim.models import Word2Vec as w2v



In [2]:
# Load Unigram model
FILE = "C:/Users/MyPC/Desktop/FYP/W2V Models/w2v_reddit_unigram_300d.bin"
model = w2v.load_word2vec_format(FILE, binary=True)

In [3]:
# Cell to find most similar words 
# One word for unigram: dragon, bleach, tottenham
# Two words for bigram: dragon_ball, barack_obama (UNDERSCORE NEEDED + BIGRAM MODEL LOADED)
model.most_similar("neuropsychopharmacology", topn=20)

[('biopsychology', 0.740115225315094),
 ('astrochemistry', 0.7391058206558228),
 ('neuroendocrinologist', 0.7296165227890015),
 ('nanoscience', 0.7265405058860779),
 ('neuropharmacology', 0.7247588038444519),
 ('saltzberg', 0.7157706618309021),
 ('ethnomusicology', 0.7156946659088135),
 ('psychobiology', 0.7154250144958496),
 ('nueroscience', 0.7147186994552612),
 ('neuropsychiatry', 0.7140935659408569),
 ('ichthyology', 0.710540235042572),
 ('molbio', 0.7056220769882202),
 ('oenology', 0.7056138515472412),
 ('antropology', 0.7041956186294556),
 ('biopsych', 0.7037904858589172),
 ('neuroengineering', 0.7037561535835266),
 ('nanoengineering', 0.7024978995323181),
 ('psycholinguistics', 0.7002543210983276),
 ('bioanthropology', 0.6995773315429688),
 ('christmann', 0.698868989944458)]

In [4]:
# Cell for semantic evaluation (Ex. King - man + woman is approximately equal to queen)
model.most_similar(positive=["tokyo","malaysia"], negative=["japan"])

[('lumpur', 0.6737101674079895),
 ('kuala', 0.6668090224266052),
 ('taipei', 0.6401477456092834),
 ('bangkok', 0.6113026142120361),
 ('penang', 0.5809809565544128),
 ('lampur', 0.5752942562103271),
 ('toyko', 0.5550657510757446),
 ('selangor', 0.5511509776115417),
 ('singapore', 0.5502724647521973),
 ('mumbai', 0.5481346249580383)]

In [5]:
# Cell for syntactic evaluation (Ex. walking - walk + swim is approximately equal to swimming)
model.most_similar(positive=["greenish","blue"], negative=["green"])

[('blueish', 0.7298511266708374),
 ('greyish', 0.7232707738876343),
 ('bluish', 0.7149738669395447),
 ('pinkish', 0.705883264541626),
 ('purplish', 0.7028074264526367),
 ('brownish', 0.6946163773536682),
 ('grayish', 0.6922476887702942),
 ('reddish', 0.6911346316337585),
 ('yellowish', 0.6770833134651184),
 ('whitish', 0.6669460535049438)]

In [6]:
# Cell to check which word doesn't match among a group of words
model.doesnt_match("blue green yellow apple".split())

'apple'

In [7]:
# Cell to check similarity among two words
model.similarity("titanic","rose")

0.24046405589195533

In [8]:
# Count number of times a specific word occured in the 2015 Dataset
word = model.vocab['difu']
type(word.count)

int

In [9]:
# Check if word (Unigram) is in model. It is case-sensitive
'Dragon' in model

False

In [10]:
# A brief review of words in the model
count = 70

for index, word in enumerate(model.vocab):
    print(index, word, model.vocab[word].count)
    if index == count:
        break

0 okkie 439508
1 blows 1142077
2 farokhmanesh 868069
3 harqan 152515
4 ecoupon 234460
5 kmail 927823
6 houtei 455986
7 lse 1015078
8 shedler 571945
9 coworkes 572107
10 lurpis 808069
11 hhuuggee 501753
12 perilaku 178095
13 fritzbox 879686
14 unhingedness 360368
15 maritza 944746
16 camelpacks 294028
17 fancams 1044724
18 sweatfests 240353
19 consultation 1127028
20 juzgados 907097
21 noonmark 113457
22 roasties 892514
23 reyjavik 588103
24 chattel 1105974
25 badf 769341
26 spyxe 894233
27 newgrad 436663
28 doneness 1085608
29 perpetualperplex 99818
30 crk 1060265
31 saintliness 915133
32 sorex 404134
33 honzon 23022
34 cruciamentum 741891
35 metalsmithing 934140
36 videogameobsession 242152
37 walamart 164211
38 bothies 336256
39 volantese 359981
40 celata 83881
41 padden 817455
42 unlikeable 1111388
43 embarssing 819288
44 metsler 150690
45 diebacked 150467
46 lokie 851823
47 ames 1107869
48 scotlanders 322660
49 heseltine 843305
50 asianing 137332
51 werkweek 440430
52 aldemedes 681

In [11]:
# Looking under the hood of Word2vec
# Use this cell as tips for K-Means Clustering
# Motivation: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors

# Feature vector for each word(s) 
#print(model.syn0[0])

# Shape of the vocabulary
#print(model.syn0.shape)

# Get list of all keys (words)
#model.index2word[2000:3000]

In [13]:
# Do preliminary of K-Means testing using the first (1000,2000,4000) words using 250 clusters
from sklearn.cluster import KMeans
import time

# Specify the number of words and clusters
WORDS = 1000
CLUSTERS = 250

# Get the word vectors and the word
word_vectors = model.syn0[:WORDS]
words = model.index2word[:WORDS]

# Initialize K-Means
k_means = KMeans( n_clusters = CLUSTERS )

# Fit the model, get the centroid number and calculate time
start = time.time()
idx = k_means.fit_predict(word_vectors)
end = time.time()

print("TIME TAKEN: ", end-start)

TIME TAKEN:  2.908646821975708


In [18]:
# Create a Word / Index dictionary
# Each vocabulary word is matched to a cluster center

word_centroid_map = dict(zip( words, idx ))

len(word_centroid_map.values()[0])

TypeError: 'dict_values' object does not support indexing

In [16]:
# Loop through the top N clusters

N = 10

for i in range(0, N):
    
    #Create array of words
    words = []
    
    for i in range(0, WORDS):
        

0
1
2
3
4
5
6
7
8
9
