# Word2Vec Unigram Testing

This Python Notebook is used for evaluation of the Word2Vec Unigram model. The section is broken down as follows:

- Find most similar words from the selected word
- Perform Syntactic Analysis
- Perform Semantic Analysis
- Find uncommon word among a list of words
- Find cosine similarity among two words
- Find the frequency count of a word
- Check if a word is in the model
- Print preview a list of words
- Evaluation on a K-Means + Word2Vec model

In [1]:
from gensim.models import Word2Vec as w2v



In [2]:
# Load Unigram model
FILE = "C:/Users/MyPC/Desktop/FYP/W2V Models/w2v_reddit_unigram_300d.bin"
model = w2v.load_word2vec_format(FILE, binary=True)

In [3]:
# Cell to find most similar words 
# One word for unigram: dragon, bleach, tottenham
# Two words for bigram: dragon_ball, barack_obama (UNDERSCORE NEEDED + BIGRAM MODEL LOADED)
model.most_similar("neuropsychopharmacology", topn=20)

[('biopsychology', 0.740115225315094),
 ('astrochemistry', 0.7391058206558228),
 ('neuroendocrinologist', 0.7296165227890015),
 ('nanoscience', 0.7265405058860779),
 ('neuropharmacology', 0.7247588038444519),
 ('saltzberg', 0.7157706618309021),
 ('ethnomusicology', 0.7156946659088135),
 ('psychobiology', 0.7154250144958496),
 ('nueroscience', 0.7147186994552612),
 ('neuropsychiatry', 0.7140935659408569),
 ('ichthyology', 0.710540235042572),
 ('molbio', 0.7056220769882202),
 ('oenology', 0.7056138515472412),
 ('antropology', 0.7041956186294556),
 ('biopsych', 0.7037904858589172),
 ('neuroengineering', 0.7037561535835266),
 ('nanoengineering', 0.7024978995323181),
 ('psycholinguistics', 0.7002543210983276),
 ('bioanthropology', 0.6995773315429688),
 ('christmann', 0.698868989944458)]

In [4]:
# Cell for semantic evaluation (Ex. King - man + woman is approximately equal to queen)
model.most_similar(positive=["tokyo","malaysia"], negative=["japan"])

[('lumpur', 0.6737101674079895),
 ('kuala', 0.6668090224266052),
 ('taipei', 0.6401477456092834),
 ('bangkok', 0.6113026142120361),
 ('penang', 0.5809809565544128),
 ('lampur', 0.5752942562103271),
 ('toyko', 0.5550657510757446),
 ('selangor', 0.5511509776115417),
 ('singapore', 0.5502724647521973),
 ('mumbai', 0.5481346249580383)]

In [5]:
# Cell for syntactic evaluation (Ex. walking - walk + swim is approximately equal to swimming)
model.most_similar(positive=["greenish","blue"], negative=["green"])

[('blueish', 0.7298511266708374),
 ('greyish', 0.7232707738876343),
 ('bluish', 0.7149738669395447),
 ('pinkish', 0.705883264541626),
 ('purplish', 0.7028074264526367),
 ('brownish', 0.6946163773536682),
 ('grayish', 0.6922476887702942),
 ('reddish', 0.6911346316337585),
 ('yellowish', 0.6770833134651184),
 ('whitish', 0.6669460535049438)]

In [6]:
# Cell to check which word doesn't match among a group of words
model.doesnt_match("blue green yellow apple".split())

'apple'

In [7]:
# Cell to check similarity among two words
model.similarity("titanic","rose")

0.24046405589195533

In [8]:
# Count number of times a specific word occured in the 2015 Dataset
word = model.vocab['difu']
type(word.count)

int

In [9]:
# Check if word (Unigram) is in model. It is case-sensitive
'Dragon' in model

False

In [10]:
# A brief review of words in the model
count = 70

for index, word in enumerate(model.vocab):
    print(index, word, model.vocab[word].count)
    if index == count:
        break

0 teatime 1029889
1 ithin 792856
2 kucheras 721360
3 dramatising 939723
4 oeschger 164139
5 bdsmpussy 676306
6 lessor 1083616
7 buttgrabber 115414
8 kpopalypse 782100
9 miminum 857370
10 moraliser 259469
11 soomethings 68881
12 weatherizing 658902
13 shadwo 325533
14 throwig 402521
15 allowedd 185897
16 sosho 340588
17 corporates 1093727
18 nonlicensed 599837
19 canserbero 310245
20 synergystic 745359
21 namboku 514664
22 babyhunters 454243
23 complimented 1126718
24 scrollsawing 83296
25 setupact 364667
26 rossx 375822
27 dionysiandogma 386807
28 methods 1143496
29 doobiest 268726
30 trixy 932238
31 treasues 325322
32 pinesol 999967
33 pennslyn 313387
34 vibrators 1111846
35 sonderlich 641349
36 finipil 619031
37 multiweek 546256
38 superbosses 982962
39 khodet 273747
40 arents 803995
41 speedkills 734690
42 saracasm 836106
43 beheads 1081131
44 therealsix 331061
45 ecrin 279779
46 freeling 457772
47 terrawatt 701551
48 pocito 366969
49 pulsewave 960684
50 enchantor 56653
51 somatype 

In [11]:
# Load the K-Means model
import pickle

# Specify the file
FILE = "C:/Users/MyPC/Desktop/FYP/K-Means Models/dict_250.pk"

# Load using pickle
word_centroid_map =  pickle.load(open(FILE,"rb"))

type(word_centroid_map)

dict

In [12]:
# Loop through the top N clusters

N = 100

for cluster in range(0, N):
    
    #Create array of words
    words = []
    
    for word, cluster_num in word_centroid_map.items():
        if cluster_num == cluster:
            words.append(word)
    
    print("CLUSTER NUMBER: %i" % (cluster))
    print("WORDS: ", words)
    print("\n\n")
        

CLUSTER NUMBER: 0
WORDS:  ['gas', 'ice', 'salt', 'powder', 'liquid', 'chemical', 'water', 'sugar', 'blood', 'electricity', 'energy', 'acid', 'oil']



CLUSTER NUMBER: 1
WORDS:  ['computer', 'phones', 'cameras', 'headphones', 'computers', 'storage', 'technology', 'machine', 'equipment', 'laptop', 'boards', 'devices', 'machines', 'hardware', 'electronic']



CLUSTER NUMBER: 2
WORDS:  ['familiar', 'understood', 'grasp', 'discuss', 'imagine', 'express', 'explains', 'describe', 'identify', 'understand', 'explaining', 'disagree', 'represent', 'figure', 'tell', 'know', 'define', 'relate', 'learn', 'understanding', 'elaborate', 'explain', 'describing', 'clarify', 'justify', 'see', 'interact', 'knowing', 'agree', 'share', 'find', 'communicate']



CLUSTER NUMBER: 3
WORDS:  ['cities', 'center', 'office', 'square', 'stations', 'state', 'campus', 'county', 'urban', 'station', 'local', 'area', 'department', 'centre', 'district', 'city']



CLUSTER NUMBER: 4
WORDS:  ['blade', 'ult', 'combo', 'cast',

In [14]:
# Find cluster number for a specific word
WORD = "apple"

word_centroid_map["apple"]

103