# Word2Vec Unigram Testing

This Python Notebook is used for evaluation of the Word2Vec Unigram model. The section is broken down as follows:

- Find most similar words from the selected word
- Perform Syntactic Analysis
- Perform Semantic Analysis
- Find uncommon word among a list of words
- Find cosine similarity among two words
- Find the frequency count of a word
- Check if a word is in the model
- Print preview a list of words
- Others (Vector space size)

In [1]:
from gensim.models import Word2Vec as w2v



In [2]:
# Load Unigram model
FILE = "C:/Users/MyPC/Desktop/FYP/W2V Models/w2v_reddit_unigram_300d.bin"
model = w2v.load_word2vec_format(FILE, binary=True)

In [3]:
# Cell to find most similar words 
# One word for unigram: dragon, bleach, tottenham
# Two words for bigram: dragon_ball, barack_obama (UNDERSCORE NEEDED + BIGRAM MODEL LOADED)
model.most_similar("neuropsychopharmacology", topn=20)

[('biopsychology', 0.740115225315094),
 ('astrochemistry', 0.7391058206558228),
 ('neuroendocrinologist', 0.7296165227890015),
 ('nanoscience', 0.7265405058860779),
 ('neuropharmacology', 0.7247588038444519),
 ('saltzberg', 0.7157706618309021),
 ('ethnomusicology', 0.7156946659088135),
 ('psychobiology', 0.7154250144958496),
 ('nueroscience', 0.7147186994552612),
 ('neuropsychiatry', 0.7140935659408569),
 ('ichthyology', 0.710540235042572),
 ('molbio', 0.7056220769882202),
 ('oenology', 0.7056138515472412),
 ('antropology', 0.7041956186294556),
 ('biopsych', 0.7037904858589172),
 ('neuroengineering', 0.7037561535835266),
 ('nanoengineering', 0.7024978995323181),
 ('psycholinguistics', 0.7002543210983276),
 ('bioanthropology', 0.6995773315429688),
 ('christmann', 0.698868989944458)]

In [4]:
# Cell for semantic evaluation (Ex. King - man + woman is approximately equal to queen)
model.most_similar(positive=["tokyo","malaysia"], negative=["japan"])

[('lumpur', 0.6737101674079895),
 ('kuala', 0.6668090224266052),
 ('taipei', 0.6401477456092834),
 ('bangkok', 0.6113026142120361),
 ('penang', 0.5809809565544128),
 ('lampur', 0.5752942562103271),
 ('toyko', 0.5550657510757446),
 ('selangor', 0.5511509776115417),
 ('singapore', 0.5502724647521973),
 ('mumbai', 0.5481346249580383)]

In [5]:
# Cell for syntactic evaluation (Ex. walking - walk + swim is approximately equal to swimming)
model.most_similar(positive=["greenish","blue"], negative=["green"])

[('blueish', 0.7298511266708374),
 ('greyish', 0.7232707738876343),
 ('bluish', 0.7149738669395447),
 ('pinkish', 0.705883264541626),
 ('purplish', 0.7028074264526367),
 ('brownish', 0.6946163773536682),
 ('grayish', 0.6922476887702942),
 ('reddish', 0.6911346316337585),
 ('yellowish', 0.6770833134651184),
 ('whitish', 0.6669460535049438)]

In [6]:
# Cell to check which word doesn't match among a group of words
model.doesnt_match("blue green yellow apple".split())

'apple'

In [7]:
# Cell to check similarity among two words
model.similarity("titanic","rose")

0.24046405589195533

In [54]:
# Count number of times a specific word occured in the 2015 Dataset
word = model.vocab['difu']
type(word.count)

int

In [9]:
# Check if word (Unigram) is in model. It is case-sensitive
'Dragon' in model

False

In [69]:
# A brief review of words in the model
count = 70

for index, word in enumerate(model.vocab):
    print(index, word)
    if index == count:
        break

0 durantula
1 difu
2 mackinack
3 sview
4 illid
5 volatage
6 kangarooland
7 factless
8 respondan
9 wolfphram
10 siedrah
11 furbutt
12 svbtle
13 mouskeys
14 divines
15 flawlessfakeids
16 treeborn
17 conmercial
18 jagaloon
19 berusaha
20 godmen
21 errico
22 auncle
23 civitates
24 nintendan
25 limnologist
26 sterilizes
27 adamvideo
28 guaifenesin
29 qzcz
30 ixnay
31 laurinatis
32 allora
33 publisize
34 konspirazytheory
35 mol
36 unproportionately
37 audiobookbay
38 tarutaru
39 autocombos
40 crimebusters
41 fauhawks
42 franjo
43 makovsky
44 kirkle
45 bater
46 leavins
47 vilely
48 manina
49 jemmie
50 ruhkmar
51 musiczlife
52 preciptated
53 grillmaster
54 animate
55 saluzzo
56 afuckin
57 dimentions
58 recharger
59 ygfny
60 garnagle
61 brickoftr
62 knotting
63 weaned
64 zhills
65 metalls
66 omanomanoman
67 footbar
68 accentual
69 baesball
70 basetypes


In [72]:
# Looking under the hood of Word2vec

# Feature vector for each word(s) 
#print(model.syn0[0])

# Shape of the vocabulary
#print(model.syn0.shape)

# Get list of all keys (words)
#model.index2word[2000:3000]

1143362