# Word2Vec Unigram Testing

This Python Notebook is used for evaluation of the Word2Vec Unigram model. The section is broken down as follows:

- Find most similar words from the selected word
- Perform Syntactic Analysis
- Perform Semantic Analysis
- Find uncommon word among a list of words
- Find cosine similarity among two words
- Find the frequency count of a word
- Check if a word is in the model
- Print preview a list of words
- Visualisation of words in Vector Space using TSNE

In [1]:
from gensim.models import Word2Vec as w2v



In [2]:
# Load Unigram model
FILE = "C:/Users/MyPC/Desktop/FYP/W2V Models/w2v_reddit_unigram_300d.bin"
model = w2v.load_word2vec_format(FILE, binary=True)

In [3]:
# Cell to find most similar words 
# One word for unigram: dragon, bleach, tottenham
# Two words for bigram: dragon_ball, barack_obama (UNDERSCORE NEEDED + BIGRAM MODEL LOADED)
model.most_similar("neuropsychopharmacology", topn=20)

[('biopsychology', 0.740115225315094),
 ('astrochemistry', 0.7391058206558228),
 ('neuroendocrinologist', 0.7296165227890015),
 ('nanoscience', 0.7265405058860779),
 ('neuropharmacology', 0.7247588038444519),
 ('saltzberg', 0.7157706618309021),
 ('ethnomusicology', 0.7156946659088135),
 ('psychobiology', 0.7154250144958496),
 ('nueroscience', 0.7147186994552612),
 ('neuropsychiatry', 0.7140935659408569),
 ('ichthyology', 0.710540235042572),
 ('molbio', 0.7056220769882202),
 ('oenology', 0.7056138515472412),
 ('antropology', 0.7041956186294556),
 ('biopsych', 0.7037904858589172),
 ('neuroengineering', 0.7037561535835266),
 ('nanoengineering', 0.7024978995323181),
 ('psycholinguistics', 0.7002543210983276),
 ('bioanthropology', 0.6995773315429688),
 ('christmann', 0.698868989944458)]

In [4]:
# Cell for semantic evaluation (Ex. King - man + woman is approximately equal to queen)
model.most_similar(positive=["tokyo","malaysia"], negative=["japan"])

[('lumpur', 0.6737101674079895),
 ('kuala', 0.6668090224266052),
 ('taipei', 0.6401477456092834),
 ('bangkok', 0.6113026142120361),
 ('penang', 0.5809809565544128),
 ('lampur', 0.5752942562103271),
 ('toyko', 0.5550657510757446),
 ('selangor', 0.5511509776115417),
 ('singapore', 0.5502724647521973),
 ('mumbai', 0.5481346249580383)]

In [5]:
# Cell for syntactic evaluation (Ex. walking - walk + swim is approximately equal to swimming)
model.most_similar(positive=["greenish","blue"], negative=["green"])

[('blueish', 0.7298511266708374),
 ('greyish', 0.7232707738876343),
 ('bluish', 0.7149738669395447),
 ('pinkish', 0.705883264541626),
 ('purplish', 0.7028074264526367),
 ('brownish', 0.6946163773536682),
 ('grayish', 0.6922476887702942),
 ('reddish', 0.6911346316337585),
 ('yellowish', 0.6770833134651184),
 ('whitish', 0.6669460535049438)]

In [6]:
# Cell to check which word doesn't match among a group of words
model.doesnt_match("blue green yellow apple".split())

'apple'

In [7]:
# Cell to check similarity among two words
model.similarity("titanic","rose")

0.24046405589195533

In [8]:
# Count number of times a specific word occured in the 2015 Dataset
word = model.vocab['difu']
type(word.count)

int

In [9]:
# Check if word (Unigram) is in model. It is case-sensitive
'Dragon' in model

False

In [10]:
# A brief review of words in the model
count = 70

for index, word in enumerate(model.vocab):
    print(index, word, model.vocab[word].count)
    if index == count:
        break

0 cophine 874562
1 beschrijving 816928
2 weggenomen 190914
3 tableau 1103123
4 helliers 162212
5 aberrent 363674
6 sejuanis 952647
7 witan 348114
8 ersonally 871572
9 widefield 1008074
10 depicitions 399130
11 luffi 418292
12 cermic 50259
13 volcanologists 876439
14 gotgames 423052
15 community 1146004
16 troggz 606961
17 vrup 794162
18 sndwav 882798
19 gooddamnit 84074
20 piccoult 785911
21 canl 233695
22 heimatort 449817
23 muslimist 328616
24 ecough 796192
25 staphh 591401
26 bispebjerg 213167
27 imapala 793197
28 speechtotext 656819
29 sewiously 243058
30 neefew 508196
31 krasnogorsk 250809
32 climatic 1110428
33 localidad 478787
34 sobriquet 979303
35 misers 1011735
36 cpps 785523
37 arumat 368935
38 sheebs 703902
39 bedroomed 783759
40 snackrifice 726508
41 marpenter 377957
42 boolits 1007834
43 mennilli 76433
44 wenceslao 236664
45 burglurized 17581
46 garrotxa 688960
47 violurilor 370933
48 fweinds 118393
49 trustyworthy 406536
50 ventillators 94833
51 ardel 654602
52 dodgerh 7

In [11]:
# Visualisation (Normal) using TSNE
# Motivation: http://lvdmaaten.github.io/tsne/
# Video: https://www.youtube.com/watch?v=RJVL80Gg3lA

# Firstly: Import the libraries
from sklearn.manifold import TSNE

import seaborn as sns
import matplotlib.pyplot as plt
import mpld3

sns.set_style("whitegrid")

%matplotlib inline
mpld3.enable_notebook()

In [12]:
# Create function to return list of words and word embeddings
import random

def getEmbeddings(word_count):
    
    # Get number of words in model
    model_vocab_length = len(model.syn0)
    
    # Get unique index counts
    unique_words_index = random.sample(range(0, model_vocab_length), word_count)
    
    # Arrays to store feature vector and words
    vectors = []
    words = []
    
    # Add vector and words
    # Add vectors and words
    for index in unique_words_index:
        
        vectors.append(model.syn0[index]) 
        words.append(model.index2word[index])
        
    return vectors, words

In [17]:
# Display the graph in this cell (RANDOM WORDS)

# Get the feature vectors and respective words
wv, vocabulary = getEmbeddings(300)

# Initialize TSNE model
tsne = TSNE(n_components=2, random_state=0)

# Fit with first 500 words 
Y = tsne.fit_transform(wv)

# Scatter points
fig, ax = plt.subplots(figsize=(10, 10))

# Use Scatterplot
ax.scatter(Y[:, 0], Y[:, 1], facecolors='none', edgecolors='none')

# Initialize Points
for label, x, y in zip(vocabulary, Y[:, 0], Y[:, 1]):
    ax.annotate(label, xy=(x, y), fontsize=15)

# Display
mpld3.display(fig)