##Word2Vec Demo##
From https://github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest


In [1]:
# to get gensim, to to https://radimrehurek.com/gensim/
# OR run this on your command line: easy_install -U gensim 

import nltk
import numpy as np
import gensim
from gensim.models import Word2Vec
from nltk.data import find


In [22]:
# To get the model file needed, do the following one time only:
#one time only: Run download; view the UI that pops up; switch to the models tab, and download the word2vec_sample model
# nltk.download()

In [34]:

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = gensim.models.Word2Vec.load_word2vec_format(word2vec_sample, binary=False)

We pruned the model to only include the most common words (~44k words).


In [35]:
len(model.vocab)

43981

Each word is represented in the space of 300 dimensions:


In [11]:
len(model['university'])

300

Finding the top n words that are similar to a target word is simple. The result is the list of n words with the score.


In [13]:
model.most_similar(positive=['university'], topn = 10)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780907511711121),
 ('undergraduate', 0.6587095260620117),
 ('campus', 0.6434987783432007),
 ('college', 0.6385269165039062),
 ('academic', 0.6317198276519775),
 ('professors', 0.6298646926879883),
 ('undergraduates', 0.6149813532829285),
 ('University', 0.6139305233955383),
 ('student', 0.6005401611328125)]

Finding a word that is not in a list is also supported in the API.


In [14]:
model.doesnt_match('breakfast cereal dinner lunch'.split())

'cereal'

Mikolov et al. (2013) figured out the following famous exampe:  word embedding captures much of syntactic and semantic regularities. For example,
the vector 'King - Man + Woman' is close to 'Queen' and 'Germany - Berlin + Paris' is close to 'France'.

In [15]:
model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)

[('queen', 0.7118192911148071)]

In [26]:
model.most_similar(positive=['woman','king'], negative=['man', 'boy'], topn = 10)

[('queen', 0.42707839608192444),
 ('monarch', 0.3571931719779968),
 ('thrones', 0.32843494415283203),
 ('queens', 0.3282967805862427),
 ('kingdom', 0.3216303586959839),
 ('courtiers', 0.3147016763687134),
 ('throne', 0.3120730519294739),
 ('royal', 0.29855966567993164),
 ('kings', 0.29191964864730835),
 ('Camilla', 0.27846354246139526)]

In [50]:
model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)

[('France', 0.7884091138839722)]

In [49]:
model.most_similar(positive=['president', 'university'], topn=30)

[('chancellor', 0.6200418472290039),
 ('dean', 0.6120452880859375),
 ('President', 0.591903805732727),
 ('faculty', 0.5726973414421082),
 ('rector', 0.5606599450111389),
 ('presidents', 0.5546602606773376),
 ('Provost', 0.5418164730072021),
 ('regents', 0.5399488210678101),
 ('professors', 0.5367733240127563),
 ('universities', 0.5157524347305298),
 ('campus', 0.5094808340072632),
 ('student', 0.5033937692642212),
 ('academic', 0.5031865835189819),
 ('institute', 0.5005171895027161),
 ('undergraduate', 0.48198601603507996),
 ('Professors', 0.47340402007102966),
 ('professor', 0.47276201844215393),
 ('Faculty', 0.47209471464157104),
 ('chairman', 0.4699815511703491),
 ('professorship', 0.467648446559906),
 ('presidency', 0.46344324946403503),
 ('University', 0.45916348695755005),
 ('campuses', 0.45756882429122925),
 ('college', 0.45753854513168335),
 ('trustees', 0.45137834548950195),
 ('Chancellor', 0.4487611949443817),
 ('undergraduates', 0.4440937042236328),
 ('institution', 0.437450

You can train your own models.  Here is an example using NLTK corpora.  This will be an exercise in seeing how different corpora yield different results.

In [44]:
from nltk.corpus import brown
brown_model = gensim.models.Word2Vec(brown.sents())

# It might take some time to train the model. So, after it is trained, it can be saved as follows:

brown_model.save('brown.embedding')
new_model = gensim.models.Word2Vec.load('brown.embedding')

In [45]:
brown_model.most_similar('president')

[('Hengesbach', 0.9018814563751221),
 ('Corp.', 0.89886873960495),
 ('superintendent', 0.8922324180603027),
 ('Cardinals', 0.8915742635726929),
 ('Larson', 0.8907597661018372),
 ('Monte', 0.8878529071807861),
 ('Railroad', 0.884084939956665),
 ('dean', 0.8820680975914001),
 ('Football', 0.8818255662918091),
 ('upheld', 0.8803490400314331)]