# Intro and setup

This notebook demonstrates the use of word2vec in Python using the  [gensim libraries](https://github.com/RaRe-Technologies/gensim).  Information is available on the [gensim website](https://radimrehurek.com/gensim/index.html) along with tutorials and the [API](https://radimrehurek.com/gensim/apiref.html). 

You can install them to your local machine using the command:
```
pip install --upgrade gensim
```
Of course if you are using a jupyter notebook, that doesn't mean it is installed in the correct place.  Make sure the `sys.path` command yields a location that matches the location of where the `pip install` occured.  Because the `sys.path` includes the current working directory, I use a `-t` flag to place the libraries in a location I can make the current working directory.  You could also theroretically append to `sys.path`

## Background

Word2Vec was first presented in the paper: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). *Efficient estimation of word representations in vector space*. arXiv preprint arXiv:1301.3781.  Extensions were presented in the paper: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). *Distributed representations of words and phrases and their compositionality*. In Advances in neural information processing systems (pp. 3111-3119).

The original C code was published on Google Code, but was subsequently moved to https://github.com/tmikolov/word2vec

In [1]:
# turn off pretty printing to get horizontal display - optional, but I'm saving space for display
%pprint

Pretty printing has been turned OFF


In [2]:
import os

Commands like `os.getcwd()` and `os.chdir()` can help you navigate to the proper location.  

In [3]:
# gensim library also requires numpy, scipy, requests, docutils
os.listdir()

['word2vec.ipynb', '__pycache__', 'python_dateutil-2.7.3.dist-info', 'numpy', 'requests-2.19.1.dist-info', 'dateutil', 'idna', 'scipy-1.1.0.dist-info', 'jmespath-0.9.3.dist-info', 'boto', 'botocore', 'gensim-3.6.0.dist-info', 'file', 'idna-2.7.dist-info', 'jmespath', 'certifi-2018.8.24.dist-info', 'bz2file-0.98-py3.6.egg-info', 'chardet-3.0.4.dist-info', 'docutils-0.14.dist-info', 'bin', 'savedModelText8', 'urllib3-1.23.dist-info', 'smart_open', 'chardet', 'Tutorials.html', 'six-1.11.0.dist-info', 'requests', 'gensim', 'docutils', 'botocore-1.12.18.dist-info', 'boto3-1.9.18.dist-info', '.ipynb_checkpoints', 's3transfer-0.1.13.dist-info', 'bz2file.py', 'numpy-1.15.2.dist-info', '__init__.py', 'six.py', 'certifi', 'boto-2.49.0.dist-info', 'smart_open-1.7.1-py3.6.egg-info', 'scipy', 'boto3', 's3transfer', 'word2vecTalk.ipynb', 'urllib3']

# Data

## import data
There are lots of possible data sources.  Gensim includes specific loaders for the [Brown corpus](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.BrownCorpus) and Matt Mahoney's [text8 corpus](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus).  The Brown corpus is in the  widely used [Natural Language ToolKit](https://www.nltk.org/).  Text8 is a [cleaned text](http://mattmahoney.net/dc/textdata.html) 100MB selection of Wikipedia, getting rid of html tags, tables, spelling out numbers, etc.  It can be downloaded at http://mattmahoney.net/dc/text8.zip.

In [4]:
with open('/home/milton/data/wikidump/text8.txt', 'r') as f:
    raw_data = f.read()

In [5]:
type(raw_data)

<class 'str'>

In [6]:
# yah, it is 100 MB
len(raw_data)

100000000

In [7]:
# what does it look like?
raw_data[:1000]

' anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institut

In [8]:
words = raw_data.split()

In [9]:
len(words)

17005207

In [10]:
# how many unique words?
len(set(words))

253854

In [11]:
# the kinds of words you might expect from a Wikipedia corpus
list(set(words))[:100]

['tyrranion', 'overbending', 'padshah', 'seutter', 'xaverian', 'pharmacia', 'arplaninac', 'marque', 'homoplasy', 'paraneoplastic', 'ilocano', 'castagnoli', 'yamas', 'hexahedrites', 'aharonov', 'naiditsch', 'upn', 'choo', 'hyland', 'patna', 'harve', 'openable', 'elevate', 'anglophiliac', 'agraphia', 'bank', 'lawmeme', 'playback', 'prado', 'demodulator', 'fortean', 'makkedah', 'catv', 'engages', 'personam', 'diatessaron', 'modwenna', 'nowshak', 'hyperarid', 'khanjan', 'sram', 'erikson', 'khalav', 'infiltrated', 'hama', 'reliving', 'percepts', 'patriae', 'ectotherm', 'rafto', 'subtilatum', 'carpool', 'windle', 'salvific', 'dunk', 'biorobotics', 'dakha', 'rozemond', 'dephasing', 'unexposed', 'qus', 'bontade', 'roddy', 'nathuram', 'metacognition', 'qa', 'overcast', 'codimension', 'facesez', 'polyadenylyl', 'paracelsus', 'mpps', 'gordimer', 'rosie', 'greatly', 'sterilising', 'subbookkeeper', 'spels', 'blackwells', 'crist', 'caked', 'claridade', 'yamin', 'eschenburg', 'sherpa', 'hollyhock', '

# Gensim

In [12]:
import gensim

In [13]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Setup data
`gensim.models` takes a corpus broken into sentences.  I'm using the `Text8Corpus` iterator that comes as part of the `word2vec` class.  You can use any other data as long as you create an iterable to yield sentences.

In [14]:
sentences = gensim.models.word2vec.Text8Corpus('/home/milton/data/wikidump/text8.txt')

## run model

In [15]:
# run model
model = gensim.models.Word2Vec(sentences, size = 100, window = 6, min_count=5, workers=3)

2018-11-02 10:02:12,052 : INFO : collecting all words and their counts
2018-11-02 10:02:12,056 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-02 10:02:17,292 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2018-11-02 10:02:17,292 : INFO : Loading a fresh vocabulary
2018-11-02 10:02:17,670 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2018-11-02 10:02:17,670 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2018-11-02 10:02:17,829 : INFO : deleting the raw counts dictionary of 253854 items
2018-11-02 10:02:17,837 : INFO : sample=0.001 downsamples 38 most-common words
2018-11-02 10:02:17,838 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2018-11-02 10:02:18,046 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2018-11-02 10:02:18,046 : 

Note that due to the stochastic nature of network training, the weights in the models will be different.  Each trained model will yield different results below in [Distance from mean](#distance-from-mean) or [Similarity](#similarity)

In [16]:
# save it as binary
model.save('savedModelText8')

2018-11-02 10:03:32,431 : INFO : saving Word2Vec object under savedModelText8, separately None
2018-11-02 10:03:32,434 : INFO : not storing attribute vectors_norm
2018-11-02 10:03:32,435 : INFO : not storing attribute cum_table
2018-11-02 10:03:33,194 : INFO : saved savedModelText8


In [17]:
print(model)

Word2Vec(vocab=71290, size=100, alpha=0.025)


## Vocabulary

In [18]:
# get list of word vectors
words = list(model.wv.vocab)

In [19]:
# get sorted list of word vectors
words = list(model.wv.index2word)

In [20]:
len(words)

71290

In [21]:
words[:200]

['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two', 'is', 'as', 'eight', 'for', 's', 'five', 'three', 'was', 'by', 'that', 'four', 'six', 'seven', 'with', 'on', 'are', 'it', 'from', 'or', 'his', 'an', 'be', 'this', 'which', 'at', 'he', 'also', 'not', 'have', 'were', 'has', 'but', 'other', 'their', 'its', 'first', 'they', 'some', 'had', 'all', 'more', 'most', 'can', 'been', 'such', 'many', 'who', 'new', 'used', 'there', 'after', 'when', 'into', 'american', 'time', 'these', 'only', 'see', 'may', 'than', 'world', 'i', 'b', 'would', 'd', 'no', 'however', 'between', 'about', 'over', 'years', 'states', 'people', 'war', 'during', 'united', 'known', 'if', 'called', 'use', 'th', 'system', 'often', 'state', 'so', 'history', 'will', 'up', 'while', 'where', 'city', 'being', 'english', 'then', 'any', 'both', 'under', 'out', 'made', 'well', 'her', 'e', 'number', 'government', 'them', 'm', 'later', 'since', 'him', 'part', 'name', 'c', 'century', 'through', 'because', 'x', 'university'

In [22]:
# check the index for a word
model.wv.vocab['one'].index

3

## Vectors

In [23]:
model.wv?

[0;31mType:[0m        Word2VecKeyedVectors
[0;31mString form:[0m <gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f6b52182ef0>
[0;31mFile:[0m        ~/programming/python/gensim/gensim/models/keyedvectors.py
[0;31mDocstring:[0m  
Mapping between words and vectors for the :class:`~gensim.models.Word2Vec` model.
Used to perform operations on the vectors such as vector lookup, distance, similarity etc.


In [24]:
model.wv.get_vector('one')

array([ 0.24125664,  1.9268551 , -2.2396307 , -0.9135645 ,  0.30719054,
       -0.3901619 , -0.72596925,  1.6858315 ,  1.9339089 , -1.8731937 ,
        0.19566078, -0.5991121 , -1.479587  , -1.5020843 ,  0.18137535,
       -0.23894407, -0.12432036, -0.05521582, -0.1348761 , -0.8019359 ,
       -1.9279583 , -1.3177425 , -0.33780292,  1.0493244 ,  0.31354004,
        1.8714385 , -0.8742712 , -0.17126894, -0.14679043, -4.7852097 ,
       -0.3815103 ,  1.8579111 , -1.7332219 ,  0.7389472 ,  2.1735961 ,
        1.7000883 ,  1.0617127 , -2.405899  , -1.6414772 , -0.63530964,
       -1.3333199 ,  1.3860669 ,  0.42986315, -0.53261244,  0.11340322,
       -1.9065561 , -1.080117  , -1.1631343 ,  0.42832732,  0.8277875 ,
       -0.8820234 ,  1.0434159 ,  1.4340398 ,  0.28658646,  0.54299736,
       -2.195669  , -1.9429504 ,  2.371765  ,  0.52610004,  0.31313244,
        0.95485383,  1.1148375 ,  0.38447228, -0.74583143, -0.2827666 ,
        0.9085859 , -0.1010727 ,  1.9584153 , -0.6831744 , -0.34

In [25]:
len(model.wv.get_vector('one'))

100

### Distance from mean
<a id="distance-from-mean"></a>

In [26]:
model.wv.doesnt_match?

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mwv[0m[0;34m.[0m[0mdoesnt_match[0m[0;34m([0m[0mwords[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Which word from the given list doesn't go with the others?

Parameters
----------
words : list of str
    List of words.

Returns
-------
str
    The word further away from the mean of all words.
[0;31mFile:[0m      ~/programming/python/gensim/gensim/models/keyedvectors.py
[0;31mType:[0m      method


In [27]:
# find word in list that is farthest from the mean
model.wv.doesnt_match("breakfast cereal dinner lunch".split())

2018-11-02 10:03:33,670 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


'cereal'

In [28]:
model.wv.doesnt_match("cook janitor pilot sport teacher".split())

  if np.issubdtype(vec.dtype, np.int):


'sport'

In [29]:
model.wv.doesnt_match("joy time peace angst".split())

  if np.issubdtype(vec.dtype, np.int):


'angst'

In [30]:
model.wv.doesnt_match("joy timely peace angst".split())

  if np.issubdtype(vec.dtype, np.int):


'peace'

In [31]:
model.wv.doesnt_match("joy peace angst".split())

  if np.issubdtype(vec.dtype, np.int):


'peace'

## Similarity
<a id="similarity"></a>

### Cosine similarity

In [32]:
model.wv.similarity?

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mwv[0m[0;34m.[0m[0msimilarity[0m[0;34m([0m[0mw1[0m[0;34m,[0m [0mw2[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute cosine similarity between two words.

Parameters
----------
w1 : str
    Input word.
w2 : str
    Input word.

Returns
-------
float
    Cosine similarity between `w1` and `w2`.
[0;31mFile:[0m      ~/programming/python/gensim/gensim/models/keyedvectors.py
[0;31mType:[0m      method


In [33]:
model.wv.similarity('woman', 'man')

  if np.issubdtype(vec.dtype, np.int):


0.72925234

In [34]:
model.wv.similarity('woman', 'tree')

  if np.issubdtype(vec.dtype, np.int):


0.29012758

In [35]:
model.wv.similarity('tree', 'shrub')

  if np.issubdtype(vec.dtype, np.int):


0.40712953

In [36]:
model.wv.similarity('tree', 'bush')

  if np.issubdtype(vec.dtype, np.int):


-0.13019277

In [37]:
# distance is just the opposite of similarity
model.wv.distance('woman', 'tree')

  if np.issubdtype(vec.dtype, np.int):


0.7098724246025085

In [38]:
model.wv.distance('woman', 'tree') + model.wv.similarity('woman', 'tree')

  if np.issubdtype(vec.dtype, np.int):


1.0

In [39]:
# closest by cosine similarity
model.wv.similar_by_word('woman', topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('child', 0.8045989274978638), ('girl', 0.7382532358169556), ('man', 0.7292523384094238), ('herself', 0.7025970816612244), ('lady', 0.6832849979400635), ('lover', 0.6755771636962891), ('mother', 0.6658172607421875), ('wife', 0.6456671357154846), ('prostitute', 0.6433481574058533), ('person', 0.6396008729934692)]

In [40]:
# closest by cosine similarity
model.wv.similar_by_word('she', topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('he', 0.8081147074699402), ('herself', 0.7315353155136108), ('her', 0.7108720541000366), ('leto', 0.6110916137695312), ('nobody', 0.6102216243743896), ('faramir', 0.5981771349906921), ('odrade', 0.5976393222808838), ('him', 0.5975172519683838), ('rachel', 0.5924820899963379), ('baldrick', 0.5899745225906372)]

In [41]:
model.wv.most_similar?

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mwv[0m[0;34m.[0m[0mmost_similar[0m[0;34m([0m[0mpositive[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mnegative[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mtopn[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m [0mrestrict_vocab[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mindexer[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Find the top-N most similar words.
Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection
weight vectors of the given words and the vectors for each word in the model.
The method corresponds to the `word-analogy` and `distance` scripts in the original
word2vec implementation.

Parameters
----------
positive : list of str, optional
    List of words that contribute positively.
negative : list of str, optional
    List of words that contribute negatively.
topn : int, 

In [42]:
model.wv.most_similar(positive=['woman'], topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('child', 0.8045989274978638), ('girl', 0.7382532358169556), ('man', 0.7292523384094238), ('herself', 0.7025970816612244), ('lady', 0.6832849979400635), ('lover', 0.6755771636962891), ('mother', 0.6658172607421875), ('wife', 0.6456671357154846), ('prostitute', 0.6433481574058533), ('person', 0.6396008729934692)]

In [43]:
model.wv.most_similar(negative=['woman'], topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('hippocampal', 0.43745601177215576), ('touman', 0.3993116021156311), ('automated', 0.3980580270290375), ('abm', 0.3962332010269165), ('districting', 0.38271263241767883), ('automation', 0.378354012966156), ('divestiture', 0.3775786757469177), ('operations', 0.37553471326828003), ('loran', 0.37270671129226685), ('samsung', 0.36578959226608276)]

In [44]:
model.wv.most_similar(positive=['woman', 'king'], topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.73751300573349), ('princess', 0.7058136463165283), ('son', 0.691680908203125), ('man', 0.6852022409439087), ('daughter', 0.6846308708190918), ('lady', 0.6647908091545105), ('bride', 0.6599146723747253), ('prince', 0.657150387763977), ('lover', 0.6516727209091187), ('wife', 0.648215651512146)]

In [45]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.6976450085639954), ('daughter', 0.6202482581138611), ('empress', 0.6175616979598999), ('princess', 0.6116492748260498), ('matilda', 0.6097214221954346), ('prince', 0.6079800128936768), ('son', 0.5972870588302612), ('isabella', 0.5928862690925598), ('aquitaine', 0.5920809507369995), ('jadwiga', 0.5864522457122803)]

### Multiplicative combination

In [46]:
model.wv.most_similar_cosmul?

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mwv[0m[0;34m.[0m[0mmost_similar_cosmul[0m[0;34m([0m[0mpositive[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mnegative[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mtopn[0m[0;34m=[0m[0;36m10[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Find the top-N most similar words, using the multiplicative combination objective,
proposed by `Omer Levy and Yoav Goldberg "Linguistic Regularities in Sparse and Explicit Word Representations"
<http://www.aclweb.org/anthology/W14-1618>`_. Positive words still contribute positively towards the similarity,
negative words negatively, but with less susceptibility to one large distance dominating the calculation.
In the common analogy-solving case, of two positive and one negative examples,
this method is equivalent to the "3CosMul" objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator,
respectively - a potentia

In [47]:
model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'], topn=10)

[('queen', 0.9062806963920593), ('matilda', 0.8731217980384827), ('empress', 0.8670921921730042), ('daughter', 0.8613948822021484), ('princess', 0.8548787832260132), ('aquitaine', 0.8529645204544067), ('jadwiga', 0.8512430191040039), ('isabella', 0.8460915684700012), ('son', 0.8443405628204346), ('prince', 0.8387503623962402)]

In [48]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.6976450085639954), ('daughter', 0.6202482581138611), ('empress', 0.6175616979598999), ('princess', 0.6116492748260498), ('matilda', 0.6097214221954346), ('prince', 0.6079800128936768), ('son', 0.5972870588302612), ('isabella', 0.5928862690925598), ('aquitaine', 0.5920809507369995), ('jadwiga', 0.5864522457122803)]

In [49]:
model.wv.most_similar_cosmul(positive=['woman', 'king'], topn=10)

[('queen', 0.6143661141395569), ('princess', 0.600183367729187), ('son', 0.5903830528259277), ('daughter', 0.5863444805145264), ('man', 0.5782984495162964), ('bride', 0.5700618028640747), ('lady', 0.5688951015472412), ('wife', 0.5610396265983582), ('lover', 0.560962438583374), ('prince', 0.5588306188583374)]

In [50]:
model.wv.most_similar(positive=['woman', 'king'], topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.73751300573349), ('princess', 0.7058136463165283), ('son', 0.691680908203125), ('man', 0.6852022409439087), ('daughter', 0.6846308708190918), ('lady', 0.6647908091545105), ('bride', 0.6599146723747253), ('prince', 0.657150387763977), ('lover', 0.6516727209091187), ('wife', 0.648215651512146)]


[<img style="float: left;" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png">](http://creativecommons.org/licenses/by-sa/4.0/)  

Licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).