In [2]:
%load_ext autoreload
%autoreload 2

#word2vec 

This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google.

## Training

Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip)

In [3]:
import word2vec

Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles"

In [4]:
word2vec.word2phrase('/Users/gautamchheda/Downloads/text8', '/Users/gautamchheda/Downloads/text8-phrases', verbose=True)

Starting training using file /Users/gautamchheda/Downloads/text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
Words written: 17000K

This will create a `text8-phrases` that we can use as a better input for `word2vec`.
Note that you could easily skip this previous step and use the origial data as input for `word2vec`.

Train the model using the `word2phrase` output.

In [5]:
word2vec.word2vec('/Users/gautamchheda/Downloads/text8-phrases', '/Users/gautamchheda/Downloads/text8.bin', size=100, verbose=True)

Starting training using file /Users/gautamchheda/Downloads/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 365.38k  

That generated a `text8.bin` file containing the word vectors in a binary format.

Do the clustering of the vectors based on the trained model.

In [6]:
word2vec.word2clusters('/Users/gautamchheda/Downloads/text8', '/Users/gautamchheda/Downloads/text8-clusters.txt', 100, verbose=True)

Starting training using file /Users/gautamchheda/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 364.16k  

That created a `text8-clusters.txt` with the cluster for every word in the vocabulary

## Predictions

In [7]:
import word2vec

Import the `word2vec` binary file created above

In [8]:
model = word2vec.load('/Users/gautamchheda/Downloads/text8.bin')

We can take a look at the vocabulaty as a numpy array

In [9]:
model.vocab

array([u'</s>', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'],
      dtype='<U78')

Or take a look at the whole matrix

In [10]:
model.vectors.shape

(98331, 100)

In [11]:
model.vectors

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.05901973, -0.03838892,  0.10086869, ..., -0.10895406,
        -0.04325397,  0.00908804],
       [ 0.17550297, -0.02656363, -0.04145248, ..., -0.02540955,
         0.13294062,  0.11193433],
       ...,
       [-0.03154316,  0.07935622,  0.06886648, ...,  0.12321164,
        -0.01726373, -0.0321489 ],
       [ 0.08473612, -0.01125982,  0.15778488, ...,  0.09312072,
        -0.21965933, -0.08323528],
       [ 0.0689628 , -0.03754506,  0.0347125 , ...,  0.0051401 ,
        -0.10758115, -0.07847002]])

We can retreive the vector of individual words

In [12]:
model['dog'].shape

(100,)

In [13]:
model['dog'][:10]

array([ 0.09261248, -0.03804511,  0.15976225,  0.01918418,  0.06726611,
       -0.07513122,  0.15003631,  0.09454908,  0.13023934,  0.08768496])

In [16]:
import scipy.spatial.distance

scipy.spatial.distance.cosine(model['electrolyte'],model['drinks'])

0.43260275098796075

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [30]:
indexes, metrics = model.cosine('rest')
indexes, metrics

(array([ 3561,   737,   789, 79524,   163,   540,   420,   529,  4015,
         2966]),
 array([0.71550285, 0.62992115, 0.62373991, 0.59537892, 0.56506883,
        0.54524602, 0.5427218 , 0.53518276, 0.51957802, 0.51318743]))

In [31]:
import scipy

scipy.spatial.distance.cosine(model['adult'],model['patient'])

0.5054304978844895

This returned a tuple with 2 items:
1. numpy array with the indexes of the similar words in the vocabulary
2. numpy array with cosine similarity to each word

Its possible to get the words of those indexes

In [32]:
model.vocab[indexes]

array([u'remainder', u'whole', u'entire', u'wealthiest_men', u'end',
       u'outside', u'parts', u'throughout', u'bulk', u'possession'],
      dtype='<U78')

There is a helper function to create a combined response: a numpy [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html)

In [33]:
model.generate_response(indexes, metrics)

rec.array([(u'remainder', 0.71550285), (u'whole', 0.62992115),
           (u'entire', 0.62373991), (u'wealthiest_men', 0.59537892),
           (u'end', 0.56506883), (u'outside', 0.54524602),
           (u'parts', 0.5427218 ), (u'throughout', 0.53518276),
           (u'bulk', 0.51957802), (u'possession', 0.51318743)],
          dtype=[(u'word', '<U78'), (u'metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [34]:
model.generate_response(indexes, metrics).tolist()

[(u'remainder', 0.7155028507687243),
 (u'whole', 0.6299211480784054),
 (u'entire', 0.6237399134767975),
 (u'wealthiest_men', 0.5953789201127435),
 (u'end', 0.565068833388109),
 (u'outside', 0.5452460225015305),
 (u'parts', 0.5427218023488299),
 (u'throughout', 0.5351827648340569),
 (u'bulk', 0.5195780177930026),
 (u'possession', 0.5131874266729617)]

### Phrases

Since we trained the model with the output of `word2phrase` we can ask for similarity of "phrases"

In [12]:
indexes, metrics = model.cosine('los_angeles')
model.generate_response(indexes, metrics).tolist()

[(u'san_francisco', 0.886558000570455),
 (u'san_diego', 0.8731961018831669),
 (u'seattle', 0.8455603712285231),
 (u'las_vegas', 0.8407843553947962),
 (u'miami', 0.8341796009062884),
 (u'detroit', 0.8235412519780195),
 (u'cincinnati', 0.8199138493085706),
 (u'st_louis', 0.8160655356728751),
 (u'chicago', 0.8156786240847214),
 (u'california', 0.8154244925085712)]

### Analogies

Its possible to do more complex queries like analogies such as: `king - man + woman = queen` 
This method returns the same as `cosine` the indexes of the words in the vocab and the metric

In [13]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics

(array([1087, 1145, 7523, 3141, 6768, 1335, 8419, 1826,  648, 1426]),
 array([ 0.2917969 ,  0.27353295,  0.26877692,  0.26596514,  0.26487509,
         0.26428581,  0.26315492,  0.26261258,  0.26136635,  0.26099078]))

In [14]:
model.generate_response(indexes, metrics).tolist()

[(u'queen', 0.2917968955611075),
 (u'prince', 0.27353295205311695),
 (u'empress', 0.2687769174818083),
 (u'monarch', 0.2659651399832089),
 (u'regent', 0.26487508713026797),
 (u'wife', 0.2642858109968327),
 (u'aragon', 0.2631549214361766),
 (u'throne', 0.26261257728511833),
 (u'emperor', 0.2613663460665488),
 (u'bishop', 0.26099078142148696)]

### Clusters

In [17]:
clusters = word2vec.load_clusters('/Users/gautamchheda/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [36]:
clusters['rest']

43

We can see get all the words grouped on an specific cluster

In [17]:
clusters.get_words_on_cluster(90).shape

(221,)

In [23]:
clusters.get_words_on_cluster(89)[:1000]

array(['take', 'play', 'run', 'live', 'go', 'get', 'turn', 'move',
       'running', 'cut', 'moving', 'keep', 'pass', 'look', 'die', 'drive',
       'travel', 'begin', 'stop', 'break', 'reach', 'stand', 'carry',
       'target', 'setting', 'store', 'enter', 'looking', 'advance',
       'draw', 'fit', 'escape', 'moves', 'stay', 'operate', 'getting',
       'grow', 'fly', 'walk', 'drop', 'beat', 'starts', 'watch', 'wear',
       'switch', 'gets', 'returns', 'driving', 'handle', 'trip', 'check',
       'touch', 'clean', 'jump', 'catch', 'blow', 'pick', 'compete',
       'finish', 'feed', 'crowd', 'push', 'sending', 'connect', 'trigger',
       'sit', 'wait', 'pointing', 'delay', 'pull', 'arrive', 'bet',
       'sends', 'repeat', 'releasing', 'sail', 'exit', 'hide', 'listen',
       'emerge', 'fix', 'pushing', 'safely', 'transmit', 'spell',
       'listening', 'spare', 'chances', 'insert', 'concentrate', 'sink',
       'stopping', 'dial', 'steal', 'climb', 'hang', 'alert', 'sweep',
       

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [19]:
model.clusters = clusters

In [20]:
indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)

In [21]:
model.generate_response(indexes, metrics).tolist()

[(u'berlin', 0.32333651414395953, 20),
 (u'munich', 0.28851564633559, 20),
 (u'vienna', 0.2768927258877336, 12),
 (u'leipzig', 0.2690537010929304, 91),
 (u'moscow', 0.26531859560322785, 74),
 (u'st_petersburg', 0.259534503067277, 61),
 (u'prague', 0.25000637367753303, 72),
 (u'dresden', 0.2495974800117785, 71),
 (u'bonn', 0.24403155303236473, 8),
 (u'frankfurt', 0.24199720792200027, 31)]