# 参考
https://github.com/danielfrg/word2vec

# 準備

[Anaconda](https://www.continuum.io/downloads) からインストールしたPythonにword2vec をインストール。

`pip install word2vec`

In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# word2vec

## データのダウンロード

http://mattmahoney.net/dc/text8.zip
をダウンロードし、適当なフォルダに解答する。

とりあえず、このディレクトリにフォルダを作成する（容量が大きいので.gitignoreしている。）

## Training

In [3]:
import word2vec

Run word2phrase to group up similar words "Los Angeles" to "Los_Angeles"

This will create a text8-phrases that we can use as a better input for word2vec. Note that you could easily skip this previous step and use the origial data as input for word2vec.

Train the model using the word2phrase output.

That generated a text8.bin file containing the word vectors in a binary format.

Do the clustering of the vectors based on the trained model.

# Predictions

Import the word2vec binary file created above

In [8]:
model = word2vec.load('./text8/text8.bin')

We can take a look at the vocabulaty as a numpy array

In [9]:
model.vocab

array(['</s>', 'the', 'of', ..., 'denishawn', 'tamiris', 'dolophine'], 
      dtype='<U78')

Or take a look at the whole matrix

In [10]:
model.vectors.shape

(98331, 100)

In [11]:
model.vectors

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.12324433,  0.08508522, -0.14245066, ..., -0.05163304,
        -0.04413069, -0.02636524],
       [ 0.09120493,  0.07657887,  0.09830743, ..., -0.07062475,
         0.00264537,  0.1461402 ],
       ..., 
       [-0.05293642, -0.05607885,  0.12834577, ...,  0.01149706,
        -0.23037991, -0.0393982 ],
       [-0.08716934, -0.0674235 ,  0.13691288, ...,  0.11314875,
        -0.28018489, -0.05091289],
       [ 0.19590881,  0.05077619,  0.0686731 , ...,  0.1524414 ,
        -0.09329432,  0.06795349]])

We can retreive the vector of individual words.

In [12]:
model['dog'].shape

(100,)

In [13]:
model['dog'][:10]

array([ 0.16765068, -0.06553809,  0.02739922,  0.05225474,  0.15062518,
       -0.06210165,  0.05994908,  0.03833631,  0.12365534,  0.14278004])

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [14]:
indexes, metrics = model.cosine('socks')
indexes, metrics

(array([14558, 20175, 29181, 34047, 31618, 22883, 23666, 30402, 20336, 37766]),
 array([ 0.83670478,  0.83519267,  0.82714757,  0.82426758,  0.82154239,
         0.81419191,  0.81361981,  0.81271618,  0.80641576,  0.80382507]))

This returned a tuple with 2 items:

numpy array with the indexes of the similar words in the vocabulary
numpy array with cosine similarity to each word
Its possible to get the words of those indexes

In [15]:
model.vocab[indexes]

array(['winged', 'hairy', 'pumpkin', 'nosed', 'straps', 'skirt', 'striped',
       'gravy', 'crab', 'boa'], 
      dtype='<U78')

There is a helper function to create a combined response: a numpy record array

In [16]:
model.generate_response(indexes, metrics)

rec.array([('winged', 0.8367047800859084), ('hairy', 0.8351926650699733),
 ('pumpkin', 0.8271475672165035), ('nosed', 0.8242675798607484),
 ('straps', 0.8215423863858035), ('skirt', 0.8141919128224772),
 ('striped', 0.8136198096480802), ('gravy', 0.8127161761603303),
 ('crab', 0.8064157558571379), ('boa', 0.8038250736592599)], 
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [17]:
model.generate_response(indexes, metrics).tolist()

[('winged', 0.8367047800859084),
 ('hairy', 0.8351926650699733),
 ('pumpkin', 0.8271475672165035),
 ('nosed', 0.8242675798607484),
 ('straps', 0.8215423863858035),
 ('skirt', 0.8141919128224772),
 ('striped', 0.8136198096480802),
 ('gravy', 0.8127161761603303),
 ('crab', 0.8064157558571379),
 ('boa', 0.8038250736592599)]

## Phrases

Since we trained the model with the output of word2phrase we can ask for similarity of "phrases"

In [18]:
indexes, metrics = model.cosine('los_angeles')
model.generate_response(indexes, metrics).tolist()

[('san_francisco', 0.8953364682036224),
 ('san_diego', 0.8761334153196536),
 ('las_vegas', 0.8524525856073968),
 ('miami', 0.8381276706082835),
 ('seattle', 0.8260420347119275),
 ('california', 0.8217981543657138),
 ('chicago', 0.8142463083479377),
 ('st_louis', 0.8121326055898652),
 ('chicago_illinois', 0.8101182172944901),
 ('dallas', 0.8080168522391078)]

## Analogies
Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric

In [19]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics

(array([ 1088,  1145,  7540,   344,  3141,  4978,  1827,  1335,  1427, 13088]),
 array([ 0.28635165,  0.27244199,  0.26373954,  0.26257615,  0.26253012,
         0.26074491,  0.26011973,  0.25957689,  0.25923383,  0.25913231]))

In [20]:
model.generate_response(indexes, metrics).tolist()

[('queen', 0.2863516507498973),
 ('prince', 0.2724419886339948),
 ('empress', 0.263739540598119),
 ('son', 0.2625761524291093),
 ('monarch', 0.2625301180985256),
 ('heir', 0.26074491063702476),
 ('throne', 0.2601197344276091),
 ('wife', 0.25957688950021474),
 ('bishop', 0.25923383068212613),
 ('king_edward', 0.25913230551757593)]

## Clusters

In [22]:
clusters = word2vec.load_clusters('./text8/text8-clusters.txt')

We can see get the cluster number for individual words

In [26]:
type(clusters)

word2vec.wordclusters.WordClusters

In [28]:
clusters.vocab

array([b'</s>', b'the', b'of', ..., b'denishawn', b'tamiris', b'dolophine'], dtype=object)

In [29]:
clusters[b'dog']

98

We can see get all the words grouped on an specific cluster

In [30]:
clusters.get_words_on_cluster(90).shape

(285,)

In [31]:
clusters.get_words_on_cluster(90)[:10]

array([b'or', b'common', b'important', b'making', b'complex', b'simple',
       b'direct', b'difficult', b'active', b'alternative'], dtype=object)

In [34]:
model.clusters = clusters

In [35]:
indexes, metrics = \
    model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)

In [36]:
model.generate_response(indexes, metrics).tolist()

[('berlin', 0.32607859249215876, 20),
 ('vienna', 0.28917846478744846, 82),
 ('munich', 0.2843345925557563, 5),
 ('leipzig', 0.2792443656190464, 41),
 ('moscow', 0.27522173121249416, 59),
 ('st_petersburg', 0.26281150847112744, 63),
 ('prague', 0.2596628868838047, 19),
 ('dresden', 0.25564775446499577, 86),
 ('z_rich', 0.2543353152525521, 71),
 ('bonn', 0.2520885868719889, 10)]