word2vec example

Tobias Kind edited this page Mar 28, 2017 · 24 revisions

The word2vec example is an algorithm for computing continuous distributed representations of words. According to the word2vec repository it provides a provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

The code is based on a the paper Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov et al. and a detailed explanation is covered in the Word2Vec TF tutorial.


The installation is best done in a docker image or with a full bazel installation. In the docker image or main execute the following code listed below. The wget command will load the text8 corpus (30 MByte/100 MByte extracted) which starts with anarchism originated as a term of abuse. The file is 100,000,000 characters long and contains 17,005,207 words including 253,854 unique words and 71,290 unique frequent words.

The file questions-words.txt contains roughly 20,000 manually curated word relationships (ngrams and shingles) including capital-common-countries (Athens Greece Baghdad Iraq), capital-world (Abuja Nigeria Accra Ghana), currency (Algeria dinar Argentina peso), city-in-state, family, gram1-adjective-to-adverb, gram2-opposite, gram3-comparative, gram4-superlative (bad worst big biggest), gram5-present-participle, gram6-nationality-adjective, gram7-past-tense, gram8-plural, gram9-plural-verbs (decrease decreases describe describes).

cd tensorflow
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
wget https://word2vec.googlecode.com/svn/trunk/questions-words.txt
bazel build -c opt tensorflow/models/embedding:all

which results in

root@fb729273837c:/tensorflow# bazel build -c opt tensorflow/models/embedding:all
INFO: Reading 'startup' options from /root/.bazelrc: --batch
INFO: Found 10 targets...
INFO: Elapsed time: 10.615s, Critical Path: 2.25s
```

After that we can start the example python file by using the manual command from the readme. The tutorial code has two versions of the multi-threaded word2vec, a batched and unbatched skip-gram model:

* word2vec.py - a version of word2vec implemented using Tensorflow ops and minibatching.
* word2vec_optimized.py - s version of word2vec implemented using C ops that does no minibatching.

bazel-bin/tensorflow/models/embedding/word2vec_optimized
--train_data=text8
--eval_data=questions-words.txt
--save_path=/tmp/


which will then drizzle into

```
root@fb729273837c:/tensorflow# time bazel-bin/tensorflow/models/embedding/word2vec_optimized   --train_data=text8   --eval_data=questions-words.txt   --save_path=/tmp/
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
I tensorflow/models/embedding/word2vec_kernels.cc:134] Data file: text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words.
Data file:  text8
Vocab size:  71290  + UNK
Words per epoch:  17005207
Eval analogy file:  questions-words.txt
Questions:  17827
Skipped:  1717
Epoch    1 Step   151322: lr = 0.023 words/sec =    34117
Eval 1554/17827 accuracy =  8.7%
Epoch    2 Step   302660: lr = 0.022 words/sec =     3900
Eval 2302/17827 accuracy = 12.9%
Epoch    3 Step   453986: lr = 0.020 words/sec =    32707
Eval 3049/17827 accuracy = 17.1%
Epoch    4 Step   605329: lr = 0.018 words/sec =    11805
Eval 3528/17827 accuracy = 19.8%
Epoch    5 Step   756656: lr = 0.017 words/sec =   126655
Eval 4055/17827 accuracy = 22.7%
Epoch    6 Step   907954: lr = 0.015 words/sec =    66275
Eval 4434/17827 accuracy = 24.9%
Epoch    7 Step  1059303: lr = 0.013 words/sec =   125780
Eval 4737/17827 accuracy = 26.6%
Epoch    8 Step  1210621: lr = 0.012 words/sec =   123938
Eval 5042/17827 accuracy = 28.3%
Epoch    9 Step  1361968: lr = 0.010 words/sec =    89538
Eval 5335/17827 accuracy = 29.9%
Epoch   10 Step  1513319: lr = 0.008 words/sec =    48258
Eval 5621/17827 accuracy = 31.5%
Epoch   11 Step  1664661: lr = 0.007 words/sec =   113623
Eval 5812/17827 accuracy = 32.6%
Epoch   12 Step  1815978: lr = 0.005 words/sec =    58567
Eval 6053/17827 accuracy = 34.0%
Epoch   13 Step  1967289: lr = 0.003 words/sec =    81122
Eval 6203/17827 accuracy = 34.8%
Epoch   14 Step  2118655: lr = 0.002 words/sec =    68519
Eval 6291/17827 accuracy = 35.3%
Epoch   15 Step  2269981: lr = 0.000 words/sec =    64780
Eval 6366/17827 accuracy = 35.7%

real	36m4.861s
user	240m20.464s
sys	24m18.860s
root@fb729273837c:/tensorflow# 

The final accuracy for tensorflow word2vec_optimized.py using the text8 corpus and questions-words.txt is 35.7%. The result is not deterministic but changes from time to time in this example.


We can also see that batched version (word2vec_optimized.py) is highly efficient and uses around 90-100% of all CPU cores, whereas the non-batched version (word2vec.py) is slow and inefficient and barely reaches 40% CPU utilization.

word2vec-optimized-tensorflow:

word2vec-optimized-tensorflow

word2vec-tensorflow-not-optimized:

word2vec-tensorflow-not-optimized


Links

  • text8 - text8 corpus by Matt Mahoney
  • word2vec - computing continuous distributed representations of words
  • word2vec@chalow - 手持ちの MacBook Air (OS X 10.9.2) で word2vec を動かしてみる
  • word2vec@cnblogs - 用中文把玩Google开源的Deep-Learning项目word2vec
  • Word2Vec&GloVe - Getting Started with Word2Vec and GloVe in Python
  • Books&ngrams - Google Books ngram viewer
  • word2vec&parallel - Interesting benchmark about parallelizing word2vec in Python
  • word2vec - explained with examples
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.