Join GitHub today
The word2vec example is an algorithm for computing continuous distributed representations of words. According to the word2vec repository it provides a provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
The installation is best done in a docker image or with a full bazel installation. In the docker image or main execute the following code listed below. The wget command will load the text8 corpus (30 MByte/100 MByte extracted) which starts with anarchism originated as a term of abuse. The file is 100,000,000 characters long and contains 17,005,207 words including 253,854 unique words and 71,290 unique frequent words.
The file questions-words.txt contains roughly 20,000 manually curated word relationships (ngrams and shingles) including capital-common-countries (Athens Greece Baghdad Iraq), capital-world (Abuja Nigeria Accra Ghana), currency (Algeria dinar Argentina peso), city-in-state, family, gram1-adjective-to-adverb, gram2-opposite, gram3-comparative, gram4-superlative (bad worst big biggest), gram5-present-participle, gram6-nationality-adjective, gram7-past-tense, gram8-plural, gram9-plural-verbs (decrease decreases describe describes).
cd tensorflow wget http://mattmahoney.net/dc/text8.zip -O text8.gz gzip -d text8.gz -f wget https://word2vec.googlecode.com/svn/trunk/questions-words.txt bazel build -c opt tensorflow/models/embedding:all
which results in
root@fb729273837c:/tensorflow# bazel build -c opt tensorflow/models/embedding:all INFO: Reading 'startup' options from /root/.bazelrc: --batch INFO: Found 10 targets... INFO: Elapsed time: 10.615s, Critical Path: 2.25s ``` After that we can start the example python file by using the manual command from the readme. The tutorial code has two versions of the multi-threaded word2vec, a batched and unbatched skip-gram model: * word2vec.py - a version of word2vec implemented using Tensorflow ops and minibatching. * word2vec_optimized.py - s version of word2vec implemented using C ops that does no minibatching.
which will then drizzle into ``` root@fb729273837c:/tensorflow# time bazel-bin/tensorflow/models/embedding/word2vec_optimized --train_data=text8 --eval_data=questions-words.txt --save_path=/tmp/ I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8 I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8 I tensorflow/models/embedding/word2vec_kernels.cc:134] Data file: text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words. Data file: text8 Vocab size: 71290 + UNK Words per epoch: 17005207 Eval analogy file: questions-words.txt Questions: 17827 Skipped: 1717 Epoch 1 Step 151322: lr = 0.023 words/sec = 34117 Eval 1554/17827 accuracy = 8.7% Epoch 2 Step 302660: lr = 0.022 words/sec = 3900 Eval 2302/17827 accuracy = 12.9% Epoch 3 Step 453986: lr = 0.020 words/sec = 32707 Eval 3049/17827 accuracy = 17.1% Epoch 4 Step 605329: lr = 0.018 words/sec = 11805 Eval 3528/17827 accuracy = 19.8% Epoch 5 Step 756656: lr = 0.017 words/sec = 126655 Eval 4055/17827 accuracy = 22.7% Epoch 6 Step 907954: lr = 0.015 words/sec = 66275 Eval 4434/17827 accuracy = 24.9% Epoch 7 Step 1059303: lr = 0.013 words/sec = 125780 Eval 4737/17827 accuracy = 26.6% Epoch 8 Step 1210621: lr = 0.012 words/sec = 123938 Eval 5042/17827 accuracy = 28.3% Epoch 9 Step 1361968: lr = 0.010 words/sec = 89538 Eval 5335/17827 accuracy = 29.9% Epoch 10 Step 1513319: lr = 0.008 words/sec = 48258 Eval 5621/17827 accuracy = 31.5% Epoch 11 Step 1664661: lr = 0.007 words/sec = 113623 Eval 5812/17827 accuracy = 32.6% Epoch 12 Step 1815978: lr = 0.005 words/sec = 58567 Eval 6053/17827 accuracy = 34.0% Epoch 13 Step 1967289: lr = 0.003 words/sec = 81122 Eval 6203/17827 accuracy = 34.8% Epoch 14 Step 2118655: lr = 0.002 words/sec = 68519 Eval 6291/17827 accuracy = 35.3% Epoch 15 Step 2269981: lr = 0.000 words/sec = 64780 Eval 6366/17827 accuracy = 35.7% real 36m4.861s user 240m20.464s sys 24m18.860s root@fb729273837c:/tensorflow#
The final accuracy for tensorflow word2vec_optimized.py using the text8 corpus and questions-words.txt is 35.7%. The result is not deterministic but changes from time to time in this example.
We can also see that batched version (word2vec_optimized.py) is highly efficient and uses around 90-100% of all CPU cores, whereas the non-batched version (word2vec.py) is slow and inefficient and barely reaches 40% CPU utilization.
- text8 - text8 corpus by Matt Mahoney
- word2vec - computing continuous distributed representations of words
- word2vec@chalow - 手持ちの MacBook Air (OS X 10.9.2) で word2vec を動かしてみる
- word2vec@cnblogs - 用中文把玩Google开源的Deep-Learning项目word2vec
- Word2Vec&GloVe - Getting Started with Word2Vec and GloVe in Python
- Books&ngrams - Google Books ngram viewer
- word2vec¶llel - Interesting benchmark about parallelizing word2vec in Python
- word2vec - explained with examples