# Word2Vec Tutorial

In case you missed the buzz, word2vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

Word2vec is very useful in [automatic text tagging](https://github.com/RaRe-Technologies/movie-plots-by-genre), recommender systems and machine translation.

Check out an [online word2vec demo](http://radimrehurek.com/2014/02/word2vec-tutorial/#app) where you can try this vector algebra for yourself. That demo runs `word2vec` on the Google News dataset, of **about 100 billion words**.

## This tutorial

In this tutorial you will learn how to train and evaluate word2vec models on your business data.


## Preparing the Input
Starting from the beginning, gensim’s `word2vec` expects a sequence of sentences as its input. Each sentence is a list of words (utf8 strings):

In [1]:
import warnings
warnings.filterwarnings('ignore')

# import modules & set up logging
import gensim
import os

In [2]:
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

In [3]:
# generate the Word2Vec model
model = gensim.models.Word2Vec(sentences, min_count=1)

In [4]:
print(model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)
{'first': <gensim.models.keyedvectors.Vocab object at 0x10f8bdba8>, 'sentence': <gensim.models.keyedvectors.Vocab object at 0x10f8bd390>, 'second': <gensim.models.keyedvectors.Vocab object at 0x10f94fd30>}


Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the `MySentences` iterator and `word2vec` doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

**Note to advanced users:** calling `Word2Vec(sentences, iter=1)` will run **two** passes over the sentences iterator. In general it runs `iter+1` passes. By the way, the default value is `iter=5` to comply with Google's word2vec in C language. 
  1. The first pass collects words and their frequencies to build an internal dictionary tree structure. 
  2. The second pass trains the neural model.

These two passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

In [5]:
# build the same model, making the 2 steps explicit
new_model = gensim.models.Word2Vec(min_count=1)  # an empty model, no training
new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator     
new_model.train(sentences, total_examples=new_model.corpus_count, epochs=new_model.iter)                       
# can be a non-repeatable, 1-pass generator

(1, 20)

In [6]:
print(new_model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)
{'first': <gensim.models.keyedvectors.Vocab object at 0x10f8bdba8>, 'sentence': <gensim.models.keyedvectors.Vocab object at 0x10f8bd390>, 'second': <gensim.models.keyedvectors.Vocab object at 0x10f94fd30>}


### More data would be nice
For the following examples, we'll use the [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor) (which you already have if you've installed gensim):

In [7]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = test_data_dir + 'lee_background.cor'

In [8]:
class MyText(object):
    def __iter__(self):
        for line in open(lee_train_file):
            # assume there's one document per line, tokens separated by whitespace
            yield line.lower().split()

sentences = MyText()

print(sentences)

<__main__.MyText object at 0x11e99a2e8>


## Training
`Word2Vec` accepts several parameters that affect both training speed and quality.

### min_count
`min_count` is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

In [9]:
# default value of min_count=5
model = gensim.models.Word2Vec(sentences, min_count=10)

### size
`size` is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

In [10]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

### workers
`workers`, the last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:

In [11]:
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)

The `workers` parameter only has an effect if you have [Cython](http://cython.org/) installed. Without Cython, you’ll only be able to use one core because of the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) (and `word2vec` training will be [miserably slow](http://rare-technologies.com/word2vec-in-python-part-two-optimizing/)).

## Memory
At its core, `word2vec` model parameters are stored as matrices (NumPy arrays). Each array is **#vocabulary** (controlled by min_count parameter) times **#size** (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer `size=200`, the model will require approx. `100,000*200*4*3 bytes = ~229MB`.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

## Evaluating
`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?). 

Gensim supports the same evaluation set, in exactly the same format:

In [12]:
model.accuracy('./datasets/questions-words.txt')

[{'section': 'capital-common-countries', 'correct': [], 'incorrect': []},
 {'section': 'capital-world', 'correct': [], 'incorrect': []},
 {'section': 'currency', 'correct': [], 'incorrect': []},
 {'section': 'city-in-state', 'correct': [], 'incorrect': []},
 {'section': 'family',
  'correct': [],
  'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')]},
 {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},
 {'section': 'gram2-opposite', 'correct': [], 'incorrect': []},
 {'section': 'gram3-comparative',
  'correct': [],
  'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
   ('GOOD', 'BETTER', 'LONG', 'LONGER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'LONG', 'LONGER'),
   ('GREAT', 'GREATER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'LOW', 'LOWER'),
   ('LONG', 'LONGER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('LOW', 

This `accuracy` takes an 
[optional parameter](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy) `restrict_vocab` 
which limits which test examples are to be considered.



In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.

By default it uses an academic dataset WS-353 but one can create a dataset specific to your business based on it. It contains word pairs together with human-assigned similarity judgments. It measures the relatedness or co-occurrence of two words. For example, 'coast' and 'shore' are very similar as they appear in the same context. At the same time 'clothes' and 'closet' are less similar because they are related but not interchangeable.

In [13]:
model.evaluate_word_pairs(test_data_dir + 'wordsim353.tsv')

((0.11448905088605649, 0.42372165769397785),
 SpearmanrResult(correlation=0.16033487988065467, pvalue=0.26104545934340595),
 85.55240793201133)

Once again, **good performance on Google's or WS-353 test set doesn’t mean word2vec will work well in your application, or vice versa**. It’s always best to evaluate directly on your intended task. For an example of how to use word2vec in a classifier pipeline, see this [tutorial](https://github.com/RaRe-Technologies/movie-plots-by-genre).

## Storing and loading models
You can store/load models using the standard gensim methods:

In [14]:
from tempfile import mkstemp

fs, temp_path = mkstemp("gensim_temp")  # creates a temp file

model.save(temp_path)  # save the model

In [15]:
new_model = gensim.models.Word2Vec.load(temp_path)  # open the model

which uses pickle internally, optionally `mmap`‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

In addition, you can load models created by the original C tool, both using its text and binary formats:
```
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
  # using gzipped/bz2 input works too, no need to unzip:
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)
```

## Online training / Resuming training
Advanced users can load a model and continue training it with more sentences and [new vocabulary words](online_w2v_tutorial.ipynb):

In [16]:
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

# cleaning up temp
os.close(fs)
os.remove(temp_path)

You may need to tweak the `total_words` parameter to `train()`, depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C tool, `KeyedVectors.load_word2vec_format()`. You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

## Using the model
`Word2Vec` supports several word similarity tasks out of the box:

In [17]:
model.wv.most_similar(positive=['human', 'crime'], negative=['party'], topn=1)

[('2002', 0.9785429239273071)]

In [18]:
model.wv.doesnt_match("input is lunch he sentence cat".split())

'sentence'

In [19]:
print(model.wv.similarity('human', 'party'))
print(model.wv.similarity('tree', 'murder'))

0.99885035
0.99659353


You can get the probability distribution for the center word given the context words as input:

In [20]:
print(model.predict_output_word(['emergency', 'beacon', 'received']))

[('more', 0.0008478443), ('can', 0.0008344142), ('training', 0.00077405263), ('first', 0.0007672186), ('there', 0.0007667362), ('government', 0.0007633608), ('australia', 0.00073514367), ('al', 0.0007349143), ('forces', 0.00073407707), ('continue', 0.0007311217)]


The results here don't look good because the training corpus is very small. To get meaningful results one needs to train on 500k+ words.

If you need the raw output vectors in your application, you can access these either on a word-by-word basis:

In [21]:
model.wv['tree']  # raw NumPy vector of a word

array([ 3.9231010e-02, -3.6319464e-02,  1.7520290e-02, -3.7792906e-02,
        3.0579556e-02,  3.0988628e-02, -4.9328292e-03, -8.9142382e-02,
       -9.5751740e-02,  6.5454260e-02, -8.1723956e-03,  5.6888893e-02,
       -2.8740758e-02, -4.6804454e-02, -2.5754213e-02,  3.6591962e-02,
       -3.8776644e-02,  2.5568964e-02, -7.2662503e-02, -1.6003391e-02,
        1.2712982e-01,  1.1695630e-03, -6.1111145e-02,  2.4065810e-04,
       -2.5093846e-02,  1.4823681e-02,  4.1186783e-02, -6.9799438e-02,
        4.2318660e-03, -5.6799226e-03, -7.5128943e-02, -2.4081195e-02,
        1.5295460e-02,  6.2460154e-02,  9.3654757e-03, -6.0983947e-03,
       -2.2662973e-02, -8.1043504e-03, -2.3061233e-02, -2.2140717e-02,
       -3.0015498e-02,  2.5920236e-02,  4.1723013e-02,  4.0615454e-02,
       -6.7355074e-02,  1.0661065e-02, -9.8679446e-02, -7.8338370e-02,
       -6.7834608e-02,  4.0874865e-02, -4.4954631e-02, -4.0322137e-03,
        5.1951963e-02, -1.1869692e-01, -7.0454210e-02,  5.7762589e-02,
      