# Word Vector learning with Word2Vec

This notebook will help you get started with the skip-gram and CBOW model of word vector learning. Here we'll be using the `gensim` toolkit for python, which reimplements the code for training and evaluation of the models. Feel free to play with the orginal C code at https://code.google.com/archive/p/word2vec/ also.

The first step is to load the library into python. Note that you will need `gensim` and its dependent libraries as well as `nltk` installed in your environment. If you see an error importing these below, you'll need to `pip install` a few packages. 

In [1]:
import gensim 
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

ImportError: No module named gensim

For the purpose of this we will use part of the Brown corpus. Note that for the word vectors to be meaningful you need much more text than this. However the code will run slowly, so we'll just consider a subset for the moment. In general you need to use much bigger corpora, e.g., all of Wikipedia, and train the model for hours or days. 

In [2]:
import nltk
corpus = nltk.corpus.brown.sents() # make the corpus smaller if you want the code to run more quickly

LookupError: 
**********************************************************************
  Resource u'corpora/brown' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/Users/tcohn/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

This next line trains the model on the corpus, modelling all words that occur in the corpus. You can prune the vocabulary using the 'min_count' option, such that only words seen several times are included. This helps to make things faster and avoid modelling what might be typographical errors or other noise in the dataset.

In [None]:
model = gensim.models.Word2Vec(corpus, min_count=1, workers=2, iter=5) # workers = number of threads
# takes a minute or so

To see the various options for training the model, please read the help (see below, this will pop up a help window). For example the 'sg' option allows you to swap between the CBOW and skip-gram models. You can compare the results for running with both models by rerunning the training command above with this option. Consider 'size' also, which sets the number of hidden dimensions. There are a plethora of other parameters which can also have a big effect on the model quality and runtime.

In [None]:
gensim.models.Word2Vec?

Before going any further, you probably want to save your model to avoid the need to re-run the slow training step.

In [None]:
model.save('brown_skipgram.model')

## Evaluation

Now let's interrogate the model to see how similar various words are. Please try out some words yourself, to see whether its uncovered sensible or otherwise interesting relations. 

In [None]:
words = "woman women man girl boy green blue did".split()
for w1 in words:
    for w2 in words:
        print w1, w2, model.similarity(w1, w2)

It's a little easier to see if we plot this in a graph. Let's form a matrix with the words as rows and columns, and cells being the similarity values. We can then display this as if it were an image, and see clusters of items more clearly.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
M = np.zeros((len(words), len(words)))
for i, w1 in enumerate(words):
    for j, w2 in enumerate(words):
        M[i,j] = model.similarity(w1, w2)
        
plt.imshow(M, interpolation='nearest')
plt.colorbar()

ax = plt.gca()
ax.set_xticklabels([''] + words, rotation=45)
ax.set_yticklabels([''] + words)

You can also find the 'k' most similar words to a given word. There's a lot of gibberish when trained on a small corpus. You should try increasing the size of the training data to see the difference.

In [None]:
model.most_similar(positive=['woman'], topn=10)

Have more of a play with the 'most_similar' method, using the 'negative' option to perform vector difference in order to evaluate king-man+woman = ??? and similar.

In [None]:
your code goes here...

## Quanitative evaluation

For more quantitative evaluation, download this file to the current working directory http://word2vec.googlecode.com/svn/trunk/questions-words.txt and see if your model gets many of these values correct. You will want to train on the full Brown corpus to give the model enough data to learn some of these relations.

In [None]:
# download the file
import urllib2
uin = urllib2.urlopen('http://word2vec.googlecode.com/svn/trunk/questions-words.txt')
fout = open('questions-words.txt', 'w')
fout.write(uin.read())

In [None]:
# evaluate the model, warning - this can be slow
model.accuracy('questions-words.txt')

## Using 'big data' 

Finally let's use a model trained on masses of data. This is courtesy of Google Research who've trained a 300d model on terabytes of text, and provided the trained model files. NLTK includes a vocabulary restricted snippet which is fairly compact.

In [None]:
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
big_model = gensim.models.Word2Vec.load_word2vec_format(word2vec_sample, binary=False)

How does it differs from the small model? Try some of the above tests to see whether the vectors to appear to better capture the characteristics of words. 

In [None]:
big_model.most_similar(positive=['woman'], topn=10)

Try the quantative evaluation above for the big model. Does it perform better? Come up with some tests for different relation types, e.g., city-state, place-sport team, hyper-hyponym, tense, number, etc and see if the model can capture these well.

In [None]:
your code goes here...

You can play with the full set of learned vectors, although be warned it's a rather big file https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

Or you might want to train your own vectors on your own corpus. 