## Google’s Word2Vec Embedding

An alternative to creating your own embedding is to simply use an existing pre-trained word embedding. Along with the paper and code for Word2Vec, Google also published a pre-trained Word2Vec model on the Word2Vec Google Code Project. 

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google Word2Vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabyte file. 

You can download it from here: GoogleNews-vectors-negative300.bin.gz. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

Extract the gz file after downloading.

In [1]:
from gensim.models import KeyedVectors 

unable to import 'smart_open.gcs', disabling that module


In [8]:
# load the google word2vec model 
filename = 'GoogleNews-vectors-negative300.bin' 

In [9]:
model = KeyedVectors.load_word2vec_format(filename, binary=True) 

That is the word queen is the closest word given the subtraction of the notion of man from king and adding the word woman. The man-ness in king is replaced with woman-ness to give us queen. A very cool concept. Gensim provides an interface for performing these types of operations in the most similar() function on the trained or loaded model.

In [30]:
# calculate: (king - man) + woman = ? 
result = model.most_similar(positive=['woman', 'king']
                            , negative=['man']
                            , topn=3) 
print(result)

[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431607246399)]


## Stanford’s GloVe Embedding

Stanford researchers also have their own word embedding algorithm like Word2Vec called Global Vectors for Word Representation, or GloVe for short.

In [None]:
from gensim.models import KeyedVectors

In [35]:
# Working with the 100-dimensional version of the model,
# we can convert the file to Word2Vec format as follows:

from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt' 
word2vec_output_file = 'glove.6B.100d.txt.word2vec' 
glove2word2vec(glove_input_file, word2vec_output_file)

(400000, 100)

In [37]:
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False) 

In [38]:
# calculate: (king - man) + woman = ? 
result = model.most_similar(positive=['woman', 'king']
                            , negative=['man']
                            , topn=1) 

print(result)

[('queen', 0.7698541283607483)]


### Taking the data with 300-dimension

In [39]:
# Working with the 300-dimensional version of the model,
# we can convert the file to Word2Vec format as follows:

from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.300d.txt' 
word2vec_output_file = 'glove.6B.300d.txt.word2vec' 
glove2word2vec(glove_input_file, word2vec_output_file)

filename = 'glove.6B.300d.txt.word2vec'
model_300 = KeyedVectors.load_word2vec_format(filename, binary=False) 

# calculate: (king - man) + woman = ? 
result = model_300.most_similar(positive=['woman', 'king']
                            , negative=['man']
                            , topn=1) 

print(result)

[('queen', 0.6713277101516724)]
