[Pretrained Word2Vec Embeddings](#Pretrained-Word2Vec-Embeddings)   

1. [Google News Embedding](#Google-News-Embedding)    
2. [Similarity Scores](#Similarity-Scores)    
3. [Debunking Some Common Examples](#Debunking-Some-Common-Examples)
4. [Other Pretrained Models](Other-Pretrained-Models)
5. [Building Your Own Model](Building-Your-Own-Model)



# Pretrained Word2Vec Embeddings

Now that we have a better feel for the model behind Word2Vec we'll see how we can implement word embeddings developed by teams with more computational resources than most everyone has at home.

We'll be using `gensim` and more or less working through the tutorial in their documentation here https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html. I'll also try my best to pull in some examples from Mikolov et. al.'s original papers.

First things first, please try running the following code.

In [None]:
import numpy as np

In [None]:
import gensim.downloader as api

In [None]:
wv = api.load('word2vec-google-news-300')

The above code takes a bit to run, and takes a long while to run if it's your first time running it (the model is approximately 2GB). So please hit shift + enter now while I'm talking so it will load in time for our first coding break.

## Google News Embedding
The model we just loaded using gensim was built by Google using their Google News dataset. Their network was trained with a vocabulary of 3 million words and phrases using.

These 3 million dimensional one hot encoded vectors were then projected down to 300 features, meaning that each word vector in the embedding has 300 dimensions. Using the language from `Last week's class` their $M = 3,000,000$ and their #N= 300#.

### Accessing the Word Vectors
Once the above has loaded you can access the embedding of any word or phrase in the embedding by simply calling wv[string]. Let's see

In [None]:
wv['man']

In [None]:
wv['woman']

In [None]:
wv['influenza']

### Retrieving the index
It would be nice to know whether or not a word/phrase we're interested in is in the vocabulary. We can check by looking at the word index for the vocabulary.



In [None]:
# index2word contains every word in the 
# vocabulary

# let's look at the first 20
for i in range(20):
    print(wv.index_to_key[i])

In [None]:
# wv.index2word is a list of the words/phrases in the vocab
print(wv.index_to_key[:20])

## Similarity Scores
Similar to LSA Word2Vec can be used to find words that are similar to one another, which is useful for tasks like searching through a document for words related to a particular topic.

Work through the following to learn how to use the pretrained Word2Vec to find both the similarity between pairs of words, as well as find the most similar words to a predefined set.

### Calculating Similarities between pairs of word embeddings
There are a few different ways you can calculate similarity scores between pairs of vectors. You'll work through them now.

In [None]:
### similarity(word1, word2)
## call wv.similarity for two words/phrases
## Try and find the similarity between "apple" and
## other fruits and vegetables you know

In [None]:
### cosine_similairties(vec1, array_of_vectors)
## call wv.cosine_similarities for a vector and a collection of vectors
## Create a numpy array where each row has the word embeddings 
## for the fruits and vegetables you were interested in
## compare to the word embedding for "apple"

### Finding the most similar words
Another problem you may be interested in is finding the words that are most similar to a given word or vector.

In [None]:
## You can find the most similar words to a given word
## using wv.similar_by_word(word)
## try it out with "apple"

In [None]:
## You can control how many words are returned with topn
## Get the 25 most similar words to "apple"

In [None]:
## You can also do this for a collection of words
## with wv.most_similar(list_of_words)
## try running this on the list ["apple","pie"]
## and the list ["apple","computer"]

In [None]:
## Sometimes you may not have a word, but rather a vector
## You can find the words most similar to that vector with
## wv.similar_by_vector(vec)
## Find the words most similar to the test vector below
test = 2*np.random.random(300)-1

## Code here

One of these things just doesn't belong
Another fun feature is that you can put in a list of words and Word2Vec can pick out the one that doesn't belong.

In [None]:
print(wv.doesnt_match(["apple","banana","grapes","pear","jeans"]))

## Debunking Some Common Examples
If you've ever heard of Word2Vec prior to this, you may have seen amazingly intuitive results like "King" - "Man" + "Woman" = "Queen".

While such examples are certainly eye-catching let's see how they hold up in actuality. Note inspiration for this section comes from this blog post https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935a85.

Use what you've learned above to test out the following "equations":

* king - man + woman = queen
* bigger - big + cold = colder
* Einstein - scientist + Picasso = painter
* Paris - France + Italy = Rome
* lebron - cavs + lakers = kobe

In [None]:
## Code here

In [None]:
## Code here

In [None]:
## Code here

In [None]:
## Code here

## Other Pretrained Models
While we've looked at the Google News embedding there are a number of other pretrained Word2Vec embeddings that may be of interest.

This Github repository has a nice list of the pretrained models you can get online, https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models.

Here are the names from the gensim documentation.

In [None]:
# ['fasttext-wiki-news-subwords-300',
#  'conceptnet-numberbatch-17-06-300',
#  'word2vec-ruscorpora-300',
#  'word2vec-google-news-300',
#  'glove-wiki-gigaword-50',
#  'glove-wiki-gigaword-100',
#  'glove-wiki-gigaword-200',
#  'glove-wiki-gigaword-300',
#  'glove-twitter-25',
#  'glove-twitter-50',
#  'glove-twitter-100',
#  'glove-twitter-200',
#  '__testing_word2vec-matrix-synopsis']

## Building Your Own Model
`gensim` also offers the functionality to build your own Word2Vec model if you clean the data up. Let's demonstrate the process with a truly controversial example using Green Eggs and Ham by Dr. Seuss.

In [None]:
## Here is the text of Green Eggs and Ham
seuss = '''
"I AM SAM. I AM SAM. SAM I AM.

THAT SAM-I-AM! THAT SAM-I-AM! I DO NOT LIKE THAT SAM-I-AM!

DO WOULD YOU LIKE GREEN EGGS AND HAM?

I DO NOT LIKE THEM,SAM-I-AM.
I DO NOT LIKE GREEN EGGS AND HAM.

WOULD YOU LIKE THEM HERE OR THERE?

I WOULD NOT LIKE THEM HERE OR THERE.
I WOULD NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU LIKE THEM IN A HOUSE?
WOULD YOU LIKE THEN WITH A MOUSE?

I DO NOT LIKE THEM IN A HOUSE.
I DO NOT LIKE THEM WITH A MOUSE.
I DO NOT LIKE THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU EAT THEM IN A BOX?
WOULD YOU EAT THEM WITH A FOX?

NOT IN A BOX. NOT WITH A FOX.
NOT IN A HOUSE. NOT WITH A MOUSE.
I WOULD NOT EAT THEM HERE OR THERE.
I WOULD NOT EAT THEM ANYWHERE.
I WOULD NOT EAT GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU? COULD YOU? IN A CAR?
EAT THEM! EAT THEM! HERE THEY ARE.

I WOULD NOT, COULD NOT, IN A CAR.

YOU MAY LIKE THEM. YOU WILL SEE.
YOU MAY LIKE THEM IN A TREE!

I WOULD NOT, COULD NOT IN A TREE.
NOT IN A CAR! YOU LET ME BE.
I DO NOT LIKE THEM IN A BOX.
I DO NOT LIKE THEM WITH A FOX.
I DO NOT LIKE THEM IN A HOUSE.
I DO NOT LIKE THEM WITH A MOUSE.
I DO NOT LIKE THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

A TRAIN! A TRAIN! A TRAIN! A TRAIN!
COULD YOU, WOULD YOU ON A TRAIN?

NOT ON TRAIN! NOT IN A TREE!
NOT IN A CAR! SAM! LET ME BE!
I WOULD NOT, COULD NOT, IN A BOX.
I WOULD NOT, COULD NOT, WITH A FOX.
I WILL NOT EAT THEM IN A HOUSE.
I WILL NOT EAT THEM HERE OR THERE.
I WILL NOT EAT THEM ANYWHERE.
I DO NOT EAT GREEM EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

SAY! IN THE DARK? HERE IN THE DARK!
WOULD YOU, COULD YOU, IN THE DARK?

I WOULD NOT, COULD NOT, IN THE DARK.

WOULD YOU COULD YOU IN THE RAIN?

I WOULD NOT, COULD NOT IN THE RAIN.
NOT IN THE DARK. NOT ON A TRAIN.
NOT IN A CAR. NOT IN A TREE.
I DO NOT LIKE THEM, SAM, YOU SEE.
NOT IN A HOUSE. NOT IN A BOX.
NOT WITH A MOUSE. NOT WITH A FOX.
I WILL NOT EAT THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE!

YOU DO NOT LIKE GREEN EGGS AND HAM?

I DO NOT LIKE THEM, SAM-I-AM.

COULD YOU, WOULD YOU, WITH A GOAT?

I WOULD NOT, COULD NOT WITH A GOAT!

WOULD YOU, COULD YOU, ON A BOAT?

I COULD NOT, WOULD NOT, ON A BOAT.
I WILL NOT, WILL NOT, WITH A GOAT.
I WILL NOT EAT THEM IN THE RAIN.
NOT IN THE DARK! NOT IN A TREE!
NOT IN A CAR! YOU LET ME BE!
I DO NOT LIKE THEM IN A BOX.
I DO NOT LIKE THEM WITH A FOX.
I WILL NOT EAT THEM IN A HOUSE.
I DO NOT LIKE THEM WITH A MOUSE.
I DO NOT LIKE THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE!
I DO NOT LIKE GREEN EGGS AND HAM!
I DO NOT LIKE THEM, SAM-I-AM.

YOU DO NOT LIKE THEM. SO YOU SAY.
TRY THEM! TRY THEM! AND YOU MAY.
TRY THEM AND YOU MAY, I SAY.

sAM! IF YOU LET ME BE,
I WILL TRY THEM. YOU WILL SEE.

(... and he tries them ...)

SAY! I LIKE GREEN EGGS AND HAM!
I DO! I LIKE THEM, SAM-I-AM!
AND I WOULD EAT THEM IN A BOAT.
AND I WOULD EAT THEM WITH A GOAT...
AND I WILL EAT THEM, IN THE RAIN.
AND IN THE DARK. AND ON A TRAIN.
AND IN A CAR. AND IN A TREE.
THEY ARE SO GOOD, SO GOOD, YOU SEE!
SO I WILL EAT THEM IN A BOX.
AND I WILL EAT THEM WITH A FOX.
AND I WILL EAT THEM IN A HOUSE.
AND I WILL EAT THEM WITH A MOUSE.
AND I WILL EAT THEM HERE AND THERE.
SAY! I WILL EAT THEM ANYWHERE!
I DO SO LIKE GREEN EGGS AND HAM!
THANK YOU! THANK YOU, SAM I AM.
'''

In [None]:
# Here I clean it up a bit
seuss = seuss.replace("\n","").replace(","," ").replace('"',"").replace("(","").replace(")","")
seuss = seuss.replace(".",". ").replace("!","! ").replace("?","? ")

In [None]:
# The clean version
print(seuss)

In [None]:
# gensim only requires sequences of words in a sentence
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

In [None]:
# let's give them just that
sentences = [word_tokenize(sent) for sent in sent_tokenize(seuss)]

In [None]:
# You train a Word2Vec model in gensim with
# Word2Vec
from gensim.models import Word2Vec

### Parameters

* `min_count` = int - Ignores all words with total absolute frequency lower than this - (2, 100)
* `window` = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
* `size` = int - Dimensionality of the feature vectors. - (50, 300)
* `sample` = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
* `alpha` = float - The initial learning rate - (0.01, 0.05)
* `min_alpha` = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
* `negative` = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
* `workers` = int - Use these many worker threads to train the model (=faster training with multicore machines)

In [None]:
# This line makes a model
# You first put in your sentences
# then you can OPTIONALLY specify a number of parameters
# vector_size is the size of the hidden layer
# window is the size of the skip-gram window
# min_count sets a minimum number of times that a word has to appear
model = Word2Vec(sentences = sentences,
                     size=10,  
                     window=1, 
                     min_count=1)

In [None]:
model.wv['SAM']

In [None]:
model.wv['EGGS']

In [None]:
model.wv.most_similar(['HAM'])

Since we spent so much time and energy training this model we may want to save it.

In [None]:
model.save("green_eggs_and_ham.model")

We can load it later.

In [None]:
model = Word2Vec.load("green_eggs_and_ham.model")

Now when training an actual model you'll likely want to know all the bells and whistles for tuning the model. You can find all that and more here: https://radimrehurek.com/gensim/models/word2vec.html?highlight=word2vec#module-gensim.models.word2vec

## An Improvement on Word2Vec
A nice improvement on Word2Vec was made by Facebook researchers (including Mikolov) called, fasttext, https://arxiv.org/abs/1607.04606.

The key idea to the improvement was to replace words and phrases as the base unit input layer with partial character -grams of words. For example instead of "apples" they used things like "app", "ppl", "ple", and "les".

This is also implementable with gensim. Here is the documentation, https://radimrehurek.com/gensim_3.8.3/models/fasttext.html.

Below I'll demonstrate Facebook's pretrained model trained on Wikipedia. Note don't try and run this code right now because it takes a long time to download the model. Just watch the code, and if you're interested in the model you can always play with it later.

In [None]:
ft = api.load('fasttext-wiki-news-subwords-300')

In [None]:
## Let's look at the top 20 words
for i in range(20):
    print(ft.index_to_key[i])

In [None]:
## And just to note we can use this model in exactly
## the same way as Word2Vec
ft.similar_by_word("apple",topn=20)

In [None]:
ft.similar_by_word("ppl",topn=20)

In [None]:
test = ft['king'] - ft['man'] + ft['woman']
ft.similar_by_vector(test,topn=5)

That's it for this notebook. I hope you enjoyed playing around with `gensim`'s word embedding models!