[Word2Vec](#Word2Vec)

1. [Introduction](#Introduction)   
2. [References](#References)   
3. [The Idea Behind the Approach - Using Context Clues](#The-Idea-Behind-the-Approach---Using-Context-Clues)     
4. [Searching for Context Clues: Skip-Grams](#Searching-for-Context-Clues:-Skip-Grams)   
5. [The Neural Network](#The-Neural-Network)   
6. [Training Word2Vec is Costly](#Training-Word2Vec-is-Costly)   


# Word2Vec

## Introduction
In this section we'll be diving into how neural networks can be used to represent NLP data in what are known as embeddings. They are so called because we are taking higher dimensional data and finding clever ways to embed it into a smaller dimensional space.

By clever I mean that we'll be finding ways to embed that preserve some inherent nature of the data, for instance in a way that makes sense given a word's meaning in a piece of text. For example, a good word embedding should probably group the words "apple" and "pears" together (depending on the data set of course).


The first feed forward neural network based word embedding we'll touch on is Word2Vec. Word2Vec was an approach developed by Mikolov et. al. across two papers in 2013 (https://arxiv.org/abs/1301.3781 and https://arxiv.org/abs/1310.4546).

The key idea behind this approach is to look at the words that occur next to the word you're interested in across your corpus, an idea known as looking at the context surround the target word.

In this notebook we'll learn how to implement Word2Vec using an algorithm known as skip-gram with negative sampling (SGNS).

## References

Notably, Mikolov et. al.'s papers are a bit hard to decipher, so we'll actually be working off of a collection of other sources that are much more reader-friendly (i.e. I can understand them).

This notebook will build heavily off of the following sources:

* https://web.stanford.edu/~jurafsky/slp3/6.pdf
* https://www.tensorflow.org/tutorials/text/word2vec
* https://arxiv.org/abs/1411.2738
* https://www.youtube.com/watch?v=D-ekE-Wlcds

Let's go!

In [None]:
## We'll import these, but I don't know how much 
## we'll use them

## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
from seaborn import set_style

## This sets the plot style
## to have a grid on a white background
set_style("whitegrid")

## The Idea Behind the Approach - Using Context Clues
Before diving into the nitty-gritty lets dive into the idea behind the approach.



 You shall know a word by the company it keeps 

J.R. Firth


At the heart of Word2Vec is the desire to learn the "meaning" of a word using context clues. Let's look at a simple example. Imagine you've picked up the hottest young adult novel set in a far off dystopian future. As many such novels do the author has created a "fun" future slang for their dystopian future and, being from the present, the words are unfamiliar to you. You come upon the following sentence, "After work he needed to deposit his weekly pay at the kakoonahole."

I would be shocked if you have heard the word kakoonahole before, but you probably have a rough idea that the author is using this word to represent a banking establishment of some kind.

What gave that away? Neighboring words like "weekly pay" and "deposit".

This is precisely the idea behind Word2Vec.

With Word2Vec we will see how we can use a very simple feed forward neural network architecture (a single hidden layer) to produce representations of words as short dense vectors, as opposed to the standard long sparse vectors we've used up to this point (one-hot encodings, frequency vectors, tf-idf vectors).

Importantly, these short dense vectors have been demonstrated to provide intuitive results that outperform previous techniques (like LSA) at certain tasks.

## Searching for Context Clues: Skip-Grams
When we use the phrase "context clues", we mean the words surrounding the word in which we have an interest. We can quantify the context words surrounding our target word using skip-grams.

Consider this sample sentence.

My cat sits in the sun

For skip-grams we focus on a single word, and the window around that word. In this example let's choose a window size of 2. To create a collection of skip-grams you choose a target word, let's say it is sits, and look at all the words within the window size on either side. The skip-grams are then the collection of target word-window word pairings. For sits and size 2 this gives (sits, my), (sits, cat), (sits, in), (sits, the).

Before moving on let's do a short practice to register understanding. Write down the skip-grams for the word cat.

ANSWER
(cat, My), (cat, sits), (cat, in).

### Making skip-grams in keras
We can use keras to quickly make skip-grams for us, let's see how!

Preparing the data
First you'll need to turn your text into a list of indices, like the imdb data set from the neural networks folder.

In [None]:
## First we'll tokenize our data using simple string functions
sample_sentence = "My cat sits in the sun"

tokens = list(sample_sentence.lower().split())
print(tokens)

In [None]:
# now we create a word index dictionary
word_index = {}
i = 1

for word in tokens:
    if word not in word_index.keys():
        word_index[word] = i
        i = i + 1
        
print(word_index)

In [None]:
# now we create a reverse index as well
reverse_word_index = {i: word for word,i in word_index.items()}

print(reverse_word_index)

In [None]:
# we can now create a sequence for our sentence like so
sample_sequence = [word_index[word] for word in tokens]
print(sample_sequence)

### Generating the skip-grams
Now that we have a sequence for our sentence we can use `keras` to create a set of skip-grams for us.

In [None]:
from tensorflow import keras

from tensorflow.keras.preprocessing import sequence

In [None]:
# How large you want your windows to be
window_size = 2

# how many words are in your vocabulary?
vocabulary_size = len(word_index.keys())

# ignore the negative_samples argument for now
# more on that later
positive_skip_grams, _ = sequence.skipgrams(sample_sequence, 
                                  vocabulary_size=vocabulary_size,
                                  window_size=window_size,
                                  negative_samples=0)

In [None]:
positive_skip_grams

In [None]:
print("The sentence was:")
print(sample_sentence)

print("######################")

print("The skip-grams are:")
for item in positive_skip_grams:
    print(reverse_word_index[item[0]],reverse_word_index[item[1]])

### You Code
Use the next few code chunks to read in the imdb data set from `keras`. The practice by calculating the skip-grams for the a couple of reviews from the training set.

In [None]:
from tensorflow.keras.datasets import imdb

In [None]:
n = 10000
(imdb_train,y_train_imdb), (imdb_test,y_test_imdb) = imdb.load_data(num_words=n, 
                                                                                seed=440,
                                                                                index_from=3)

# word_index is a dictionary that maps each word to it's index
imdb_word_index = imdb.get_word_index()

# We now adjust the indices according to the coding presented above
imdb_word_index = {key:(value+3) for key,value in imdb_word_index.items()}

imdb_word_index["<PAD>"] = 0
imdb_word_index["<START>"] = 1
imdb_word_index["<UNKNOWN WORD>"] = 2
imdb_word_index["<UNUSED WORD>"] = 3

imdb_reverse_index = dict([(value,key) for (key,value) in imdb_word_index.items()])

In [None]:
### You Code Here
### Choose a few reviews and find their skip-grams here

## The Neural Network
The neural network used for __Word2Vec__, as previously mentioned, has a very simple architecture and we will introduce it now.

Suppose you wish to represent a vocabulary of $M$ unique words/tokens from your corpus. You then build a network with `M` input nodes, one corresponding to each word in the vocabulary, a single hidden layer with `N` nodes where , and an output layer with M nodes, again corresponding to each word in the vocabulary.

Its architecture looks like this 

As you can see from the architecture, the activation from the input layer to the hidden layer is the identity function and the activation from the hidden layer to the output layer is the softmax function, which is defined for a vector $z$ with $K$ entries as:

$$
\sigma(z)_i = \frac{e^{z_i}}{\sum^{K}_{j=1}e^{z_j}},
$$
 
and thus transforms the output from nodes of real numbers to a probability distribution.

### Why The Softmax?
The genius of the __Word2Vec__ approach was that it turned the semantics problem, "what does word $w$ mean?" into a context problem, "what is the context of $w$?" is then turned into a multiclass classification problem.

With the above neural net the goal of the network is to model the probability that word 
 is a contextual word (in the skip-gram window) of a given target word 
, i.e. we're modeling the conditional probability:

$$
p(w_j|w_t)\sim \sigma(O)_j= e^{O_j}/\sum^{K}_{i=1}e^{O_i}
$$

where I'm lazily letting $O_i$ denote the output of the weighted sum of the hidden layer nodes at output node $i$.

### What is the Training Data?
The training data for this network is produced from the skipgrams. An $X,y$  pair in the training set would be as follows. $X$ would be a one hot encoded vector where all $M$ entries are $0$  except for the entry corresponding to the target word of interest. $y$ is a one hot encoded vector where all $M$ entries are $0$ except for the entry corresponding to the context word.

### Weights are Where it's at
However, we don't care at all about using the network to make predictions, we just want the weight matrices that result.

Let $W$ be the $N\times M$  trained weight matrix for the input layer into the hidden layer, and let $x_i(M\times1)$ be a one-hot encoded vector corresponding to word $w_i$, then the word2vec embedding of $w_i$ is simply $Ww_i$ which is the $i$^{th} column of $W$ .

You Code, A Very Simple Example
I've coded up the skip grams for a very simple example below and created the $X$ and $y$ for you. Using what we learned last week make a word2vec neural net with a $5$ node tall hidden layer.

Use `rmsprop` as your `optimizer`, `binary_crossentropy` as your `loss`, and `accuracy` as your `metrics`.

Also train for at least `1000` `epochs`, with a batch size of `12`.

_Hint: to make a layer with identity activation just don't include the `activation` = argument_.

In [None]:
skipgrams = [('king','kingdom'),('queen','kingdom'),('king','palace'),('queen','palace'),('king','royal'),
            ('queen','royal'),('king','George'),('queen','Mary'),('man','rice'),('woman','rice'),
            ('man','farmer'),('woman','farmer'),('man','house'),('woman','house'),('man','George'),
            ('woman','Mary')]

word_index = {'George':0, 'Mary':1, 'farmer':2, 'house':3, 'kingdom':4,
                 'king':5, 'man':6, 'palace':7, 'queen':8, 'rice':9, 'royal':10,
                 'woman':11}

skipgrams = [(word_index[gram[0]],word_index[gram[1]]) for gram in skipgrams]

reverse_index = {i:word for word,i in word_index.items()}

In [None]:
X = np.zeros((len(word_index.keys()),len(skipgrams)))
y = np.zeros((len(word_index.keys()),len(skipgrams)))

In [None]:
for j in range(len(skipgrams)):
    gram = skipgrams[j]
    X[gram[1],j] = 1
    y[gram[0],j] = 1

In [None]:
# You'll need these
#from tensorflow.keras import models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


In [None]:
import tensorflow as tf
print(tf.__version__)

If you are having problem with Tensorflow   
Run the Following in your command Prompt

In [None]:
#!pip uninstall tensorflow
#!pip install tensorflow==2.2.0

In [None]:
### You code
### Call your neural net model, model

In [None]:
### You code

In [None]:
### You code

### Looking at the Word Embedding
Now we need to get the weight matrix. We didn't review how to do this in the `Neural Networks` folder so let's see how to now.

In [None]:
weights = []

for layer in model.layers:
    weights.append(layer.get_weights())

In [None]:
weights

We want the 0th entry of the 0th entry in weights.

In [None]:
np.shape(weights[1][0])

### Projecting to 2-dimensions
Now we can look at the word embedding in two dimensions using a standard dimension reduction technique like PCA.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(2)

In [None]:
fit=pca.fit_transform(weights[0][0])

In [None]:
plt.figure(figsize=(10,10))

plt.scatter(fit[:,0],fit[:,1])

for i in range(len(word_index.keys())):
    plt.text(fit[i,0],fit[i,1],reverse_index[i], fontsize=14)

plt.show()

Before continuing with this notebook, I want to pause and show you a nice interactive web app that gives a good intuition for Word2Vec, https://ronxin.github.io/wevi/. Go to that app and play around for a bit before finishing this notebook.

## Training Word2Vec is Costly
Before moving on to the next notebook let's end with this final demonstration.

Program a loop to count the number of skip-grams that would result from the imdb data set.

In [None]:
### You code here

In [None]:
### print how many skip-grams there were


That's a lot of data!

To make things worse (from your laptop's perspective) in the imdb example $M =10,000$ and a standard $N$ is $300$  (based on original paper). That means we'd need to find weights for $10,000\times 300 = 3,000,000$  weights twice. (Good thing we have a lot of data).

So training your own Word2Vec embedding comes with a large start up cost compared to everything else we've done in the program.

That's why many projects don't start with the training of a custom Word2Vec embedding, but first either try some of the older techniques we've learned or use a pretrained Word2Vec embedding, the topic of Next Week's Class!! 