[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_11-NLP/blob/master/W11_S0_AG_NLP_Objective_3.ipynb)

### Objective 3: Comparing documents or words

**Leverage Word Vectors (Word2vec)  to compare words**

A key challenge in Natural Language Processing is understanding the language. Text or speech is our main communication channel but the computer can only understand numbers. Therefore the words need to be converted to numbers. The process of converting words to numbers (i.e. vectors) is called Word Embedding. Word2vec is a popular algorithm for building vector representations of words (i.e. word embeddings).  

The vector for each word is a semantic depiction of how that word is used in context. 2 words are closer in meaning if they share contexts (Reference: Distribution Hypothesis Theory  -https://en.wikipedia.org/wiki/Distributional_semantics). The Distribution Hypothesis Theory states the following: 

Linguistic items with similar distributions or similar context have similar meanings.

Here is a simple example to clarify the underlying principle of the Distribution Hypothesis Theory:

**Sentence 1**: Traffic was light today

**Sentence 2**: Traffic was heavy yesterday

**Sentence 3**: Prediction is that traffic will be smooth-flowing tomorrow since it is a national holiday

Per the Distribution Hypothesis Theory, the words light, heavy, and smooth-flowing must be related in some way since they occur in a similar context (close to the word traffic). Similarly, the words today, yesterday and tomorrow are related since they appear next to a word indicating the state of traffic. Therefore, a word's meaning is governed by the context it occurs in.

Another example that highlights the principles that underpin the Distribution Hypothesis Theory is as follows:

Let’s take the word ‘mobile’. One of many meanings of this word is the ability to move freely and another one could be a handheld device. In the sentence: “I have become increasingly mobile with the purchase of a car” – we can easily glean that it is the former meaning of mobile. However in the sentence: I just purchased a new mobile phone – mobile takes on the latter meaning (i.e. handheld device).

**Constructing Vector of Words**

**Vector of Words:**

![alt text](https://www.dropbox.com/s/ysmrl91jkouh29c/Image%201%20-%20Vector%20of%20Words.JPG?raw=1)

The table above  illustrates a vector of words where we have a column for every context for every word in a given source text. So for the given text:

**"It was the sunniest of days, it was the rainiest of days"**, the vector of words is depicted in the above image . 

The values in each cell highlight how many times the word occurs in the given context. The numbers in the columns constitute that word's vector. For example, the vector for the word "sunniest" is:

**[0,	0,	0,	1,	0,	0,	0,	0,	0,	0]**

We have a vector representation that is comprised of 10 dimensions since there are 10 possible contexts (as depicted by the column headers within the table). Once we have a vector representation for the words, we can conduct vector arithmetic (calculate the Euclidean distance between 2 words) to determine which vector representations are similar or closest to each other. In our example scenario, a quick cursory glance of the vector space surfaces that the words "sunniest" and "rainiest" are similar since they have identical vector representations and the Euclidean distance is 0. 


### Word2vec from Google

A common question that arises at this point is how to we handle a large corpus of text given that there will be many possible context to deal with - just dealing with a vector space of 10 dimensions in the example scenario was cumbersome enough. Luckily, instances of pre-trained word vectors are available that can be used to detect word similarities. An instance of a pre-trained word vector from Google is available here:

https://code.google.com/archive/p/word2vec/

### Other variants of the Word2vec  model


**1) SkipGram:** This variant tries to predict the neighbors of a word.

Skip-grams: This variant tries to predict the neighbors’ of a word. In Skip-gram model, we take a center word and a window of context (neighbors) words to train the model and then predict context words out to some window size for each center word.

This notion of “context” or “neighboring” words is best described by considering a center word and a window of words around it. 

For example, if we consider the sentence **“The speedy Proshe drove past the elegant Rolls-Royce”** and a window size of 2, we’d have the following pairs for the skip-gram model:

**Text:**
The	speedy	Proshe	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (the, speedy), (the, Proshe)

**Text:**
The	speedy	Proshe	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (speedy, the), (speedy, Proshe), (speedy, drove)

**Text:**
The	speedy	Proshe	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (Proshe, the), (Proshe, speedy), (Proshe, drove), (Proshe, past)

**Text:**
The	speedy	Proshe	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (drove, speedy), (drove, Proshe), (drove, past), (drove, the)

The **Skip-gram model** is going to output a probability distribution i.e. the probability of a word appearing in context given a center word and we are going to select the vector representation that maximizes the probability.


![alt text](https://www.dropbox.com/s/c7mwy6dk9k99bgh/Image%202%20-%20SkipGrams.jpg?raw=1)

**2) Continuous Bag of Words (CBOW):** This model tries to predict a center word based on the neighboring words. In the case of the CBOW model, we input the context words within the window (such as “the”, “Proshe”, “drove”) and aim to predict the target or center word “speedy” (the input to the prediction pipeline is reversed as compared to the SkipGram model).

A graphical depiction of the input to output prediction pipeline for both variants of the Word2vec model is attached. The graphical depiction will help crystallize the difference between SkipGrams and Continuous Bag of Words.


![alt text](https://www.dropbox.com/s/k3ddmbtd52wq2li/Image%203%20-%20CBOW%20Model.jpg?raw=1)