# Word Embeddings #

In class we learned about distributed representation of words which is also know as word embeddings. Neural network word embedding models generate word representation where the similarity between words could be directly interpreted. This type of vector representation could be used to predict the words that appear in a context of a given word. For a word $w_i$ we define a context $w_{c_i}$ which consists of the words that appear in a particular word window before and after it. Context words are modeled as a conditional distribution of the center word. 

$$ \Large p(w_{c_i} | w_i) $$

Word embeddings are in fact feature vectors which are assigned to each word in the collection. The conditional probability between a word $w_i$ and one of its context words $j: j \in c_i$ is computed using the softmax function:

$$ \Large p(w_j | w_i)= \frac{ \exp ({v^{'}_j}^T v_i)}{ \sum_{k=1}^V {\exp ({v^{'}_k}^T v_i)}} $$

Note that for each word we assign two types of features vectors $v^{'}$ and $v$ which are referred to as context and word embedding vectors. 

We use this conditional to define an objective which is the product of the conditionals of all the words in the collection. Word embeddings are obtained by optimizing the log of this product which is a sum log of the conditionals:

$$ \Large \frac{1}{|C|} \sum_{i=1}^{C} \sum_{j=1}^{|c_i|} \log p(w_j | w_i)$$

Word embeddings are obtained by optimizing the objective using gradient based methods. 

In the class we learned about two word embeddings models - continuous bag of words (CBOW) and skip-gram. The major difference between the two models is how the conditional distribution is defined. In the skip-gram model each context word is conditioned on the observed word. This is the conditional distribution which we used at the beginning of this section. CBOW on the other hand models each observed word conditioned on its context:

$$ \Large p(w_i | w_{c_i}) $$


In this lab session we are going to obtain hands on experience in using word embeddings to represent words in a collection. More specifically we'll be working with the Gensim implementation of the word2vec family of word embedding models which consists of the skip-gram and CBOW models. We are going to be using the Amazon product reviews collection. To get a richer word embedding model we'll be using the version of the data that contains 10k reviews.

## Loading the Collection ##

Let's load the Amazon product reviews data. Again, as a reminder the reviews data is semi-structured and is in a json file format. Below is a preview of this data which contains the entry for one review:  
`
{
  "reviewerID": "A3HVRXV0LVJN7",
  "asin": "0110400550",
  "reviewerName": "BiancaNicole",
  "helpful": [
    4,
    4
  ],
  "reviewText": "Best phone case ever . Everywhere I go I get a ton of compliments on it. It was in perfect condition as well.",
  "overall": 5.0,
  "summary": "A++++",
  "unixReviewTime": 1358035200,
  "reviewTime": "01 13, 2013"
}
`
This dataset comes with a set of python functions that will help us convert the reviews from json format to Pandas dataframes. 

In [None]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

With these helper functions we'll extract the "reviewText" field from each review:

In [None]:
import pandas as pd
import gzip

review_file = "../../../data/amazon_reviews/cp/reviews_Cell_Phones_and_Accessories_h10k.json.gz"

df = getDF(review_file)
print (df['reviewText'])

## Text Processing ##
Now that we've extracted the reviews we'll proceed by tokenizing them. In this next step we'll perform the following:  
* Extract sentences
* Tokenize words
* Remove stopwords
* Remove punctuation marks

In [None]:
import nltk
import string
from nltk.tree import Tree
stopwords_list = nltk.corpus.stopwords.words('english')
#Create a list for the tokenized sentences:
tok_sentences = list()
#Create a translation table for removing the punctuation marks:
translator=str.maketrans('','',string.punctuation)

all_words = list()
r_count=0
for review  in df['reviewText']:
    r_count+=1
    if (r_count%1000==0):
        print (r_count)
    sentences = nltk.sent_tokenize(review)
    for sentence in sentences:
        sent_words = nltk.word_tokenize(sentence)
        sent_words_tok = [word.lower() for word in sent_words if word.isalpha()]
        tok_sentences.append(sent_words_tok)
        for words in sent_words_tok:
            all_words.append(words)


Let's also obtain the sorted list of words based on their frequency count. This would help us get a sense better sense of the words present in this collection:

In [None]:
import numpy as np
frequency_count = nltk.FreqDist(all_words)
words =np.array([word for word in frequency_count.keys()])
word_freq=np.array([word for word in frequency_count.values()])
freq_sort = np.argsort(word_freq)[::-1]
word_freq_sort =word_freq[freq_sort]
words_sorted = words[freq_sort]
rank=1
for object in words_sorted:
    if (rank<=1000):
        print(object+"\t"+str(frequency_count[object]))
    rank+=1

## Word2Vec ##
With the Gensim package training a word embeddings model is straightforward. It requires a call to a single method named __Word2Vec__.During the lab session we'll learn about some of the input parameters of this method. The default settings of this method uses the CBOW model. This method expects as an input a list of sentences. In our case this list would be the output of the tokenized step above. Let's now use this method to train a word embedding model using the Amazon reviews data. 

In [None]:
import multiprocessing
from gensim.models import Word2Vec
%env PYTHONHASHSEED=1 
model = Word2Vec(tok_sentences, size=100, seed=1, window=5, min_count=1, iter=5)
model.save(review_file+".w2v")

Let's now use the trained model to explore the relationship between words in our collection. The __wv.most_similar__  method allows us to obtain the most similar words for a given input word. Below is an example use:

In [None]:
model.wv.most_similar('quality')

**[Assignment 1]**  
With the above method explore the word emeddings representation of the reviews collection. 

**[Assignment 2]**
Earlier in class we learned that the word2vec family of models capture semantic and syntactic relationships between the words. In particular we learned that using vector arithmetic we could capture word analogies such as the following:  
“Man is to Women as King is to Queen”  
which could be obtained by the following arithmentic operation:  
“King” - ”Men” + “Women” ~ “Queen”
The Gensim implementation of word2vec contains a method called __wv.most_similar()__ that for a given set of words returns a list of the most similar words that abide to this relationship. Let's look at several examples:

In [None]:
model.wv.most_similar(positive=['sound', 'camera'], negative=['picture']) 

Use this method to explore the generated word embeddings.

**[Solution 2]**

**[Assignment 3]**
the gensim implementation of the word2vec family of models provides additional methods that let's you further explore the generated embeddings space. Below are few methods along with their description and example use.

* Returns the word that is not related to the other words in the list:

In [None]:
model.wv.doesnt_match("samsung motorola lg iphone volume".split())

* Computes the similarity between two words:

In [None]:
model.wv.similarity('samsung', 'iphone')

Use these methods to further explore the generated embedding space.

**[Solution 3]**

**[Assignment 4]**
So far in our assignment we used the CBOW model. In this part of the lab session we are going to use the skip-gram model. You could train the skip-gram model by specifying the following parameter in the above __Word2Vec__ method:
Word2Vec(tok_sentences, size=100, window=4, min_count=5, workers=4, sg=1, hs=1)
Note that the default settings of the word2vec method uses negative embeddings while in this case we will be using the hiearchical softmax __hs=1__. 
Use some of the previous methods with the skip-gram model and see if you could observe a difference between the two models. 

**[Solution 4]**