# <font color='orange'>Word2Vec Introduction </font>

The Natural Languahe Processing for data science that was introduced in class discusses the ideas of topics as Bag of words and N-grams with a key note by professor mentioning “this lecture may be subject to change in the upcoming years, as massive improvements in “off-the-shelf” language understanding are ongoing.” This tutorial will introduce you to one such deep learning method known as Word2Vec which aims to learn the meaning of the words rather than relying on heuristic approaches.

Limiting the scope of the tutorial and not going into deeper mathematics, word2vec can be explained as a model where each word is represented as a vector in N dimensional space. During the initialization stage, the vocabulary of corpus can be randomly initialized as vectors with some values. Then, using vector addition, if we add tiny bit of one vector to another, the vector moves closer to each other and similarly subtraction results in moving further away from one another. So while training a model we can pull the vector closer to words that it co occurs with within a specified window and away from all the other words. So in the end of the training words that are similar in meaning end up clustering nearby each other and many more patterns arises automatically.

## <font color='orange'>Technique</font>

Word2Vec uses a technique called <b>skip-gram with negative sampling</b> to map the semantic space of a text corpus. I have tried to explain this using a passage taken from dataset used in our tutorial.


“This system is sometimes known as a presidential system because the government is answerable<font color='green'> <i>solely and exclusively to a </i></font><font color='red'><i><b> 'presiding' </b></i></font><font color='green'><i>activist head of state, and</i></font> is selected by and on occasion dismissed by the head of state without reference to the legislature.”

<b>Step 1</b>  
Let’s take a word from above passage as target and number of words occurring close to the target as context (five words on either side of target).  
<i>Target</i>  = presiding  
<i>Context</i> = solely and exclusively to a; activist head of state, and


<b>Step 2</b>  
Each of the word in the above paragragh should be represented as a vector in n dimensional space. In the beginning, these vector values can be randomly initialized.  
Note: The dimension can be decided at the time of model creation, given as one of the parameters of models

<b>Step 3</b>  
Our goal is to get the target and context closer to each other in vector space. This can be done by taking the target and context vectors and pulling them together by a small amount by maximizing the log transformation of the dot product of the target and the context vectors.

<b>Step 4</b>  
Another motive is to move target vector away from words that are not in it’s context. To achieve this, words are randomly sampled from rest of the corpus and they are pushed away from target vector by minimizing the log transformation of the dot product of the target and the non context sample words.

The above four steps is repeated for each target word. In the end the vectors that are often used together will be pulled towards each other and vectors for words that are rarely used together will be pushed apart. 

In our example if notion of ‘presiding’ resembles ‘activist head’ often in corpus then vectors for these two concepts will be close to one another.


## <font color='blue'>Data Collection</font>


I used [ClueWeb datasets](http://lemurproject.org/clueweb12/) which consists of around (50 million Wikipedia documents) and indexed them using [Apache Lucene](https://lucene.apache.org/core/). After Indexing, I used Lucene Search Engine to extract the top 1000 documents for query "president united states". These documents are then stored in a local file "wikidocs"  
I am using these 1000 documents to build the word2vec model to find the interested relationships between words.
Please note that the entire process of indexing and retrieving the documents is beyond the scope of the tutorial.

The file wikidocs can be downloaded from https://www.dropbox.com/s/rnu33c4j6ywnu6z/wikidocs?dl=0. Make sure to save the file in same directory as the notebook.


In [1]:
#Load the documents extracted from wikipedia corpus
with open("wikidocs") as f:
    html = f.read()
    html = unicode(html, errors='replace')
    
#Printing the lenght of the corpus
print("The lenght of corpus is "+str(len(html))+".")


The lenght of corpus is 80120356.


## <font color='blue'> Install Libraries</font>
The data collected is in the form of xml which needs to be cleaned to plain text format. I am using BeautifulSoup libraries to parse the wikipedia documents and nltk tokenizer to convert them to tokens.  
Please note that these libraries are already been introduced in class and used in homeworks. Hence, running the below command should work for you.

In [2]:
import urllib
import re
import pandas as pd
import nltk.data
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/vallarimehta/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## <font color='blue'> Parsing </font>  
 
<b>Convert the docs into list of sentences</b>  
Word2Vec toolkit in python expects input in the form of a list of sentences, each of which is a list of words.   


<b>Remove punctuation and lowercase all words</b>  
The words are converted into lowecase and punctuations are removed using regular expressions.

<b>Stopwords</b>  
Removal of stopwords is optional. It is better to not remove stopwords for word2vec algorithm for the first train as  the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors. However, if results are not satisfactory, the model can be trained again by removing stop words. 

In [3]:
# Function to convert a document to a sequence of words,
# optionally removing stop words.  
# Returns a list of words.
def review_to_wordlist(review, remove_stopwords=False):

    # 1. Remove HTML
    review_text = BeautifulSoup(review,"lxml").get_text()
    review_text = review_text.encode('utf-8')

    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)

    # 3. Convert words to lower case and split them
    words = review_text.lower()
    words = words.split()

    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = stopwords.words("english")
        stops = set(stops)
        words = [w for w in words if not w in stops]
        
    # 5. Return a list of words
    return(words)

Function to split a document into parsed sentences using NLTK tokenizer. 

In [4]:
# Function to split a review into parsed sentences. Returns a
# list of sentences, where each sentence is a list of words
def review_to_sentences( review, tokenizer, remove_stopwords=False):

    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))

    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

Now we apply these functions to our Wikipedia Corpus

In [28]:
#Parse documents to create list of sentences
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
sentences = review_to_sentences(html, tokenizer)

print("There are " + str(len(sentences)) + " sentences in our corpus.")


There are 97832 sentences in our corpus.


Let's observe what a sentence looks like after parsing it through above functions.

In [57]:
sentences[100]

['in',
 'presidential',
 'systems',
 'the',
 'head',
 'of',
 'state',
 'often',
 'has',
 'power',
 'to',
 'veto',
 'a',
 'bill']

# <font color='green'>Generating Word2Vec Model</font>

## <font color='red'> Installing the libraries</font>

You can install gensim using pip:

       $ pip install --upgrade gensim
    

If this fail, make sure you’re installing into a writeable location (or use sudo), and have following dependencies.  

       Python >= 2.6
       NumPy >= 1.3  
       SciPy >= 0.7  




Alternatively, you can use conda package to install gensim, which takes care of all the dependencies.   
        
       $ conda install -c anaconda gensim=0.12.4


## <font color='red'>Load libraries for word2vec</font>
After you run all the installs, make sure the following commands work for you:

In [30]:
import gensim
from gensim.models import word2vec
from gensim.models import Phrases
from gensim.models import Word2Vec
import logging

## <font color='red'>Training the Model </font>

## Logging
Import the built-in logging module and configure it so that Word2Vec creates nice output messages

In [31]:
#Using built in logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

## Parameters for model

The Word2Vec model requires some parameters for initialization. 

<b>size</b>  
Size is number of dimensions you want for word vectors. If you have an idea about how many topics the corpus cover, you can use that as size here. For wikipedia documents I use around 50-100. Usually, you will need to experiment with this value and pick the one which gives you best result.

<b>min_count</b>  
Terms that occur less than min_count are ignored in calculations. This reduce noise in the vector space. I have used 10 for my experiment. Usually for bigger corpus size, you can experiment with higher values.

<b>window</b>    
The maximum distance between the current and predicted word within a sentence. This is explained in the technique section of the tutorial.

<b>downsampling</b>    
Threshold for configuring which higher-frequency words are randomly downsampled. Useful range is (0, 1e-5)

In [32]:
# Set values for various parameters
num_features = 200    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

## Initialize and train the model
Train the model using the above parameters. This might take some time

In [33]:
print "Training model..."
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

Training model...


If you don’t plan to train the model any further, calling init_sims will be better for memory.

In [34]:
model.init_sims(replace=True)

## <font color='red'>Storing & Loading Models</font>  
It can be helpful to create a meaningful model name and save the model for later use. 
You can load it later using Word2Vec.load()

In [35]:
#You can save the model using meaningful name
model_name = "wiki_100features_15word_count"
model.save(model_name)
#Loading the saved model
word2vec_model = gensim.models.Word2Vec.load("wiki_100features_15word_count")

## <font color='red'>Investigate the vocabulary</font>
You can either use model.index2word which gives list of all the terms in vocabulary. Or model.vocab.keys() which gives keys of all the terms used in the model.

In [58]:
#List of all the vcabulary terms
vocab = model.index2word
print "Lenght of vocabulary =",len(vocab)
print vocab[20:69]

Lenght of vocabulary = 11674
['congress', 'from', 'vice', 'that', 'var', 'was', 'this', 'have', 'with', 'office', 'which', 'any', 'senate', 'u', 'house', 'all', 'not', 'at', 'presidential', 'constitution', 'no', 'may', 'other', 'such', 'are', 'one', 'new', 'government', 'law', 'if', 'it', 'he', 'american', 'two', 'an', 'article', 'but', 'section', 'http', 'their', 'federal', 'd', 'each', 'us', 'who', 'amendment', 'his', 'representatives', 'executive']


Check if the word ‘obama’ exists in the vocabulary:



In [37]:
'obama' in model.vocab

True

Check if the word 'beyonce’ exists in the vocabulary:

In [38]:
'beyonce' in model.vocab

False

The vector representation of word ‘obama’ looks like this:

In [39]:
model['obama']

array([  8.86289869e-03,   1.27600618e-02,  -3.78985070e-02,
         5.57042398e-02,  -8.71062372e-03,  -1.24862291e-01,
        -8.76149163e-02,   1.56393647e-02,  -1.55269457e-02,
         2.65169926e-02,  -8.65512937e-02,  -4.16206196e-02,
        -8.79949480e-02,   7.72259906e-02,  -3.61485332e-02,
         2.33706519e-01,  -2.25558355e-01,  -1.25489861e-01,
        -2.32529882e-02,  -1.51585296e-01,  -1.04060665e-01,
         9.67445523e-02,   4.27967943e-02,  -8.07592198e-02,
         6.51278393e-03,   1.35209421e-02,  -8.75448212e-02,
         8.17318112e-02,  -2.99313646e-02,  -2.53677946e-02,
         3.37144290e-03,   7.65991956e-02,   2.20242664e-02,
         6.49797618e-02,  -3.52603421e-02,  -6.62654489e-02,
        -5.77722378e-02,  -6.54017627e-02,   2.97236126e-02,
         5.49717434e-02,   3.95357832e-02,   7.72382468e-02,
        -6.72882199e-02,  -4.48467880e-02,  -1.31184086e-01,
        -8.92265216e-02,  -4.73204330e-02,  -3.11650950e-02,
         1.50380395e-02,

Let's test the words similar to "obama"

In [40]:
model.most_similar('obama',  topn=10)

[('barack', 0.8767746686935425),
 ('michelle', 0.6473720669746399),
 ('biden', 0.5824570059776306),
 ('joe', 0.4867965579032898),
 ('palin', 0.46410897374153137),
 ('speaks', 0.45847803354263306),
 ('barak', 0.45301488041877747),
 ('gaza', 0.4382231831550598),
 ('statesbarack', 0.43616873025894165),
 ('mccain', 0.4352244436740875)]

## <font color='red'>Phrases</font>

We can use gensim models.phrases in order to detect common phrases from sentences. For example two single words "new" and "york" can be combined as one word "new york".

In [41]:
bigram = gensim.models.Phrases(sentences)

Generte the model using above bigram.

In [42]:
new_model = Word2Vec(bigram[sentences], workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)


Let's access the new vocabulary.

In [52]:
vocab = new_model.vocab.keys()
vocab = list(vocab)
print "Lenght of new vocabulary =",len(vocab)
print vocab[15:55]

Lenght of new vocabulary = 16802
[u'eligible', u'electricity', u'senator_lee', u'performance_it', u'gun_violence', u'bedford_jun', u'sprawling_retreat', u'buck', u'lord', u'thomasville', u'have_originated', u'banzhaf_power', u'congress_assembled', u'deliberation', u'i_haven', u'american_liberalism', u'discuss_economic', u'regional', u'fourteenth_amendment', u'dell', u'sloven_ina', u'appropriation', u'domestic_vs', u'scrolltop', u'additional_citations', u'colfax', u'www_masterliness', u'bringing', u'his_career', u'america_telephone', u'four', u'popular_music', u'prize', u'wooden', u'including_debts', u'wednesday', u'jihad', u'cultural_influence', u'succession', u'excluding_indians']


Check if the word 'dominican republic’ exists in the vocabulary:

In [53]:
'dominican_republic' in new_model.vocab

True

Let’s assess the relationship of words in our semantic vector space. For example, which words are most similar to the word ‘republic’?



In [45]:
new_model.most_similar('republic',  topn=10)

[(u'ireland', 0.7163577079772949),
 (u'spain', 0.6763077974319458),
 (u'france', 0.6528872847557068),
 (u'central', 0.6357733011245728),
 (u'china', 0.6244645118713379),
 (u'parliament', 0.6186147332191467),
 (u'officially', 0.594301164150238),
 (u'quebec', 0.5794820189476013),
 (u'china_taiwan', 0.5794057846069336),
 (u'korea', 0.5775099992752075)]

What about the phrase "dominican republic"?



In [23]:
new_model.most_similar('dominican_republic',  topn=10)

[(u'ghana', 0.9546933174133301),
 (u'guinea', 0.9521167278289795),
 (u'morocco', 0.9403805732727051),
 (u'papua_new', 0.9389972686767578),
 (u'namibia', 0.9385524392127991),
 (u'paraguay_peru', 0.9382070302963257),
 (u'suriname', 0.9381791353225708),
 (u'luxembourg', 0.9376348257064819),
 (u'guyana_haiti', 0.9367575645446777),
 (u'costa_rica', 0.9362567067146301)]

Does the results differ if we exclude the relationship between republic and dominican_republic?




In [46]:
new_model.most_similar(positive=['republic'], negative=['dominican_republic'], topn=10)

[(u'the', 0.607554018497467),
 (u'delegated', 0.556876540184021),
 (u'government', 0.5254863500595093),
 (u'parliament', 0.5150182247161865),
 (u'parliamentary', 0.5106503367424011),
 (u'factions', 0.5051050782203674),
 (u'of', 0.49954289197921753),
 (u'collective_head', 0.48483002185821533),
 (u'is', 0.47944697737693787),
 (u'concurring', 0.4781777858734131)]

## <font color='red'>Query Expansion</font>

One of the applications of word embedding is that it can be used in search engine in order to expand the query terms in order to produce better results.

If you recall previously, the wikipedia documents are extracted from Lucene Search using the query "president united states". Now, let's use these three query terms to obtain expanded terms closest to query.

Note: This idea is taken from the paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/acl2016.pdf


Below function take each term in vocabulary and compare it to each term of query in terms of similarity score. The similarity scores are added for all terms and top k terms are returned.

In [47]:
#Function to expand a query term

def expand_term(query,k):
    
    #Get the vocab of model
    vocab = new_model.index2word
    vocab = set(vocab)
    term_score = {}

    #Split the query terms
    query_list = query.split()
    
    #Convert the query to lower case
    query_list = [element.lower() for element in query_list]
    
    
    #Remove stop words from query
    stops = stopwords.words("english")
    stops = set(stops)
    query = [word for word in query_list if word not in stops] 
    

    #Filter the vocab to remove stopwords
    filter_vocab = [word for word in vocab if word not in stops] 
    
    #Calculate each score for terms in vocab
    for term in filter_vocab:
        term_score[term] = 0.0
        for q in query:
            if term in term_score:
                term_score[term] += new_model.similarity(q,term)
            else:
                term_score[term] = new_model.similarity(q,term)

    #Sort the top k terms of dict term_score
    sorted_k_terms = sorted(term_score.iteritems(), key=lambda x:-x[1])[:k]
    sorted_k_terms = dict(sorted_k_terms)
    
    #join the query term
    q_term = ' '.join(query)
    
    #Return the expanded terms of query
    return sorted_k_terms.keys()


Now, let's test our function to check the result for query "president united states"

In [55]:
query = "president united states"
k = 15  #k defines number of expanded terms
result = expand_term(query,k)
print result

[u'kingdom', u'united', u'army', u'hospital', u'methodist', u'airlines', u'nations', u'states', u'methodist_church', u'health_care', u'way', u'president', u'highest', u'association', u'bank']


Now let's try for "republic constitution"

In [56]:
query = "republic constitution"
k = 15  #k defines number of expanded terms
result = expand_term(query,k)
print result

[u'commonwealth', u'parliament', u'assembly', u'constitution', u'great_britain', u'known_as', u'france_spain', u'republic', u'federal_government', u'ireland', u'constitution_tempered', u'central', u'america', u'officially', u'china']


## <font color='orange'>Applications</font>

Word2Vec model as described in this tutorial capture semantic and syntactic relationships among words in corpus. Hence, it can be used in search engines for synonyms, query expansion as well as recommendations (for example, recommending similar movies). 

In our experiments, word embeddings do not seem to provide enough discriminative power between related but distinct concepts. This could be due to smaller corpus size as well as word embeddings are in it's initial stage of development. Hence, there is a huge scope for improvements in the above technique for it to be fully utilized in commercial applications. 

This being said, word2vec are exteremely interesting and it's lot of fun to explore the relationships amongst different words.



## <font color='orange'>Refrences</font>
[Google's code, writeup, and the accompanying papers](https://code.google.com/archive/p/word2vec/)   
[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)   
[Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546.pdf)   
[Presentation on word2vec by Tomas Mikolov from Google](https://docs.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit)   

