## Introduction

Word2Vec is a widely used pipeline that can map words into a high dimensional semantic space. Given a training text corpus, which is in the form of a list of sentences, it creates a vocabulary, mapping a word to a vector in this semantic space. It does this by using a set of models, specifically two-layer neural networks.

Because of its relative ease of use, it is a powerful tool that can be used to make interesting insights of complicated text data sets. This projection onto the sematic space can be used to achieve very interesting results, such as solving logical analogies and classifying types of blog articles.

This tutorial will teach you how to use Word2Vec, and will show some practical applications of it, including some interesting insights we can derive from a few datasets. It will use it in an innovative way that is not as commonly used: to determine the similarity between sentences. It will also compare this approach to simpler approaches, namely TF-IDF.

### Tutorial content

This tutorial will show how to use Word2Vec in Python, specifically using [genism](https://radimrehurek.com/gensim/models/word2vec.html).


We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Getting the datasets](#Getting-the-datasets)
- [Parsing data](#Parsing-data)
- [Model training](#Model-training)
- [Compute sentence similarities](#Compute-sentence-similarities)
- [Comparison with TF-IDF Method](#Comparison-with-TF-IDF-Method)
- [Further Resources](#Further-Resources)


## Installing the libraries

Before getting started, you'll need to install and import the various libraries that we will use.  Assuming you have anaconda fully installed, you can install genism and nltk using `conda`:

    $ conda install -c anaconda genism

    $ conda install -c anaconda nltk
    

In [104]:
from gensim.models import Word2Vec, phrases
from gensim.models.word2vec import LineSentence
from gensim.corpora import WikiCorpus
from nltk.tokenize import RegexpTokenizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from nltk.corpus import stopwords

from scipy.spatial.distance import cosine
import multiprocessing

import gzip
import numpy as np
import random
import sqlite3
import pandas as pd
import re

## Getting the datasets

So, where do we even start?

The first step to using Word2Vec is having a dataset to train on. This choice of text corpus will depend on your intended use of the tranied model. In this tutorial, we will explore how the different data sets differ, and compare the vocabularies generated by each. Though both data sets are from Amazon Reviews, one contains only sarcastic reviews while the other contains only reviews about Video Games.

The data sets used in this tutorial come from the following places:
- [Amazon Review Data (Video Game Reviews)](http://jmcauley.ucsd.edu/data/amazon/)
- [Amazon Review Data (Reviews marked as sarcastic)](http://storm.cis.fordham.edu/~filatova/SarcasmCorpus.html)

## Parsing data

After downloading the dataset (which may take a while), we need to convert the text into a standard format that the Word2Vec model can train on. This may involve decompressing files, removing punctuation, etc., and is dependent on the input data format. We will look at how to do so for the Amazon Review data below.

In [2]:
# Amazon Review Files:
raw_sarcasm_file_name = 'sarcasm/sarcasm_lines.txt'
sarcasm_file_name = 'corpuses/sarcasm.txt'

raw_amazon_video_file_name = 'amazon_review/reviews_Video_Games_5.json.gz'
amazon_video_file_name = 'corpuses/amazon_vg.txt'

The sarcasm data came in a different format than the rest, so we need to write code to parse that file individually.

In [3]:
# create sarcasm corpus
with open(sarcasm_file_name, 'w') as sarcasm:
    with open(raw_sarcasm_file_name, 'r') as raw_sarcasm:
        for line in raw_sarcasm.readlines():
            columns = line.split('\t')
            if len(columns) == 2:
                utt = columns[1]
                utt = re.sub("[^a-zA-Z']"," ", utt)
                utt = utt.lower()
                sarcasm.write(unicode(utt + '\n'))

The rest of the Amazon Review data came in a .json.gz format, and we will obtain the review texts in the following way.

In [18]:
# parse gzip of Amazon Review file
def parse_gzip(path):
    g = gzip.open(path, 'rb')
    
    #to limit the number of input lines
    linecount = 0
    for l in g:
        linecount += 1
        if linecount>10000:
            break
        yield eval(l)

# parse reviews from gzip file and write to corpus
def parse_amazon(in_gz, out_txt):
    with open(out_txt, 'w') as amazon:
        for d in parse_gzip(in_gz):
            utt = d['reviewText']
            utt = re.sub("[^a-zA-Z']"," ", utt)
            utt = utt.lower()
            amazon.write(unicode(utt + '\n'))


In [19]:
parse_amazon(raw_amazon_video_file_name, amazon_video_file_name)
print "Completed Amazon Video Game Review parse"

Completed Amazon Video Game Review parse


Now, the raw Amazon Review data has parsed into a text corpus that matches the format that the Word2Vec model requires, which is simply a list of sentence strings separated by spaces.

## Model training

So now that we have the data in the format we want it in, how do we actually train our model so we can start predicting the similarities?

The first step is to separate the text into sentences, represented by a list of words. We do that here using regular expressions as it faster than using the python split function.

In [22]:
# get a list of sentencse from an input text file
def tokenize_data(df):
    df_iter = df.iterrows()
    str2lst = []
    for i, row in df_iter:
        s = str(row[0])
        tokenizer = RegexpTokenizer('\w+')        
        lst = tokenizer.tokenize(s)
        for e in lst:
            e.decode(encoding='utf-8', errors='ignore')
        str2lst.append(lst)
    return str2lst

Next, we can finally pass it into the Word2Vec model for training! However, before doing that we first refine the data even more. Often in the sentences, conjuction words, like "don't" or "can't", appear and they should be considered as one word. Also, two and three word phrases, such as "in the", "of course", and "I am sorry", occur, and since they are thought of as single phrases they should be treated as such. So, we calculate and combine these two and three word phrases into bigrams and trigrams respectively, using the Phrases class.

In [27]:
# Return a model from a given text file
def build_model(data_file):
    freq = 10
    size_NN = 80
    n_threads = 4

    train_df = pd.read_csv(data_file,header=None)
    sentences = tokenize_data(train_df)

    l = []
    for lst in sentences:
        lst_u = [s.decode('utf-8','ignore') for s in lst]
        l.append(lst_u)

    bigram = phrases.Phrases(l)
    trigram = phrases.Phrases(bigram[l])

    model = Word2Vec(min_count=freq,size=size_NN, workers=n_threads, alpha=0.025, min_alpha=0.025,
                     max_vocab_size=50000000)
    model.build_vocab(trigram[bigram[l]])
    print "created initial model"

    for epoch in range(5):
        random.shuffle(l)
        model.train(trigram[bigram[l]])
        print "epoch #" + str(epoch) + " completed"

    return model

Then, we create the models and indices with the commands below. Careful, this may take a long time, depending on how big your text corpus is, and how long you selected your semantic vectors to be (in our case, size_NN=80, which is small enough). Unless running with a lot of memmory, it is advised to not a use large number of sentences for each corpus. In this case, I limited the number of sentences extracted to 10000.

In [159]:
sarcasm = build_model(sarcasm_file_name)
sarcasm_index = set(sarcasm.index2word)
print "completed training the sarcasm model"
amazon_vg = build_model(amazon_video_file_name)
amazon_vg_index = set(amazon_vg.index2word)
print "completed training the video game reviews model"

created initial model
epoch #0 completed
epoch #1 completed
epoch #2 completed
epoch #3 completed
epoch #4 completed
completed training the sarcasm model
created initial model
epoch #0 completed
epoch #1 completed
epoch #2 completed
epoch #3 completed
epoch #4 completed
completed training the video game reviews model


## Compute sentence similarities

Now, we will determine a way to find the similarities between sentences. We do this by mapping each word in each sentence to the semantic space, and then averaging all of these vectors to create a final sentence vector. Then, we find the cosine difference between these sentence vectors, and that becomes the similarity.

In [160]:
# Averages all word vectors in a given paragraph
def avg_vectors(words, model, num_features, index2word_set):
        featureVec = np.zeros(num_features, dtype="float32")
        nwords = 0
        
        for word in words:
            if word in index2word_set:
                nwords = nwords+1
                featureVec = np.add(featureVec, model[word])
        if nwords > 0:
            featureVec = np.divide(featureVec, nwords)

        return featureVec

In [161]:
# Computes sentence similarities
def compute_similarities(model, sentences, index, best=2, worst=0):
    vectors = []
    for sentence in sentences:
        vectors.append(avg_feature_vector(sentence.lower().split(" "), model, 80, index))
        
    similar = []
    for i in range(len(vectors)):
        for j in range(len(vectors)):
            if i>j:
                sentence = [sentences[i], sentences[j]]
                similarity = 1 - cosine(vectors[i], vectors[j])
                if not np.isnan(similarity):
                    similar.append((similarity, sentence))
                
    similar.sort(reverse=True)
    if len(similar)<max(best, worst):
        print "not enough data"
    else:
        if best:
            print "best " + str(best)
            for x in similar[:best]:
                print x
        if worst:
            print "worst " + str(worst)
            for x in similar[-1*worst:]:
                print x

    print ""

In [162]:
setence1 = "do not buy"
setence2 = "terrible product"
setence3 = "best product ever"
setence4 = "i love this product"
setence5 = "i use this product every day"
sentences = [setence1, setence2, setence3, setence4, setence5]

In [163]:
print "Sarcasm similarities"
compute_similarities(sarcasm, sentences, sarcasm_index)

print "Video Game data similarities"
compute_similarities(amazon_vg, sentences, amazon_vg_index)

Sarcasm similarities
best 2
(0.92048052816809933, ['i use this product every day', 'i love this product'])
(0.80848912011328022, ['best product ever', 'terrible product'])

Video Game data similarities
best 2
(0.67825017781110841, ['i use this product every day', 'i love this product'])
(0.50801964822782431, ['i love this product', 'best product ever'])



The results above show some interesting insights about the Amazon Review data in comparison to the Sarcasm data. Both of them correctly found similarities in 'i use this product every day' and 'i love this product'. Also, the sarcasm corpus links 'best product ever' and 'terrible product', which seem to be opposites, to be the next most popular, while the video game corpus links 'i love this product' and 'best product ever' to be the next most popular. This is because the data with sarcastic sentences may often have conflicting sentences, while the video game reviews are more normal in tone.

## Comparison with TF-IDF Method

We will now compare how well word2vec performs with how well the TF-IDF algorithm works. The TF-IDF algorithm essentially counts the number of each word occurence in each sentence, mapping each pair of words into a vector representing this count, and then computes the cosine similarity between the two sentences. This is a much more naive way of doing things, and performs much worse, as shown below.

In [150]:
def compute_tfidf_similarities(sentences, best=2, worst=0):
    similar = []
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i>j:
                sentence = [sentences[i], sentences[j]]
                vect = TfidfVectorizer(min_df=1)
                tfidf = vect.fit_transform(sentence)
                similarity = (tfidf * tfidf.T).A[0,1]
                
                if not np.isnan(similarity):
                    similar.append((similarity, sentence))
    similar.sort(reverse=True)
    if len(similar)<max(best, worst):
        print "not enough data"
    else:
        if best:
            print "best " + str(best)
            for x in similar[:best]:
                print x
        if worst:
            print "worst " + str(worst)
            for x in similar[-1*worst:]:
                print x

    print ""

In [155]:
print "TF-IDF similarities:"
compute_tfidf_similarities(sentences,3)

print "Video Game data similarities"
compute_similarities(amazon_vg, sentences, amazon_vg_index)

TF-IDF similarities:
best 3
(0.3563004293331381, ['i use this product every day', 'i love this product'])
(0.26055567105626237, ['i love this product', 'terrible product'])
(0.26055567105626237, ['best product ever', 'terrible product'])

Video Game data similarities
best 2
(0.67825017781110841, ['i use this product every day', 'i love this product'])
(0.50801964822782431, ['i love this product', 'best product ever'])



The model trained on the Video Game corpus accurately predicted the similar sentences, while the TF-IDF algorithm did not because it did not take the semantics of each word into account.

## Further Resources

If you would like to learn more about word2vec, and the technologies discussed in this tutorial, you may view the following online resources:

- [The original word2vec paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [The TF-IDF word relavence paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.1424&rep=rep1&type=pdf)
- [Other interesting uses of word2vec, including using it for solving analogies](https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/)
- [Fun and interactive website to solve analogies created using word2vec](http://deeplearner.fz-qqq.net/)