# Word2Vec Text Representation using Gutenberg Corpus

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

## Outline

**Main Goal:** To practice how to create Word2Vec models with Gensim and NLTK. Then introduce how to extract information from both text representation, and finally how to measure word similarity.

- Gensim Corpus Inizialization
- Word2Vec model example

## What is Word2Vec?

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space [(Mikolov2013)](#Mikolov2013).

**Note**: About Gensim and NLTK software please read the introductions notes about them in [2.1-TfIdf Notebook](02.1-TfIdf.ipynb)

In [1]:
import gensim
import nltk
import os
import re
import time

## Wrangling Data

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

This first method to load the whole text collection is based on "os" module, this is only a code snippet to practice a different ways to do it. NLTK, numpy, and other libraries have it's own methods to do the same process.

In this case a text structure "sentences" with a list of words per sentence per line is generated.

In [2]:
doc_collection = []
file_path = 'gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection.append(doc.read())
            
#Wrangling the data from list of doc-strings -> list of word-list by sentences
sentences = []
for doc in range(len(doc_collection)):
    for sent in nltk.sent_tokenize(doc_collection[doc]):
        sent_words = []
        for word in nltk.word_tokenize(sent):
            sent_words.append(word)
        sentences.append(sent_words)

## Generating the Word2Vec Model

**WARNING**: gensim.models.word2vec: Each 'sentences' item should be a list of words (usually unicode strings).

In [9]:
from gensim.models import Word2Vec

try:
    w2v = Word2Vec.load('models/gutenberg_w2v.model')
    print('Word2Vec Model Generated in 59 seconds')
except:
    print('No puedo entrar aquí')
    init = time.time()
    #first build vocabulary
    w2v = Word2Vec(iter=1)
    w2v.build_vocab(sentences)

    #second train the model / save it / and then load it
    w2v = Word2Vec(sentences, min_count=1, size=300)
    w2v.save('models/gutenberg_w2v.model')
    w2v = gensim.models.Word2Vec.load('gensim_data/w2v_model')

    #third train the model with more sentences
    w2v.train(sentences,total_words=20000000,epochs=w2v.iter)
    end = time.time()-init
    print('Total time:', end)

Word2Vec Model Generated in 59 seconds


In [14]:
w2v.wv.most_similar(positive=['Alice'],negative=['man'])

[('Tenderly', 0.5983538627624512),
 ('`Uncle', 0.581244945526123),
 ('1788', 0.5553467273712158),
 ("'Leonora", 0.5489853620529175),
 ('unsays', 0.5169966220855713),
 ('wilted', 0.5126117467880249),
 ('promiscuously', 0.5119398236274719),
 ('politely', 0.5090441703796387),
 ('submissively', 0.5081309676170349),
 ('ago.', 0.5049775838851929)]

In [15]:
w2v.wv['Alice'][:10]

array([-0.4553154 ,  0.3860206 , -0.0982976 ,  0.45534748, -0.3113477 ,
       -0.16704908,  0.48424447, -0.3274756 ,  0.08231983, -0.1183285 ],
      dtype=float32)

## Sklearn Word2Vec-Cosine sentence similarity

### Wrangling Data

From string-sentences to "Continue Bag of Word" numerical vectors.

In [76]:
sentence1 = 'the girl run into the hole'
sentence2 = 'Here Alice run to the hole'

sent1 = sentence1.split()
sent2 = sentence2.split()

sent1s = 'girl run hole'
sent2s = 'Alice run hole'

sent1sl = sent1s.split()
sent2sl = sent2s.split()

#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']
sent3s = ['boy','eat','red','apple']

In [87]:
import numpy as np

def preproc_data(sent, model):
    
    vec_sent = []

    for i,word in enumerate(sent):
        try:
            vec_sent.append(model.wv[word])
        except:
            pass

    vec_sent = sum(np.asarray(vec_sent))
    result = vec_sent.reshape(1,-1)
    
    return result

In [88]:
w2v_sent1 = preproc_data(sent1,w2v)
w2v_sent2 = preproc_data(sent2,w2v)
print(len(w2v_sent1[0]))
w2v_sent2[0][:10]

300


array([-0.8443149 , -0.04467377, -0.11465675,  1.6633564 ,  1.9606867 ,
       -0.5129897 , -1.0194454 , -0.8266785 ,  1.5195833 , -0.21492743],
      dtype=float32)

### Applying Similarity

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(w2v_sent1,w2v_sent2)[0][0]

0.8332392

In [20]:
#Filtering stopwords
w2v_sent1s, w2v_sent2s = preproc_data(sent1s,sent2s,w2v)
cosine_similarity(w2v_sent1s,w2v_sent2s)[0][0]

0.8332392

## Scipy Cosine Similarity

In [21]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(w2v_sent1,w2v_sent2))
print(cosine_scipy(w2v_sent1s,w2v_sent2s)) #Filtering stopwords

0.1667608618736267
0.1667608618736267


# Gensim Particular Measures

Gensim jaccard and cosine are impossible to measure because a p2v_bow vector is needed, but in this model this kind of BOW corpus don't exist.

## Cosine using Gensim w2v of a sentence

In [72]:
vec_sent1 = w2v.wv[sent1]
vec_sent2 = w2v.wv[sent2]

vec_sent1_ = vec_sent1.sum(axis=0).reshape(1,-1)
vec_sent2_ = vec_sent2.sum(axis=0).reshape(1,-1)

print('w2v sentence vector similarity without transformation',
      cosine_similarity(vec_sent1,vec_sent2)[0][0])
print('w2v sentence vector similarity with transformation',
      cosine_similarity(vec_sent1_,vec_sent2_)[0][0])

w2v sentence vector similarity without transformation 0.17152277
w2v sentence vector similarity with transformation 0.82981557


## Gensim w2v.n_similarity

In [83]:
print(w2v.wv.n_similarity(sent3,sent2))
print(w2v.wv.n_similarity(sent1,sent2))
print(w2v.wv.n_similarity(sent1sl,sent2sl))

0.737802905397651
0.829815479780645
0.8406460034906343


## Gensim w2v.similarity

A score constructed with this method based on an international article.[John2016](#John2016)

In [73]:
w2v.wv.similarity('woman','man')

0.7787217808243306

In [36]:
def sent_sim_jonh2016(sent1, sent2, model):
    """type sent1,sent2: list of strings"""
    
    sim_vector = []
    ALPHA = 0.25

    for wordA in sent1:
        for wordB in sent2:
            try:
                sim = w2v.wv.similarity(wordA,wordB)
                if sim > ALPHA:
                    sim_vector.append(sim)
            except:
                pass

    return sum(sim_vector)/(len(sim_vector) or 1)

In [84]:
print('Similar sentences w2v.similarity', sent_sim_jonh2016(sent1,sent2, w2v))
print('Similar sentences w2v.similarity without stopwords', sent_sim_jonh2016(sent1sl,sent2sl, w2v))

Similar sentences w2v.similarity 0.5067870774528688
Similar sentences w2v.similarity without stopwords 0.627630266069921


## Gensim TfIdf-Hellinger sentence similarity

In [50]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim

print(hellinger(w2v_sent1,w2v_sent2))
print(kullback_leibler(w2v_sent1, w2v_sent2))

nan
inf


  sim = np.sqrt(0.5 * ((np.sqrt(vec1) - np.sqrt(vec2))**2).sum())


## Best Pair Word Overlap

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [78]:
def harmonic_best_pair_word_sim(sent1,sent2,model):
    p=0
    for wordA in sent1:
        m = 0
        for wordB in sent2:
            try:
                m = max(m, model.wv.similarity(wordA,wordB))
            except:
                pass
        p += m
    p = p/len(sent1)

    q=0
    for wordA in sent2:
        m = 0
        for wordB in sent1:
            try:
                m = max(m, model.wv.similarity(wordA,wordB))
            except:
                pass
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

In [81]:
print('Dissimilar sentences w2v_harmonic_best_pair_word similarity', 
      harmonic_best_pair_word_sim(sent3,sent2,w2v))
print('Dissimilar sentences without stopwords w2v_harmonic_best_pair_word similarity',
      harmonic_best_pair_word_sim(sent3s,sent2s,w2v))
print('Similar sentences w2v_harmonic_best_pair_word', 
      harmonic_best_pair_word_sim(sent1,sent2,w2v))
print('Similar sentences w2v_harmonic_best_pair_word without stopwords',
      harmonic_best_pair_word_sim(sent1sl,sent2sl,w2v))

Dissimilar sentences w2v_harmonic_best_pair_word similarity 0.5880608452662112
Dissimilar sentences without stopwords w2v_harmonic_best_pair_word similarity 0.34912268984175787
Similar sentences w2v_harmonic_best_pair_word 0.7593893198382271
Similar sentences w2v_harmonic_best_pair_word without stopwords 0.8217402173991791


# Conclusions

- The best similarities using this text representation models must be implemented with innovatives ideas. For example: ``sent_sim_jonh2016`` and ``harmonic_best_pair_word_sim``.
- In almost all cases the stopword filtering increment the similarity between similar sentences and diminished similarity between different sentences.
- Gensim Hellinger and Kullback Leibler still been useless.

# Recomendations

- See the notebooks training Gensim models with Wikipedia dump, review gensim distances and distances trated here.
- Try to test other text representation models like Weigthed Matrix Factorization.
- Try to train w2v model with more documents and test the Best-Pair word overlap similarity.

<a id='references'></a>
# References

<a id='Perkins2014'></a>
[1] *[Perkins2014]* Jacov Perkins. 
Book **Python 3 Text Processing with NLTK 3 Cookbook**. 2014. 
p. 7 **ISBN**: 978-1-78216-785-3

<a id='Mikolov2013'></a>
[2] *[Mikolov2013]* Tomas Mikolov et al. **Efficient Estimation of Word Representations in Vector Space**. Publisher [arXiv](https://arxiv.org/abs/1301.3781), 2013.

<a id='John2016'></a>
[3] *[John2016]* John, Adebayo Kolawole and Caro, Luigi Di and Boella, Guido. **NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity**. Publisher ACM, 2016.