# Word2Vec Text Representation using Wikipedia Corpus

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

## Outline

**Main Goal:** To practice how to create Word2Vec models with Gensim and NLTK and Wikipedia corpus.

- Gensim Corpus Inizialization
- Word2Vec model example

**Note**: About Gensim and NLTK software please read the introductions notes about them in [2.01-TfIdf Notebook](2.01-TfIdf.ipynb)

In [1]:
import gensim
import nltk
import os
import re
from gensim.models import Word2Vec
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
from gensim.models.word2vec import Text8Corpus
import time

#Wiki corpus path
corpus_path = '/media/DATA/wiki_es/'
wiki_corpus = corpus_path+'dump/eswiki-20161201-pages-articles-multistream.xml.bz2'

## Wrangling Data

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

Corpus.get_texts is implemented using iterators as input and not work with generators.

In [2]:
def read_wikipedia_corpus(filename, dictionary, article_min_tokens, article_max_tokens):

    # We don't want to do a dictionary construction step.
    corpus = gensim.corpora.WikiCorpus(filename, 
                                       dictionary=dictionary,
                                       article_min_tokens=article_min_tokens,
                                       article_max_tokens=article_max_tokens,
                                       lemmatize=None)

    for text in corpus.get_texts():
        yield text

## Generating the Word2Vec Model

**WARNING**: gensim.models.word2vec: Each 'sentences' item should be a list of words (usually unicode strings).

In [3]:
#Building vocabulary, this step is obligated before trained the model, could take several minutes.
dictionary = Dictionary.load_from_text(corpus_path+'_wordids.txt.bz2')
init = time.time()
print(init)

w2v = Word2Vec(iter=1,                 #Number of iterations (epochs) over the corpus.
               min_count=20,           #Ignores all words with total frequency lower than this.
               size=300,               #Dimensionality of the feature vectors.
               max_vocab_size=2000000, #Limits the RAM during vocabulary building.
               sg=0,                   #Defines the training algorithm. If 0 skip-gram is employed.
              )
w2v.build_vocab(read_wikipedia_corpus(wiki_corpus, 
                                       dictionary=dictionary, 
                                       article_min_tokens=50,     #Minimum tokens in article.
                                       article_max_tokens=5000),  #Maximum tokens in article.
                )
end = time.time()-init
print('Corpus of %d articles, model vocabulary size = %d generated in %d seconds' % (w2v.corpus_count ,len(w2v.wv.vocab),end))



1521730513.281354
----------- 100000
1521731006.004047
----------- 200000
1521731308.615893
----------- 300000
1521731567.400948
----------- 400000
1521731863.391605
----------- 500000
1521732169.979917
----------- 600000
1521732419.9087088
----------- 700000
1521732613.5474608
----------- 800000
1521732847.45133
----------- 900000
1521733104.1776474
----------- 1000000
1521733350.9373498
----------- 1100000
1521733585.289727
Corpus of 1103059 articles, model vocabulary size = 360330 generated in 3095 seconds


In [None]:
# Once you init the word2vector model with its parameters and build the vocabulary, 
# then train the model with more sentences
init = time.time()
print(init)
w2v.train(Text8Corpus(wiki_corpus),total_words=len(w2v.wv.vocab),epochs=w2v.iter)
end = time.time()-init
print('Word2Vec Model Generated in %d seconds' % end)

In [4]:
w2v.save('/media/abelma/SSD2/wiki-w2v.model')

In [None]:
#Load de word2vec model
w2v = Word2Vec.load('/media/abelma/SSD2/wiki-w2v.model')

In [8]:
w2v.wv.most_similar(positive=['niño'])#,negative=['hombre'])

[('pareció', 0.27089589834213257),
 ('áñez', 0.26660650968551636),
 ('pteridophytoa', 0.24674996733665466),
 ('nidificante', 0.2368468940258026),
 ('carpes', 0.23490853607654572),
 ('kitty', 0.23462679982185364),
 ('cahir', 0.23371116816997528),
 ('phyllanthaceae', 0.2335861325263977),
 ('enriquecían', 0.2326796054840088),
 ('plagaron', 0.2292286455631256)]

In [10]:
w2v.wv['rey'][:10]

array([-1.6270967e-03, -1.2212507e-03, -1.0092995e-03,  1.5547395e-03,
        1.5413483e-03,  8.1122958e-04, -9.3063823e-04, -3.8523821e-04,
        4.9518021e-05,  8.3216251e-04], dtype=float32)

## Pretrained Models

To generate some of this models are very computing expensive, so it is better to load pre-trained models to calculate the correspondent word vector.

[Wikipedia Word2Vec](https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent)

[1.5 Gb Google News Word2Vec Corpus](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)

[Homepage for Google's Word2Vec code and pre-trained models](https://code.google.com/archive/p/word2vec/)

## Sklearn Word2Vec-Cosine sentence similarity

### Wrangling Data

From string-sentences to "Continue Bag of Word" numerical vectors.

In [134]:
# to get similarity betwee 2 sentences with word2vec create it like John2016
sentence1 = 'la niña corrió hacia el hueco'
sentence2 = 'Alicia corrió hacia el hueco'
sent1 = sentence1.split() #sentence in list of words format
sent2 = sentence2.split()
#Filtering stopwords by hand
sent1s = 'niña corrió hueco'
sent2s = 'Alicia corrió hueco'
sent1sl = sent1s.split()
sent2sl = sent2s.split()

In [86]:
import numpy as np

def preproc_data(sentence1, sentence2, model):
    
    w2v_sent1 = []
    w2v_sent2 = []

    for i,word in enumerate(sent1):
        try:
            w2v_sent1.append(w2v.wv[word])
        except:
            pass

    for i,word in enumerate(sent2):
        try:
            w2v_sent2.append(w2v.wv[word])
        except:
            pass

    w2v_sent1 = sum(np.asarray(w2v_sent1))
    w2v_sent2 = sum(np.asarray(w2v_sent2))
    A = w2v_sent1.reshape(1,-1)
    B = w2v_sent2.reshape(1,-1)
    
    return A,B

In [87]:
nvec_sent1_w2v, nvec_sent2_w2v = preproc_data(sent1,sent2,w2v)
nvec_sent1s_w2v, nvec_sent2s_w2v = preproc_data(sent1s,sent2s,w2v)
print(len(nvec_sent1_w2v[0]))
nvec_sent2s_w2v[0][:10]

300


array([-0.00054531,  0.00163083,  0.00311696, -0.00242069,  0.00249035,
        0.00299105, -0.00119631,  0.00505188,  0.00258618,  0.00072918],
      dtype=float32)

### Applying Similarity

In [88]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(nvec_sent1_w2v,nvec_sent2_w2v)[0][0]

0.83529264

In [90]:
cosine_similarity(nvec_sent1s_w2v,nvec_sent2s_w2v)[0][0]

0.83529264

## Scipy Cosine Similarity

In [91]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(nvec_sent1_w2v,nvec_sent2_w2v))
print(cosine_scipy(nvec_sent1s_w2v,nvec_sent2s_w2v)) #Filtering stopwords

0.16470736265182495
0.16470736265182495


## Cosine using Gensim w2v of a sentence

In [125]:
vec_sent1 = w2v.wv[sent1]
vec_sent2 = w2v.wv[['corrió','al','hueco']]

#cosine(vec_sent1,vec_sent2)
vec_sent1_ = vec_sent1.sum(axis=0)
vec_sent2_ = vec_sent2.sum(axis=0)

1-cosine_scipy(vec_sent1_,vec_sent2_)

0.5004435181617737

## Gensim w2v.n_similarity

This method fails when the word is not on the model, try to apply normalization and not capital letters.
Test this line:

    w2v.n_similarity(sent1,sent2)

In [143]:
w2v.n_similarity(['la', 'niña', 'corrió', 'hacia', 'el', 'hueco'],['alicia', 'corrió', 'hacia', 'el', 'hueco'])

  """Entry point for launching an IPython kernel.


0.7712078136193977

In [139]:
w2v.n_similarity(['niña','corrió','hueco'],['corrió','hueco'])

  """Entry point for launching an IPython kernel.


0.8432498313143995

In [94]:
#Testing diferent meaning sentences
w2v.n_similarity(['el','niño','come','una','manzana','roja'],
                   ['ella','corrió','al','hueco'])

  


0.01038645543748053

## Gensim w2v.similarity

A score constructed with this method based on an international article.[John2016](#John2016)

In [117]:
# to get similarity betwee 2 sentences with word2vec create it like John2016
def sent_sim_jonh2016(sent1, sent2, model, ALPHA):
    """type sent1,sent2: list of strings"""
    
    sim_vector = []

    for wordA in sent1:
        for wordB in sent2:
            try:
                sim = w2v.similarity(wordA,wordB)
                if sim > ALPHA:
                    sim_vector.append(sim)
            except:
                pass

    return sum(sim_vector)/(len(sim_vector) or 1.0)


In [118]:
ALPHA = 0.1
print('Sentence w2v.similarity with stopwords', sent_sim_jonh2016(sent1, sent2, w2v, ALPHA))
print('Sentence w2v.similarity without stopwords',sent_sim_jonh2016(sent1sl, sent2sl, w2v, ALPHA))
print(sent_sim_jonh2016(['el','niño','come','una','manzana','roja'],['ella','corrió','al','hueco'],w2v, ALPHA))

Sentence w2v.similarity with stopwords 0.7047605146066211
Sentence w2v.similarity without stopwords 1.0
0.107300718000851


  # Remove the CWD from sys.path while we load stuff.


## Best Pair Word Overlap

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [132]:
# get similarity between 2 words with word2vec
print('Similarity between woman and girl:', w2v.wv.similarity('woman','girl'))

sent1 = ['the','girl','run','into','the','hall']
sent2 = ['Here','Alice','run','to','the','hall']

def harmonic_best_pair_word_sim(sent1,sent2, w2v):
    p=0
    for wi in sent1:
        m = 0
        for wc in sent2:
            try:
                m = max(m, w2v.wv.similarity(wi,wc))
            except:
                pass
        p += m
    p = p/len(sent1)

    q=0
    for wc in sent2:
        m = 0
        for wi in sent1:
            try:
                m = max(m, w2v.wv.similarity(wi,wc))
            except:
                pass
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

print('Harmonic mean word overlap w2v.similarity',harmonic_best_pair_word_sim(sent1,sent2, w2v))

Similarity between woman and girl: 0.1297296040355166
Harmonic mean word overlap w2v.similarity 0.5921575099918465


In [136]:
#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']
print(harmonic_best_pair_word_sim(sent3,sent2,w2v))

#With stopword filtering
print(harmonic_best_pair_word_sim(sent1sl,sent2sl,w2v))

0.05429410777519685
0.681364062418572


# Conclusions

- The best similarities using this text representation models must be implemented with innovatives ideas.
- The original gensim accuracy test output is different to this one.

# Recomendations

- Try to test other text representation models like Weigthed Matrix Factorization to study if the problem of sparcity.
- Try to train w2v model with more documents and test the Best-Pair word overlap similarity.

<a id='referencias'></a>
# Referencias

<a id='Perkins2014'></a>
[1] *[Perkins2014]* Jacov Perkins. 
Book **Python 3 Text Processing with NLTK 3 Cookbook**. 2014. 
p. 7 **ISBN**: 978-1-78216-785-3

<a id='Mikolov2013'></a>
[2] *[Mikolov2013]* Tomas Mikolov et al. **Efficient Estimation of Word Representations in Vector Space**. Publisher [arXiv](https://arxiv.org/abs/1301.3781), 2013.

<a id='John2016'></a>
[3] *[John2016]* John, Adebayo Kolawole and Caro, Luigi Di and Boella, Guido. 
**NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity**. 
Publisher ACM, 2016.