# Similarity in Paragraph2Vec Text Representation

*Gensim software examples.*

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

## Outline

**Main Goal:** To practice how to create paragraph2vec models with Gensim and NLTK. Then introduce how to extract information from both text representation, and finally how to measure word similarity.

- Acquiring and wrangling data for model initialization. 
- Gensim paragraph2vec model generation/loading example
- Sklearn, Scipy text similarity measures examples
- Gensim original measures examples

**About Paragrah2Vec**: Read the [2.9-Paragraph2Vec](2.9-Paragraph2Vec.ipynb) notebook.

**Note**: About Gensim and NLTK software please read the introductions notes about them in [2.1-TfIdf Notebook](02.1-TfIdf.ipynb)

In [1]:
import os
import smart_open
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
from gensim.corpora.wikicorpus import WikiCorpus
import time

In [175]:
#Change this configurations with your paths to Wiki corpus
data_path = '/opt/wiki_es/'
output_path = '/media/abelma/SSD2/wiki_es/'

# 1 Acquiring & Wrangling Data

From txt collection to a WikiCorpus object, then to a TaggedWikiDocument that it is the input-type for the collection of docs that accept `p2v.build_vocab` and `p2v.train` methods.

In [2]:
init = time.time()
print(init)
wiki_corpus = WikiCorpus(data_path+'dump/eswiki-20161201-pages-articles-multistream.xml.bz2',
                         lemmatize=False,
                         article_min_tokens=50,     #Minimum tokens in article.
                         article_max_tokens=5000,
                         dictionary={})
end = time.time()-init

1523542334.6557834
----------- 100000
1523542834.5501635
----------- 200000
1523543149.6039762
----------- 300000
1523543429.2024624
----------- 400000
1523543773.0984907
----------- 500000
1523544246.1291332
----------- 600000
1523544657.153675
----------- 700000
1523544998.0905125
----------- 800000
1523545415.5849423
----------- 900000
1523545860.8929534
----------- 1000000
1523546300.8007684
----------- 1100000
1523546702.1207345


After 69.5min the WikiCorpus object of 2016-12_Spanish_Wikipedia dump was uploaded.

In [3]:
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument([c for c in content], [title])

In [4]:
init = time.time()
documents = TaggedWikiDocument(wiki_corpus)
end = time.time()-init
print('Tagging docs to generate input-corpus for paragraph2vec model in %f seconds' % end)
documents.wiki.save(co 'TaggedWiki.documents')

Tagging docs to generate input-corpus for paragraph2vec model in 0.000130 seconds


  "corpus.save() stores only the (tiny) iteration object; "


# 2 Generating the Paragraph2Vec Model

_Note:_ This model apply a tokens lowercarse automatically, for that reason sentences are lowercased.

### Prepocessing

In [179]:
try:
    init = time.time()
    p2v = Doc2Vec.load(output_path+'wiki-p2v.model')
    end = time.time()-init
    print('Paragraph2Vec Model vocabulary loaded in %f seconds' % end)
    
except:
    init = time.time()
    print('no debo entrar aquí')
    p2v = Doc2Vec(epochs=1,                #Number of iterations (epochs) over the corpus.
                   min_count=20,           #Ignores all words with total frequency lower than this.
                   vector_size=300,        #Dimensionality of the feature vectors.
                   max_vocab_size=2000000, #Limits the RAM during vocabulary building.
                   dm=0,                   #Defines the training algorithm.
                   workers=cores,
                  )
    p2v.build_vocab(documents)
    end = time.time()
    p2v.save(output_path+'wiki-p2v.model')
    print('Paragraph2Vec Model vocabulary scaned in %f seconds' % end)

/media/abelma/SSD2/wiki_es/wiki-p2v.model
Paragraph2Vec Model vocabulary loaded in 139.031553 seconds


First round: paragraph2vec vocabulary built in 52.45 min.

Second round: paragraph2vec vocabulary loaded in 1.76 min.

### Generating the model

In [11]:
try:
    init = time.time()
    p2v = Doc2Vec.load(output_path+'wiki-p2v.model')
    end = time.time()-init
    print('Paragraph2Vec Model loaded in %f seconds' % end)
    
except:
    init = time.time()
    print(init)
    p2v.train(documents,
              total_examples=p2v.corpus_count,
              epochs=1,)
    end = time.time()-init
    p2v.save(output_path+'wiki-p2v.model')
    print('Paragraph2Vec Model Generated in %f seconds' % end)

1523548085.7003894
----------- 100000
1523548720.1758692
----------- 200000
1523549139.410583
----------- 300000
1523549470.3297775
----------- 400000
1523549827.492253
----------- 500000
1523550194.3657181
----------- 600000
1523550491.8750749
----------- 700000
1523550731.400874
----------- 800000
1523551038.9913855
----------- 900000
1523551356.006913
----------- 1000000
1523551666.4491296
----------- 1100000
1523551954.6985068
Paragraph2Vec Model Generated in 3877.760819 seconds


First round: paragraph2vec model trained in 65 min.

In [13]:
p2v.wv['niño'][:10]

array([ 0.00084364,  0.00100843, -0.0013037 ,  0.00095026,  0.00102746,
       -0.00099448,  0.00164336,  0.00095534,  0.00066178, -0.00122448],
      dtype=float32)

In [53]:
print(p2v.docvecs.most_similar(positive='Argentina'))
print(p2v.wv.most_similar(positive='argentina'))

[('República del Congo', 0.9746325016021729), ('Nunavut', 0.963964581489563), ('Chechenia', 0.9632564783096313), ('Burundi', 0.9623643755912781), ('Libertador San Martín', 0.9616067409515381), ('Cuetzala del Progreso', 0.9613605737686157), ('Islas Baleares', 0.9603628516197205), ('Puerto Maldonado', 0.9602200388908386), ('Caquetá', 0.9595517516136169), ('Isla Grande de Chiloé', 0.9593847393989563)]
[('estadística', 0.26488637924194336), ('magari', 0.25142616033554077), ('swordfish', 0.24946871399879456), ('kiotenses', 0.24445898830890656), ('afroecuatoriano', 0.2428331971168518), ('peraleja', 0.23987072706222534), ('revin', 0.23189276456832886), ('illapa', 0.22987493872642517), ('horatio', 0.22923138737678528), ('aculturación', 0.22797220945358276)]


As you can see in the above example the docvecs is useful for expressions that appier in `p2v.docvecs.doctags`, otherwise the model will rise an error. If you use word vector similarity (`wv.most_similar`) the results are worst than `docvecs.most_similar`.

In [97]:
p2v.docvecs.n_similarity(['Argentina'],['Congo'])

0.8469448991224805

# 3 Measuring Similarity between Pair of Sentences

This section is made to show the utility of _paragraph2vec_ model in an applied example. Also to show some native similarity methods of `gensim.Doc2Vec` class.

## Data

In [54]:
sentence1 = 'la niña corrió hacia el hueco'
sentence2 = 'Alicia corrió hacia el hueco'

sent1 = sentence1.lower().split()
sent2 = sentence2.lower().split()

#Filtering stopwords by hand
sent1s = 'niña corrió hueco'
sent2s = 'Alicia corrió hueco'

sent1sl = sent1s.lower().split()
sent2sl = sent2s.lower().split()

#If we change the sent1 by a very different meaning sent3
sent3 = ['el','niño','comió','una','manzana','roja']
sent3s = ['niño','comió','manzana','roja']

## 3.1 Wrangling Data

From string sentences to word for word paragraph2vec model numpy array.

In [57]:
import numpy as np

def preproc_data(sent, model):
    
    vec_sent = []

    for i,word in enumerate(sent):
        try:
            vec_sent.append(model.wv[word])
        except:
            pass

    vec_sent = sum(np.asarray(vec_sent))
    result = vec_sent.reshape(1,-1)
    
    return result

In [58]:
p2v_sent1 = preproc_data(sent1,p2v)
p2v_sent2 = preproc_data(sent2,p2v)
print(len(p2v_sent1[0]))
p2v_sent2[0][:10]

300


array([ 0.00198484, -0.00161264, -0.00392045, -0.00552214,  0.00131036,
        0.0005118 ,  0.00136406,  0.0017021 , -0.00151024, -0.00087941],
      dtype=float32)

## 3.2 Sklearn Paragraph2Vec-Cosine Sentence Similarity

In [59]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(p2v_sent1,p2v_sent2)[0][0]

0.7499604

In [60]:
#Filtering stopwords
p2v_sent1s = preproc_data(sent1sl,p2v)
p2v_sent2s = preproc_data(sent2sl,p2v)
cosine_similarity(p2v_sent1s,p2v_sent2s)[0][0]

0.69232756

## 3.3 Scipy Cosine Similarity

$Note: cosine_{Scipy\ distance} = 1 - cosine_{Sklearn\ similarity}$

In [61]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(p2v_sent1,p2v_sent2))
print(cosine_scipy(p2v_sent1s,p2v_sent2s)) #Filtering stopwords

0.2500395178794861
0.30767250061035156


## 3.4 Best Pair Word Overlap Similarity

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [77]:
def harmonic_best_pair_word_sim(sent1,sent2,model):
    p=0
    for wordA in sent1:
        m = 0
        for wordB in sent2:
            try:
                m = max(m, model.wv.similarity(wordA,wordB))
            except:
                pass
        p += m
    p = p/len(sent1)

    q=0
    for wordA in sent2:
        m = 0
        for wordB in sent1:
            try:
                m = max(m, model.wv.similarity(wordA,wordB))
            except:
                pass
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

In [78]:
print('Dissimilar sentences w2v_harmonic_best_pair_word similarity', 
      harmonic_best_pair_word_sim(sent3, sent2, p2v))
print('Dissimilar sentences without stopwords w2v_harmonic_best_pair_word similarity',
      harmonic_best_pair_word_sim(sent3s, sent2s, p2v))
print('Similar sentences w2v_harmonic_best_pair_word', 
      harmonic_best_pair_word_sim(sent1, sent2, p2v))
print('Similar sentences w2v_harmonic_best_pair_word without stopwords',
      harmonic_best_pair_word_sim(sent1sl, sent2sl, p2v))


Dissimilar sentences w2v_harmonic_best_pair_word similarity 0.039977160321282866
Dissimilar sentences without stopwords w2v_harmonic_best_pair_word similarity 0.0
Similar sentences w2v_harmonic_best_pair_word 0.7476427998305525
Similar sentences w2v_harmonic_best_pair_word without stopwords 0.6856355639043101


# 4 Gensim Original Measures

Gensim jaccard and cosine are impossible to measure because the p2v_bow vector is needed, but not exist.

## 4.1 Cosine using Gensim p2v of a sentence

In [62]:
vec_sent1 = p2v.wv[sent1]
vec_sent2 = p2v.wv[sent2]

vec_sent1_ = sum(vec_sent1).reshape(1,-1)
vec_sent2_ = sum(vec_sent2).reshape(1,-1)

print('p2v sentence vector similarity without transformation',
      cosine_similarity(vec_sent1,vec_sent2)[0][0])
print('p2v sentence vector similarity without transformation',
      cosine_similarity(vec_sent1_,vec_sent2_)[0][0])

p2v sentence vector similarity without transformation 0.0068915365
p2v sentence vector similarity without transformation 0.7499604


## 4.2 Gensim p2v.n_similarity

In [63]:
print(p2v.wv.n_similarity(sent3s,sent1))
print(p2v.wv.n_similarity(sent1,sent2))
print(p2v.wv.n_similarity(sent1sl,sent2sl))

-0.040381502259109046
0.749960442719121
0.6923275354116258


## 4.3 Gensim p2v.similarity

A score constructed with this method based on an international article.[John2016](#John2016)

In [75]:
p2v.wv.similarity('hombre','mujer')

-0.07725515726057419

In [164]:
def sent_sim_jonh2016(sent1, sent2, model):
    """:type sent1,sent2: list of strings"""
    
    sim_vector = []
    ALPHA = 0.019

    for wordA in sent1:
        for wordB in sent2:
            try:
                sim = p2v.wv.similarity(wordA,wordB)
                if sim > ALPHA:
                    sim_vector.append(sim)
            except:
                pass

    return sum(sim_vector)/(len(sim_vector) or 1)

In [165]:
print('Similar sentences w2v.similarity', sent_sim_jonh2016(sent1,sent2, p2v))
print('Similar sentences w2v.similarity without stopwords', sent_sim_jonh2016(sent1sl,sent2sl, p2v))

Similar sentences w2v.similarity 0.2740338124244419
Similar sentences w2v.similarity without stopwords 0.32096874638428086


## 4.4 Gensim Hellinger sentence similarity

In [76]:
from gensim.matutils import kullback_leibler, hellinger
print(hellinger(p2v_sent1,p2v_sent2))
print(kullback_leibler(p2v_sent1,p2v_sent2))

nan
inf


  sim = np.sqrt(0.5 * ((np.sqrt(vec1) - np.sqrt(vec2))**2).sum())


## 4.5 The case of Infered Vector of a Sentences

Paragraph to Vector in Gensim library has this method, which is not present in other models.

In [55]:
vec_sent1_infer_p2v = p2v.infer_vector(sent1)
vec_sent2_infer_p2v = p2v.infer_vector(sent2)

#Stopword filtering
vec_sent1s_infer_p2v = p2v.infer_vector(sent1sl)
vec_sent2s_infer_p2v = p2v.infer_vector(sent2sl)

#print the Paragraph2vector of the sentence 1
print(len(vec_sent1_infer_p2v))
vec_sent1_infer_p2v[:10]

300


array([ 0.04579691, -0.09673504, -0.03551013, -0.05200712, -0.01592638,
        0.02746712, -0.08295926, -0.03992163,  0.11818437, -0.16485237],
      dtype=float32)

### Applying Similarity on Infered Vectors

In [56]:
from sklearn.metrics.pairwise import cosine_similarity
print('p2v infered vector',
      cosine_similarity(vec_sent1_infer_p2v.reshape(1,-1),vec_sent2_infer_p2v.reshape(1,-1))[0][0])
print('p2v infered vector without stopwords',
      cosine_similarity(vec_sent1s_infer_p2v.reshape(1,-1),vec_sent2s_infer_p2v.reshape(1,-1))[0][0])

p2v infered vector 0.74639404
p2v infered vector without stopwords 0.56761754


# Conclusions

Same as Word2Vec this model doesn't works with bow structure, it represent a word as a vector of *size parameter* value length. At the same time this model can infered a vector for a sentence. The experiments shows that with the same corpus and the same sentences the paragraph2vec needs to lowercase all tokens, e.g. 'Alice' or 'Here', this behavior is different to Word2Vec model.

- Here the vectors that represent sentences works better than single word vectors. See for example the difference between `p2v.wv.n_similarity(sent1,sent2)` & `p2v.wv.similarity('hombre','mujer')`
- The best similarities using this text representation models must be implemented with innovatives ideas. For example: ``sent_sim_jonh2016`` and ``harmonic_best_pair_word_sim``.
- In almost all cases the stopword filtering __decrement__ the similarity between similar sentences and diminished similarity between different sentences.
- Note that `p2v.docvec.n_similarity` rise an error when have to process expressions with stopwords like "de, el, yo,...". Then to use this metric the stopword filtering must be applied first. Is interesting that `p2v.wv.n_similarity` dosn't have this problem.
- The results with _infered vectors_ are return less similarity than other sentences word vector based methods like `harmonic_best_pair_word_sim`.
- Gensim Hellinger and Kullback Leibler still been useless.


# Recomendations

* When comparations between expresions (more than one word text construction) is made, you can test what kind of expressions are very similar in the paragraph2vec model and you can play substituting it and then comparing with the other transformed query. (_see Playfull Example_)
* Train the paragraph2vec model using the wikipedia articles bigger than 5000 words, in this notebook the parameter `article_max_tokens` was implemented by myself inside Gensim pack for lack of RAM reasons. Then try to train the model with all the articles.
* Test the difference between loading wiki_corpus from .xml.bz2 or from .mm serialized corpus.

### Playfull Example

Try this little example to see the difference doing substitution.

In [105]:
p2v.docvecs.similarity('Argentina','México')

KeyError: "tag 'México' not seen in training corpus/invalid"

Look the transformation

In [120]:
for tag in p2v.docvecs.doctags:
    if 'México' in tag and 'República' in tag:
        print(tag)

Primera República Federal (México)
Movimiento federalista del noreste de México durante la República Centralista
Oficina de la Presidencia de la República (México)
Relaciones República Dominicana-México
República Centralista (México)
República Restaurada (México)
Misión Permanente de México en República Checa
Plaza de la República (Ciudad de México)
Procuraduría General de la República (México)
República Centroafricana en los Juegos Olímpicos de México 1968


In [148]:
exp1 = 'Gobierno de la República Argentina'
exp2 = 'Primera República Federal (México)'
exp3 = 'Oficina de la Presidencia de la República (México)'
print(p2v.docvecs.similarity(exp1,exp2))
print(p2v.docvecs.similarity(exp1,exp3))

0.4049288198423618
0.6489121810340391


In [145]:
def n_similar(exp1, exp2, model):
    similars = []
    list1, list2 =[],[]
    for tag in model.docvecs.doctags:
        exp1s = set(exp1.split())
        exp2s = set(exp2.split())
        tags = set(tag.split()) 
        if len(exp1s.intersection(tags)) == len(exp1s):
            list1.append(tag)
        if len(exp2s.intersection(tags)) == len(exp2s):
            list2.append(tag)
    print(len(list1),len(list2))

    for s1 in list1:
        for s2 in list2:
            similars.append((model.docvecs.similarity(s1,s2),s1,s2))
                
    similars = sorted(similars, reverse=True)
    return similars[:3]
        
print(n_similar('República México','República Argentina',p2v))

3 36
[(0.8696115540769269, 'Misión Permanente de México en República Checa', 'Embajada de Argentina en la República Popular China'), (0.8126812468532063, 'Misión Permanente de México en República Checa', 'Estado Mayor Conjunto de las Fuerzas Armadas de la República Argentina'), (0.7662970010674647, 'República Centroafricana en los Juegos Olímpicos de México 1968', 'Asociación de Reporteros Gráficos de la República Argentina')]


This example shows how we can go further but there is a lot of work to do.

<a id='references'></a>
# References

<a id='John2016'></a>
*[John2016]* John, Adebayo Kolawole and Caro, Luigi Di and Boella, Guido. 
**NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity**. 
Publisher ACM, 2016.