# Similarity in Paragraph2Vec Text Representation

*Gensim software examples.*

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

## Outline

**Main Goal:** To practice how to create paragraph2vec models with Gensim and NLTK. Then introduce how to extract information from both text representation, and finally how to measure word similarity.

- Gensim Corpus Inizialization
- paragraph2vec model example

## About Gensim

Gensim is a Python library for *topic modelling*, *document indexing*
and *similarity retrieval* with large corpora. Target audience is the
*natural language processing* (NLP) and *information retrieval* (IR)
community. [Gensim Documentation](Gensim Doc)

## About NLTK

Natural Language ToolKit (NLTK) is a comprehensive Python library for natural language
processing and text analytics. Originally designed for teaching, it has been adopted in the
industry for research and development due to its usefulness and breadth of coverage. NLTK
is often used for rapid prototyping of text processing programs and can even be used in
production applications. [(Perkins2014)](#Perkins2014)

## What is Paragrah2Vec?

Paragrah2Vec ...[(xxx)](#xxx)

In [1]:
import os
import smart_open
import gensim
from gensim.models import Doc2Vec
import time

## Wrangling Data

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

This first method to load the whole text collection is based on "os" module, this is only a code snippet to practice a different ways to do it. NLTK, numpy, and other libraries have it's own methods to do the same process.

In this case a new corpus with one document per line is generated.

In [2]:
doc_collection = ''
file_path = 'gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection += doc.read()+'\n'

#Wrangling the data from list of doc-strings -> list of word-list by sentences
with open('data/all_gutenberg', 'w') as f:
    f.write(doc_collection)

In [3]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

## Generating the Paragraph2Vec Model

_Note:_ This model apply a lowercarse of all tokens.

In [7]:
try:
    p2v= Doc2Vec.load('models/gutenberg-p2v.model')
    print('Doc2Vec Model Generated in 134 seconds')
except:
    init = time.time()
    corpus = list(read_corpus('data/all_gutenberg'))
    paraph2vec = Doc2Vec(corpus, vector_size=300, window=8, min_count=5, workers=4)
    end = time.time()-init
    paraph2vec.save('models/gutenberg_p2v.model')
    print('Total time:', end)

Doc2Vec Model Generated in 134 seconds


In [9]:
p2v.wv['alice'][:10]

array([ 0.6150048 , -0.13362435, -0.17901617, -0.13512719, -0.18723272,
       -0.7113043 ,  0.3514476 , -0.50977117, -0.5001958 , -0.39531764],
      dtype=float32)

## Sklearn Paragraph2Vec-Cosine Sentence Similarity

### Wrangling Data

From string-sentences to numpy paragraph vectors.

In [10]:
sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

sent1 = sentence1.split()
sent2 = sentence2.split()

sent1s = 'girl run hall'
sent2s = 'Alice run hall'

sent1sl = sent1s.split()
sent2sl = sent2s.split()

#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']

In [14]:
vec_sent1_p2v = p2v.infer_vector(sent1)
vec_sent2_p2v = p2v.infer_vector(sent2)

#print the Paragraph2vector of the sentence 1
print(len(vec_sent1_p2v))
vec_sent1_p2v[:10]

300


array([ 0.00238911, -0.01297163, -0.03777327,  0.0410721 , -0.00120385,
       -0.03433726,  0.01768394, -0.04179351, -0.01550062,  0.02970592],
      dtype=float32)

### Applying Similarity

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(vec_sent1_p2v.reshape(1,-1),vec_sent2_p2v.reshape(1,-1))[0][0]

0.51463765

### Wrangling Data

From string sentences to word for word paragraph2vec model numpy array.

In [9]:
import numpy as np

def preproc_data(sent1, sent2, model):

    sentence1 = sent1.split()
    sentence2 = sent2.split()
    
    p2v_sent1 = []
    p2v_sent2 = []

    for i,word in enumerate(sentence1):
        try:
            p2v_sent1.append(paraph2vec.wv[word])
        except:
            pass

    for i,word in enumerate(sentence2):
        try:
            p2v_sent2.append(paraph2vec.wv[word])
        except:
            pass

    p2v_sent1 = sum(np.asarray(p2v_sent1))
    p2v_sent2 = sum(np.asarray(p2v_sent2))
    
    A = p2v_sent1.reshape(1,-1)
    B = p2v_sent2.reshape(1,-1)
    
    return A,B

In [10]:
p2v_sent1, p2v_sent2 = preproc_data(sentence1,sentence2,paraph2vec)
print(len(p2v_sent1[0]))
p2v_sent2[0][:10]

300


array([ 0.10360318, -1.6545863 , -0.962557  ,  0.65428835,  0.96878296,
       -2.354933  , -0.3036998 ,  0.33115143,  0.29521412,  0.64372826],
      dtype=float32)

### Applying Similarity

In [11]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(p2v_sent1,p2v_sent2)[0][0]

0.7995827

In [12]:
#Filtering stopwords
p2v_sent1s, p2v_sent2s = preproc_data(sent1s,sent2s,paraph2vec)
cosine_similarity(p2v_sent1s,p2v_sent2s)[0][0]

0.8869797

## Scipy Cosine Similarity

In [13]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(p2v_sent1,p2v_sent2))
print(cosine_scipy(p2v_sent1s,p2v_sent2s)) #Filtering stopwords

0.20041728019714355
0.11302042007446289


## Gensim paragraph2vec sentence similarity

In [14]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim
cossim(vec_sent1_p2v,vec_sent2_p2v)

#for the next line to work the model must contain all appearing words
#paraph2vec.n_similarity(sentence1,sentence2)

#paraph2vec.wv.n_similarity(sentence1,sentence2)

TypeError: cannot convert dictionary update sequence element #0 to a sequence

## Gensim p2v.n_similarity

In [None]:
paraph2vec.n_similarity(['the','girl','run','into','the','hall'],['here','alice','run','to','the','hall'])

In [None]:
paraph2vec.n_similarity(['girl','run','hall'],['alice','run','hall'])

In [None]:
paraph2vec.n_similarity(['the','boy','eat','red','apple'],
                   ['here','alice','run','to','the','hall'])

## Gensim p2v infer_vector

In [None]:
#Testing initial infer_vector similarity
vec_sent1_p2v = vec_sent1_p2v.reshape(1,-1)
vec_sent2_p2v = vec_sent2_p2v.reshape(1,-1)
cosine_similarity(vec_sent1_p2v,vec_sent2_p2v)[0][0]

In [None]:
#infer_vector similarity filtering stopwords
vec_sent1s_p2v = paraph2vec.infer_vector(sent1s.split()).reshape(1,-1)
vec_sent2s_p2v = paraph2vec.infer_vector(sent2s.split()).reshape(1,-1)
cosine_similarity(vec_sent1s_p2v,vec_sent2s_p2v)[0][0]

## Gensim p2v.similarity

**Warning:** all the words must be converted to lowercase.

In [None]:
sent1 = sentence1.lower().split()
sent2 = sentence2.lower().split()

vec_sent1 = paraph2vec.wv[sent1]
vec_sent2 = paraph2vec.wv[sent2]

#cosine(vec_sent1,vec_sent2)
vec_sent1_ = sum(vec_sent1).reshape(1,-1)
vec_sent2_ = sum(vec_sent2).reshape(1,-1)

cosine_similarity(vec_sent1_,vec_sent2_)[0][0]

In [None]:
def word_vector_cosine_sim(sent1, sent2,p2v):
    for i,word in enumerate(sent1):
        if i == 0:
            sent1_p2v = p2v.wv[word]
        else:
            sent1_p2v+= p2v.wv[word]

    for i,word in enumerate(sent2):
        if i == 0:
            sent2_p2v = p2v.wv[word]
        else:
            sent2_p2v+= p2v.wv[word]

    # get the sentence vector similarity
    return 1-cosine_scipy(sent1_p2v,sent2_p2v)

In [None]:
print(word_vector_cosine_sim(sent1,sent2,paraph2vec))

Seems like if paragraph2vec had a sparcity problem, due to that word vectors are to slow.
Also the model if you test the *wv* method to many times the numers approximate to 0.999.

## Best Pair Word Overlap Similarity

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [None]:
paraph2vec.similarity('girl','woman')

In [None]:
def harmonic_best_pair_word_sim(string1,string2):
    p=0
    for wi in string1:
        m = 0
        for wc in string2:
            try:
                m = max(m, paraph2vec.similarity(wi,wc))
            except:
                pass
        p += m
    p = p/len(string1)

    q=0
    for wc in string2:
        m = 0
        for wi in string1:
            try:
                m = max(m, paraph2vec.similarity(wi,wc))
            except:
                pass
        q += m
    q = q/len(string2)

    sim = 2*p*q/(p+q or 1)
    return sim

In [None]:
print('Sentence w2v_harmonic_best_pair_word with stopwords', harmonic_best_pair_word_sim(sent1,sent2,w2v))
print('Sentence w2v_harmonic_best_pair_word without stopwords',harmonic_best_pair_word_sim(sent1,sent2,w2v))
print('Different sentence w2v_harmonic_best_pair_word similarity', harmonic_best_pair_word_sim(sent3,sent2,w2v))

## Textsim Jaccard

In [None]:
import sys
sys.path.append('/home/abelm/')
import textsim
from textsim.tokendists import jaccard_distance
print('Textsim Jaccard', jaccard_distance(sent1,sent2))
print('Textsim Jaccard, stopwords_filter=yes', jaccard_distance('girl run hall','Alice eat hall'))

# Conclusions

Same as Word2Vec this model doesn't works with bow structure, it represent a word as a vector of *size parameter* value length. At the same time this model can infered a vector for a sentence. The experiments shows that with the same corpus and the same sentences the paragraph2vec tends to fail with some words, e.g. 'Alice' or 'Here', this behavior is different to Word2Vec model.

* Gensim Hellinger, Cosine, Jaccard, Kullback-Leibler and the others based on bowvec doesn't work.
* 0.777 input = str, Jaccard, Textsim, stopwords_filter=no
* 0.800 input = str, Jaccard, Textsim, stopwords_filter=yes
* 0.433 input = str, Cosine, Textsim-sklearn, stopwords_filter=no
* 0.333 input = str, Cosine, Textsim-sklearn, stopwords_filter=yes
* 0.773 input = self vec, Cosine, Sklearn
* 0.361 input = Doc2Vec infer_vec, Cosine, Sklearn
* 0.635 input = str list, Harmonic mean, Best word sim of words in both sentences, stopwords_filter=no
* 0.543 input = str list, Harmonic mean, Best word sim of words in both sentences, stopwords_filter=yes


# Recomendations

* Made the same example with Wikipedia dump data, to test the similarity difference according to data.

<a id='referencias'></a>
# Referencias

<a id='Perkins2014'></a>
[1] *[Perkins2014]* Jacov Perkins. 
Book **Python 3 Text Processing with NLTK 3 Cookbook**. 2014. 
p. 7 **ISBN**: 978-1-78216-785-3

<a id='John2016'></a>
[3] *[John2016]* John, Adebayo Kolawole and Caro, Luigi Di and Boella, Guido. **NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity**. Publisher ACM, 2016.