# Paragraph2Vec Text Representation Model

*Gensim software examples.*

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

**data:** Gutenberg Corpus

## Outline

**Main Goal:** To practice how to create paragraph2vec model with Gensim and NLTK. Then introduce how to extract information from this text representation model, and finally how to measure word similarity using the previous result.

- Acquiring and wrangling data for model initialization. 
- Gensim paragraph2vec model generation/loading example
- Sklearn, Scipy text similarity measures examples
- Gensim original measures examples

## What is Paragrah2Vec?

Paragrah2Vec is an unsupervised framework that learns continuous distributed vector representations for pieces of texts. The texts can be of variable-length, ranging from sentences to documents. The name Paragraph Vector is to emphasize the fact that the method can be applied to variable-length pieces of texts, anything from a phrase or sentence to a large document.[(Lee and Mikolov, 2014)](#Lee2014)

**Note**: About Gensim and NLTK software please read the introduction notes about them in [2.1-TfIdf Notebook](02.1-TfIdf.ipynb)

In [2]:
import os
import smart_open
import gensim
from gensim.models import Doc2Vec
import time

# 1 Acquiring & Wrangling Data

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

This first method to load the whole text collection is based on "os" module, this is only a code snippet to practice a different ways to do it. NLTK, numpy, and other libraries have it's own methods to do the same process.

In this case a new corpus with one document per line is generated.

In [3]:
doc_collection = ''
file_path = 'data/gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection += doc.read()+'\n'

#Wrangling the data from list of doc-strings -> list of word-list by sentences
with open('data/all_gutenberg', 'w') as f:
    f.write(doc_collection)

In [4]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

# 2 Generating the Paragraph2Vec Model

_Note:_ This model apply a tokens lowercarse automatically, for that reason sentences are lowercased.

In [5]:
try:
    p2v= Doc2Vec.load('models/gutenberg-p2v.model')
    print('Doc2Vec Model Generated in 134 seconds')
except:
    init = time.time()
    corpus = list(read_corpus('data/all_gutenberg'))
    paraph2vec = Doc2Vec(corpus, vector_size=300, window=8, min_count=5, workers=4)
    end = time.time()-init
    paraph2vec.save('models/gutenberg_p2v.model')
    print('Total time:', end)

Doc2Vec Model Generated in 134 seconds


In [6]:
p2v.wv['alice'][:10]

array([ 0.6150048 , -0.13362435, -0.17901617, -0.13512719, -0.18723272,
       -0.7113043 ,  0.3514476 , -0.50977117, -0.5001958 , -0.39531764],
      dtype=float32)

# 3 Measuring Similarity between Pair of Sentences

This section is made to show the utility of _paragraph2vec_ model in an applied example. Also to show some native similarity methods of `gensim.Doc2Vec` class.

## Data

In [134]:
sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

sent1 = sentence1.lower().split()
sent2 = sentence2.lower().split()

sent1s = 'girl run hall'
sent2s = 'Alice run hall'

sent1sl = sent1s.lower().split()
sent2sl = sent2s.lower().split()

#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']
sent3s = ['boy','eat','red','apple']

## 3.1 Wrangling Data

From string sentences to word for word paragraph2vec model numpy array.

In [147]:
import numpy as np

def preproc_data(sent, model):
    
    vec_sent = []

    for i,word in enumerate(sent):
        try:
            vec_sent.append(model.wv[word])
        except:
            pass

    vec_sent = sum(np.asarray(vec_sent))
    result = vec_sent.reshape(1,-1)
    
    return result

In [148]:
p2v_sent1 = preproc_data(sent1,p2v)
p2v_sent2 = preproc_data(sent2,p2v)
print(len(p2v_sent1[0]))
p2v_sent2[0][:10]

300


array([ 0.47775584, -2.3387423 , -1.6507049 ,  0.6713754 ,  1.2924995 ,
       -3.5926514 ,  0.50806916,  0.16648477, -0.53688073, -0.36386508],
      dtype=float32)

## 3.2 Sklearn Paragraph2Vec-Cosine Sentence Similarity

In [42]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(p2v_sent1,p2v_sent2)[0][0]

0.7995827

In [44]:
#Filtering stopwords
p2v_sent1s = preproc_data(sent1sl,p2v)
p2v_sent2s = preproc_data(sent2sl,p2v)
cosine_similarity(p2v_sent1s,p2v_sent2s)[0][0]

0.8869797

## 3.3 Scipy Cosine Similarity

$Note: cosine_{Scipy\ distance} = 1 - cosine_{Sklearn\ similarity}$

In [45]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(p2v_sent1,p2v_sent2))
print(cosine_scipy(p2v_sent1s,p2v_sent2s)) #Filtering stopwords

0.20041728019714355
0.11302042007446289


## 3.4 Best Pair Word Overlap Similarity

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [130]:
def harmonic_best_pair_word_sim(sent1,sent2,model):
    p=0
    for wordA in sent1:
        m = 0
        for wordB in sent2:
            try:
                m = max(m, model.wv.similarity(wordA,wordB))
            except:
                pass
        p += m
    p = p/len(sent1)

    q=0
    for wordA in sent2:
        m = 0
        for wordB in sent1:
            try:
                m = max(m, model.wv.similarity(wordA,wordB))
            except:
                pass
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

In [135]:
print('Dissimilar sentences w2v_harmonic_best_pair_word similarity', 
      harmonic_best_pair_word_sim(sent3, sent2, p2v))
print('Dissimilar sentences without stopwords w2v_harmonic_best_pair_word similarity',
      harmonic_best_pair_word_sim(sent3s, sent2s, p2v))
print('Similar sentences w2v_harmonic_best_pair_word', 
      harmonic_best_pair_word_sim(sent1, sent2, p2v))
print('Similar sentences w2v_harmonic_best_pair_word without stopwords',
      harmonic_best_pair_word_sim(sent1sl, sent2sl, p2v))


Dissimilar sentences w2v_harmonic_best_pair_word similarity 0.5256416051928539
Dissimilar sentences without stopwords w2v_harmonic_best_pair_word similarity 0.0
Similar sentences w2v_harmonic_best_pair_word 0.7539022594752681
Similar sentences w2v_harmonic_best_pair_word without stopwords 0.8636539612550971


# 4 Gensim Particular Measures

Gensim jaccard and cosine are impossible to measure because the p2v_bow vector is needed, but not exist.

## 4.1 Cosine using Gensim p2v of a sentence

In [102]:
vec_sent1 = p2v.wv[sent1]
vec_sent2 = p2v.wv[sent2]

vec_sent1_ = sum(vec_sent1).reshape(1,-1)
vec_sent2_ = sum(vec_sent2).reshape(1,-1)

print('p2v sentence vector similarity without transformation',
      cosine_similarity(vec_sent1,vec_sent2)[0][0])
print('p2v sentence vector similarity without transformation',
      cosine_similarity(vec_sent1_,vec_sent2_)[0][0])

p2v sentence vector similarity without transformation 0.18863933
p2v sentence vector similarity without transformation 0.7152189


## 4.2 Gensim p2v.n_similarity

In [145]:
print(p2v.wv.n_similarity(sent3s,sent1))
print(p2v.wv.n_similarity(sent1,sent2))
print(p2v.wv.n_similarity(sent1sl,sent2sl))

0.6423358595531631
0.7152189931006554
0.8741961944624282


## 4.3 Gensim p2v.similarity

A score constructed with this method based on an international article.[John2016](#John2016)

In [109]:
p2v.wv.similarity('woman','man')

0.6345771081022482

In [115]:
def sent_sim_jonh2016(sent1, sent2, model):
    """:type sent1,sent2: list of strings"""
    
    sim_vector = []
    ALPHA = 0.25

    for wordA in sent1:
        for wordB in sent2:
            try:
                sim = p2v.wv.similarity(wordA,wordB)
                if sim > ALPHA:
                    sim_vector.append(sim)
            except:
                pass

    return sum(sim_vector)/(len(sim_vector) or 1)

In [146]:
print('Similar sentences w2v.similarity', sent_sim_jonh2016(sent1,sent2, p2v))
print('Similar sentences w2v.similarity without stopwords', sent_sim_jonh2016(sent1sl,sent2sl, p2v))

Similar sentences w2v.similarity 0.5202314680642451
Similar sentences w2v.similarity without stopwords 0.6844719901559259


## 4.4 Gensim Hellinger sentence similarity

In [117]:
from gensim.matutils import kullback_leibler, hellinger
print(hellinger(p2v_sent1,p2v_sent2))
print(kullback_leibler(p2v_sent1,p2v_sent2))

nan
inf


  sim = np.sqrt(0.5 * ((np.sqrt(vec1) - np.sqrt(vec2))**2).sum())


## 4.5 The case of Infered Vector of a Sentences

Paragraph to Vector in Gensim library has this method, which is not present in other models.

In [69]:
vec_sent1_infer_p2v = p2v.infer_vector(sent1)
vec_sent2_infer_p2v = p2v.infer_vector(sent2)

#Stopword filtering
vec_sent1s_infer_p2v = p2v.infer_vector(sent1sl)
vec_sent2s_infer_p2v = p2v.infer_vector(sent2sl)

#print the Paragraph2vector of the sentence 1
print(len(vec_sent1_infer_p2v))
vec_sent1_infer_p2v[:10]

300


array([ 0.00455156, -0.0286791 , -0.01700589,  0.04639047, -0.00288303,
       -0.04456371,  0.01654539, -0.02167564, -0.00419786,  0.0179023 ],
      dtype=float32)

### Applying Similarity on Infered Vectors

In [34]:
from sklearn.metrics.pairwise import cosine_similarity
print('p2v infered vector',
      cosine_similarity(vec_sent1_infer_p2v.reshape(1,-1),vec_sent2_infer_p2v.reshape(1,-1))[0][0])
print('p2v infered vector without stopwords',
      cosine_similarity(vec_sent1s_infer_p2v.reshape(1,-1),vec_sent2s_infer_p2v.reshape(1,-1))[0][0])

p2v infered vector 0.19902407
p2v infered vector without stopwords 0.60062534


# Conclusions

Same as Word2Vec this model doesn't works with bow structure, it represent a word as a vector of *size parameter* value length. At the same time this model can infered a vector for a sentence. The experiments shows that with the same corpus and the same sentences the paragraph2vec needs to lowercase all tokens, e.g. 'Alice' or 'Here', this behavior is different to Word2Vec model.

- The best similarities using this text representation models must be implemented with innovatives ideas. For example: ``sent_sim_jonh2016`` and ``harmonic_best_pair_word_sim``.
- In almost all cases the stopword filtering increment the similarity between similar sentences and diminished similarity between different sentences.
- Gensim Hellinger and Kullback Leibler still been useless.


# Recomendations

* Made the same example with Wikipedia dump data, to test the similarity difference according to training data.

<a id='references'></a>
# References

<a id='Lee2014'></a>
*[Lee and Mikolov]* Quoc Le and Tomas Mikolov.
**Distributed Representations of Sentences and Documents**. 2014. 
Proceedings of the 31 st International Conference on Machine Learning, Beijing, China.

<a id='John2016'></a>
*[John2016]* John, Adebayo Kolawole and Caro, Luigi Di and Boella, Guido. 
**NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity**. 
Publisher ACM, 2016.