# Text Similarity using Tf-Idf Model

**Prerequisites:** Skills in tokenization with nltk, knowledge of TfIdf Text Representation model.

## Outline

**Main Goal:** To practice how to create TfIdf models with Gensim Tf-Idf implementations, using NLTK preprocessing. Then introduce how to extract features from this text representation, and finally how to measure text similarity using previous results.

- Gensim Corpus Inizialization
- TfIdf model generation
- Wrangling data from BOW to numpy objects
- Text similarity measures examples

## About Gensim

Gensim is a Python library for *topic modelling*, *document indexing*
and *similarity retrieval* with large corpora. Target audience is the
*natural language processing* (NLP) and *information retrieval* (IR)
community. [Gensim Documentation](https://radimrehurek.com/gensim/tutorial.html)

## About NLTK

Natural Language ToolKit (NLTK) is a comprehensive Python library for natural language
processing and text analytics. Originally designed for teaching, it has been adopted in the
industry for research and development due to its usefulness and breadth of coverage. NLTK
is often used for rapid prototyping of text processing programs and can even be used in
production applications. [(Perkins2014)](#Perkins2014)

## What is TfIdf?

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus [(Salton1983)](#Salton1983).

In [10]:
import os
import nltk
import numpy as np
from gensim.models import TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import time

In [None]:
corpus_path = '/media/DATA/wiki_es/'

## Wrangling Data

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

In [3]:
doc_collection = []
file_path = 'data/gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection.append(doc.read().lower())

In [4]:
tokenized_text = [[word for word in nltk.word_tokenize(doc)] for doc in doc_collection]

print(len(tokenized_text))
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

#remove stop words
texts = [[word for word in text if word not in stopwords] for text in tokenized_text]

21


## Generating the TfIdf Model

In [55]:
try:
    tfidf = TfidfModel.load('models/gutenberg_tfidfA.model')
    id2word = Dictionary(texts)
    id2word.filter_extremes(no_below=2, no_above=0.6)
    print('Pre-generated model TfIdf in 1.897 seconds.')

except:
    init = time.time()
    # Create dictionary with tid to token mappings (or alternatively load one)
    id2word = Dictionary(texts)

    #remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
    #id2word.filter_extremes(no_below=2, no_above=0.6)

    #convert the dictionary to a bag of words corpus for reference
    bow_corpus = [id2word.doc2bow(text) for text in texts]

    #generating the tf-idf model
    tfidf = TfidfModel(bow_corpus,id2word=id2word)
    end = time.time()-init
    tfidf._smart_save('models/gutenberg_tfidfB.model')
    print('Total time %f segundos.' % end)

Total time 1.701983 segundos.


In [37]:
print(id2word.doc2bow(['alice']))
print(tfidf.idfs[id2word.doc2bow(['alice'])[0][0]])

[(26576, 1)]
2.3923174227787602


In [38]:
[(tfidf.id2word[i],tfidf.idfs[i]) for i in range(90,100)]

[('_joint_', 4.392317422778761),
 ('_just_', 4.392317422778761),
 ('_lady_', 4.392317422778761),
 ('_letting_', 4.392317422778761),
 ('_little_', 4.392317422778761),
 ('_man_', 4.392317422778761),
 ('_married_', 4.392317422778761),
 ('_marry_', 4.392317422778761),
 ('_may_', 3.3923174227787602),
 ('_me_', 4.392317422778761)]

As you can see the Tf model is very simple, only the word, ans its related tfidf coefficient. The Gensim implementation contains a method - *\_\_getitem\_\_* - that return the tfidf representation of an input bag of word vector. 

## Sklearn TfIdf-Cosine Sentence Similarity

### Wrangling Data

* First: From string-sentences to bow representation of a sentence.
* Second: From bow representation to numerical-list representation of a sentence.
* Third: From numerical-list vector to numerical-vector (numpy) representation.

In [9]:
sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

sent1 = sentence1.lower().split()
sent2 = sentence2.lower().split()

sent1s = 'girl run hall'
sent2s = 'Alice run hall'

sent1sl = sent1s.lower().split()
sent2sl = sent2s.lower().split()

#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']
sent3s = ['boy','eat','red','apple']

In [44]:
import numpy as np

def preproc_data(sent1, sent2, model):
    
    #from raw sent to bowvec sent
    bowvec_sent1 = id2word.doc2bow(sent1)
    bowvec_sent2 = id2word.doc2bow(sent2)

    bowvec_sent1_tfidf = tfidf[bowvec_sent1]
    bowvec_sent2_tfidf = tfidf[bowvec_sent2]
    
    #from bowvec to numerical list sent
    
    nvec1 = []
    nvec2 = []
    vec1 = dict(bowvec_sent1_tfidf)
    vec2 = dict(bowvec_sent2_tfidf)
    words = set(vec1.keys()).union(vec2.keys())
    for word in words:
        nvec1.append(vec1.get(word,0.0))
        nvec2.append(vec2.get(word,0.0))
        
    #from numerical list sent to numpy vec
    nvec_sent1_tfidf = np.asarray(nvec1)
    nvec_sent2_tfidf = np.asarray(nvec2)
    A = nvec_sent1_tfidf.reshape(1,-1)
    B = nvec_sent2_tfidf.reshape(1,-1)
    
    return bowvec_sent1_tfidf,bowvec_sent2_tfidf,nvec1,nvec2, A, B

In [47]:
bowvec_sent1_tfidf,bowvec_sent2_tfidf,nvec1,nvec2, A, B = preproc_data(sent1,sent2,tfidf)
print('Bow-Vec with tfidf values of sent 1', bowvec_sent1_tfidf)
print('Numerical list of sent 1',nvec1)
print('Numpy vector of sent 1', A)

Bow-Vec with tfidf values of sent 1 [(3236, 0.7916654345238935), (3419, 0.5554386815550553), (6105, 0.2544675044332316)]
Numerical list of sent 1 [0.0, 0.2544675044332316, 0.5554386815550553, 0.7916654345238935]
Numpy vector of sent 1 [[0.         0.2544675  0.55543868 0.79166543]]


In [56]:
for word in sent1:
    print(word,id2word.doc2bow([word]))

the []
girl [(3236, 1)]
run [(6105, 1)]
into []
the []
hall [(3419, 1)]


_Note_: seems like if this model filter the stopwords automatically.

### Applying Similarity

In [45]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(A,B)[0][0]

0.1330855207623347

In [46]:
bowvec_sent1s_tfidf,bowvec_sent2s_tfidf,nvec1s,nvec2s, As, Bs = preproc_data(sent1sl,sent2sl,tfidf)
cosine_similarity(As,Bs)[0][0]

0.1330855207623347

## Scipy TfIdf-Cosine sentence similarity

$Note: cosine_{Scipy\ distance} = 1 - cosine_{Sklearn\ similarity}$

In [50]:
from scipy.spatial.distance import cosine as cosine_scipy
print(cosine_scipy(nvec1,nvec2))
print(cosine_scipy(nvec1s,nvec2s))

0.8669144792376653
0.8669144792376653


## Gensim tfidf.n_similarity

Do not exist this kind of method! (Press 'Tab' key in the next cell to check it!)

In [None]:
tfidf.

## Best Pair Word Overlap

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

Due to the unmanagability of TfIdf Gensim object, and the few examples I could get, I decided to create de TfIdf with sklearn and then manipulated, to get the tfidf vector o a word, see the example below:

In [61]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
vectorizer = TfidfVectorizer(min_df=1, lowercase=True) #note: by default CountVectorizer, TfidfVectorizer use lowercase=True
corpus = load_files('data/',categories=['gutenberg'])
TfIdfMatrix = vectorizer.fit_transform(corpus.data)
pdTfIdf = pd.DataFrame(TfIdfMatrix.toarray(), columns=vectorizer.get_feature_names())
pdTfIdf = pdTfIdf.T
pdTfIdf.shape

(53415, 21)

In [62]:
pdTfIdf.loc['alice'].values.reshape(1,-1)

array([[3.24275368e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        3.97026332e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.10916056e-03, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        3.10535868e-04]])

In [81]:
def harmonic_best_pair_word_sim(sent1,sent2, pdTfIdf):
    p=0
    for wi in sent1:
        m = 0
        for wc in sent2:
            try:
                winp = pdTfIdf.loc[wi].values.reshape(1,-1)
                wcnp = pdTfIdf.loc[wc].values.reshape(1,-1)
                m = float(max(m, cosine_similarity(winp,wcnp)))
            except:
                pass
        p += m
    p = p/len(sent1)

    q=0
    for wc in sent2:
        m = 0
        for wi in sent1:
            try:
                wcnp = pdTfIdf.loc[wc].values.reshape(1,-1)
                winp = pdTfIdf.loc[wi].values.reshape(1,-1)
                m = float(max(m, cosine_similarity(winp,wcnp)))
                #print(m, type(m))
            except:
                pass
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

In [83]:
print('Dissimilar sentences tfidf_harmonic_best_pair_word similarity', 
      harmonic_best_pair_word_sim(sent3, sent2, pdTfIdf))
print('Dissimilar sentences without stopwords tfidf_harmonic_best_pair_word similarity',
      harmonic_best_pair_word_sim(sent3s, sent2s, pdTfIdf))
print('Similar sentences tfidf_harmonic_best_pair_word', 
      harmonic_best_pair_word_sim(sent1, sent2, pdTfIdf))
print('Similar sentences tfidf_harmonic_best_pair_word without stopwords',
      harmonic_best_pair_word_sim(sent1sl, sent2sl, pdTfIdf))

Dissimilar sentences tfidf_harmonic_best_pair_word similarity 0.7190818951347712
Dissimilar sentences without stopwords tfidf_harmonic_best_pair_word similarity 0.0
Similar sentences tfidf_harmonic_best_pair_word 0.9119930077479885
Similar sentences tfidf_harmonic_best_pair_word without stopwords 0.8445119942483112


## Gensim TfIdf-Hellinger sentence similarity

In [86]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim

print(hellinger(bowvec_sent1_tfidf,bowvec_sent2_tfidf))
print(hellinger(A,B))
print(kullback_leibler(A, B))

0.9744522138374331
0.9744522138374331
inf


In [87]:
print('Gensim Cosine:',cossim(bowvec_sent1_tfidf,bowvec_sent2_tfidf))
print('Gensim Cosine, filtering stopwords:',cossim(bowvec_sent1_tfidf,bowvec_sent2_tfidf))
print('Gensim Jaccard:',jaccard(bowvec_sent1_tfidf,bowvec_sent2_tfidf))

Gensim Cosine: 0.13308552076233474
Gensim Cosine, filtering stopwords: 0.13308552076233474
Gensim Jaccard: 0.8992553737425453


# Conclusions

* As you can test the TfIdf doesn't have a fast or parallel solution.
* In Gensim TfIdf model is generated from bowvecs.
* There is a good variation between Cosine, Word Overlap and Hellinger, this could be interesting to analize in a big dataset.
* Interesting too is that Gensim and Sklearn cosine have the same result.
* TfIdf Model filter stopword automatically, then the similarity comparison between original sentences and preprocessed sentences are equal.
* ``harmonic_best_pair_word_sim`` distance separated very well between similar and dissimilar sentences without stopwords.

# Recommendations

* Made the same example with Wikipedia dump data, to test the similarity difference according to data.

<a id='References'></a>
# References

<a id='Perkins2014'></a>
[1] *[Perkins2014]* Jacov Perkins. 
Book **Python 3 Text Processing with NLTK 3 Cookbook**. 2014. 
p. 7 **ISBN**: 978-1-78216-785-3

[2] *[Salton1983]* Salton, G; McGill, M. J. (1986). **Introduction to modern information retrieval**. McGraw-Hill. 
**ISBN**: 978-0-07-054484-0.
<a id='Salton1983'></a>

