# Text Similarity using Tf-Idf Model

**Prerequisites:** Skills in tokenization with nltk, knowledge of TfIdf Text Representation model.

## Outline

**Main Goal:** To practice how to create TfIdf models with Gensim Tf-Idf implementations, using NLTK preprocessing. Then introduce how to extract features from this text representation, and finally how to measure text similarity using previous results.

- Gensim Corpus Inizialization
- TfIdf model generation
- Wrangling data from BOW to numpy objects
- Text similarity measures examples

## What is TfIdf?

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. [(Salton1983)](#Salton1983).

In [1]:
import os
import nltk
from gensim.models import TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import time

## Wrangling Data

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

In [55]:
doc_collection = []
file_path = 'gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection.append(doc.read().lower())

In [56]:
tokenized_text = [[word for word in nltk.word_tokenize(doc)] for doc in doc_collection]

print(len(tokenized_text))
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

#remove stop words
texts = [[word for word in text if word not in stopwords] for text in tokenized_text]

21


## Generating the TfIdf Model

In [57]:
init = time.time()
# Create dictionary with tid to token mappings (or alternatively load one)
id2word = Dictionary(texts)

#remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
id2word.filter_extremes(no_below=2, no_above=0.6)

#convert the dictionary to a bag of words corpus for reference
bow_corpus = [id2word.doc2bow(text) for text in texts]

#generating the tf-idf model
tfidf = TfidfModel(bow_corpus,id2word=id2word)
end = time.time()-init
print('Total time %f segundos.' % end)

Total time 1.897382 segundos.


In [66]:
print(id2word.doc2bow(['accounted']))
print(tfidf.idfs[id2word.doc2bow(['accounted'])[0][0]])

[(90, 1)]
1.070389327891398


In [58]:
[(tfidf.id2word[i],tfidf.idfs[i]) for i in range(90,100)]

[('accounted', 1.070389327891398),
 ('accounting', 2.3923174227787602),
 ('accounts', 1.3923174227787602),
 ('accrue', 2.807354922057604),
 ('accumulations', 2.807354922057604),
 ('accusation', 2.070389327891398),
 ('accuse', 1.3923174227787602),
 ('accused', 1.3923174227787602),
 ('accustomed', 1.3923174227787602),
 ('achieved', 2.070389327891398)]

As you can see the Tf model is very simple, only the word, ans its related tfidf coef. The Gensim implementation contains a method - *\_\_getitem\_\_* - that return the tfidf representation of an input bag of word vector. 

### Wrangling Data

* First: From string-sentences to bow representation of a sentence.
* Second: From bow representation to numerical-list representation of a sentence.
* Third: From numerical-list vector to numerical-vector (numpy) representation.

In [43]:
sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

In [45]:
import numpy as np

def preproc_data(sentence1, sentence2, model):
    
    #from raw sent to bowvec sent
    sent1 = sentence1.split()
    sent2 = sentence2.split()

    bowvec_sent1 = id2word.doc2bow(sent1)
    bowvec_sent2 = id2word.doc2bow(sent2)

    bowvec_sent1_tfidf = tfidf[bowvec_sent1]
    bowvec_sent2_tfidf = tfidf[bowvec_sent2]
    
    #from bowvec to numerical list sent
    
    nvec1 = []
    nvec2 = []
    vec1 = dict(bowvec_sent1_tfidf)
    vec2 = dict(bowvec_sent2_tfidf)
    words = set(vec1.keys()).union(vec2.keys())
    for word in words:
        nvec1.append(vec1.get(word,0.0))
        nvec2.append(vec2.get(word,0.0))
        
    #from numerical list sent to numpy vec
    nvec_sent1_tfidf = np.asarray(nvec1)
    nvec_sent2_tfidf = np.asarray(nvec2)
    A = nvec_sent1_tfidf.reshape(1,-1)
    B = nvec_sent2_tfidf.reshape(1,-1)
    
    return bowvec_sent1_tfidf,bowvec_sent2_tfidf,nvec1,nvec2, A, B

In [46]:
bowvec_sent1_tfidf,bowvec_sent2_tfidf,nvec1,nvec2, A, B = preproc_data(sentence1,sentence2,tfidf)
print('Bow-Vec with tfidf values of sent 1', bowvec_sent1_tfidf)
print('Numerical list of sent 1',nvec1)
print('Numpy vector of sent 1', A)

Bow-Vec with tfidf values of sent 1 [(3766, 0.6507150320816605), (3931, 0.7593220311718628)]
Numerical list of sent 1 [0.7593220311718628, 0.0, 0.6507150320816605]
Numpy vector of sent 1 [[0.75932203 0.         0.65071503]]


## Sklearn TfIdf-Cosine sentence similarity

In [16]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(A,B)[0][0]

0.2428008742309789

In [21]:
sentence1s = 'girl run hall'
sentence2s = 'Alice run hall'
bowvec_sent1s_tfidf,bowvec_sent2s_tfidf,nvec1s,nvec2s, As, Bs = preproc_data(sentence1s,sentence2s,tfidf)
cosine_similarity(As,Bs)[0][0]

0.2428008742309789

## Scipy TfIdf-Cosine sentence similarity

In [22]:
from scipy.spatial.distance import cosine as cosine_scipy
print(cosine_scipy(nvec1,nvec2))
print(cosine_scipy(nvec1s,nvec2s))

0.757199125769
0.757199125769


## Gensim tfidf.n_similarity

Do not exist this kind of method!

In [None]:
tfidf.

## Best Pair Word Overlap

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

Due to the unmanagability of TfIdf Gensim object, and the few examples I could get, I decided to create de TfIdf with sklearn and then manipulated, to get the tfidf vector o a word, see the example below:

In [23]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
vectorizer = TfidfVectorizer(min_df=1, lowercase=False) #note: by default CountVectorizer, TfidfVectorizer use lowercase=True
corpus = load_files('./',categories=['gutenberg'])
TfIdfMatrix = vectorizer.fit_transform(corpus.data)
pdTfIdf = pd.DataFrame(TfIdfMatrix.toarray(), columns=vectorizer.get_feature_names())
pdTfIdf = pdTfIdf.T
pdTfIdf.shape

(63341, 21)

In [24]:
pdTfIdf.loc['Alice'].values.reshape(1,-1)

array([[  3.39911051e-01,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   4.24280334e-04,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.15835370e-03,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   3.26450029e-04]])

In [28]:
sent1 = ['the','girl','run','into','the','hall']
sent2 = ['Here','Alice','run','to','the','hall']

def harmonic_best_pair_word_sim(sent1,sent2, pdTfIdf):
    p=0
    for wi in sent1:
        m = 0
        winp = pdTfIdf.loc[wi].values.reshape(1,-1)
        for wc in sent2:
            wcnp = pdTfIdf.loc[wc].values.reshape(1,-1)
            m = max(m, cosine_similarity(winp,wcnp))
        p += m
    p = p/len(sent1)

    q=0
    for wc in sent2:
        m = 0
        wcnp = pdTfIdf.loc[wc].values.reshape(1,-1)
        for wi in sent1:
            winp = pdTfIdf.loc[wi].values.reshape(1,-1)
            m = max(m, cosine_similarity(winp,wcnp))
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

harmonic_best_pair_word_sim(sent1,sent2, pdTfIdf)[0][0]

0.9157856019931182

In [29]:
#With stopword filtering
sent1 = ['girl','run','hall']
sent2 = ['Alice','run','hall']
print(harmonic_best_pair_word_sim(sent1,sent2,pdTfIdf)[0][0])

0.889675805486


## Gensim TfIdf-Hellinger sentence similarity

In [30]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim

print(hellinger(bowvec_sent1_tfidf,bowvec_sent2_tfidf))
print(hellinger(A,B))
print(kullback_leibler(A, B))

0.610041216037
0.919727979992
inf


In [31]:
print('Gensim Cosine:',cossim(bowvec_sent1_tfidf,bowvec_sent2_tfidf))
print('Gensim Cosine, filtering stopwords:',cossim(bowvec_sent1_tfidf,bowvec_sent2_tfidf))
print('Gensim Jaccard:',jaccard(bowvec_sent1_tfidf,bowvec_sent2_tfidf))

Gensim Cosine: 0.2428008742309789
Gensim Cosine, filtering stopwords: 0.2428008742309789
Gensim Jaccard: 0.880566019484764


array([[ 0.        ,  0.75932203,  0.65071503]])

# Conclusions

* As you can test the TfIdf doesn't have a fast or parallel solution.
* In Gensim TfIdf model is generated from bowvecs.
* There is a good variation between Cosine, Word Overlap and Hellinger, this could be interesting to analize in a big dataset.
* Interesting too is that Gensim and Sklearn cosine haven't the same result.

# Recommendations

* Made the same example with Wikipedia dump data, to test the similarity difference according to data.

<a id='References'></a>
# References


[1] *[Salton1983]* Salton, G; McGill, M. J. (1986). Introduction to modern information retrieval. McGraw-Hill. 
ISBN 978-0-07-054484-0.
<a id='Salton1983'></a>