# Testing TfIdf Model with MSRPC

**Prerequisites:** See the previous notebooks, and generate TfIdf model.

## Outline

**Main Goal:** To practice the application of Word Embedding models to a real corpus, the MSRPC, made for Paraphrase Recognition task.

**Index:**
- Loading TfIdf Model generated from Wikipedia dump.
- Load MSRPC
- Calculate all similarity corpus-distances using tfidf.model for every pair of sentences in MSRP.
- Traing a simple machine model to recognize paraphrase using the distance vector model.

In [25]:
from gensim.models import TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import numpy as np
import time
from pandas import DataFrame, Series, read_table, read_csv

In [2]:
#Loading tfidf Wiki model
corpus_path = '/media/DATA/wiki_es/'
tfidf = TfidfModel.load(corpus_path+'wiki-tfidf.model')

In [348]:
data = read_csv('data/msrpc.csv',sep='\t',header=0)
data.head()

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,702876,702977,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


In [345]:
len(cs)

5801

In [None]:
#print the bow vector of the sentence 1
sent = "Yo como pescado"
vec_sent = dictionary.doc2bow(sent.lower().split())
print(vec_sent)

In [None]:
#print the TFIDF vector of the sentence 1
vec_sent_tfidf = tfidf[vec_sent]
print(vec_sent_tfidf)

## Gensim TfIdf-Hellinger sentence similarity

In [None]:
sent1 = 'the girl run into the hall'
sent2 = 'Here Alice run to the hall'

sentence1 = sent1.split()
sentence2 = sent2.split()

vec_sent1 = dictionary.doc2bow(sentence1)
vec_sent2 = dictionary.doc2bow(sentence2)

vec_sent1_tfidf = tfidf[vec_sent1]
vec_sent2_tfidf = tfidf[vec_sent2]
print(vec_sent1_tfidf)
print(vec_sent2_tfidf)

Testing similarity with Gensim ecuations.

In [None]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim

In [None]:
hellinger(vec_sent1_tfidf,vec_sent2_tfidf)

In [None]:
print('Gensim Cosine:',cossim(vec_sent1_tfidf,vec_sent2_tfidf))
print('Gensim Jaccard:',jaccard(vec_sent1_tfidf,vec_sent2_tfidf))

One problem with Hellinger equation in Gensim is that iterates over the major vector, then in the above example the word 74333(eat) never will affect the result.

## Scipy TfIdf-Cosine sentence similarity

Testing similarity with Scipy equations. A normalized vector with the above problem is showed to correct it.

In [None]:
from scipy.spatial.distance import cosine
from scipy.spatial.distance import jaccard as jaccard_scipy

In [None]:
#cosine(vec_sent1_tfidf,vec_sent2_tfidf)

The above line result in an error because used vectors are bow vectors in the following format: list((wordid,word tfidf)). Then a previous transformation of vectors is needed to 1D numerical vectors.

In [None]:
from six import iteritems
vec2 = dict(vec_sent1_tfidf)
vec1 = dict(vec_sent2_tfidf)
#[vec1.get(index, 0.0)**2 for index, value in iteritems(vec2)]
nvec1,nvec2 = [],[]
words = set(vec1.keys()).union(vec2.keys())
for word in words:
    nvec1.append(vec1.get(word,0.0))
    nvec2.append(vec2.get(word,0.0))
print(nvec1,'\n',nvec2)

In [None]:
print('Scipy Cosine:',cosine(nvec1,nvec2))
print('Scipy Jaccard:',jaccard_scipy(nvec1,nvec2))

## Textsim TfIdf-Jaccard sentence similarity

Doing similarity with textsim package.

In [None]:
import sys
sys.path.append('/home/abelm')

from textsim.tokendists import jaccard_distance
from textsim.tokendists import cosine_similarity_sklearn

In [None]:
print('Textsim Jaccard', jaccard_distance(sent1,sent2))
print('TfIdf Textsim Jaccard', jaccard_distance(nvec1,nvec2))
#Prerocessed sentences
print('Textsim Cosine Sklearn',cosine_similarity_sklearn('girl run hall','Alice eat hall'))

In [None]:
A = np.asarray(nvec1).reshape((1,-1))
B = np.asarray(nvec2).reshape((1,-1))

print('TfIdf Textsim Cosine Sklearn',cosine_similarity_sklearn(A,B))

## Sklearn TfIdf-Cosine sentence similarity

The last experiment is made with TfIdf matrix from gensim.
Unfortunately to load the Wikipedia dump to make a tf-idf index is to much for this computer.

In [None]:
import numpy as np
from textsim.tokendists import cosine_similarity_sklearn
from sklearn.metrics.pairwise import cosine_similarity

#Sklearn cosine for raw sentences implemented in textsim
cosine_similarity_sklearn(sent1,sent2)

In [None]:
A = np.asarray(nvec1).reshape((1,-1))
B = np.asarray(nvec2).reshape((1,-1))
cosine_similarity(A,B)[0][0]

# Conclusions

* 0.659 input = bowvec, Hellinger, Gensim, 
* 0.267 input = bowvec, Cosine, Gensim
* 0.839 input = bowvec, Jaccard, Gensim
* 0.732 input = tfidf vec, Cosine, Scipy
* 1.000 input = tfidf vec, Jaccard, Scipy
* 0.777 input = str, Jaccard, Textsim, stopwords_filter=no
* 0.800 input = str, Jaccard, Textsim, stopwords_filter=yes
* 0.333 input = str, Cosine, Textsim-sklearn, stopwords_filter=yes
* 0.433 input = str, Cosine, Textsim-sklearn, stopwords_filter=no
* 0.267 input = tfidf vec, Cosine, Sklearn

As you can self analyze Gensim and Sklearn cosine have the same result. The sentences have words in common and in the context of "Alice's Adventures in Wonderland" by Lewis Carroll have the same mining, this book is part of the Gutenberg collection but only appears on Wikipedia dump as articles of few importance.