# Testing TfIdf Model with MSRPC

**Prerequisites:** See the previous notebooks, and generate TfIdf model.

## Outline

**Main Goal:** To practice the application of Word Embedding models to a real corpus, the MSRPC, made for Paraphrase Recognition task.

**Index:**
- Data Wrangling:
    - Loading TfIdf Model generated from Wikipedia dump.
    - Load MSRPC
    - Preprocessing (stopword removal, punctuation removal, etc.)
    - From string to tfidf numerical vectors.
- Feature Extraction:
    - Calculate all similarity corpus-distances using tfidf.model for every pair of sentences in MSRP.
- Classification:
    - Training a simple machine model to recognize paraphrase using the distance vector model.

In [1]:
from gensim.models import TfidfModel
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import numpy as np
import time
from pandas import DataFrame, Series, read_table, read_csv
from six import iteritems

## Data Wrangling

Loading Wikipedia tfidf model, loading MSRP Corpus. Transforming MSRP string sentences to tfidf numerical vector sentences.

In [2]:
#Loading tfidf Wiki model and dictionary
corpus_path = '/media/DATA/wiki_es/'
tfidf = TfidfModel.load(corpus_path+'wiki-tfidf.model')
dictionary = Dictionary.load_from_text(corpus_path+'_wordids.txt.bz2')



In [3]:
#Loading the Paraphrase Recognition corpus of Microsoft
data = read_csv('data/msrpc.csv',sep='\t',header=0)
print('Data corpus length:', len(data))
data.head()

Data corpus length: 5801


Unnamed: 0,class,ID1,ID2,sentence1,sentence2
0,1,702876,702977,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


In [4]:
#Preprocessing

In [5]:
#From str to bow vector
def str_to_bow_tfidfVector(sentence,dictionary=dictionary,tfidf=tfidf):
    vec_sent = dictionary.doc2bow(sentence.lower().split()) #sent bow vector
    bow_sent_tfidf = tfidf[vec_sent]                        #sent tfidf vector
    return bow_sent_tfidf

In [6]:
print(data.sentence1[0])
print(str_to_bow_tfidfVector(data.sentence1[0]))

Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.
[(8690, 0.3793547108014584), (19848, 0.49603841465983006), (23059, 0.2501920888751192), (23212, 0.46686708212150263), (41338, 0.5739990774122925)]


Bow_tfidf vectors are generated because Gensim package have some similarity functions implemented that works with this kind of structure.

But, we need a numpy array to work with sklearn or scipy text similarity functions. As a second condition to work with this sim func you need to normalize the # of words in both sentences, for the words that only appear in one sentence you need to assign a 0.0 value in the other, this make that the 2 sentences have the same length.

In [7]:
#From bow vector to numpy vector
def bowtfidf_to_numpy(bow_tfidfVector1, bow_tfidfVector2):
    vec2 = dict(bow_tfidfVector1)
    vec1 = dict(bow_tfidfVector2)
    nvec1,nvec2 = [],[]
    words = set(vec1.keys()).union(vec2.keys())
    for word in words:
        nvec1.append(vec1.get(word,0.0))
        nvec2.append(vec2.get(word,0.0))
    
    A = np.asarray(nvec1).reshape((1,-1))
    B = np.asarray(nvec2).reshape((1,-1))
    
    return A,B

In [8]:
bow_tfidfVector1 = str_to_bow_tfidfVector(data.sentence1[0])
bow_tfidfVector2 = str_to_bow_tfidfVector(data.sentence2[0])
np_tfidfVec1, np_tfidfVec2 = bowtfidf_to_numpy(bow_tfidfVector1,bow_tfidfVector2)
print(np_tfidfVec1,'\n' ,np_tfidfVec2)

[[0.         0.         0.37133383 0.21991717 0.30226133 0.4985811
  0.57694139 0.         0.34537114 0.1471734 ]] 
 [[0.37935471 0.25019209 0.         0.         0.         0.49603841
  0.57399908 0.46686708 0.         0.        ]]


## Feature Extraction

In [9]:
from gensim.matutils import hellinger
from gensim.matutils import cossim as gcosine
from gensim.matutils import jaccard as gjaccard
#from scipy.spatial.distance import cosine as scosine
from textsim.tokendists import cosine_distance_scipy as scosine
from textsim.tokendists import cosine_similarity_sklearn as kcosine

In [10]:
## Gensim TfIdf-Hellinger sentence similarity
print(data.sentence1[0],'\n',data.sentence2[0],'\n')
print('Hellinger distance: ',hellinger(bow_tfidfVector1,bow_tfidfVector2))
print('Gensim Cosine distance:',gcosine(bow_tfidfVector1,bow_tfidfVector2))
print('Gensim Jaccard distance:',gjaccard(bow_tfidfVector1,bow_tfidfVector2))
print('Scipy cosine distance: ', scosine(np_tfidfVec1,np_tfidfVec2))
print('Sklearn cosine distance: ', kcosine(np_tfidfVec1,np_tfidfVec2))

Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence. 
 Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence. 

Hellinger distance:  1.1141090076668747
Gensim Cosine distance: 0.5784792046014192
Gensim Jaccard distance: 0.7687920514085964
Scipy cosine distance:  0.5784792046014192
Sklearn cosine distance:  0.5784792046014192


### Extracting all features for all pair of sentences

And adding those to a new pandas DataFrame for Machine Learning task. 

In [64]:
features = {
    'hellinger':hellinger,
    'gcosine':gcosine,
    'gjaccard':gjaccard,
    'scosine':scosine,
    'kcosine':kcosine,
}
columns=list(features.keys())
columns.append('class')

df = DataFrame(columns=columns)
for i in range(len(data)):
    row = []
    sent1 = str_to_bow_tfidfVector(data.sentence1[i],dictionary=dictionary,tfidf=tfidf)
    sent2 = str_to_bow_tfidfVector(data.sentence2[i],dictionary=dictionary,tfidf=tfidf)
    for feature in features:
        if feature in ['scosine','kcosine']:
            A,B = bowtfidf_to_numpy(sent1,sent2)
            row.append(features[feature](A,B))
        else:
            row.append(features[feature](sent1,sent2))
    
    row.append(data.iloc[i]['class'])
    df.loc[i] = row
        


In [65]:
df.loc[0]

scosine      0.578479
hellinger    1.114109
kcosine      0.578479
gcosine      0.578479
gjaccard     0.768792
class        1.000000
Name: 0, dtype: float64

## Paraphrase Recognition Example

In [59]:
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

In [66]:
x = df.as_matrix(list(features.keys()))
y = df.as_matrix(['class']).ravel()

In [67]:
#Scaling
Xs = scale(x,with_mean=True,with_std=True,axis=0)

#Partitioning data into test and train subsets
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.3, random_state=4)

In [68]:
# instantiate the model (using the default parameters)
clf = SVC(kernel='linear')

# fit the model with data
clf.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [69]:
from sklearn import metrics

# STEP 3: make predictions on the testing set
y_pred = clf.predict(X_test)

# compare actual response values (y_test) with predicted response values (y_pred)
metrics.accuracy_score(y_test, y_pred)

0.6576680068925904

If you want to know how tfidf impact the result, look at the accuracy not using a tfidf model. Now as you don't have the TfIdf model Gensim cosine, hellinger or jaccard can't be used. At the same time we used a vector to call sklearn and sciy measures, but now without a word vector model we need the original string sentences. We will use the same measures loaded from textsim library.

In [41]:
from textsim.tokendists import cosine_distance_scipy as tscosine
from textsim.tokendists import cosine_distance_sklearn as tkcosine

tfeatures = {
    'scosine':tscosine,
    'kcosine':tkcosine,
}
columns=list(tfeatures.keys())
columns.append('class')

df2 = DataFrame(columns=columns)
for i in range(len(data)):
    row = []
    for feature in tfeatures:
        row.append(features[feature](data.sentence1[i],data.sentence2[i]))

    row.append(data.iloc[i]['class'])
    df2.loc[i] = row

In [42]:
df2.shape

(5801, 3)

In [44]:
x2 = df2.as_matrix(list(features.keys()))
y2 = df2.as_matrix(['class']).ravel()
Xs2 = scale(x2,with_mean=True,with_std=True,axis=0)
X_train2, X_test2, y_train2, y_test2 = train_test_split(Xs2, y2, test_size=0.3, random_state=4)
clf = SVC(kernel='linear')
clf.fit(X_train2, y_train2)
y_pred2 = clf.predict(X_test2)
metrics.accuracy_score(y_test2, y_pred2)

0.7168294083859851

# Conclusions

In this notebook the model Tf-Idf was applied to a real problem named *Paraphrase Recognition*. The results can be interpreted as bad, but the majority of the measures that use this model created in the other notebooks has not been included. It is recommended that more similarity measures and different word embedding models must be tested to have a sustainable and scientific answer about the importance or influence of the word embedding methods to similarity problems.