<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for Paper *Paraphrase Beyond Sentence*. Source and license info is on [GitHub](https://github.com/sorice/2017paraphrasebsent/).</i></small>

# Basic Text Similarity Example with Sklearn

The goal of this notebook is to show step-by-step how to calculate similarity between two texts using vectorized texts. The similarity measures used in the examples below are based on tokens.

In [9]:
import warnings
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

## Step 1 From String to Vectors

Introduce the strings and transform to vectors using NLTK.

In [1]:
import numpy as np

sA = "PCCW's chief operating officer, Mike Butcher, and Alex Arena, the chief financial officer, will report directly to Mr So."
sB = "Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So."

from sklearn.feature_extraction.text import CountVectorizer

count_v = CountVectorizer()
tdm = count_v.fit_transform([sA, sB])
A = tdm[0]
B = tdm[1]
print(type(A))
print(type(B))

print (type(B),B.shape,A.shape)
#pairwise_distances_argmin_min(A,Bp,axis=1,metric='jaccard')
Bp = B.toarray()
Ap = A.toarray()
print (type(Bp))
print (Bp)
print (Ap)

<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'> (1, 19) (1, 19)
<class 'numpy.ndarray'>
[[1 1 1 1 2 1 0 1 1 1 0 2 1 0 1 1 0 1 1]]
[[1 1 1 1 2 0 1 1 0 1 1 2 1 1 1 1 1 1 1]]


## Step 2 Padding Vectors When Lengths are Not Equal

If the vectors are obtained from a different source to the one shown above, an equal length in both vectors is needed. The similarity metrics in sklearn works well if both vectors have the same length. Different sentences have different sizes, so it is mandatory to pad the shortest one.

In [2]:
A = np.array([0,1,0,0,1,1,1,0,0,1,0,0,0,0],dtype=np.int32)
B = np.array([0,0,1,0,1,0,1,0,1],dtype=np.int32)

Bp = np.pad(B,(0,len(A)-len(B)),'constant') #padding B with value 0 constant at the right side to the A len.

## Step 3 Calculating Similarities

In [6]:
import scipy
import sklearn
from sklearn.metrics.pairwise import *

#Reshape the vectos to samples,features
Bp = Bp.reshape(1,-1)
A = A.reshape(1,-1)

# jaccard similarity
pairwise_distances_argmin_min(A,Bp,axis=1,metric='manhattan')

(array([0]), array([5.]))

## Step 4 Constructing a Vector of Features

The next step is about building a vector made of some similarity measures.

    pairwise_distances_argmin_min(A,Bp,axis=1,metric='dice')
    pairwise_distances_argmin_min(A,Bp,axis=1,metric='hamming')
    pairwise_distances_argmin_min(A,Bp,axis=1,metric='cityblock')
    pairwise_distances_argmin_min(A,Bp,axis=1,metric='cosine')

In [10]:
from sklearn.metrics.pairwise import * #support sparse matrix inputs
from scipy.spatial.distance import * #do not support sparse matrix inputs

#Use the next line if your vectors have more than 14 samples, the 'mahalanobis' distance fails
    #from sklearn.metrics.pairwise import _VALID_METRICS 

_VALID_METRICS = ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock',
                  'braycurtis', 'canberra', 'chebyshev', 'correlation',
                  'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski',
                  'matching', 'minkowski', 'rogerstanimoto',
                  'russellrao', 'seuclidean', 'sokalmichener',
                  'sokalsneath', 'sqeuclidean', 'yule',]
vector = []

for distance in _VALID_METRICS:
    index, dist =  pairwise_distances_argmin_min(A,Bp, axis=1, metric=distance)
    vector.append(dist[0])

_DISTANCE_VECTOR = np.array(vector, dtype = np.float16)
_DISTANCE_VECTOR = _DISTANCE_VECTOR.reshape(1,-1)

In [11]:
print (_DISTANCE_VECTOR.shape)
print (_DISTANCE_VECTOR)

(1, 23)
[[2.236  2.236  5.     5.     5.     0.5557 5.     1.     0.8115 0.5527
  0.5557 0.3572 0.7144 0.8945 0.3572 2.236  0.5264 0.857     nan 0.5264
  0.8335 5.     0.6   ]]


## Step 5 Scaling all Features for Normalization

All the features must be scaled to the interval [0,1]. In the example below will be used the *MaxAbsScaler* method which takes the maximal existent value as 1. Other techniques are available in the module sklearn.preprocessing.

    from sklearn.preprocessing import MaxAbsScaler

    maxabs = MaxAbsScaler()
    _DISTANCE_VECTOR_NORM = maxabs.fit_transform(_DISTANCE_VECTOR)
    
A setback in this method is that it needs some samples. That is why the samples used are taken from an ARFF file.

In [12]:
from scipy.io import arff
import pandas as pd

data, meta = arff.loadarff('data/MSRPC-2004/msrpc_test_simmetrics-18f.arff')

print(data.shape) # see a (1725,) is important to reduce this for this notebook

#Reduce the samples
_DISTANCE_MATRIX_a = data[:10] # This matrix contain non numeric values, is important to drop

# Using pandas to worki with the distance matrix
_DISTANCE_MATRIX_b = pd.DataFrame(_DISTANCE_MATRIX_a)
_DISTANCE_MATRIX_c = _DISTANCE_MATRIX_b.drop(['class'], axis=1)

from sklearn.preprocessing import MaxAbsScaler

maxabs = MaxAbsScaler()
_DISTANCE_MATRIX_NORM = maxabs.fit_transform(_DISTANCE_MATRIX_c)
print('original MATRIX[0]\n', _DISTANCE_MATRIX_a[0],'\ntype',_DISTANCE_MATRIX_a.dtype, 'shape',_DISTANCE_MATRIX_a.shape,'\n')
print('transformed MATRIX[0]\n', _DISTANCE_MATRIX_NORM[0],'\ntype',_DISTANCE_MATRIX_NORM.dtype, 'shape',_DISTANCE_MATRIX_NORM.shape)

#Converting the result in a DataFrame
_NormMatrix_DataFrame = pd.DataFrame(_DISTANCE_MATRIX_NORM,columns=_DISTANCE_MATRIX_c.columns)
print(_NormMatrix_DataFrame.columns)
print(_NormMatrix_DataFrame.shape)

(1725,)
original MATRIX[0]
 (0.72500002, 0.48598132, 0.61682242, 0.61682242, 0.73934317, 0.73934317, 0.91112685, 0.89166665, 0.58008659, 0.3888889, 0.438357, 0.4375, 0.78516549, 0.28, 0.36842105, 0.85772765, 0.46666667, 0.58333337, b'yes') 
type [('NeedlemanWunch', '<f8'), ('SmithWaterman', '<f8'), ('SmithWatermanGotoh', '<f8'), ('SmithWatermanGotohWindowedAffine', '<f8'), ('Jaro', '<f8'), ('JaroWinkler', '<f8'), ('ChapmanMeanLength', '<f8'), ('ChapmanLengthDeviation', '<f8'), ('QGramsDistance', '<f8'), ('BlockDistance', '<f8'), ('CosineSimilarity', '<f8'), ('DiceSimilarity', '<f8'), ('EuclideanDistance', '<f8'), ('JaccardSimilarity', '<f8'), ('MatchingCoefficient', '<f8'), ('MongeElkan', '<f8'), ('OverlapCoefficien', '<f8'), ('Levenshtein', '<f8'), ('class', 'S3')] shape (10,) 

transformed MATRIX[0]
 [0.88359377 0.55277496 0.67167594 0.67167594 0.88752249 0.79225245
 0.9220618  0.91024303 0.72510823 0.49122807 0.54455337 0.54687499
 0.86514849 0.41999999 0.49736843 0.913556   0.51851

# Conclusions

- The *sklearn.metrics.pairwise* method allows to get 23 measures based on tokens.

Next notebook *[3.2-Similarity-Features-Elaboration.ipynb](03.2-Text-Similarity-Features-Elaboration-with-textsim-pack.ipynb)* is a complete guide to use the internal documentation notebooks of **textsim** library for the creation of a complete set of features based on different string, token or knowledge distances.