# Applying Preprocess for Real

This tutorial intends to show ``preprocess`` in a real context. After a 
quickstart in the library, and the bases of text normalization with 
python, the next obvious step is to apply preprocessing techniques in a 
real NLP problem

The selected problem is *Semantic Text Similarity*.

## Semantic Text Similarity

SEMEVAl is an International Workshop on Semantic Evaluation, currently
part of Lexical and Computational Semantic and Semantic Evaluation
scientific conference. The objective of this workshop is to measure
the degree of semantic equivalence between two texts. The data is
composed by sentence pairs, coming from previously existing paraphrase
datasets [Agirre2012]_. This event is divided in tasks, the task of 
interest here is [Semantic Text Similarity](http://alt.qcri.org/semeval2012/task17/)

Usually in the gold standard the semantic equivalence is measured with
a float number between [0-5].

## Dataset

The data used for this example is a small part of SemEval 2012 Shared
[Task 6 Dataset](https://www.cs.york.ac.uk/semeval-2012/task6/index.php%3Fid=data.html), the en-en subset.

The subset is from MSR-Paraphrase, [Microsoft Research Paraphrase Corpus](http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/).
750 pairs of sentences.

### Legal Note

STS 2012 Dataset is under this licenses:
* http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/
* http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/

In [1]:
#import the dataset
import pandas as pd
data = pd.read_csv('../../preprocess/data/2012SMTeuroparl.train.tsv', sep='\t')

In [2]:
data.columns = ['score','s1','s2']
data.head()

Unnamed: 0,score,s1,s2
0,4.25,I know that in France they have had whole herd...,"I know that in France, the principle of slaugh..."
1,4.8,"Unfortunately, the ultimate objective of a Eur...",Unfortunately the final objective of a Europea...
2,4.8,The right of a government arbitrarily to set a...,The right for a government to draw aside its c...
3,4.0,"The House had also fought, however, for the re...",This Parliament has also fought for this reduc...
4,4.8,The right of a government arbitrarily to set a...,The right for a government to dismiss arbitrar...


### Requirements

Thise example use the open source library [textsim](https://github.com/sorice/textsim), 
a personal proyect of the author. Is a library for text similarity 
which integrates some very known text similarity distances, and some 
implementation of those distances on scipy, sklearn and other python libraries.

In [3]:
import preprocess
import textsim
from copy import deepcopy
import warnings
warnings.filterwarnings('ignore')

# Preprocessing

In [4]:
preprocess.basic.__all__

['lowercase',
 'replace_urls',
 'replace_symbols',
 'replace_dot_sequence',
 'multipart_words',
 'expand_abbrevs',
 'normalize_abbrevs',
 'expand_contractions',
 'replace_punctuation',
 'extraspace_for_endingpoints',
 'add_doc_ending_point',
 'del_tokens_len_one',
 'hyphenation',
 'del_digits']

In [5]:
#You can play with the atomic steps preproc-text library allows
flow = ['lowercase', 
        'expand_contractions', 
        'replace_dot_sequence', 
        'multipart_words', 
        'replace_punctuation', 
        'del_digits']

pdata = deepcopy(data)

#Preprocess all the sentences and keep the new value in pdata
for i in range(len(pdata)):
    pdata.iloc[i].s1 = preprocess.pipeline(pdata.iloc[i].s1, flow=flow)
    pdata.iloc[i].s2 = preprocess.pipeline(pdata.iloc[i].s2, flow=flow)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


### Feature Engineering

Converting Sentences to Vectors of similarity distances.

Every pair of sentences will be convereted to one vector of float values, and the original score will be taken as the final result to get. The same process will be done with preprocessed data and original data, to calculate de impact of preprocess in the machine learning process.

The next process must take some time, because the cell must perform 733*2 text to vector conversions, and then obtain 733*43 calculations

In [48]:
mlpdata = pd.DataFrame()
mlpdata['score'] = pdata['score']

textsimData =pd.DataFrame()
#make textsim matrix
for metric in textsim.__all_distances__:
    observations = []
    for i in range(len(pdata)):
        observations.append(textsim.__all_distances__[metric](pdata.iloc[i].s1, pdata.iloc[i].s2))
    textsimData[metric] = observations

  dist = 1.0 - uv / np.sqrt(uu * vv)
  return (ntf + nft - ntt + n) / (ntf + nft + n)
  return np.sqrt(((XA - XB) ** 2 / V).sum())
  return float(2.0 * ntf * nft / np.array(ntt * nff + ntf * nft))


In [65]:
print(textsimData.shape)
textsimData.head()

(733, 43)


Unnamed: 0,binary_distance,levenshtein_distance,edit_similarity,damerau_levenshtein_distance,jaro_distance,jaro_winkler_distance,hamming_distance,match_rating_comparison,dice_coefficient,lcs_distance,...,matching_distance,minkowski_distance,rogerstanimoto_distance,russellrao_distance,seuclidean_distance,sokalmichener_distance,sokalsneath_distance,sqeuclidean_distance,yule_distance,qgram_distance
0,0.0,78,0.462069,78,0.764647,0.858788,119,True,0.622222,77,...,0.62069,19.0,0.619048,0.413793,6.0,0.619048,0.604651,21.0,3.789474,0.622222
1,0.0,32,0.769784,32,0.787642,0.872585,121,True,0.631579,110,...,0.478261,11.0,0.357143,0.26087,4.690416,0.357143,0.37037,11.0,0.380952,0.631579
2,0.0,38,0.672414,38,0.85715,0.91429,70,True,0.823529,95,...,0.388889,7.0,0.105263,0.111111,3.741657,0.105263,0.111111,7.0,0.0,0.823529
3,0.0,148,0.467626,148,0.766001,0.812801,257,True,0.444444,158,...,0.674419,31.0,-2.777778,-0.162791,7.615773,-2.777778,0.0,35.0,0.275862,0.444444
4,0.0,38,0.672414,38,0.82183,0.893098,100,True,0.787879,92,...,0.444444,8.0,0.2,0.166667,4.0,0.2,0.210526,8.0,0.0,0.787879


In [63]:
pmatrix = mlpdata.merge(textsimData, left_index=True, right_index=True)

In [64]:
pmatrix.shape

(733, 44)

### Repeating the same process with data

The original data, without preprocess must be transformed into float numer matrices, or feature matrix.

In [69]:
mldata = pd.DataFrame()
mldata['score'] = data['score']

textsimData =pd.DataFrame()
#make textsim matrix
for metric in textsim.__all_distances__:
    observations = []
    for i in range(len(pdata)):
        observations.append(textsim.__all_distances__[metric](data.iloc[i].s1, data.iloc[i].s2))
    textsimData[metric] = observations

  dist = 1.0 - uv / np.sqrt(uu * vv)
  return (ntf + nft - ntt + n) / (ntf + nft + n)
  return np.sqrt(((XA - XB) ** 2 / V).sum())
  return float(2.0 * ntf * nft / np.array(ntt * nff + ntf * nft))


In [70]:
print(textsimData.shape)
textsimData.head()

(733, 43)


Unnamed: 0,binary_distance,levenshtein_distance,edit_similarity,damerau_levenshtein_distance,jaro_distance,jaro_winkler_distance,hamming_distance,match_rating_comparison,dice_coefficient,lcs_distance,...,matching_distance,minkowski_distance,rogerstanimoto_distance,russellrao_distance,seuclidean_distance,sokalmichener_distance,sokalsneath_distance,sqeuclidean_distance,yule_distance,qgram_distance
0,0.0,78,0.462069,78,0.764647,0.858788,119,True,0.622222,77,...,0.62069,19.0,0.619048,0.413793,6.0,0.619048,0.604651,21.0,3.789474,0.622222
1,0.0,32,0.769784,32,0.787642,0.872585,121,True,0.631579,110,...,0.478261,11.0,0.357143,0.26087,4.690416,0.357143,0.37037,11.0,0.380952,0.631579
2,0.0,38,0.672414,38,0.85715,0.91429,70,True,0.823529,95,...,0.388889,7.0,0.105263,0.111111,3.741657,0.105263,0.111111,7.0,0.0,0.823529
3,0.0,148,0.467626,148,0.766001,0.812801,257,True,0.444444,158,...,0.674419,31.0,-2.777778,-0.162791,7.615773,-2.777778,0.0,35.0,0.275862,0.444444
4,0.0,38,0.672414,38,0.82183,0.893098,100,True,0.787879,92,...,0.444444,8.0,0.2,0.166667,4.0,0.2,0.210526,8.0,0.0,0.787879


In [82]:
#Replace n
print(textsimData.isnull().sum())

binary_distance                  0
levenshtein_distance             0
edit_similarity                  0
damerau_levenshtein_distance     0
jaro_distance                    0
jaro_winkler_distance            0
hamming_distance                 0
match_rating_comparison          0
dice_coefficient                 0
lcs_distance                     0
lcs_similarity                   0
smith_waterman_distance          0
needleman_wunsch_distance        0
needleman_wunsch_similarity      0
containment_distance             0
jaccard_distance                 0
overlap_distance                 0
matching_coefficient             0
matching_coefficient_pablo       0
token_containment_distance       0
masi_distance                    0
interval_distance                0
manhattan_distance               0
cosine_distance                  0
euclidean_distance               0
braycurtis_distance              0
canberra_distance                0
chebyshev_distance               0
correlation_distance

In [90]:
mask = textsimData['yule_distance'] != np.inf
print(mask.count())

[0;31mType:[0m        float
[0;31mString form:[0m inf
[0;31mDocstring:[0m   Convert a string or number to a floating point number, if possible.


733


In [92]:
#Replacing inf by max and -inf by min
is_inf = textsimData.yule_distance == np.inf 
is_ninf = textsimData.yule_distance == -np.inf
yule_mean = textsimData.yule_distance[~is_inf & ~is_ninf].mean()
textsimData.replace([np.inf, -np.inf], textsimData.yule_distance[~is_inf & ~is_ninf].mean(), inplace=True)
col_mask=textsimData.isnull().any(axis=0) 
row_mask=textsimData.isnull().any(axis=1)
textsimData.loc[row_mask,col_mask]
textsimData.loc[row_mask,col_mask] = yule_mean

mask = textsimData['correlation_distance'] != np.inf
textsimData.loc[~mask, 'correlation_distance'] = textsimData.loc[mask, 'correlation_distance'].max()
bmask = textsimData['correlation_distance'] != -np.inf
textsimData.loc[~bmask, 'correlation_distance'] = textsimData.loc[bmask, 'correlation_distance'].min()
mask = textsimData['seuclidean_distance'] != np.inf
textsimData.loc[~mask, 'seuclidean_distance'] = textsimData.loc[mask, 'seuclidean_distance'].max()
bmask = textsimData['seuclidean_distance'] != -np.inf
textsimData.loc[~bmask, 'seuclidean_distance'] = textsimData.loc[bmask, 'seuclidean_distance'].min()
print(textsimData.isnull().sum())

binary_distance                 0
levenshtein_distance            0
edit_similarity                 0
damerau_levenshtein_distance    0
jaro_distance                   0
jaro_winkler_distance           0
hamming_distance                0
match_rating_comparison         0
dice_coefficient                0
lcs_distance                    0
lcs_similarity                  0
smith_waterman_distance         0
needleman_wunsch_distance       0
needleman_wunsch_similarity     0
containment_distance            0
jaccard_distance                0
overlap_distance                0
matching_coefficient            0
matching_coefficient_pablo      0
token_containment_distance      0
masi_distance                   0
interval_distance               0
manhattan_distance              0
cosine_distance                 0
euclidean_distance              0
braycurtis_distance             0
canberra_distance               0
chebyshev_distance              0
correlation_distance            0
dice_distance 

The simple inspection of this columns series makes us to evaluate that the 'binary_distance', 'match_rating_comparison', 'damerau_levenshtein_distance' have 0.0 values, boolean values and same value than levenstein_distance respectively. So for the final calculation this columns are useless.

In [94]:
exp1 = textsimData.drop(['binary_distance', 'match_rating_comparison', 'damerau_levenshtein_distance'], axis=1)

In [95]:
exp1.shape

(733, 40)

## Machine Learning model

[Some kind of Logistic Regression for classification.]

[Features, use textsim.calc_all](make a brief description here, and link with github.com/sorice/textsim)

In [96]:
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
y = data['score']
X = exp1
reg = make_pipeline(StandardScaler(),SGDRegressor(max_iter=100, tol=1e-3))
reg.fit(X,y)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=100))])

In [119]:
test_ext.shape

(40, 1)

In [120]:
scaler = StandardScaler()
test = exp1.iloc[0].to_numpy()
test_ext = test.reshape(-1,1).reshape(1,-1)
print(test_ext_esc.shape)
reg.predict(test_ext)

(40, 1)


array([4.5164438])

In [122]:
data['score'][0]

4.25

### Trainin Process

Train without preprocess

Train with preprocess

## Cross Validation

[Show differences between scores obtained with/without preprocess]

## Recommendations

* usually we must reduce dimensionality, for better interpretabillity
  of the model, less complexity, reduce the training time, avoid 
  overfitting and gain capacity of generalization 

* Feature selection process is not objective of this tutorial, but it
  is recommended that comparing the list of must important features,
  could show how preprocess is relevant for improving results, due to
  the straight relation between preprocess and selected features. 

## Other Applications

``preprocess`` library has been used successfully as part of the
following projects:

- [Text Preprocessing Chapter of MyNLP Course Py3 version](file:///media/abelm/Almacen/Doctorado/Notas_de_la_Investigacion/03_Mi_Curso_Postgrado_Natural_Language_Process/02_Pre-Procesamiento_py3/)
- [Text Preprocessing Chapter of MyNLP Course Py2 version](file:///media/abelm/Almacen/Doctorado/Notas_de_la_Investigacion/03_Mi_Curso_Postgrado_Natural_Language_Process/02_Pre-Procesamiento)
- [Llanes-corpus similarity experiment active](file:///media/abelm/Almacen/Doctorado/01_Codigos/2016-02_Llanes_simCalcFlow/)
- [Next text-reuse experiment active](file:///media/abelm/Almacen/Doctorado/01_Codigos/2015-11-30_Llanes_similarity_Example_8_test_15/)
- [repository of my Text-Reuse algorithm](file:///media/abelm/Almacen/Doctorado/00_plag_algh/)

### Older uses

Older versions of this module. Be careful! Many of this URLs are the ancient versions with different software architectures.

- [QtNLP-Linguist module](https://github.com/sorice/QtNLP-Linguist)