<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for Paper *Paraphrase Beyond Sentence*. Source and license info is on [GitHub](https://github.com/sorice/2017paraphrasebsent/).</i></small>

# Constructing a Corpus of Similarity Objects based on MSRPC

After having a good set of text similarity functions, the next urgent resource is the corpus of similarity objects to use into machine learning sklearn python library. Most frecuently format inside sklearn datasets is csv. 

The ARFF formats used here, is for __Weka__ software, made in Java. The original intention was to compare performance in both platforms, but this objective never was accomplished, also to compare the results of the stage of "feature extraction" by Weka and Sklearn methods.

The next set of cells implement the generation of a corpus of similarity vectors. This cells have been converted into script module functions for future uses in the following notebooks, but as the main objective of this collection is to teach, we let the first successfull implementation for student questions and quiz implementations.

**Note:** The above mention functions are *script.datasets.msrpc_to_csv*,*script.datasets.msrpc_to_arff*.

**Note:** Textsim package can be download from [github](https://github.com/sorice/textsim).

__2020 Note:__ The MSRPC arff files used in the next cells were obtained with a JAVA software of Las Tunas University in Cuba, this software is not longer available. So the last cell is dedicated to obtain the same similarity vectors or some extended ones based only in python packages.

In [1]:
import sklearn
import csv
import numpy as np
from sklearn.utils import Bunch
import sys
import pandas as pd
from pandas import DataFrame, Series, read_csv, read_table
import textsim
from textsim.utils import calc_all
import arff
import os
from tqdm import tqdm
import time
import warnings
warnings.filterwarnings('ignore')

### Loading original MSRPC

The __objective__ of these sections are to teach how to program the loader of a corpus, similar to sklearn.

The original Microsoft Research Paraphrase Corpus is organized in this schema:

__class__, __ID of String A__, __ID of String B__, __String A__, __String B__

This corpus can be downloaded in [http://research.microsoft.com/en-us/downloads](http://research.microsoft.com/en-us/downloads) also could be found in many repositories in github like [THIS](https://github.com/wasiahmad/paraphrase_identification/tree/master/datasets/msr-paraphrase-corpus)

Originaly MSRPC was divided in two subsets: test and train. The author of this collection did a unic file (_data/msrpc.txt_) for future experiments with different configurations. It is possible to split this unic file in test/train subsets for __model evaluation__ using different strategies like _stratified Kfold_.

In [1]:
#Read Paraphrase Corpus
df = read_table('data/MSRPC-2004/msr_paraphrase_test.txt',sep='\t')
data = []
distances = []
exceptions = []

ti = time.time()
#Open vector ARFF similarity feature corpus
with open('data/MSRPC-2004/msrpc_test_textsim-42fb.arff','w') as corpus: 
    corpus.write('@relation paraphrase\n\n')
    for distance in sorted(textsim.__all_distances__.keys()):
        corpus.write('@attribute '+distance+' numeric\n')
        distances.append(distance)
    corpus.write('@attribute '+'id'+' integer\n')
    distances.append('id')
    corpus.write('@attribute class {yes,no}\n\n')
    distances.append('class')
    corpus.write('@data\n')
    
    for row in tqdm(range(len(df))):
        clase, ide1, ide2, sent1, sent2 = df.xs(row)
        try:
            obj = calc_all(sent1,sent2)[2:]
            sec = ''
            for item in obj:
                if str(item) == 'nan':
                    sec += '?,'
                else:
                    sec += str(item)+','
            sec += str(row) #append id for future analysis after classification
            obj.append(row)
            if clase:
                corpus.write(sec+',yes\n')
                obj.append('yes')
            else:
                corpus.write(sec+',no\n')
                obj.append('no')
            data.append(obj)
            
        except:
            exceptions.append(row)

tf = time.time()-ti
print('Total time:',tf)
print('Exceptions:',len(exceptions),'\nValues:',exceptions)

100%|██████████| 1725/1725 [06:41<00:00,  3.40it/s]

Total time: 401.63975954055786
Exceptions: 54 
Values: [16, 35, 48, 143, 176, 217, 228, 251, 279, 281, 322, 378, 408, 429, 430, 446, 475, 525, 572, 576, 588, 662, 697, 701, 731, 745, 766, 773, 824, 843, 897, 924, 953, 1003, 1014, 1019, 1039, 1059, 1077, 1151, 1186, 1209, 1216, 1236, 1413, 1425, 1467, 1470, 1527, 1607, 1663, 1666, 1677, 1717]





*Runs*

1. time = 379.49773502349854, valid_data = 1671
2. time = 381, valid_data = 1671
3. time = 415.97, valid_data = 1671 (last version that generate a correct MSRP ARFF file)

## CSV Data Generation

Only exec the next block if you exec first the above block, the __data__ object must exist to execute well.

In [7]:
def corpus_to_csv(file_path='data/MSRPC-2004/msrpc_test_textsim-42f.csv'):
    with open(file_path,'w') as corpus:
        corpus.write(str(len(data))+',')
        corpus.write(str(len(distances)-1)+',')
        corpus.write('Paraph,Non,')
        for distance in distances:
            corpus.write(distance+',')
        corpus.write('\n')
        for instance in data:
            corpus.write(str(instance)[1:-1]+'\n')
    return
            
corpus_to_csv()

### Parcial conclusions

* The csv corpus estandard wich is loaded with Bunch, must have some metadatas in the first two row, like, the len of cases, the amount of features, etc.
* Is very important to have distances in cython, because the process could be much much fast.
* Do not attempt to do this in a laptop, without the cython implementation. Is to slow!
* A GPU version could be more suitable to get results in minutes.

## Generating complete MSRPC from txt to csv

This step is only needed once. Is a slow process so the recommedation is to doit in a powerfull machine.

In [8]:
from scripts.datasets import msrpc_to_csv

In [3]:
ti = time.time()
msrpc_to_csv('data/MSRPC-2004/msrpc.txt')
tf = time.time()-ti
print('Total time:',tf)

Total time: 826.1204106807709


## Playful Programming

### Example of Parallel version of Corpus construction code

Code of parallel_process funct geted from [Dans Shiebler Blog](http://danshiebler.com/2016-09-14-parallel-progress-bar/).
This version code only generates CSV format. For ARFF format see __scripts.parallel_msrpc_arff.py__.

In [2]:
from concurrent.futures import ProcessPoolExecutor, as_completed
import textsim
from textsim.utils import calc_all
from tqdm import tqdm

### Loading MSRPC.txt

Like in __scripts.datasets.msrpc_to_csv()__

In [3]:
#Read Paraphrase Corpus in TXT format
file_path = 'data/MSRPC-2004/msrpc.txt'

ti = time.time()
df = DataFrame(columns=['class','id1','id2','sent1', 'sent2'])
loading_except = []

with open(file_path) as corp:
    count = 0
    for row in corp:
        try:
            obj = row.split('\t')
            if count == 0: #do not process the line 0
                count+=1
                pass
            else: #do distance calculation in the rest
                df = df.append(Series(obj, index=df.columns), ignore_index=True)
                count+=1
        except:
            loading_except.append(count)
            
tm = time.time()-ti
print('MSRPC loaded in ', tm, ' seconds')

MSRPC loaded in  13.015718698501587  seconds


In [4]:
from tqdm import tqdm
from scripts.parallel import parallel_process, return_obj
import time

distances = []
exceptions = []
ti = time.time()
#Open target file for new corpus & write
with open('data/MSRPC-2004/parallel_msrpc.csv','w') as corpus:    
    for distance in sorted(textsim.__all_distances__.keys()):
        distances.append(distance)
    distances.append('id')
    distances.append('class')
    corpus.write(','.join(str(elem) for elem in distances)+'\n')

#Parallel trick for this problem
arr = [{'row':i, 'sent1':df.xs(i)[3], 'sent2':df.xs(i)[4], 'clase':df.xs(i)[0]} for i in range(len(df))]
exceptions = parallel_process(arr, return_obj, use_kwargs=True)
    
tf = time.time()-ti
print('Total time:',tf)

100%|██████████| 5.80k/5.80k [05:21<00:00, 18.0it/s]
5798it [00:00, 634502.43it/s]

Total time: 324.66868019104004





In [5]:
df_excepts = DataFrame(exceptions)
df_excepts = df_excepts.dropna()
print('Exceptions:',len(df_excepts),'\nValues:')
df_excepts.T #Transpose only to get a better visualization of exceptions in one row

Exceptions: 15 
Values:


Unnamed: 0,48,143,697,1019,1470,2695,2826,3087,3394,3658,3920,3990,4308,4403,4451
0,48.0,143.0,697.0,1019.0,1470.0,2695.0,2826.0,3087.0,3394.0,3658.0,3920.0,3990.0,4308.0,4403.0,4451.0


Original exceptions in msrpc_test subcorpus (len = 1725 cases)

    exceptions = [16, 35, 48, 143, 176, 217, 228, 251, 279, 281, 322, 378, 408, 429, 430, 446, 475, 525, 572, 576, 588, 662, 697, 701, 731, 745, 766, 773, 824, 843, 897, 924, 953, 1003, 1014, 1019, 1039, 1059, 1077, 1151, 1186, 1209, 1216, 1236, 1413, 1425, 1467, 1470, 1527, 1607, 1663, 1666, 1677, 1717]

In [6]:
print(list(df_excepts.index)) #to get the same values but in int type

[48, 143, 697, 1019, 1470, 2695, 2826, 3087, 3394, 3658, 3920, 3990, 4308, 4403, 4451]


# Conclusions

* tqdm library is excellent for visualizing bar progress of single core
  and parallel code solutions, and let you know exactly how many
  instances still pendent in your actual cell running.
* Original parallel solution in DELL i7 first generation laptop = speed up 1.7x (6.7 minutes are now 4 minutes), tested only in msrpc_test subcorpus len = 1725 cases.
* Average speed up in second parallel solution in HP i7 8th generation laptop = 2.2x (E.g. 826 seconds in non-parallel are = in this solution to 376 seconds).
* The change in reading method of msrpc.txt show a big gaining in exceptions handling: 
    - 54 in msrpc_test using __read_csv__ method.
    - 15 in the whole corpus using __open().read().split('\t')__.
* Cython most improve the distances calculation time.
* The numbers shows that DELL i7 1thsG is 0.5x slower than HP 8thG, with the same amount of jobs = 4.

# Recomendations

* Read pandas.io module to get a strong idea about how to play with in/out operations.
* Read scipy.io module to undestand arff format reading implemented iside this library.
* Implement cython measures, because knowledge and corpus measures are very slow. The firt one based on Wornet are very complicated to use in a 10K cases corpus. The second needs big corpus and Neural Network training for get some results.
* It is important to review the distances or algorithms implemented in SpaCy to calculate if it is more suitable to use some of those, and avoid other packages distances.

# Questions

1. Make a parallel version of msrpc_to_csv function.
2. Make a GPU compatible version of msrpc_to_csv function.

# References and Resources

<a id='Scipy2012'></a>
* [Scipy2012] Scipy Community, Manual "SciPy Reference Guide", 2012.