<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for Paper *Paraphrase Beyond Sentence*. Source and license info is on [GitHub](https://github.com/sorice/2017paraphrasebsent/).</i></small>

# Constructing a Corpus of Similarity Objects

After having a good set of text similarity functions, the next urgent resource is the corpus of similarity objects to use into machine learning sklearn python library. Most frecuently formats inside sklearn are csv and arff. ARFF formats are useful to use in Weka software to compare for example the 10 most important features extracted in python platform versus java platform, also compare times.

The next set of cells implement the generation of a corpus of similarity vectors. This cells have been converted into script module functions for future uses in the following notebooks, but as the main objective of this collection is to teach, we let the first successfull implementation for student questions and quiz implementations.

**Note:** The above mention functions are *script.datasets.msrpc_to_csv*,*script.datasets.msrpc_to_arff*.

**Note:** Textsim package can be download from [github](https://github.com/sorice/textsim).

In [1]:
import sys
sys.path.append('/home/abelm')
from tqdm import tqdm

import pandas as pd
from pandas import DataFrame, Series, read_table
import textsim
from textsim.utils import calc_all
import arff
import time

#Read Paraphrase Corpus
df = read_table('data/MSRPC-2004/msr_paraphrase_test.txt',sep='\t')
data = []
distances = []
exceptions = []
ti = time.time()
#Open vector similarity feature corpus
with open('data/MSRPC-2004/msrpc_test_textsim-42fb.arff','w') as corpus: 
    corpus.write('@relation paraphrase\n\n')
    for distance in sorted(textsim.__all_distances__.keys()):
        corpus.write('@attribute '+distance+' numeric\n')
        distances.append(distance)
    corpus.write('@attribute '+'id'+' integer\n')
    distances.append('id')
    corpus.write('@attribute class {yes,no}\n\n')
    distances.append('class')
    corpus.write('@data\n')
    
    for row in tqdm(range(len(df))):
        clase, ide1, ide2, sent1, sent2 = df.xs(row)
        try:
            obj = calc_all(sent1,sent2)[2:]
            sec = ''
            for item in obj:
                if str(item) == 'nan':
                    sec += '?,'
                else:
                    sec += str(item)+','
            sec += str(row) #append id for future analysis after classification
            obj.append(row)
            if clase:
                corpus.write(sec+',yes\n')
                obj.append('yes')
            else:
                corpus.write(sec+',no\n')
                obj.append('no')
            data.append(obj)
            
        except:
            exceptions.append(row)

tf = time.time()-ti
print('Total time:',tf)
print('Exceptions:',len(exceptions),'\nValues:',exceptions)

100%|██████████| 1725/1725 [06:41<00:00,  3.40it/s]

Total time: 401.63975954055786
Exceptions: 54 
Values: [16, 35, 48, 143, 176, 217, 228, 251, 279, 281, 322, 378, 408, 429, 430, 446, 475, 525, 572, 576, 588, 662, 697, 701, 731, 745, 766, 773, 824, 843, 897, 924, 953, 1003, 1014, 1019, 1039, 1059, 1077, 1151, 1186, 1209, 1216, 1236, 1413, 1425, 1467, 1470, 1527, 1607, 1663, 1666, 1677, 1717]





*Runs*

1. time = 379.49773502349854, valid_data = 1671
2. time = 381, valid_data = 1671
3. time = 415.97, valid_data = 1671 (last version that generate a correct MSRP ARFF file)

## CSV Data Generation

Sklearn frecuently works with CSV format.

In [7]:
def corpus_to_csv(file_path='data/MSRPC-2004/msrpc_test_textsim-42f.csv'):
    with open(file_path,'w') as corpus: #Open vector similarity feature corpus
        corpus.write(str(len(data))+',')
        corpus.write(str(len(distances)-1)+',')
        corpus.write('Paraph,Non,')
        for distance in distances:
            corpus.write(distance+',')
        corpus.write('\n')
        for instance in data:
            corpus.write(str(instance)[1:-1]+'\n')
    return
            
corpus_to_csv()

## Paralel version of Corpus construction code

Code of parallel_process funct geted from [Dans Shiebler Blog](http://danshiebler.com/2016-09-14-parallel-progress-bar/).
This version code only generates ARFF format.

In [12]:
import sys
sys.path.append('/home/abelm')

from tqdm import tqdm
from concurrent.futures import ProcessPoolExecutor, as_completed
import pandas as pd
from pandas import DataFrame, Series, read_table
import time
import textsim
from textsim.utils import calc_all

def parallel_process(array, function, n_jobs=4, use_kwargs=False, front_num=3):
    """
        A parallel version of the map function with a progress bar. 

        Args:
            array (array-like): An array to iterate over.
            function (function): A python function to apply to the elements of array
            n_jobs (int, default=16): The number of cores to use
            use_kwargs (boolean, default=False): Whether to consider the elements of array as dictionaries of 
                keyword arguments to function 
            front_num (int, default=3): The number of iterations to run serially before kicking off the parallel job. 
                Useful for catching bugs
        Returns:
            [function(array[0]), function(array[1]), ...]
    """
    #We run the first few iterations serially to catch bugs
    if front_num > 0:
        front = [function(**a) if use_kwargs else function(a) for a in array[:front_num]]
    #If we set n_jobs to 1, just run a list comprehension. This is useful for benchmarking and debugging.
    if n_jobs==1:
        return front + [function(**a) if use_kwargs else function(a) for a in tqdm(array[front_num:])]
    #Assemble the workers
    with ProcessPoolExecutor(max_workers=n_jobs) as pool:
        #Pass the elements of array into function
        if use_kwargs:
            futures = [pool.submit(function, **a) for a in array[front_num:]]
        else:
            futures = [pool.submit(function, a) for a in array[front_num:]]
        kwargs = {
            'total': len(futures),
            'unit': 'it',
            'unit_scale': True,
            'leave': True
        }
        #Print out the progress as tasks complete
        for f in tqdm(as_completed(futures), **kwargs):
            pass
    out = []
    #Get the results from the futures. 
    for i, future in tqdm(enumerate(futures)):
        try:
            out.append(future.result())
        except Exception as e:
            out.append(e)
    return front + out

In [13]:
def return_obj(row, sent1, sent2, clase):
    try:
        obj = calc_all(sent1,sent2)[2:]
        sec = ''
        for item in obj:
            if str(item) == 'nan':
                sec += '?,'
            else:
                sec += str(item)+','
        sec += str(row) #append id for future analysis after classification
        if clase:
            with open('data/MSRPC-2004/msrpc_test_textsim-42fb.arff','a') as corpus: 
                corpus.write(sec+',yes\n')
        else:
            with open('data/MSRPC-2004/msrpc_test_textsim-42fb.arff','a') as corpus: 
                corpus.write(sec+',no\n')

    except:
        return row

#Read Paraphrase Corpus
df = read_table('data/MSRPC-2004/msr_paraphrase_test.txt',sep='\t')

distances = []
exceptions = []
ti = time.time()
#Open vector similarity feature corpus
with open('data/MSRPC-2004/msrpc_test_textsim-42fb.arff','w') as corpus: 
    corpus.write('@relation paraphrase\n\n')
    for distance in sorted(textsim.__all_distances__.keys()):
        corpus.write('@attribute '+distance+' numeric\n')
        distances.append(distance)
    corpus.write('@attribute '+'id'+' integer\n')
    distances.append('id')
    corpus.write('@attribute class {yes,no}\n\n')
    distances.append('class')
    corpus.write('@data\n')

#Parallel trick for this problem
arr = [{'row':i, 'sent1':df.xs(i)[3], 'sent2':df.xs(i)[4], 'clase':df.xs(i)[0]} for i in range(len(df))]
exceptions = parallel_process(arr, return_obj, use_kwargs=True)
    
tf = time.time()-ti
print('Total time:',tf)

100%|██████████| 1.72K/1.72K [03:58<00:00, 6.37it/s]
1722it [00:00, 237898.27it/s]

Total time: 240.59578776359558





In [14]:
df_excepts = DataFrame(exceptions)
df_excepts = df_excepts.dropna()
print('Exceptions:',len(df_excepts),'\nValues:')
df_excepts.T #Transpose only to get a better visualization of exceptions in one row

Exceptions: 54 
Values:


Unnamed: 0,16,35,48,143,176,217,228,251,279,281,...,1413,1425,1467,1470,1527,1607,1663,1666,1677,1717
0,16,35,48,143,176,217,228,251,279,281,...,1413,1425,1467,1470,1527,1607,1663,1666,1677,1717


In [15]:
print(list(df_excepts.index)) #to get the same values but in int type

[16, 35, 48, 143, 176, 217, 228, 251, 279, 281, 322, 378, 408, 429, 430, 446, 475, 525, 572, 576, 588, 662, 697, 701, 731, 745, 766, 773, 824, 843, 897, 924, 953, 1003, 1014, 1019, 1039, 1059, 1077, 1151, 1186, 1209, 1216, 1236, 1413, 1425, 1467, 1470, 1527, 1607, 1663, 1666, 1677, 1717]


# Cython Version of Corpus construction code

In this case the cython implementation of distances must be inside the library textsim like in **nlpnet** package.

# Conclusions

* tqdm library is excellent for visualizing bar progress of single core
  and parallel code solutions, and let you know exactly how many
  instances still pendent in your actual cell running.
* The parallel solution speed up 1.7x (6.7 minutes are now 4 minutes).
* Cython most improve more than this.

# Recomendations

* Read pandas.io module to get a strong idea about how to play with in/out operations.
* Read scipy.io module to undestand arff format reading implemented iside this library.

# Questions

1. Make a parallel version of CSV corpus generation.


# References and Resources

<a id='Scipy2012'></a>
* [Scipy2012] Scipy Community, Manual "SciPy Reference Guide", 2012.