<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for SciPy LA Habana 2017. Source and license info is on [github repository](http://github.com/sorice/simtext_scipyla2017).</i></small>

# Paragraph Semantic Text Similarity Corpus (PSTS Corpus)

## Transforming PlagDet into a Paraphrase Identification Corpus

The objetive of this notebook is to describe the process to convert a classic plagiarism detection corpus (sometimes referred to as *text-reuse corpus*) into a fragment-pairs based paraphrase identification corpus.

## Plagiarism Detection Corpus

The original Plagiarism Detection Corpus of PAN-13 has two parts, the train and test sets.
They have the following structure:

    PAN-13-text-alignment-corpus
        pairs                           -> list of text-names-tuple on susp & src to compare
        susp/                           -> susp directory containing all suspicious texts
        src/                            -> src directory containing all text reuse source files
        01-no-plagiarism/               -> a directory containing an XML file per no-plag case in the pairs file
        02-no-obfuscation/              -> a directory containing an XML file per copy-paste case in the pairs file
        03-random-obfuscation/          -> a directory containing an XML file per random paraphrase case in the pairs file
        04-translation-obfuscation/     -> a directory containing an XML file per cross-lingual text-reuse case in the pairs file
        05-summary-obfuscation/         -> a directory containing an XML file per paraphrase case of summary type in the pairs file
        
Here is an example of the XML structure of a case, [suspicious-document00007-source-document00382.xml](files/data/PAN-PC-2013/orig/03-random-obfuscation/suspicious-document00007-source-document00382.xml):

<body>
<pre style="color:#1f1c1b;background-color:#ffffff;">
<b>&lt;document</b><span style="color:#006e28;"> reference=</span><span style="color:#aa0000;">&quot;suspicious-document00007.txt&quot;</span><b>&gt;</b>
<b>&lt;feature</b><span style="color:#006e28;"> name=</span><span style="color:#aa0000;">&quot;plagiarism&quot;</span><span style="color:#006e28;"> obfuscation=</span><span style="color:#aa0000;">&quot;random&quot;</span><span style="color:#006e28;"> obfuscation_degree=</span><span style="color:#aa0000;">&quot;0.4694788492120119&quot;</span><span style="color:#006e28;"> source_length=</span><span style="color:#aa0000;">&quot;453&quot;</span><span style="color:#006e28;"> source_offset=</span><span style="color:#aa0000;">&quot;0&quot;</span><span style="color:#006e28;"> source_reference=</span><span style="color:#aa0000;">&quot;source-document00382.txt&quot;</span><span style="color:#006e28;"> this_length=</span><span style="color:#aa0000;">&quot;453&quot;</span><span style="color:#006e28;"> this_offset=</span><span style="color:#aa0000;">&quot;9449&quot;</span><span style="color:#006e28;"> type=</span><span style="color:#aa0000;">&quot;artificial&quot;</span> <b>/&gt;</b>
<b>&lt;/document</b><b>&gt;</b>
</pre>
</body>

As you can see, this case refers to two documents (suspicious-document00007.txt, source-document00382.txt) and inside each one, to a fragment. After some processing, both fragments of text can be seen. The xml establishes a *paraphrase* type (also *obfuscation* in this corpus), the boundaries (*offset*,*length*) for both documents, a degree of paraphrase and the way in wich this case was generated.

**Note:** some XMLs of this corpus may contain more than one pair of fragments.

In [3]:
%run scripts/corpusReader.py

0,1
Susp,Src
"Special@ tamu. edu DJ of Regulation storage tissue and the use from Cadet of crop improvement. Hannapel Plant, Miller Marchetti& Park wd (1985): 700-703 plant of Potato Acid Physiol78 Accumulation by wdpark Tuber. Manipulation Protein Publications list of gibberellic Patents submitted, am, MA JC, wd Park (1998) release in biotechnology and Jacinto, two long rice grain varieties having pubmed processing quality. Mcclung for Plant Variety Protection","wdpark@tamu.edu Manipulation of plant storage tissue and the use of biotechnology in crop improvement. Hannapel DJ, Miller JC & Park WD (1985) : 700-703 Regulation of Potato Tuber Protein Accumulation by Gibberellic Acid. Plant Physiol78  Publications list from Pubmed Patents McClung, AM, MA Marchetti, WD Park (1998) Release of Cadet and Jacinto, two long grain rice varieties having special processing quality. Submitted for Plant Variety Protection"


## Paraphrase Identification Corpus

A classic corpus of paraphrase identification may contain different structures of paraphrase cases. Usually the structure could be:

    id class sentence-1 sentence-2
    
And the class could be equal to *0* or *1*, which means *non-paraphrase* and *paraphrase*.

## Overview of the Problem of Plagiarism to Paraphrase Corpus Transformation

Broadly speaking, in a plagiarism detection issue you must detect (or extract) the two fragments (suspicious and source) by using some approaches (citation measures, word fingerprints, ngrams, etc.); the problem is to find the boundaries of both fragments that are usually inside large documents.

But, in a paraphrase detection problem it must be detected if two sentences are paraphrased or not, and a very common technique here is to convert the original structure of a case into a machine learning object: a vector of features based on the *$original_{sentence}$*, the *$paraphased_{sentence}$* and the class _paraphrased/non-paraphrased_ .

As you can see on linguist international investigations on paraphrase, there is a wide range of definitions, for that reason we would like to define a concept:

**Paraphrase Definition:** *$class = 1$ (paraphrased) if there is some kind of transformation maintaining a high semantic similarity degree [<a href="#Vila2014" title="Is This a Paraphrase ? What Kind ? Paraphrase Boundaries and Typology"> (Vila2014, p. 6)</a>](#Vila2014), and $class = 0$ (non-paraphrased)if both text are dissimilar even if they speak about the same semantic field but differs on meanning in some degree.*

After normalization evaluation (see the resultant structure in [Normalization-Alignment-Quality Notebook](02.3-Eval-Normalization-Alignment-Quality.ipynb)) the purpose of this pipeline's step is to obtain a corpus with the following structure:
<p><font color='#F84825'>
 $(case_{id}, text_{fragment_{1}}, text_{fragment_{2}}, binary\,class)$
</font>
<p>Then in the next notebooks <font color='#F84825'>this structure</font> will be used to get a data feature vector representation to apply in machine learning.

Reminding previous generated structures:

* Output structure after alignment subprocess:

$(id_K,normalized-sentence_K,original\,offset_{sentence\,K},original\,offset+length_{sentence\,K})$

* Output structure after quality norm subprocess:

$(id_{sentence_P\,susp},offset_{sentence_P},offset+length_{sentence_P},\%\,sentence_{P}\, \in\,susp_{fragment\,X},id_{fragment\,X})$

## A New Paraphrase Identification Corpus at Fragment Level

### Generating TRUE Cases

In [1]:
from scripts import PANXml_Reader
import pandas as pd
import time

xmlColecctionPath = 'data/orig/xml/'
alignedCollectionPath = 'data/aligned/'
origCollectionPath = 'data/orig/'

timei = time.time()
with open('data/aligned/aligned_pairs') as casePairs:
    for docs in casePairs:
        #print(docs)
        susp, src = docs.split()
        #print(susp)
        xmlDoc = PANXml_Reader(xmlColecctionPath+susp[:-4]+'-'+src[:-4]+'.xml')
        fragmentList = xmlDoc.parser()
        newCase = {}
        
        #Analyse the fragment pairs list in the xml case
        if fragmentList != []: #this line filter non-paraphrased XML
            
            #Load Quality Matrix per case
            QM = pd.read_csv(alignedCollectionPath+'quality/'+susp+' '+src,
                           names=['sentID','offset','length','percent','FragID'],
                           delimiter = '\t')
            
            for id, frag in enumerate(fragmentList):
                text = {'susp/':'','src/':''}

                #For every doc in the pair
                for doc,file_type in zip([susp,src],['susp/','src/']):
                    targetID = int(str(id+1)+doc[-9:-4])
                    docText = open(origCollectionPath+file_type+doc)
                    offsetf = len(docText.read())
                    docText.close()
                    lenf = 0

                    #Load aligned matrix for doc
                    AM = pd.read_csv(alignedCollectionPath+file_type+doc,
                                     names=['id','sent','offset','length'], 
                                     sep='\t')
                    
                    #Join correspondent aligned sentences in a single fragment
                    for idx in QM.index:
                        if QM.FragID[idx] == targetID:
                            offsetf = min(offsetf,QM.offset[idx])
                            lenf = max(lenf,QM.length[idx])
                            #print(QM.sentID[idx])
                            #print(AM.sent[QM.sentID[idx]])
                            text[file_type] +=  AM.sent[QM.sentID[idx]]+' '

                #Take both created fragment per doc and create a pair fragment case
                newCaseID = str(id+1)+susp[-9:-4]+src[-9:-4]
                
                caseClass = 1
                newCase[newCaseID] = ''.join([str(newCaseID),'\t',text['susp/']+'\t',
                                        text['src/']+'\t',str(caseClass)+'\t',
                                        frag.suspOffset+'\t',frag.suspLength+'\t',
                                        frag.srcOffset+'\t',frag.srcLength+'\n'])
                                        
            
            #Write the positive cases corpus
            paraphCorpus = open('data/true_pairs','a')
            for value in newCase.values():
                paraphCorpus.write(value)
            paraphCorpus.close()
print('Total time:', time.time() - timei)

KeyboardInterrupt: 

## Generation of Non-Paraphrased Cases Problem

The last difficult problem to describe is the emptyness of *non-plagiarism* XML cases (contained in the *data/PAN-PC-2013/orig/01-no-plagiarism/* folder).  Those XMLs have an empty structure, only through the xml file's name you can figure out which texts don't have similarities (See the [suspicious-document00017-source-document00534.xml](files/data/PAN-PC-2013/orig/01-no-plagiarism/suspicious-document00017-source-document00534.xml) example below). How to solve that? 

<body>
<pre style='color:#1f1c1b;background-color:#ffffff;'>
<b>&lt;document</b><span style='color:#006e28;'> reference=</span><span style='color:#aa0000;'>&quot;suspicious-document00017.txt&quot;</span><b>&gt;</b>
<b>&lt;/document</b><b>&gt;</b>
</pre>
</body>

Once we have both dissimilar texts, we must select two fragments with some shallow properties similar to the positive cases. Why must they share some properties? (E.g. close vocabulary) Because these could help to identify the features with deep semantic similarity identification capacities in the following phases. Regarding machine learning problems modeling properties, a not balanced corpus is proposed, with a 66% of non-paraphrased cases.

__Note__: Another approach of *non-paraphrased cases* could be the use of a set of copy-paste cases (similar pairs of text but not paraphrased). For this alternative analysis, or related, the author proposes a set of experiments described in a special notebook not contained in this tutorial.

### Details of Non-Paraphrased Cases Generator Algorithm

The list of non paraphrased pair of docs.

    data/aligned/false_pairs

## Slow Generation of Non-Paraphrase Cases Collection

This is a solution that consumes a lot of RAM and computing time. It is based on pre-calculating 'all' similarity scores between 'all' possible fragments in every document (joining all consecutive sentences). At the very end this is a misconception of what is right or wrong to avoid some influences from the experiment design.

In [1]:
%run scripts/02.4_nonParaphrasedCasesGeneration.py 
                data/PAN-PC-2013/aligned/FALSE_paraph_aligned_pairs 
                data/PAN-FPC-2017/PAN-True-Paraphrase-Corpus 
                data/PAN-PC-2013/aligned/susp/ 
                data/PAN-PC-2013/aligned/src 
                data/PAN-FPC-2017/

ERROR: File `'scripts/02.5_nonParaphrasedCasesGenerationj.py'` not found.


## Random Fast-Generation of Non-Paraphrase Cases Collection

This variant is less complex:
- Take all false pairs xml files
- Select a random true pair
- Get two fragments of similar length (%10 of diff)
- Write the texts on the PAN-None-Paraphrase-Corpus

In [81]:
def read_aligned_text(csv_file):
    return pd.read_csv(csv_file,
                       names=['id','sent','offset','length'], 
                       sep='\t')

def get_aligned_frag(csv_file,offset,length):
    aligned = read_aligned_text(csv_file)
    condition = False
    text_result = ''
    rlength = 0; roffset = 0
    for idx in aligned.index:
        if offset >= aligned.offset[idx] \
        and offset < aligned.offset[idx] + aligned.length[idx]:
            roffset = aligned.offset[idx]
            condition = True
        if offset+length < aligned.offset[idx] and condition:
            rlength = aligned.offset[idx]-1-roffset
            condition = False
        if condition == True:
            text_result += ''.join(aligned.sent[idx])
    return text_result, roffset, rlength
            

In [85]:
from random import choice
from os.path import isfile

dataPath = 'data/aligned/' 
falseCases = []
classValue = '0'

timei = time.time()
trueCases = pd.read_csv('data/true_pairs',
                       names=['id','susp','src','clase','suspOffset','suspLen','srcOffset','srcLen'], 
                       sep='\t')

pairs = range(len(trueCases))  
count=0
        
with open('data/orig/false_pairs') as falsePairs:
    for line in falsePairs:
        falseSusp,falseSrc = line.split()
        falseSuspText = open('data/norm/susp/'+falseSusp).read()
        falseSrcText = open('data/norm/src/'+falseSrc).read()
        false_frags = 1
        
        while(false_frags < 4):#get 3 diff fragm for every true choiced
            i = choice(pairs)#get one random true pair
            suspLen = int(trueCases.suspLen[i])
            srcLen = int(trueCases.srcLen[i])
            
            #check if false text lengths are grader than trueCase len
            if len(falseSuspText) > suspLen and len(falseSrcText)> srcLen:

                #get random fragment inside the false susp text
                falseSuspOffset = choice(range(len(falseSuspText)-suspLen))
                falseSuspLen = choice(range(falseSuspOffset+int(suspLen*0.7),
                                            falseSuspOffset+int(suspLen*1.3)))-falseSuspOffset
                
                #get random fragment inside the false src text
                falseSrcOffset = choice(range(len(falseSrcText)-srcLen))
                falseSrcLen = choice(range(falseSrcOffset+int(srcLen*0.7),
                                           falseSrcOffset+int(srcLen*1.3)))-falseSrcOffset
                
                #get the current false pairs texts
                suspFragText,falseSuspOffset,falseSuspLen = get_aligned_frag('data/aligned/susp/'+
                                                                             falseSusp,
                                                                             falseSuspOffset,
                                                                             falseSuspLen)
                srcFragText, falseSrcOffset,falseSrcLen = get_aligned_frag('data/aligned/src/'+
                                                                           falseSrc,
                                                                           falseSrcOffset,
                                                                           falseSrcLen)
            
                #Make the tuple:
                #Take both created fragment per doc and create a pair fragment case
                caseID = str(false_frags)+falseSusp[-9:-4]+falseSrc[-9:-4]
                falseCases.append(tuple((caseID,
                                        suspFragText,srcFragText,
                                        classValue,
                                        str(falseSuspOffset),str(falseSuspLen),
                                        str(falseSrcOffset),str(falseSrcLen))))
                false_frags += 1
                count+=1
                if count%1000 == 0:
                    print('Preprocessed cases: ',count)
                
            else:
                pass
            
    with open('data/false_pairs','w') as falsePairs:
        for C in falseCases:
                falsePairs.write(C[0]+'\t'+C[1]+'\t'+C[2]+'\t'+C[3]+'\t'+C[4]+'\t'+C[5]+'\t'+C[6]+'\t'+C[7]+'\n')
    print('Finish-----added: ', len(falseCases), 'false cases')
print('Total time:', time.time() - timei)


Preprocessed cases:  1000
Preprocessed cases:  2000
Finish-----added:  2991 false cases
Total time: 17.078004837036133


## Integrating both parts of the Corpus

**Note**: Check the corpus or *PAN-Paraphrase-Corpus* file visually, if it's empty then run this code, else just take it and use it, or generate the corpus in a new file.

In [86]:

with open('data/PSTSCorpus', 'a') as Corpus:
    Corpus.write('id\tsent1\tsent2\tclass\toffsetSusp\tlenSusp\toffsetSrc\tlenSrc\n') #inserting first row for further uses
    with open('data/false_pairs') as noneCorpus:
        for case in noneCorpus:
            Corpus.write(case)
    with open('data/true_pairs') as trueCorpus:
        for case in trueCorpus:
            Corpus.write(case)

## Background on Text Similarity Problems

Text similarity is a popular field of investigation with many problems very close in meaning but very different in fact. For a better understanding of this notebook, it is shown a short background on the main problems of this area as well as a short definition.

- __Semantic Text Similarity__: Given two sentences you must calculate the degree of similarity and classify them. Usually this is a multi-class problem with 6 classes.
- __Textual Entailment__: Identify if two texts are related in one direction (A implicates B).
- __Text Similarity__: Given two text fragments you must identify if they are semantically related in both directions.
- __Text Alignment__: Given two different texts you must match every sentence in text A with its corresponding sentence in text B.
- __Paraphrase Identification__: Given two sentences you must classify if they are paraphrased or not (binary classification).
- __Text Reuse__: Detect reused fragments in a single text having a text collection as source.
- __Plagiarism Detection__ (_Text Reuse + Citation Analysis_): Detect in a text collection pairs of non-quoted fragments with the same meaning.
- __Machine Translation__: Align text pairs with same meaning but in a different language.
        
So the approach presented in this tutorial is a *Text Similarity* problem seen from the perspective of a *Paraphrase Identification* problem.

### Corpus of Text Reuse

PAN-PC / TNLP / Plagiarism Corpus / 

### Corpus of Paraphrase Identification

MSRPC / STS /

# Conclusions

The main objective of this notebook was accomplished:

    "After having the aligned normalized-texts, a new paraphrase corpus (binary cathegory) was generated, based on  chunks extracted from the xmls of PAN-PC corpus."
    
The true cases are generated in the first place. This part of the process is simple and fast, because chunk information is full contained in the xmls of PAN-PC corpus.

However, the first version of non paraphrased cases (or false cases) must be constructed mathematically due to the lack of information of non-paraphrased xmls of PAN-PC corpus. The second version constructs almost 3 thousand cases based on random selections of offsets and lens of true pairs. This second version is faster and generates more credible cases.

# Recommendations

For future experiments, the best way to accomplish this task is to generate non-paraphrased pair of texts manually; that is, humanly designed.

The final proposition of this corpus must be to clarify if the selection of fragments is larger than a sentence, is more suitable or has anything to add to the process of plagiarism detection. The possible conclusions after all the machine learning experimentation are:

- Paraphrase Detection phase algorithms get almost perfect accuracy results when they have long data to compare. This fact makes us conclude that the _Search Space Reduction_ stage is more important, because it is responsible for defining the offset and length of reused fragments.
- Long reused fragments help with paraphrase because the behavior of the detection changes when paraphrase type changes, this is only possible with 3 or more classes of paraphrase inside _PSTSCorpus_. The recommendation for future experiments is to use the strategy of corpus _Plagiarised_Short_Answers_ (based on 4 degrees of rewriting), or to get a derived classification corpus similar to P4P corpus (which offers more cathegories based on linguistic phenomenon of the change: lexical, same polarity, addition-deletion, etc).

# Questions

* Analyze the _function_ __getFeatureVector__ and test with other mathematical equations. Make only 100 new cases and analyze the result against the previous one.
* Make a parallel version of the algorithm for the generation of non-paraphrased cases.
* Analyze the possibility to have a multi class corpus based on Verbatim/Paraphrased/Non-paraphrased cases, taking into acount that every kind of similarity measure will have a high score in both Verbatim & Paraphrased cases.

# References and Resources

* Vila, Marta & Martí, M Antònia & Rodríguez, Horacio "Is This a Paraphrase ? What Kind ? Paraphrase Boundaries and Typology". Open Journal of Modern Linguistics, 2014.
<a id='Vila2014'></a>