<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for SciPy LA Habana 2017. Source and license info is on [github repository](http://github.com/sorice/simtext_scipyla2017).</i></small>

# Evaluating the Best Normalization after Alignment

The objetive of this notebook is to evaluate the quality of the normalization-alignment process using the fragment boundaries defined by the Plagiarism Detection Corpus tags values.

## The Problem of Controlling Fixed Sentence Boundaries

For plagiarism detection or text reuse analysis, you must select the similar fragments from two documents. However, the techniques of text reuse sometimes do not need a sentence limit to make a similarity calculation, they can use tokens or even chars to delimit the boundaries of compared fragments. Another reality is that in artificially generated cases the algorithm can capture inside the fragment just a chunk of the beginning or ending sentence.

But in real scenarios humans take fragments composed by complete ideas (sentences) and then, modified or not, they use them in a new document.

As it was explained [before](02.1-Normalizing-Text-Corpus.ipynb), the normalization process intends to normalize tokens and delete end of sentence ambiguities or to define very well the end of sentence dots. **Alert:** The sentence tokenization is an open problem in NLP, this is an indirect way to show the quality of normalization made by the author.

Then, for future text reuse experiments it will be measured the quality of sentence division based on its pertinence to the annotated fragment into the plag case XML doc.

## Math Formalization of Normalized Sentences % in a Case
(based on Case XML Information)

After [Alignment Process](02.2-Jaccard-Align-Preproc-to-Original-Sent.ipynb) a <font color='#F84825'>new text structure</font> was obtained. Open the file [suspicious-document00007.txt](files/data/aligned/susp/suspicious-document00007.txt) to see the resulting text structure of the previous process in the form <font color='#F84825'>
   $(id_K,normalized-sentence_K,original\,offset_{sentence\,K},original\,offset+length_{sentence\,K})$
</font>        

The fragment attributes are shown below for the same document in the xml file [suspicious-document00007-source-document00382.xml](files/data/PAN-PC-2013/orig/03-random-obfuscation/suspicious-document00007-source-document00382.xml)
        
<body>
<pre style='color:#1f1c1b;background-color:#ffffff;'>
<b>&lt;document</b><span style='color:#006e28;'> reference=</span><span style='color:#aa0000;'>&quot;suspicious-document00007.txt&quot;</span><b>&gt;</b>
<b>&lt;feature</b><span style='color:#006e28;'> name=</span><span style='color:#aa0000;'>&quot;plagiarism&quot;</span><span style='color:#006e28;'> obfuscation=</span><span style='color:#aa0000;'>&quot;random&quot;</span><span style='color:#006e28;'> obfuscation_degree=</span><span style='color:#aa0000;'>&quot;0.4694788492120119&quot;</span><span style='color:#006e28;'> source_length=</span><span style='color:#aa0000;'>&quot;453&quot;</span><span style='color:#006e28;'> source_offset=</span><span style='color:#aa0000;'>&quot;0&quot;</span><span style='color:#006e28;'> source_reference=</span><span style='color:#aa0000;'>&quot;source-document00382.txt&quot;</span><span style='color:#006e28;'> this_length=</span><span style='color:#aa0000;'>&quot;453&quot;</span><span style='color:#006e28;'> this_offset=</span><span style='color:#aa0000;'>&quot;9449&quot;</span><span style='color:#006e28;'> type=</span><span style='color:#aa0000;'>&quot;artificial&quot;</span> <b>/&gt;</b>
<b>&lt;/document</b><b>&gt;</b>
</pre>
</body>
        
Having this attributes values in advance, how do we calculate the percentage of a sentence inside the fragment?

Making a mathematical reasoning, it can be found that in the three cases indicated in figure 1 the general formula to solve this problem will be:

$$\%\,of\,sentence_{K}\,that\,belongs\,to\,the\,fragment_X = \frac{min(L_K,L_X) - max(Offset_K,Offset_X)}{L_K-Offset_K}$$

Where $L_K = Offset_K + Length_K$. That means the position of the last character that belongs to the $sentence_K$.

<body> 
    <br>
<center><strong>Diagram that shows possible situations after Sentence Normalization</strong></center></br>
<table border=0 cellspacing=10> 
    <caption align="bottom"> </br><em>Figure 2.4.1: Percentage of a sentence belonging to a fragment. Three different possibilities.</em>
    </caption> 
<tr align="center">
    <th> <img src="imgs/PercentSentBelongFragment.jpg" height=200px width=300px alt="*" 
        align="center"> </th>
</tr>
</table>
</body>

To avoid the performance problem of repeating these calculations for one single document (which appears in more than one pair) we need to keep the $\%\,of\,sentence_{K}\,that\,belongs\,to\,the\,fragment_X$ related to the sentence, thus creating one new text per case based on a new structure (all values are numerical, ideal to load them in a numpy array):

    /norm/quality/suspicious-document00007-source-document00382.xml
   $(id_{sentence_P\,susp},offset_{sentence_P},offset+length_{sentence_P},\%\,sentence_{P}\, \in\,susp_{fragment\,X},id_{fragment\,X})$
   
   $\hspace{2cm}\vdots$ 
   
   $(id_{sentence_Q\,src},offset_{sentence_Q},offset+length_{sentence_Q},\%\,sentence_{Q}\, \in\,src_{fragment\,X},id_{fragment\,X})$
   
   $\hspace{2cm}\vdots$ 

## Implementation

In [1]:
import numpy as np
from scripts import PANXml_Reader

def calcPercent(file, fragmentOffset, fragmentLen, fragmentID, thresh):  
    percent = [];case_percent = []
    with open(file) as doc:
        for num,line in enumerate(doc):
            ID,sent,sentOffset,sentLen = line.split('\t')
            sentOffset = int(sentOffset)
            sentLen = int(sentLen) + sentOffset
            if sentOffset < fragmentLen and sentLen > fragmentOffset:
                
                perc = (min(sentLen,fragmentLen)-max(sentOffset,fragmentOffset))/float(sentLen-sentOffset)
                
                #if perc >= 0:
                percent.append(ID+'\t'
                               +str(sentOffset)+'\t'+str(sentLen)+'\t'
                               +str(perc)+'\t'
                               +str(fragmentID+1)+file[-9:-4])

                case_percent.append(perc)
    
    return percent, case_percent

def calc_sentPercentCase(inputfileName,alignedCollectionPath, 
                         xmlColecctionPath,
                         threshold=0.3,
                         printout=False):
    
    susp, src = inputfileName.split()
    xmlDoc = PANXml_Reader(xmlColecctionPath+susp[:-4]+'-'+src[:-4]+'.xml')
    fragmentList = xmlDoc.parser()
    result = []; case_perc = []
    
    #The non-plagiarism cases have an empty XML, the next line is to avoid thems
    if len(fragmentList) > 0:
        files = [alignedCollectionPath+'susp/'+susp,alignedCollectionPath+'src/'+src]
        for i,file in enumerate(files):
            for fragmentID,fragment in enumerate(fragmentList):
                if i == 0: #if the case is the susp case
                    Ox = int(fragment.suspOffset)
                    Lx = Ox+int(fragment.suspLength)
                    a,b = calcPercent(file,Ox,Lx,fragmentID,threshold)
                    result.extend(a);case_perc.extend(b)
                    
                else: #if the case is the src case
                    Ox = int(fragment.srcOffset)
                    Lx = Ox+int(fragment.srcLength)
                    a,b = calcPercent(file,Ox,Lx,fragmentID,threshold)
                    result.extend(a);case_perc.extend(b)
                if printout:
                    print('Case %d, Offset=%d, Length=%d' % (fragmentID, Ox, Lx-Ox))

        #Write in a single doc the percent result line by line for both docs of the case
        report = open(alignedCollectionPath+'quality/'+inputfileName, 'w')
        if printout:
            print('Sents\tOffset\tLength\t%InFrag\tFragID')
        for line in result:
            if printout:
                print(line)
            report.write(line+'\n')
        report.close()

        data = np.array(case_perc,dtype=np.float64)

        return float(data.mean()), case_perc
    
    else:
        return 0.0, []

## Testing for a case with only one pair of fragments

Please create "quality" folder inside aligned folder before starting.

In [4]:
inputFile = 'suspicious-document00427.txt source-document01618.txt'
alignedCollectionPath = 'data/aligned/'
xmlColecctionPath = 'data/orig/xml/'
A, B = calc_sentPercentCase(inputFile, alignedCollectionPath, xmlColecctionPath,printout=True)
print("Case Percent Quality =",A)

Case 0, Offset=1021, Length=502
Case 0, Offset=0, Length=687
Sents	Offset	Length	%InFrag	FragID
12	972	1064	0.4673913043478261	100427
13	1064	1188	1.0	100427
14	1188	1413	1.0	100427
15	1413	1503	1.0	100427
16	1503	1526	0.8695652173913043	100427
0	0	65	1.0	101618
1	65	106	1.0	101618
2	106	137	1.0	101618
3	137	267	1.0	101618
4	267	420	1.0	101618
5	420	584	1.0	101618
6	584	663	1.0	101618
7	663	687	1.0	101618
Case Percent Quality = 0.9489966555183946


## Testing for a case with more than one pair of fragments

In [5]:
inputFile = 'suspicious-document01064.txt source-document00671.txt'
alignedCollectionPath = 'data/aligned/'
xmlColecctionPath = 'data/orig/xml/'
A, B = calc_sentPercentCase(inputFile, alignedCollectionPath, xmlColecctionPath,printout=True)
print("Case Percent Quality =",A)

Case 0, Offset=5824, Length=272
Case 1, Offset=21870, Length=279
Case 0, Offset=0, Length=801
Case 1, Offset=879, Length=358
Sents	Offset	Length	%InFrag	FragID
51	5824	5896	1.0	101064
52	5896	5935	1.0	101064
53	5935	5998	1.0	101064
54	5998	6066	1.0	101064
55	6066	6097	0.967741935483871	101064
240	21747	21966	0.4383561643835616	201064
241	21966	22118	1.0	201064
242	22118	22150	0.96875	201064
0	0	49	1.0	100671
1	49	210	1.0	100671
2	210	470	1.0	100671
3	470	602	1.0	100671
4	602	710	1.0	100671
5	710	802	0.9891304347826086	100671
7	879	1127	1.0	200671
8	1127	1141	1.0	200671
9	1141	1190	1.0	200671
10	1190	1239	0.9591836734693877	200671
Case Percent Quality = 0.9623979004510793


## Implementation for the whole Collection

In [6]:
#Generate the percent for all sentences in every preprocessed case
#Load all cases in pairs
alignedCollectionPath = 'data/aligned/'
xmlColecctionPath = 'data/orig/xml/'

import time

norm_percent = []
sent_perc = []
threshold=0.3

init = time.time()
for line in open('data/aligned/aligned_pairs'):
    perct, sent = calc_sentPercentCase(line[:-1], alignedCollectionPath, 
                                       xmlColecctionPath,threshold)
    norm_percent.append(perct)
    sent_perc.extend(sent)
    timef = time.time() - init
print(timef)

1.5250451564788818


In [7]:
len(norm_percent)
quality_norm_vector = np.array(norm_percent,dtype=np.float64)
print(quality_norm_vector[990:1011])
print(quality_norm_vector)
print('Total useful sentences:',len(sent_perc))
quality_sent_norm = np.array(sent_perc,dtype=np.float64)

[0.         0.         0.         0.         0.         0.
 0.         0.93982923 0.99764652 0.7652065  0.96741266 0.89646236
 0.89776727 0.93668272 0.94802847 0.99578731 0.99778318 0.99827709
 0.76641118 0.96892604 0.99805388]
[0.         0.         0.         ... 0.98975591 0.88768371 0.99013629]
Total useful sentences: 13376


In [8]:
quality_norm_vector[1000:].mean()

0.9673992915065034

## Sentence Normalization Quality Measure

$Quality = \frac{\sum_{k=0}^n (\%\,of\,sentence_k\,\,inside\,fragment>\mu)}{total\,sentences\,with\,\% > \mu}$ 

Where $n+1$ is the total sentences of the analised fragment, and $\mu$ is the minimum percentage of a sentence inside the fragment which classifies that sentence as belonging to the case fragment.

In [9]:
from pandas import DataFrame
miu = 0.9
miu_vector = []
miu2 = 0.5
miu_vector2 = []

Quality = quality_norm_vector.mean()
#this is to show that np-array does not handle NaN values, 
#then DataFrame object convertion is needed
print ('Quality Type:',type(Quality),'\nQuality mean:',Quality,
       '.\nNew data type is needed: DataFrame (DF).') 

P = DataFrame(quality_norm_vector, columns=['percent'])
Qn = P.mean()
print('DF Quality based on cases-quality-average: %.4f' % (Qn['percent']))

print('Total sentences:',len(sent_perc))
Q = DataFrame(quality_sent_norm, columns=['percent'])
Qs = Q.mean()
print('DF Quality based on total sentences with percent > %.2f inside de case: %.4f' % (threshold,Qs['percent']))

for sent in quality_sent_norm:
    if sent > miu:
        miu_vector.append(sent)

for sent in quality_sent_norm:
    if sent < miu2:
        miu_vector2.append(sent)
        
print('# of normalized sentences with +%.2f%% of its size' % (miu*100),
      'inside its respectively case-frag: %d' % (len(miu_vector)))
print('# of norm sentences with +%.2f%% of its size outside of the case: %d' % 
      (100-miu2*100,len(miu_vector2)))
print('Quality based on miu: %.4f' % (len(miu_vector)/len(Q)))
print('Quality based on miu2: %.4f' % (1.0-len(miu_vector2)/len(Q)))

Quality Type: <class 'numpy.float64'> 
Quality mean: 0.4840843570093619 .
New data type is needed: DataFrame (DF).
DF Quality based on cases-quality-average: 0.4841
Total sentences: 13376
DF Quality based on total sentences with percent > 0.30 inside de case: 0.9734
# of normalized sentences with +90.00% of its size inside its respectively case-frag: 12794
# de oraciones norm con +50.00% de su tamaño fuera del caso: 318
Quality based on miu: 0.9565
Quality based on miu2: 0.9762


In [10]:
Q.head()

Unnamed: 0,percent
0,1.0
1,1.0
2,1.0
3,1.0
4,0.243421


# Conclusions

The data obtained with aligner1.0 is good. The _quality factor_ means that the normalization process redefines the 95% of the sentences with more of the half of their length inside the fragment. The 84% has more than 90% of the sentences inside the fragment defined by the xml document. Only 7% has less than 30% outside the fragment defined by corpus's xml files.

With aligner2.0 all the numbers are better. 95% of the sentences has 90% of its length inside the fragment. Only 2.6% has less than 30% out of the fragment. The implementation of the second version of jaccard and aligner based on chars and word comparisons was good.

# Questions

1. Develop new quality measures based on what was learned here.

2. Describe their formulas and algorithms.

3. Change the distance measure, and a different version of _preprocess_ lib and interpret  the result. Test the contractions and abbreviations replacement and analyse the result.

4. Implement the normalization with nltk, pattern or spacy, and analyse the results. Comment on the differences for future experiments.