<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for SciPy LA Habana 2017. Source and license info is on [github repository](http://github.com/sorice/simtext_scipyla2017).</i></small>

# Evaluating the Best Normalization after Alignment

The objetive of this notebook is to show you across a measure if we had a good *Normalization* process based on Plagiarism Detection Corpus tags values.

## The Problem of Controling Fixed Sentence Boundaries

For plagiarism detection or text reuse analysis you will must select from two whole text document the similar fragments. However the techniques of text reuse some times does not need a sentence limit to make a similarity calculation, they can use tokens or even chars to delimit the boundaries of compared fragments. Another reality is that in artificially generated cases the algorithm can capture inside the fragment just a chunk of the beginning or ending sentence.

But in real scenarios humans take fragments compose by complete ideas (sentences) and then, modified or not, they use it in a new document.

As it was explained [before](02.1-Normalizing-Text-Corpus.ipynb) the normalization process intend to normalize tokens and delete end of sentece ambiguities or to define very well the end of sentence dots. **Alert:** The sentence tokenization is an open problem in NLP, this is an indirect way to show the quality of normalization made by the author.

Then for future text reuse experiments it will be measure the cuality of sentence division based on its pertinence to the annotated fragment into the plag case XML doc.

## Math Formalization of Normalized Sentences % in a Case
(based on Case XML Information)

After [Alignment Process](02.2c-Jaccard-Align-Preproc-to-Original-Sent.ipynb) a <font color='#F84825'>new text structure</font> was obtained. Open the file [suspicious-document00007.txt](files/data/aligned/susp/suspicious-document00007.txt) to see that previous-process result structure in the form <font color='#F84825'>
   $(id_K,normalized-sentence_K,original\,offset_{sentence\,K},original\,offset+length_{sentence\,K})$
</font>        

The fragment attributes are showed below for the same document in the xml file [suspicious-document00007-source-document00382.xml](files/data/PAN-PC-2013/orig/03-random-obfuscation/suspicious-document00007-source-document00382.xml)
        
<body>
<pre style='color:#1f1c1b;background-color:#ffffff;'>
<b>&lt;document</b><span style='color:#006e28;'> reference=</span><span style='color:#aa0000;'>&quot;suspicious-document00007.txt&quot;</span><b>&gt;</b>
<b>&lt;feature</b><span style='color:#006e28;'> name=</span><span style='color:#aa0000;'>&quot;plagiarism&quot;</span><span style='color:#006e28;'> obfuscation=</span><span style='color:#aa0000;'>&quot;random&quot;</span><span style='color:#006e28;'> obfuscation_degree=</span><span style='color:#aa0000;'>&quot;0.4694788492120119&quot;</span><span style='color:#006e28;'> source_length=</span><span style='color:#aa0000;'>&quot;453&quot;</span><span style='color:#006e28;'> source_offset=</span><span style='color:#aa0000;'>&quot;0&quot;</span><span style='color:#006e28;'> source_reference=</span><span style='color:#aa0000;'>&quot;source-document00382.txt&quot;</span><span style='color:#006e28;'> this_length=</span><span style='color:#aa0000;'>&quot;453&quot;</span><span style='color:#006e28;'> this_offset=</span><span style='color:#aa0000;'>&quot;9449&quot;</span><span style='color:#006e28;'> type=</span><span style='color:#aa0000;'>&quot;artificial&quot;</span> <b>/&gt;</b>
<b>&lt;/document</b><b>&gt;</b>
</pre>
</body>
        
Having this attributes values in advance. How do we calculate the percent of a sentence inside the fragment?

Making a mathematical reasoning it can be found that in the three cases indicated in figure 1 the general formula to solve this problem will be:

$$\%\,of\,sentence_{K}\,that\,belong\,to\,the\,fragment_X = \frac{min(L_K,L_X) - max(Offset_K,Offset_X)}{L_K-Offset_K}$$

Where $L_K = Offset_K + Length_K$, that means the position of the last character that belongs to te $sentence_K$.

<center><strong>Elaborated diagram to show possible situations after sentence normalization</strong></center></br>
<table border=0 cellspacing=10> 
    <caption align="bottom"> </br><em>Figura 2.4.1: The three possible sentence pertinence to the fragment</em>
    </caption> 
<tr align="center">
    <th> <img src="imgs/PercentSentBelongFragment.jpg" height=200px width=300px alt="*" 
        align="center"> </th>
</tr>
</table>

To avoid the problem that one single documents can appear in more than one pair we need to keep the $\%\,of\,sentence_{K}\,that\,belong\,to\,the\,fragment_X$ related to the sentence. Creating one new text per case based on a new structure (all values are numerical, ideal to load it in a numpy array):

    /norm/quality/suspicious-document00007-source-document00382.xml
   $(id_{sentence_P\,susp},offset_{sentence_P},offset+length_{sentence_P},\%\,sentence_{P}\, \in\,susp_{fragment\,X},id_{fragment\,X})$
   
   $\hspace{2cm}\vdots$ 
   
   $(id_{sentence_Q\,src},offset_{sentence_Q},offset+length_{sentence_Q},\%\,sentence_{Q}\, \in\,src_{fragment\,X},id_{fragment\,X})$
   
   $\hspace{2cm}\vdots$ 

## Implementation

TODO: aquí va el gráfico del algoritmo y la explicación

In [18]:
import numpy as np
from scripts import PANXml

def calcPercent(file, fragmentOffset, fragmentLen, fragmentID, thresh):  
    percent = [];case_percent = []
    with open(file) as doc:
        for num,line in enumerate(doc):
            ID,sent,offsetk,lenk = line.split('\t')
            Offsetk = int(offsetk)
            Lenk = int(lenk)
            if Offsetk < fragmentLen and Lenk > fragmentOff:
                
                perc = (min(Lenk,fragmentLen)-max(Offsetk,fragmentOffset))/float(Lenk-Offsetk)
                
                if perc > thresh:
                    percent.append(ID+'\t'
                                   +str(Offsetk)+'\t'+lenk[:-1]+'\t'
                                   +str(perc)+'\t'
                                   +str(fragmentID+1)
                                   +file[-9:-4])
                    
                    case_percent.append(perc)
    
    return percent, case_percent

def calc_sentPercentCase(inputfileName,alignedCollectionPath, xmlColecctionPath,threshold=0.3,printout=False):
    susp, src = inputfileName.split()
    case = PANXml(xmlColecctionPath+susp[:-4]+'-'+src[:-4]+'.xml')
    result = []; case_perc = []
    
    for i,file in enumerate([alignedCollectionPath+'susp/'+susp,alignedCollectionPath+'src/'+src]):
        for fragmentID,fragment in enumerate(case.fragmentList):
            #The non-plagiarism cases have an empty XML, the next line is to avoid thems
            if len(case.fragmentList) > 0:
                if i == 0:
                    Ox = int(fragment.suspOffset)
                    Lx = Ox+int(fragment.suspLength)
                    a,b = calcPercent(file,Ox,Lx,fragmentID,threshold)
                    result.extend(a);case_perc.extend(b)
                    
                else:
                    Ox = int(fragment.srcOffset)
                    Lx = Ox+int(fragment.srcLength)
                    a,b = calcPercent(file,Ox,Lx,fragmentID,threshold)
                    result.extend(a);case_perc.extend(b)
                if printout:
                    print('Case %d, Offset=%d, Length=%d' % (fragmentID, Ox, Lx-Ox))

    #Write in a single doc the percent result line by line for both docs of the case
    report = open(alignedCollectionPath+'quality/'+inputfileName, 'w')
    if printout:
        print('Sents\tOffset\tLength\t%InFrag\tFragID')
    for line in result:
        if printout:
            print(line)
        report.write(line+'\n')
    report.close()
    
    data = np.array(case_perc,dtype=np.float64)
    #print('data:',case_perc)
    return float(data.mean()), case_perc

## Testing for a case with only one pair of fragments

Please create "quality" folder inside aligned folder before to start.

In [2]:
inputFile = 'suspicious-document00122.txt source-document01169.txt'
alignedCollectionPath = 'data/aligned/'
xmlColecctionPath = 'data/orig/xml/'
A, B = calc_sentPercentCase(inputFile, alignedCollectionPath, xmlColecctionPath,printout=True)
print("Case Percent Quality =",A)

Case 0, Offset=0, Length=556
Case 0, Offset=0, Length=999
Sents	Offset	Length	%InFrag	FragID
0	0	514	1.0	100122
1	514	556	1.0	100122
0	0	40	1.0	101169
1	40	70	1.0	101169
2	70	1002	0.9967811158798283	101169
Case Percent Quality = 0.9993562231759657


## Testing for a case with more than one pair of fragments

In [22]:
inputFile = 'suspicious-document00070.txt source-document01504.txt'
alignedCollectionPath = 'data/aligned/'
xmlColecctionPath = 'data/orig/xml/'
A, B = calc_sentPercentCase(inputFile, alignedCollectionPath, xmlColecctionPath,printout=True)
print("Case Percent Quality =",A)

Case 0, Offset=3505, Length=184
Case 1, Offset=4727, Length=311
Case 0, Offset=211, Length=387
Case 1, Offset=1021, Length=422
Sents	Offset	Length	%InFrag	FragID
30	3473	3562	0.6404494382022472	100070
31	3562	3593	1.0	100070
32	3593	3620	1.0	100070
33	3620	3655	1.0	100070
34	3655	3755	0.34	100070
42	4684	4777	0.5376344086021505	200070
43	4777	4904	1.0	200070
44	4904	4992	1.0	200070
4	211	346	1.0	101504
5	346	418	1.0	101504
6	418	497	1.0	101504
7	497	596	1.0	101504
13	1019	1096	0.974025974025974	201504
14	1096	1168	1.0	201504
15	1168	1182	1.0	201504
16	1182	1275	1.0	201504
17	1275	1443	1.0	201504
Case Percent Quality = 0.9113005776959042


## Implementation for the whole Collection

In [27]:
#Generate the percent for all sentences in every preprocessed case
#Load all cases in pairs
alignedCollectionPath = 'data/aligned/'
xmlColecctionPath = 'data/orig/xml/'

import time

norm_percent = []
sent_perc = []
threshold=0.3

init = time.time()
for line in open('data/aligned/align_pairs'):
    perct, sent = calc_sentPercentCase(line, alignedCollectionPath, 
                                       xmlColecctionPath,threshold)
    norm_percent.append(perct)
    sent_perc.extend(sent)
    timef = time.time() - init
print(timef)



0.6947212219238281


In [40]:
len(norm_percent)
quality_norm_vector = np.array(norm_percent,dtype=np.float64)
print(quality_norm_vector.shape)
print(quality_norm_vector[990:1011])
print('Total useful sentences:',len(sent_perc))
quality_sent_norm = np.array(sent_perc,dtype=np.float64)

(2000,)
[       nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan 0.9596732  0.86314415
 0.89054233 0.94825782 0.83321233 0.95089359 0.95363974 0.83758948
 0.81468753 0.8639324  0.89908607]
Total useful sentences: 13354


In [41]:
quality_norm_vector[1000:].mean()

0.9252346674340438

## Sentence Normalization Quality Measure

$Quality = \frac{\sum_{k=0}^n (\%\,of\,sentence_k\,\,inside\,fragment>\mu)}{total\,sentences\,with\,\% > \mu}$ 

Where $n+1$ is the total sentences of the analised fragment and $\mu$ is the minimum percent of a sentence inside the fragment to consider it that belong to the fragment case.

In [16]:
from pandas import DataFrame
miu = 0.9
miu_vector = []
miu2 = 0.5
miu_vector2 = []

Quality = quality_norm_vector.mean()
#this is for show that np-array do not handle NaN values, 
#then DataFrame object convertion is needed
print ('Quality Type:',type(Quality),'\nQuality mean:',Quality,
       '.\nNew data type is needed: DataFrame (DF).') 

P = DataFrame(quality_norm_vector, columns=['percent'])
Qn = P.mean()
print('DF Quality based on cases-quality-average: %.4f' % (Qn['percent']))

print('Total sentences:',len(sent_perc))
Q = DataFrame(quality_sent_norm, columns=['percent'])
Qs = Q.mean()
print('DF Quality based on total sentences with percent > %.2f inside de case: %.4f' % (threshold,Qs['percent']))

for sent in quality_sent_norm:
    if sent > miu:
        miu_vector.append(sent)

for sent in quality_sent_norm:
    if sent < miu2:
        miu_vector2.append(sent)
        
print('# of normalized sentences with +%.2f%% of its size' % (miu*100),
      'inside its respectively case-frag: %d' % (len(miu_vector)))
print('# de oraciones norm con +%.2f%% de su tamaño fuera del caso: %d' % 
      (100-miu2*100,len(miu_vector2)))
print('Quality based on miu: %.4f' % (len(miu_vector)/len(Q)))
print('Quality based on miu2: %.4f' % (1.0-len(miu_vector2)/len(Q)))

Quality Type: <class 'numpy.float64'> 
Quality mean: nan .
New data type is needed: DataFrame (DF).
DF Quality based on cases-quality-average: 0.9252
Total sentences: 13354
DF Quality based on total sentences with percent > 0.30 inside de case: 0.9388
# of normalized sentences with +90.00% of its size inside its respectively case-frag: 11245
# de oraciones norm con +50.00% de su tamaño fuera del caso: 584
Quality based on miu: 0.8421
Quality based on miu2: 0.9563


In [14]:
Q.head()

Unnamed: 0,percent
0,0.529412
1,1.0
2,1.0
3,0.986667
4,1.0


# Conclusions

Quality factor of this new data is good. Means that the normalization process redefine sentences with more of the half of its length in a 95%. 84% has more than 90% of the sentences inside the fragment defined by xml. Only 7% has less than 30% outside the fragment defined by corpus's xmls files.

# Questions

1. Develop new quality measures based on learned here.

2. Describe formulas and algorithms in Excersises.notebook.

3. Change the distance measure, and a different version of _preprocess_ lib and interprete the result. Test the contractions and abbreviations replacement and analyse the result.

4. Implement the normalization with nltk, pattern or spacy, and analyse the results. Comment the difference for future experiments

# References and Resources