<small><i>This notebook was put together by [Abel Meneses-Abad](http://www.menesesabad.com) for SciPy LA Habana 2017. Source and license info is on [github repository](http://github.com/sorice/simtext_scipyla2017).</i></small>

### Importing cell

In [2]:
import sys
import os
import re
from preprocess.demo import preProcessFlow

# Normalizing the Text Corpus

The objectives of this notebook are:

* To explain what the standard normalization process defined in Natural Language Processing is. [(2.1.1)](#Text_Normalization) 
* 2.1.2. As a second goal to show you some personal(made by the author) functions designed across the process of data normalization. 
* 2.1.3. To compare some shallow differences between some normalization methods implemented on classic NLP libraries, some of them on python and C++.
* 2.1.4. Finally, to convert the whole initial corpus collection in a normalized new one.

The first corpus used in this collection for phase 1 or _Data Preparation_ is the corpus for __Text Alignment__ task of PAN Scientific Event. PAN is a series of scientific events and shared tasks on digital text forensics and stylometry. The corpus used here is the __test corpus of 2013__ available [HERE](https://www.uni-weimar.de/medien/webis/corpora/corpus-pan-labs-09-today/pan-13/pan13-data/pan13-text-alignment-test-and-training.zip). More corpus related with the same task could be found in [pan13-data](https://www.uni-weimar.de/medien/webis/corpora/corpus-pan-labs-09-today/).

Due to the impossibility to process all the metrics (50+) in a single laptop, the resultant Paragraph Semantic Text Similarity Corpus (PSTS Corpus) will be published later with the help of the Wittylytics team. In substitution, phase 2 or _Text Similarity Calculation_ and the last experimental phase _Classifying Paraphrased Cases_ will be shown with [Microsoft Research Paraphrase Corpus](http://research.microsoft.com/en-us/downloads), which is a sentence pair corpus with two classes and shorter than PAN-PC-2013.
 
    
## Brief Summary

It must be taken into account that the text collection, to which we are referring in the whole tutorial, contains spell errors and other kinds of errors coming from the output of pdf-to-text libraries or some automatic text generation algorithms (E.g. multiple line breaks inside a single sentence). So, although humans do not make these mistakes in real text, the automatic conversion or generation process can create them.

In this phase we hope to obtain texts without end of sentences dots ambiguities, and the multi-words united by the underscore symbol `_`. Any other punctuation or non-letter symbol will be deleted.
The result of this step will be used on the [next](02.2-Jaccard-Align-Preproc-to-Original-Sent) step to match the normalized sentences with the originals with a 100% precision.



## 2.1.1 Text Normalization
<a id='Text_Normalization'></a>

*"Text normalization is a related step that involves merging different written forms of a
token into a canonical normalized form; for example, a document may contain the equivalent tokens
“Mr.”, “Mr”, “mister”, and “Mister” that would all be normalized to a single form."* [<a href="#Indurkhya2010" title="Handbook of Natural Language Processing"> (Indurkhya2010) </a>](#Indurkhya2010)

## 2.1.2 Non Standard Normalization Protocols

As you can see in the former block, text normalization only includes token transformation. To simplify source codes in the next notebooks, the investigators must have at the end of the normalization phase a canonical text at all levels: morphological, lexical and syntactic. Trying to simulate the rich-text pattern recognition that brains do (bullets, section or chapter divisions, etc.) is not a good way. Faster algorithms need very plain texts.

Here is an example. It is well known that if you have a capital letter after a dot, this symbol means the end of a sentence. But, what happens if you have the next composition: *"H. Albot was a B.A. of Psychology."*. As you can see, the first dot isn't the end of the first sentence, in fact there is only one sentence. And as a second detail the second dot is correct, and is not the end of the sentence either. Probably the must useful string to get at the end of a normalization process could be: *"Harry Albot was a Bachelor of Psychology."*. This implicates the analysis of proper names, abbreviation and acronyms, math symbols, etc.

The protocols created by the author available in the *preprocess.norm* module inside the tutorial are:

- URL analysis                                 -> trated as a contiguous string (http___google_com__)
- Rare simbols (including math simbols)        -> converted to a canonical form (u'\u03c0' = 'Pi')
- ... points detection                         -> eliminated
- Contiguous string                            -> multi-words are trated as a single token (text-reuse = text_reuse)
- Separation of end of sentence dots           -> to avoid ambiguities ("Hola. Hoy..." = "Hola . Hoy...")
- Abbreviation, Acronym and proper names       -> canonical form subtitution
- Addition of last sentence end dot            -> dot addition at the end of the last sentence if not
- Punctuation chars analysis                   -> a set of regular expressions to solve pdf-text extraction

### A real example

In [3]:
text_orig = open('test/test_text.txt').read()
preproc_text = preProcessFlow(text_orig)

print('*****************************PREPROCESSED TEXT******************************')
print(preproc_text)

text_human = open('test/test_text_human_analysis.txt').read()
text_human2, temp = re.subn(r'\n',' ',text_human)
print('*****************************HANDMADE PREPROCESSED TEXT******************************')
print (text_human2)

print ('\nAutomatic end of sentence count of preprocessed text:',preproc_text.count('.'))
print ('Human end of sentence count of original test text:', text_human2.count('.'))

*****************************PREPROCESSED TEXT******************************
For other optional flags of opencv_createsamples see the official documentation at http___Docs_opencv_org_doc_user_guide_ug_traincascade_html . 99 www_it_ebooks_info . . Generating Haar Cascades for Custom 8_4 Targets . Creating cascade by running . opencv_traincascade 3_ anoche . 4 Después . . Over 110 recipes to master this full_stack Python web framework 1 . Take your web2py skills to the next level by dipping into delicious usable recipes in this cookbook 2 . Learn advanced web2py usage from building advanced forms to creating PDF reports 3 . Written by developers of the web2py project with plenty of code examples for interesting and comprehensive learning . . Mi correo es abelm_uclv_cu . A ver si lo coge . Please check www_PacktPub_com for information on our titles . www_it_ebooks_info . . Learning SciPy for Numerical and . Scientific Computing . ISBN 978_1_78216_162_2 . Ahora probaremos la división al fi

## 2.1.3 Other Normalization Processes

Some original source codes of every library are included to show you the limitations of every normalization method on each API.

### NLTK

Python's Natural Language Toolkit (NLTK) is a suite of libraries that has become one of the best tools for prototyping and building natural language processing systems. It is developed under the guide of Stanford Professor Steven Bird (see http://www.nltk.org/).

~/nltk/tag/perceptron.py

    def normalize(self, word):
        '''
        Normalization used in pre-processing.
        - All words are lower cased
        - Digits in the range 1800-2100 are represented as !YEAR;
        - Other digits are represented as !DIGITS

This function returns a TAG for every type of normalization task but not the normalized word.

The other references to normalization that can be found inside NLTK deal with mathematical operations to make two different datas comparable, and in a few cases to put some strings into lower case.

### Freeling

The FreeLing package consists of a library providing language analysis services (such as morphological analysis, date recognition, PoS tagging, etc.). It was made by TALP Research Center and the Universitat Politecnica de Catalunya (see http://nlp.lsi.upc.edu/freeling).

The FreeLing API divides the normalization task into modules that implement the detection of some patterns: Numbers, Punctuation, Dates, Multiword, Name Entity and Quantity.

As you can review in: _~/freeling-3.1/src/include/freeling/morfo/tokenizer.h_ and _~/freeling-3.1/src/libfreeling/tokenizer.cc_ there are some basical rules you can use to analyse the logic of the normalization process inside _FreeLing_. In fact we have some of those rules in here (taken from _~/freeling-3.1/data/en/tokenizer.dat_).

    <RegExps>
    INDEX_SEQUENCE   0  (\.{4,}|-{2,}|\*{2,}|_{2,}|/{2,})
    INITIALS1 	 1  ([A-Z](\.[A-Z])+)(\.\.\.)
    INITIALS2 	 0  ([A-Z]\.)+
    TIMES            0  (([01]?[0-9]|2[0-4]):[0-5][0-9])
    NAMES_CODES	 0  ({ALPHA}|{SYMNUM})*[0-9]({ALPHA}|[0-9]|{SYMNUM}+{ALPHANUM})*
    THREE_DOTS 	 0  (\.\.\.)
    QUOTES	         0  (``|<<|>>|'')
    MAILS 	         0  {ALPHANUM}+([\._]{ALPHANUM}+)*@{ALPHANUM}+([\._]{ALPHANUM}+)*
    URLS1 	         0  ((mailto:|(news|http|https|ftp|ftps)://)[\w\.\-]+|^(www(\.[\w\-]+)+))
    URLS2            1  ([\w\.\-]+\.(com|org|net))[\s]
    CONTRACT_0a      1  (i'(m|d|ll|ve))({NOALPHANUM}|$) CI
    CONTRACT_0b      1  ((you|we|they|who)'(d|ll|ve|re))({NOALPHANUM}|$) CI
    CONTRACT_0c      1  ((he|she|it|that|there)'(d|ll|s))({NOALPHANUM}|$) CI
    CONTRACT_0d      1  ((let|what|where|how|who)'s)({NOALPHANUM}|$) CI
    CONTRACT1        1  ({ALPHA}+)('([sdm]|ll|ve|re)({NOALPHANUM}|$)) CI
    CONTRACT2        1  ('([sdm]|ll|ve|re|tween))({NOALPHANUM}|$) CI
    KEEP_COMPOUNDS   0  {ALPHA}+(['_\-\+]{ALPHA}+)+
    *ABREVIATIONS1   0  (({ALPHA}+\.)+)(?!\.\.)
    WORD             0  {ALPHANUM}+[\+]*
    OTHERS_C         0  {OTHERS}
    </RegExps>

In this case, the process takes into account the same **preprocess** module referenced before. The disadvantage of Freeling is the huge amount of code (+300 Mb) necessary for the complete installation. On the other hand, it has a wonderful performance, is implemented in C++ and has support for many languages, Spanish being the most supported by this platform.

### Pattern

It is a Belgium application for data science and some task of natural language processing. It was made by the Computational Linguistic and Psycholinguistics Research Center (CLIPS), and has support for various languages as English and Spanish (see http://www.clips.uantwerpen.be/pattern or http://github.com/clips/pattern). Currently, its main problem is that there is no version for Python 3. 

~/pattern/text/en/wordnet/__init__.py:

    def normalize(word):
       57:     """ Normalizes the word for synsets() or Sentiwordnet[] by removing accents,
       58          since WordNet does not take unicode.
       
As you can see this function only helps the English Wordnet implementation to deal with its accent incompatibilities.

### Other python libs in pypi repository

* __Normalization__
* __Normalizr__

### A curiosity in Sklearn Library

You can find a normalize method inside this library.

    from sklearn.preprocessing import normalize
    
But, basically, this is to _"Normalize samples individually to unit norm."_ (taken from ~/sklearn/preprocessing/data.py)

## 2.1.4 Text Normalization Collection

Use the _02.1_preprocessDocList.py_ script to complete this task.

Details of the above script:

- You may pass pair file path, src-path, susp-path and out-path, as parameters of the script.
- It will start reading *preProcessedDocDict* file to handle the list of previously preprocessed documents.

    * in depth: the algorithm contains a dict named "preProcessedDocDict" to optimize the preprocess flow as we can have the same susp or src doc in many cases; these are preprocessed just the first time.

The below box shows how to execute the script for the whole corpus.

In [4]:
%run scripts/02.1_preprocessCaseList orig/pairs orig/src/ orig/susp/ norm/

The script will use as default the data folder as working directory
/home/abelma/01b_Paraph/data
Preprocessed cases:  1000 Valid cases:  1000
Preprocessed cases:  2000 Valid cases:  2000
tiempo total:  24.462116241455078


# Conclusions

* The normalization process was conceptualized. 
* Then the different rules defined by the author were specified as steps in the implemented normalization process.
* Different real natural language text situations need to be analysed to implement different rules that can process them. 
* The python classic libraries don't implement a similar normalization process as defined in this notebook, mainly due to the paradigm of converting text into numerical vectors.
* The selected process has many similarities with the normalization process detailed in FreeLing platform.

# Bibliography

* Indurkhya, N. & Damerau, F. J. Herbrich, R. & Graepel, T. (Eds.) Handbook of Natural Language Processing CRC Press, 2010.
<a id='Indurkhya2010'></a>

* FreeLing User Manual, 2013.

* Perkins, Jacob. Python 3 Text Processing with NLTK 3 Cookbook, Packt Publishing, 2014.

* Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper. 2009. O’Reilly Media, Inc.

# Exercises

1. Many of the grammatical concept list on this tutorial script are not complete, for example Abbreviation is only superficially implemented. Pick up a grammatical concept, analyse the preprocess mudule implementation and try to improve it.

2. Date recognition is not implemented in this tutorial scripts, made a quick RegExp based implementation and compared it with FreeLing Date recognition rules.

3. The preprocess punctuation_filter method is the first step to correct the effects of some punctuations chars on the normalization process. Use a mathematical pdf book and first convert to text (we suggest using python-pdfminer) and then keep the result using the original implementation and then review all the regular expressions and try new ones and repeat the preprocess. After that compare the results. Save this files for future analysis in the alignment process.

4. Try to implement a new normalization flow different to showed here, and compare the results counting the number of sentences. These ones will be usefull in the next section.