<small><i>This notebook is an English version (2020) of "*Normalización de Textos con Python.ipynb*" on the collection [nlp_pydata2018 on GitHub](https://github.com/sorice/nlp_pydata2018/).</i></small>

# Text Normalization with Python

**Text Normalization**: is the subprocess that implies to mix
diferent ways of writing in a single one approiate and aceptable; 
for example a single document can contain the words "Señor", "señor", 
"Sr.", "Sr" all of them must being normalized to a unique form.[[1](#Indurkhya2008)]

**Tips**:

* The must important sign here is the **sentence end dot**. (Abel2015)
* The second most important sign it is the **underscore** or "_". This sign allows to mark
  [collocations](#collocations) for post text preprocessing.
* A whitespace before and after every *sentence end dot* makes simpler the 
  Regular Expressions to tokenize.(Abel2015)

**Preparing the scope for preprocessing**

In [1]:
import re
import string

LETTERS = ''.join([string.ascii_letters, string.digits])

## Punctuation Signs

### Signs out of ASCII & Latin1 range

This an example of rare ASCII quotation marks which usually appears in reach texts. This function filter those quotations to avoid rare characters.

In [2]:
def punctuation_filter(text):
   text = re.sub(
                 u'(?:\xc2|\xa0)|'
                 u'(?:\\xe2\\x80\\x9d|\\xe2\\x80\\x9c)|'       #Del “” in ascii.
                 u'(?:\u201c|\u201d)|'                         #Del “” in utf8.
                 u'(?:["]|[\'])'                               #Delete dobles & single quotes
                 ,' ',text)
   text = re.sub(u'(?:\\xe2\\x80\\x99|\\xe2\\x80\\x98)|'       # Del ‘’ in ascii.
                 u'(?:\u2018|\u2019)'                          # Del ‘’ in ascii
                 ,'\'',text) 
   text = re.sub(u'(?:\\xe2\\x80\\x93)|'                       # Delete rare hyphens ó – in ascii.
                 u'(?:\u2013)'                                 # Long hyphen utf8 encoding.
                 ,' - ',text)
   return text

**Note**: This func it is only a small example, a more developed func for this it is included in ``preprocess.punctuation`` function. The must important detail is that texts without this errors 
cleaned, will raise some error in the rest of normalization pipeline.

### The 3 dots sign ...

Something important for semantic analysis is the sentences **end points** location, for sentence tokenization. However for the treatment with regular expressions the three dots is a very complex sign.
Although it is not yet clear what would be the pattern by which to replace it with the following code they
are removed.

**Note**: this was problematic code because of white space between dots.

In [3]:
def del_contiguous_point_support(text):
   for i in re.finditer('[.]\s*?[.]+?[\s|[.]]*',text):
      for j in range(i.start(),i.end()):
         if text[j] == '.' or text[j]==' ':
            text = text[:j]+' '+text[j+1:]
   return text

### Special Tokens

*Changes at the morphological and lexical level.*

### Emails and Multi-Word Expressions

Some tokens like emails **pedro@gmail.com**, or **teaching - learning**,
**Firefox-v0.8** must be maintained for their semantic value either as nouns or nominal phrases.

In [5]:
def contiguos_string_recognition_support(text):
   text = re.sub('\n-','\n- ',text)
   # support for email address is inside the regexp
   for i in re.finditer('[.]\w*|-\w*|@\w*',text): 
      for j in range(i.start(),i.end()):
         if j<(len(text)-1) and text[j] in string.punctuation and text[
         j+1] not in string.whitespace:
            text = text[:j]+'_'+text[j+1:]
   return text

### URLs

Another special token are the URLs.

In [6]:
def url_string_recognition_support(text):
   for i in re.finditer('www\S*(?=[.]+?\s+?)|www\S*(?=\s+?)|http\S*(?=[.]+?\s+?)'
                        +'|http\S*(?=\s+?)',text):
      for j in range(i.start(),i.end()):
         if text[j] in string.punctuation:
            text = text[:j]+'_'+text[j+1:]
   return text

In this function two URL situations are analyzed followed by space (**Expr.** *www\S*(?=\S+?)*),
and URL as the final token of a sentence (**Expr.** *www\S*(?=[.]+?\s+?)*) of a sentence **Eg.**: **... www.google.com.*

**Note**: It is important that at the end of the parsed string (*text*) there is at least one whitespace.
So in the case of:
*"text = 'www.google.com'"* regular expressions should identify that *'m'* is also
the end of the chain.
This would make the recognition function more complex; when actually could be
slve by adding a whitespace to the end of the string, before parsing it.
This is very simple to implement in the flow (see as **Eg.** section
[add_text_end_dot](#add_text_end_dot)).

### Siglas y Abreviaturas

A special type of token is the **acronyms, abbreviations, and similars**. In this regard it must be needed
a well-polished dictionary, or perhaps a good algorithm to recognize some (current solutions are based on Machine Learning). However there are several dictionaries, such as libreoffice once, that could be used and improved.

In [7]:
def abbrev_recognition_support(text):
   for i in re.finditer('Dr(?=[.]+?)|Ms.C(?=[.]+?)|Ph.D(?=[.]+?)|Ing(?=[.]+?)|Lic(?=[.]+?)',
                        text):
      text = text[:i.end()]+'_'+text[i.end()+1:]
   return text

**Hypothesis**: Algorithms to search for a string in a list or dictionary may be somewhat slower
than regular expressions. This is because a search is needed on a structure
of data once for each token, in regular expressions it is reviewed and replaced in the text
complete once for each pattern.

In [8]:
#Pendiente versión 2 con diccionario de LibreOffice o de Google Translator.
abbr = open('data/abbr').read()
abbrDict = {}
pattern = ':'
for word in abbr.split('\n'):
    abbrDict[word] = word
print (len(abbrDict))

def abbr_filter(text, dic):
    ntext = ''
    for word in text.split(' '):
        if word in dic:
            word = dic[word]
        ntext = ntext + word + '_'    
    return ntext

481


### Profiling of Abbreviation Detection

In [9]:
from time import clock
text = '' #Construyendo un texto de prueba.
for word in abbrDict:
    text += word+' '
for n in range(2):
    text += text

print (len(text))

11484


In [10]:
print ('Expr')
start_time1=clock()
%timeit abbrev_recognition_support(text)
end_time1=clock()-start_time1
print ('Time based on Regular Expressions %.4f' %end_time1)

Expr
10000 loops, best of 3: 105 µs per loop
Tiempo basado en expresiones regulares 4.5335


In [11]:
print ('Dict')
start_time2=clock()
%timeit abbr_filter(text,abbrDict)
end_time2=clock()-start_time2
print ('Time based on diccionaries %.4f' %end_time2)

Dict
1000 loops, best of 3: 1.07 ms per loop
Tiempo basado en uso de diccionarios 4.5355


### Profiling Result

Indeed the dictionary-based acronym search is 10 times slower than based on
regular expressions, evaluated in a context of more than 11000 terms, which equals the size
than an average book.

### Stopwords

Although stopwords are essentially meaningless tokens within the sentence, and they act
generally as connectors, we separate them by their importance in the PLN. Fundamentally in the
analysis of computational efficiency and the efficiency of similarity results.

In [13]:
def del_char_len_one(text):
   text = re.sub('\s\w\s',' ',text)
   return text

## Structural Normalization

The next function only add a dot at the end of the document, if there isn't any. This avoid difficulties tokenizing the last sentence.

### add_text_end_dot

In [13]:
def add_text_end_dot(text):
   end = len(text)-1
   i = 0
   while text[end] not in LETTERS:
      end-=1
      if text[end] == '.':
         text = text[0:end]
         i+=1
   # if any char at the end is a dot before the first letter, then add one '.'
   if i==0: 
      text += '.' 
   return text

## Normalization Pipeline

This process could be different depending in which is your goal at the end, the target your final data is designed.

In [15]:
import time
from nltk.tokenize import RegexpTokenizer, WordPunctTokenizer
from preprocess.punctuation import Replacer
from preprocess.data import tnlp1_path

inita = time.time()
doc_name = tnlp1_path()[:-4]
with open(doc_name+'.txt','r') as text:
    print('---------')
    #Count unique terms
    tokenizer = RegexpTokenizer("\s+", gaps=True)
    tokensa = tokenizer.tokenize(text)
    tokens_uniqueA = set(tokensa)

#-------------------Special tokens recognition and normalization
initg = time.time()

with open(doc_name+'.txt','r') as text:
    print ('processing urls')
    text = url_string_recognition_support(text)
    print ('processing some special punctuation signs')
    text = punctuation_filter(text)
    print ('clean contiguous dots')
    text = del_contiguous_point_support(text)
    print ('abbrev recognition and normalization')
    #~ text = abbrev_recognition_support(text)
    print ('contiguous string recognition')
    # Esta demora mucho, hay que ver porque
    text = contiguos_string_recognition_support(text) 

with open(doc_name+'1_normalized_tokens.txt', 'w') as txt:
    txt.write(text)

#-------------------Clean all punctuation sign
print ('- Limpiando los signos de puntuación.')
text = open('test/2.3/out_'+doc_name+'1_normalized_tokens.txt','r').read()
replacer = Replacer()
chunk = replacer.replace(text)

texto = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt','w')
texto.write(chunk)
texto.close()

text = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt','r').read()
tokenizer = RegexpTokenizer("\s+", gaps=True)
tokens = tokenizer.tokenize(text)

#Counting unique terms
tokens_uniqueD = set(tokens)

timeg = time.time() - initg

print ('-----CLEANNING-------------: ', timeg)
print ('tokens data type is:', type(tokens))
print ("Number of tokens after cleanning is: ", len(tokens),
"\nDeleted "+str(len(tokens)-len(tokensa))+" tokens curing cleanning.",
"\n Deleted uniques: ", len(tokens_uniqueD)-len(tokens_uniqueA))

text = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt', 'r').read()
text = add_text_end_dot(text)

texto = open('test/2.3/out_'+doc_name+'6_clean_punctuation.txt', 'w')
texto.write(text)
texto.close()

timefa = time.time() - inita
print ('Number of unique terms when filtering: ', len(tokens_uniqueD))

print ('Made in ', timefa)
print (time.ctime())

---------
processing urls
processing some special punctuation signs
clean contiguous dots
abbrev recognition and normalization
contiguous string recognition
- Limpiando los signos de puntuación.
-----LIMPIEZA-------------:  0.01045370101928711
El tipo de datos de tokens es: <class 'list'>
La cantidad de tokens después de limpiar es:  886 
Eliminados 42 tokens durante la limpieza. 
 Eliminados únicos:  -28
La cantidad de términos únicos al filtrar es:  346
Finalizado en  0.01274251937866211
Fri Sep  2 14:47:59 2016


## Result Analysis

Comparing algorithm result versus human.

In [16]:
textout = open('test/2.3/out_'+doc_name+'6_clean_punctuation.txt').read()
texthuman = open('test/2.3/'+doc_name+'_human_analysis.txt').read()
lineout = []
linehuman=[]

for line in textout.split('.'):
   lineout.append(line)
for line in texthuman.split('.'):
   linehuman.append(line)
    
for i in range(15):#max(len(lineout),len(linehuman))):
   if i < len(lineout):
        print (lineout[i])
   if i < len(linehuman):
        print (linehuman[i])
   print  ('-----')

ACID 
ACID 
-----
 En bases de datos se denomina ACID a un conjunto de características necesarias para que una serie de instrucciones puedan ser consideradas como una transacción 
 
En bases de datos se denomina ACID a un conjunto de características necesarias para que una serie de instrucciones puedan ser consideradas como una transacción 
-----
 Así pues si un sistema de gestión de bases de datos es ACID compliant quiere decir que el mismo cuenta con las funcionalidades necesarias para que sus transacciones tengan las características ACID 
 Así pues, si un sistema de gestión de bases de datos es ACID compliant quiere decir que el mismo cuenta con las funcionalidades necesarias para que sus transacciones tengan las características ACID
-----
 En concreto ACID es un acrónimo de Atomicity Consistency Isolation and Durability 


En concreto ACID es un acrónimo de Atomicity, Consistency, Isolation and Durability
-----
 Atomicidad Consistencia Aislamiento y Durabilidad en español 
 Atomici

## References

<a id='Indurkhya2008'></a>
[1] *[Indurkhya2008]* Nitin Indurkhya. Book **Handbook of Natural Language Processing**. 2008. 
p. 10 **ISBN**: 978-1-4200-8593-8

## Alphabetic Index

**Collocations**: sequence of words that appear together very frecuently, and became in new linguistic codes because of that. Eg. “black night”, “white wine”, 
“United States of America”, etc.