# utils2 example: text normalization

A small example of usage of the [aura-cognitive-utils2](https://github.com/Telefonica/aura-cognitive-utils2) library for the task of text normalization


## String-based normalization

Caling the object as a function returns a normalized string (same as `to_str()` method)

In [1]:
from auracog_utils.text import TextNormalizer, NormLevel

In [2]:
t1 = '¡Dábale arroz  a la	zorra el abad!'
t2 = 'Pregunta: ¿El vigésimo segundo participante tenía cuarenta y dos años?'
t3 = 'Spider-man haciendo un auténtico Spain Pick & Roll'

### Standard

In [3]:
# Standard normalization is the default
norm = TextNormalizer('es_ES')

In [4]:
norm(t1)

'dabale arroz a la zorra el abad'

In [5]:
norm(t2)

'pregunta el 22 participante tenia 42 años'

In [6]:
norm(t3)

'spider-man haciendo un autentico spain pick & roll'

### Minimal

In [7]:
norm2 = TextNormalizer('es_ES', level=NormLevel.MINIMAL)

In [8]:
norm2(t1)

'Dábale arroz a la zorra el abad'

In [9]:
norm2(t2)

'Pregunta El vigésimo segundo participante tenía cuarenta y dos años'

In [10]:
norm2(t3)

'Spider-man haciendo un auténtico Spain Pick & Roll'

### Full

In [11]:
norm3 = TextNormalizer('es_ES', level=NormLevel.FULL)

In [12]:
norm3(t1)

'dabale arroz a la zorra el abad'

In [13]:
norm3(t2)

'pregunta el 22 participante tenia 42 años'

In [14]:
norm3(t3)

'spider man haciendo un autentico spain pick roll'

## Token normalization

Calling the `to_tkn()` method produces two paired lists of tokens, original (except punctuation) & normalized

Note that a token is usually a word, but not always

In [15]:
result = norm3.to_tkn(t1)

print(result.orig, result.norm, sep='\n')

['Dábale', 'arroz', 'a', 'la', 'zorra', 'el', 'abad']
['dabale', 'arroz', 'a', 'la', 'zorra', 'el', 'abad']


In [16]:
result = norm3.to_tkn(t2)

print(result.orig, result.norm, sep='\n')

['Pregunta', 'El', 'vigésimo segundo', 'participante', 'tenía', 'cuarenta y dos', 'años']
['pregunta', 'el', '22', 'participante', 'tenia', '42', 'años']


In [17]:
result = norm3.to_tkn(t3)

print(result.orig, result.norm, sep='\n')

['Spider-man', 'haciendo', 'un', 'auténtico', 'Spain', 'Pick', '&', 'Roll']
['spider man', 'haciendo', 'un', 'autentico', 'spain', 'pick', '', 'roll']


# Cache

If we suspect that we will send repeated strings for normalization, we can define an LRU cache that can hold the last N results and therefore improve performance.

Note that the cache only works for the object-as-a-function call, not for the `to_str()` and `to_tkn()` methods, which are never cached.

In [18]:
norm = TextNormalizer('es_ES', cache_size = 3)

In [19]:
norm(t1)

'dabale arroz a la zorra el abad'

In [20]:
norm(t1)

'dabale arroz a la zorra el abad'

In [21]:
norm.cache_info()

CacheInfo(hits=1, misses=1, maxsize=3, currsize=1)