# Using Atalaia

Atalaia is a collection of methods that can be used for simple NLP tasks. It can be used for tasks involving text preprocessing for machine learning. Atalaia works mainly with text strings and lists of strings. In order to use it, import the module and define the instances with the languages you need.

You can use Atalaia with the following languages (but not every feature is available for all of them):

- pt-br
- en
- fr

In [1]:
# import Atalaia
from atalaia.atalaia import Atalaia
import pprint
from tqdm import tqdm

# starting Atalaia instances
atalaia_pt_br = Atalaia('pt-br')
atalaia_en = Atalaia('en')
atalaia_fr = Atalaia('fr')

Let's start by doing some text removal/replacing preprocessing.

In [2]:
# removing html tags
removed_html = atalaia_en.remove_html_tags('<h1><strong>Hello world!</strong></h1>')
print(removed_html)

Hello world!


In [3]:
# if you want, you can use a placeholder to mark the locations where the tags where found
replaced_html = atalaia_en.replace_html_tags('<h1><strong>Hello world!</strong></h1>', 'HTML')
print(replaced_html)

HTML HTML Hello world! HTML HTML


In [4]:
# remove urls from text...
removed_urls = atalaia_en.remove_urls('You can go to the page http://homepage.com to see the content.')
print(removed_urls)

You can go to the page to see the content.


In [5]:
# ...or simply replace them
replaced_urls = atalaia_en.replace_urls('You can go to the page http://homepage.com to see the content.', 'URL')
print(replaced_urls)

You can go to the page URL to see the content.


In [6]:
# remove hashtags
removed_hashtags = atalaia_en.remove_hashtags('I wish you all #love and #peace!')
print(removed_hashtags)

I wish you all and !


In [7]:
# replace the hashtags
replaced_hashtags = atalaia_en.replace_hashtags('I wish you all #love and #peace', 'HASHTAG')
print(replaced_hashtags)

I wish you all HASHTAG and HASHTAG


In [8]:
# remove ips
removed_ips = atalaia_en.remove_ips('This can be accessed on the address 198.162.0.1.')
print(removed_ips)

This can be accessed on the address .


In [9]:
# replace ips
replaced_ips = atalaia_en.replace_ips('This can be accessed on the address 198.162.0.1.', 'IP')
print(replaced_ips)

This can be accessed on the address IP.


In [10]:
# remove @handlers
removed_handles = atalaia_en.remove_handles('Can you come tonight, @alice?')
print(removed_handles)

Can you come tonight, ?


In [11]:
# or replace them
replaced_handles = atalaia_en.replace_handles('Can you come tonight, @alice?', 'USERNAME')
print(replaced_handles)

Can you come tonight, USERNAME?


In [12]:
# replace quotes
replaced_quotes = atalaia_en.replace_quotes('She told me: "you are a confident and strong woman".', 'QUOTED')
print(replaced_quotes)

She told me: QUOTED you are a confident and strong woman QUOTED .


In [13]:
# remove numbers
removed_numbers = atalaia_en.remove_numbers('I told him 1,2,3,4 times that he could not do that!')
print(removed_numbers)

I told him ,,, times that he could not do that!


In order to replace numbers by words, use the method replace_numbers(). Pay attention to the fact that this WON'T transform complex numbers into a readable form. Eg: 20 will become "two zero" and not "twenty".

In [14]:
# or replace numbers by words
replaced_numbers = atalaia_en.replace_numbers('1 is the first number.')
print(replaced_numbers)

one is the first number.


Now, we can start to deal with special characters and punctuation.

In [15]:
# strip accents
stripped_accents = atalaia_pt_br.strip_accents('Mamãe me disse: você é o menino mais ágil do time.')
print(stripped_accents)

Mamae me disse: voce e o menino mais agil do time.


In [16]:
# remove punctuation
removed_punctuation = atalaia_en.remove_punctuation('Hey, are you here??? I really need your help!')
print(removed_punctuation)

Hey are you here I really need your help


In [17]:
# replace punctuation
replaced_punctuation = atalaia_en.replace_punctuation('Hey, are you here??? I really need your help!', placeholder='PUNCTUATION')
print(replaced_punctuation)

Hey PUNCTUATION are you here PUNCTUATION PUNCTUATION PUNCTUATION I really need your help PUNCTUATION


You can also spot a specific character here. Let's say you want to replace question marks only. You could do this:

In [18]:
# or replace only a specific char
replaced_punctuation = atalaia_en.replace_punctuation('Hey, are you here??? I really need your help!', sign='?', placeholder='QUESTION')
print(replaced_punctuation)

Hey, are you here QUESTION QUESTION QUESTION I really need your help!


You can expand current and common contractions

In [19]:
atalaia_fr.expand_contractions("Je veux qu'on reste ici. Je t'embrasse fort.")

'Je veux que on reste ici. Je te embrasse fort.'

You can stem the sentences using the NLTK stemmer.

In [20]:
stemmed_pt_br = atalaia_pt_br.stem_sentence('Eu adoro fazer compras com minha mãe e com minhas amigas')
print(stemmed_pt_br)
stemmed_en = atalaia_en.stem_sentence('I love to go shopping with my mother and friends')
print(stemmed_en)
stemmed_fr = atalaia_fr.stem_sentence('J\'adore faire du shopping avec ma mère et mes amies')
print(stemmed_fr)

eu ador faz compr com minh mã e com minh amig
i love to go shop with my mother and friend
j'ador fair du shopping avec ma mer et me ami


Finally, you can remove stop words

In [21]:
removed_stop_words_pt_br = atalaia_pt_br.remove_stopwords('Eu adoro fazer compras com minha mãe e com minhas amigas')
print(removed_stop_words_pt_br)
removed_stop_words_en = atalaia_en.remove_stopwords('I love to go shopping with my mother and friends')
print(removed_stop_words_en)
removed_stop_words_fr = atalaia_fr.remove_stopwords('J\'adore faire du shopping avec ma mère et mes amies.')
print(removed_stop_words_fr)

adoro fazer compras mãe amigas
I love go shopping my mother friends
J'adore faire du shopping avec ma mère et mes amies.


If you want, you can provide your own stop words list. To use only your custom list, load Atalaia in custom mode. 

In [22]:
# use a custom list only, with no language loaded
atalaia = Atalaia('custom')
custom_stopwords = ['pizza']
removed_stop_words_custom = atalaia.remove_stopwords('Every friday, we eat pizza at my house.', custom_list=custom_stopwords)
print(removed_stop_words_custom)

Every friday, we eat pizza at my house.


It's also possible to extend the stopwords of Atalaia. Set extend_set to True, while providing a custom stop words list. 

In [23]:
# use a custom list only while loading a language (extend the set)
custom_stopwords = ['pizza']
removed_stop_words_en = atalaia_en.remove_stopwords('Every friday, we eat pizza at my house.', custom_list=custom_stopwords, extend_set=True)
print(removed_stop_words_en)

Every friday, we eat pizza at my house.


While preprocessing social media texts, you may find words with repeated chars, like veryyyyyy looooonnnng words. You can use the method reduce_words_with_repeated_chars to normalize them. Be careful: abbreviations like 'AAA' can be interpreted as long texts. Another limitation is that it only catchs words with chars repeated more than 3 times.

In [24]:
reduce_long_words_en = atalaia_en.reduce_words_with_repeated_chars('I loooooooooooooove pizza so muuuchhhh')
print(reduce_long_words_en)

I love pizza so much


Some of the last methods remove parts of the text but replace them by empty spaces. Os sometimes, text come already with these empty spaces that have to be fixed. This can be fixed by the method remove_excessive_spaces. 

In [25]:
excessive_spaces_removed = atalaia_en.remove_excessive_spaces('I  can\'t     stop looking at  you')
print(excessive_spaces_removed)

I can't stop looking at you


Use replace_newline to replace newline char by another char. It will account for sentences finishing with ?.!,:;=

In [40]:
atalaia_en.replace_newline('I want to break free.\nI want to break free\nDo you wanna break free?\nI do!')

'I want to break free. I want to break free. Do you wanna break free? I do!'

Notice that, the default replacement is ". ". 

You can change it, but if you decide to use another replacement other than punct + space, you have to set consider_punctuation to False. This won't do any modifications to the sentencs that already have punctuation. It will only replace the newline char.

In [39]:
atalaia_en.replace_newline('I want to break free.\nI want to break free\nDo you wanna break free?\nI do!', 
                           replacement="--NEWLINE--", 
                           consider_punctuation = False)

'I want to break free.--NEWLINE--I want to break free--NEWLINE--Do you wanna break free?--NEWLINE--I do!'

## Quick preprocessing

The preprocess method offers a quick way of preprocessing text. I you call it, it will do the following actions in this order:

   - lower text
   - strip trailing whitespaces
   - convert emojis to text
   - replace urls
   - remove tags
   - replace hashtags
   - replace ips
   - remove social media handles
   - replace numbers
   - remove punctuation
   - remove excessive spaces     
   - tag text if True
   - remove stopwords if True
   - remove accents
   - remove excessive spaces
   - stem text if True
   - tokenize text (if True, will return a list of tokens)

Normal preprocessing will return a list of tokens for a given sentence.

In [28]:
preprocessed_text = atalaia_en.preprocess("At the end of the dayyyyyyyy,      you're solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure.")
pprint.pprint(preprocessed_text)

['at',
 'the',
 'end',
 'of',
 'the',
 'day',
 'you',
 're',
 'sole',
 'respons',
 'for',
 'your',
 'success',
 'and',
 'your',
 'failur',
 'and',
 'the',
 'sooner',
 'you',
 'realiz',
 'that',
 'you',
 'accept',
 'that',
 'and',
 'integr',
 'that',
 'into',
 'your',
 'work',
 'ethic',
 'you',
 'will',
 'start',
 'be',
 'success',
 'as',
 'long',
 'as',
 'you',
 'blame',
 'other',
 'for',
 'the',
 'reason',
 'you',
 'aren',
 't',
 'where',
 'you',
 'want',
 'to',
 'be',
 'you',
 'will',
 'alway',
 'be',
 'a',
 'failur']


If you need a string instead of a list of tokens, set 'tokenize' to False.

In [29]:
preprocessed_text = atalaia_en.preprocess("At the end of the dayyyyyyyy, you are solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure.", tokenize=False)
pprint.pprint(preprocessed_text)

('at the end of the day you are sole respons for your success and your failur '
 'and the sooner you realiz that you accept that and integr that into your '
 'work ethic you will start be success as long as you blame other for the '
 'reason you aren t where you want to be you will alway be a failur')


Another option is not stemming the words. Just set 'stem' to False too.

In [30]:
preprocessed_text = atalaia_en.preprocess("At the end of the dayyyyyyyy, you are solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure.", stem=False, remove_stopwords=True)
pprint.pprint(preprocessed_text)

['at',
 'end',
 'of',
 'day',
 'you',
 'are',
 'solely',
 'responsible',
 'for',
 'your',
 'success',
 'your',
 'failure',
 'sooner',
 'you',
 'realize',
 'that',
 'you',
 'accept',
 'that',
 'integrate',
 'that',
 'into',
 'your',
 'work',
 'ethic',
 'you',
 'will',
 'start',
 'being',
 'successful',
 'as',
 'long',
 'as',
 'you',
 'blame',
 'others',
 'for',
 'reason',
 'you',
 'aren',
 't',
 'where',
 'you',
 'want',
 'be',
 'you',
 'will',
 'always',
 'be',
 'failure']


Divide corpus into smaller sentences

## List processing

If you have a list of strings, like a pandas series, you can pass it directly to Atalaia for preprocessing. You can also choose if you want the sentences to be tokenized and stemmed.

In [31]:
list_of_strings = [
    'I love you',
    'Please, never leave me alone',
    'If you go, I will die',
    'I am watching a lot of romantic comedy lately',
    'I have to eat icecream'
]

list_processed = atalaia_en.preprocess_list(list_of_strings, stem=False, remove_stopwords=True)
pprint.pprint(list_processed)

[['i', 'love', 'you'],
 ['please', 'never', 'leave', 'me', 'alone'],
 ['if', 'you', 'go', 'i', 'will', 'die'],
 ['i', 'am', 'watching', 'lot', 'of', 'romantic', 'comedy', 'lately'],
 ['i', 'have', 'eat', 'icecream']]


## Vocabulary

Sometimes you need to transform a list of strings in a solely string, which we call here corpus.

In [32]:
# creating corpus from list of strings
corpus = atalaia_en.create_corpus(list_of_strings)
pprint.pprint(corpus)

('  I love you Please, never leave me alone If you go, I will die I am '
 'watching a lot of romantic comedy lately I have to eat icecream')


You can also calculate the lexical diversity of a string given a corpus

In [33]:
diversity = atalaia_en.lexical_diversity(list_of_strings[0], corpus)
print('Diversity is of this sentence is of {}% compared to corpus.'.format(diversity*100))
print(diversity)

Diversity is of this sentence is of 12.5% compared to corpus.
0.125


## Custom Pipelines

If you don't want to use the preprocess function, you can build a pipeline just by concanating the methods above. Pay attention to the order you choose to use. If you want to expand contractions, foir instance, don't use a method that strip accents and special chars before. 

In [34]:
text = "At the end of the day, @john you're solely responsible for your #success and your #failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being #successful."

text = atalaia_en.lower_remove_white(text)
text = atalaia_en.replace_handles(text, 'HANDLE')
text = atalaia_en.replace_hashtags(text, 'HASHTAG')
text = atalaia_en.remove_stopwords(text)
text = atalaia_en.replace_punctuation(text, placeholder='PUNCTUATION')
text = atalaia_en.tokenize(text)
pprint.pprint(text)

['at',
 'end',
 'of',
 'day',
 'PUNCTUATION',
 'HANDLE',
 'you',
 'PUNCTUATION',
 're',
 'solely',
 'responsible',
 'for',
 'your',
 'HASHTAG',
 'your',
 'HASHTAG',
 'PUNCTUATION',
 'sooner',
 'you',
 'realize',
 'that',
 'PUNCTUATION',
 'you',
 'accept',
 'that',
 'PUNCTUATION',
 'integrate',
 'that',
 'into',
 'your',
 'work',
 'ethic',
 'PUNCTUATION',
 'you',
 'will',
 'start',
 'being',
 'HASHTAG',
 'PUNCTUATION']


## Randomness tools

Sometimes you need to test if a model is performing well or badly due to the data quality or simply due to the model itself. In these cases, you can test randomness. Atalaia offers random_classification and replace_with_blob tools. The first creates random labels for each example on your dataset, while the second creates blob text using the vocabulary on your data set to generate random examples. 

In [35]:
texts = ['Oi, tudo bom?', 'Queria ser seu namorado.', 'Me liga, ok?', 'Quando voce chega mesmo?', 'Adoro o seu cachorro.']
labels = atalaia_pt_br.random_classification(texts, ['Bad','Good'], balanced=True)
pprint.pprint(labels)

Class Bad: 1 values
Class Good: 4 values
['Good', 'Good', 'Good', 'Good', 'Bad']


In [36]:
sentences = atalaia_pt_br.replace_with_blob(texts)
pprint.pprint(sentences)

['mesmo? liga, liga,',
 'Quando tudo Quando tudo',
 'o mesmo? o',
 'Me  Oi, seu',
 'cachorro. chega chega Oi,']


## Tokenizer

Atalaia comes with an internal tokenizer. To use it, simply access the tokenize method.

In [37]:
tokenized = atalaia_en.tokenize("At the end of the day, you're solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure.")
pprint.pprint(tokenized)

['At',
 'the',
 'end',
 'of',
 'the',
 'day',
 ',',
 "you're",
 'solely',
 'responsible',
 'for',
 'your',
 'success',
 'and',
 'your',
 'failure',
 '.',
 'And',
 'the',
 'sooner',
 'you',
 'realize',
 'that',
 ',',
 'you',
 'accept',
 'that',
 ',',
 'and',
 'integrate',
 'that',
 'into',
 'your',
 'work',
 'ethic',
 ',',
 'you',
 'will',
 'start',
 'being',
 'successful',
 '.',
 'As',
 'long',
 'as',
 'you',
 'blame',
 'others',
 'for',
 'the',
 'reason',
 'you',
 "aren't",
 'where',
 'you',
 'want',
 'to',
 'be',
 ',',
 'you',
 'will',
 'always',
 'be',
 'a',
 'failure',
 '.']
