In [4]:
import re, string, unicodedata
import nltk
import contractions
import inflect
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [5]:
sample = """<h1>Title Goes Here</h1>
<b>Bolded Text</b>
<i>Italicized Text</i>
<img src="this should all be gone"/>
<a href="this will be gone, too">But this will still be here!</a>
I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?
[Some text we don't want to keep is in here]
¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!
something... is! wrong() with.,; this :: sentence.
I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?
My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.
Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.
[This is some other unwanted text]
John: "Well, well, well."
James: "There, there. There, there."
&nbsp;&nbsp;
There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.
I have to go get 2 tutus from 2 different stores, too.
22    45   1067   445
{{Here is some stuff inside of double curly braces.}}
{Here is more stuff in single curly braces.}
[DELETE]
</body>
</html>"""

## Remoção de Ruído

Nesse processo de remoção de ruídos faremos:
 - Remoção do cabeçalho e do rodapé
 - Remoção dos códigos HTML e XML
 - Extração de dados importantes de outros formatos, como JSON

Na célula seguinte é utilizada a biblioteca Beautiful Soup para a remoção das tags HTML, através do parser html que esta possui

Na segunda função é feita a remoção de colchetes duplos ("[[]]") utilizando expressões regulares e remove-se todo texto dentro dos colchetes duplos. Essa remoção é específica deste exemplo.

In [6]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text

In [7]:
sample = denoise_text(sample)

In [9]:
display(sample)

'Title Goes Here\nBolded Text\nItalicized Text\n\nBut this will still be here!\nI run. He ran. She is running. Will they stop running?\nI talked. She was talking. They talked to them about running. Who ran to the talking runner?\n\n¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!\nsomething... is! wrong() with.,; this :: sentence.\nI can\'t do this anymore. I didn\'t know them. Why couldn\'t you have dinner at the restaurant?\nMy favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.\nDon\'t do it.... Just don\'t. Billy! I know what you\'re doing. This is a great little house you\'ve got here.\n\nJohn: "Well, well, well."\nJames: "There, there. There, there."\n\xa0\xa0\nThere are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.\nI have to go get 2 tutus from 2 different stores, too.\n22    45   1067   445\n{{Here is some stuff inside of

Agora iremos remover as contrações. Em inglês, encontramos diversos textos, sendo eles formais ou informais, que possuem contrações como _didn't_ ou _don't_. Ao utilizar um tokenizer, ou seja, ao separar nossas palavras nos passos a frente, essas contrações serão extraídas de forma que irão inserir ruídos em nossos dados. Uma contração como _didn't_ seria transformada em dois tokens ("did" e "n't"). Então para isso, removemos as contrações, transformando em duas palavras diferentes: did e not. Para isso iremos utilizar a biblioteca contractions

In [10]:
def replace_contractions(text):
    return contractions.fix(text)

In [11]:
sample_contractions_fixed = replace_contractions(sample) 

In [12]:
sample_contractions_fixed

'Title Goes Here\nBolded Text\nItalicized Text\n\nBut this will still be here!\nI run. He ran. She is running. Will they stop running?\nI talked. She was talking. They talked to them about running. Who ran to the talking runner?\n\n¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!\nsomething... is! wrong() with.,; this :: sentence.\nI can not do this anymore. I did not know them. Why could not you have dinner at the restaurant?\nMy favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.\ndo not do it.... Just do not. Billy! I know what you are doing. This is a great little house you have got here.\n\nJohn: "Well, well, well."\nJames: "There, there. There, there."\n\xa0\xa0\nThere are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.\nI have to go get 2 tutus from 2 different stores, too.\n22    45   1067   445\n{{Here is some stuff inside 

## Tokenization

Ao finalizar nossa remoção de ruídos, agora iremos para o processo de Tokenization. Basicamente, tokenization é o processo de segmentar uma grande quantidade de texto em elementos menores, sejam esses elementos parágrafos, sentenças ou palavras. Normalmente tokenization se refere à segmentação do texto em palavras, enquanto segmentação refere-se à particionar o texto em elementos maiores que uma palavra. Normalmente um processamento mais profundo do texto é feito apenas após o processo de segmentação.