<a href="https://colab.research.google.com/github/viniciusrpb/cic0269_natural_language_processing/blob/main/cap04_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capítulo 4 - Pré-Processamento de Textos

Neste capítulo, estudaremos abordagens para pré-processar os textos visando analisar o texto e efetuar alterações de modo a manter palavras representativas à semântica e ao discurso.

Assim, palavras como artigos e preposições podem ser removidas

## 4.1. Dicas sobre o Google Colab

Primeiramente, vamos instalar a biblioteca nltk para o uso de stemmers e lematizadores:

In [1]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Agora vamos importar as bibliotecas necessárias para utilizarmos as funções e métodos para pré-processamento de textos utilizando NLTK:

In [2]:
import pandas as pd
import re
import nltk
from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer

Faz o download da lista de stopwords

In [32]:
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Podemos carregar arquivos diretamente do Google Drive de acordo com os comandos abaixo:

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Copiando o arquivo 

In [5]:
!cp -r '/content/drive/MyDrive/sentiment_analysis/polarity_classification_2013/twitter-2013train-A.txt' 'twitter-2013train-A.txt'

Importando arquivo csv que estão no Drive para o ambiente Colab

In [6]:
df = pd.read_csv('twitter-2013train-A.txt',sep='\t',encoding='utf-8',names=['id','polarity','text'])

In [7]:
df

Unnamed: 0,id,polarity,text
0,264183816548130816,positive,Gas by my house hit $3.39!!!! I\u2019m going t...
1,263405084770172928,negative,Theo Walcott is still shit\u002c watch Rafa an...
2,262163168678248449,negative,its not that I\u2019m a GSP fan\u002c i just h...
3,264249301910310912,negative,Iranian general says Israel\u2019s Iron Dome c...
4,262682041215234048,neutral,Tehran\u002c Mon Amour: Obama Tried to Establi...
...,...,...,...
9679,103158179306807296,positive,RT @MNFootNg It's monday and Monday Night Foot...
9680,103157324096618497,positive,All I know is the road for that Lomardi start ...
9681,100259220338905089,neutral,"All Blue and White fam, we r meeting at Golden..."
9682,104230318525001729,positive,@DariusButler28 Have a great game agaist Tam...


Outra abordagem se trata de ler diretamente os arquivos diretamente dos repositórios do Github. Lembre-se que colocar a URL do RAW file do dataset:

In [8]:
df = pd.read_csv('https://raw.githubusercontent.com/viniciusrpb/cic0269_natural_language_processing/main/corpus_tweets/twitter-2013train-A.txt',sep='\t',encoding='utf-8',names=['id','polarity','text'])

In [9]:
df

Unnamed: 0,id,polarity,text
0,264183816548130816,positive,Gas by my house hit $3.39!!!! I\u2019m going t...
1,263405084770172928,negative,Theo Walcott is still shit\u002c watch Rafa an...
2,262163168678248449,negative,its not that I\u2019m a GSP fan\u002c i just h...
3,264249301910310912,negative,Iranian general says Israel\u2019s Iron Dome c...
4,262682041215234048,neutral,Tehran\u002c Mon Amour: Obama Tried to Establi...
...,...,...,...
9679,103158179306807296,positive,RT @MNFootNg It's monday and Monday Night Foot...
9680,103157324096618497,positive,All I know is the road for that Lomardi start ...
9681,100259220338905089,neutral,"All Blue and White fam, we r meeting at Golden..."
9682,104230318525001729,positive,@DariusButler28 Have a great game agaist Tam...


Como estamos interessados em analisar apenas os textos, vamos selecionar apenas a coluna 'text':

In [10]:
text_df = df['text'] 

In [11]:
text_df

0       Gas by my house hit $3.39!!!! I\u2019m going t...
1       Theo Walcott is still shit\u002c watch Rafa an...
2       its not that I\u2019m a GSP fan\u002c i just h...
3       Iranian general says Israel\u2019s Iron Dome c...
4       Tehran\u002c Mon Amour: Obama Tried to Establi...
                              ...                        
9679    RT @MNFootNg It's monday and Monday Night Foot...
9680    All I know is the road for that Lomardi start ...
9681    All Blue and White fam, we r meeting at Golden...
9682    @DariusButler28   Have a great game agaist Tam...
9683    I'm pisseeedddd that I missed Kid Cudi's show ...
Name: text, Length: 9684, dtype: object

Para pegar instâncias específicas ou colunas específicas do DataFrame, pode-se utilizar o iloc https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

In [12]:
sentence = text_df[0]
sentence = sentence+" "+text_df[1]

In [13]:
sentence

'Gas by my house hit $3.39!!!! I\\u2019m going to Chapel Hill on Sat. :) Theo Walcott is still shit\\u002c watch Rafa and Johnny deal with him on Saturday.'

Tokenização: quebrar uma sentença em palavras

In [14]:
sentence = sentence.lower()

In [15]:
sentence

'gas by my house hit $3.39!!!! i\\u2019m going to chapel hill on sat. :) theo walcott is still shit\\u002c watch rafa and johnny deal with him on saturday.'

Podemos também utilizar expressões regulares em Python para quebrar a sentença conforme alguns padrões estejam presentes.

Na regra abaixo, o regex indica que na ocorrência de símbolos de pontuação constantemente presentes em tweets, como hashtags, @s, pontos de exclamação, vírgulas e outros, sejam considerados na quebra da sentença em palavras, mas que não sejam considerados como palavras individuais:

In [16]:
tokens = re.split(r'[\s!$@#,.;]+',sentence)

In [17]:
tokens

['gas',
 'by',
 'my',
 'house',
 'hit',
 '3',
 '39',
 'i\\u2019m',
 'going',
 'to',
 'chapel',
 'hill',
 'on',
 'sat',
 ':)',
 'theo',
 'walcott',
 'is',
 'still',
 'shit\\u002c',
 'watch',
 'rafa',
 'and',
 'johnny',
 'deal',
 'with',
 'him',
 'on',
 'saturday',
 '']

Remoção de stopwords

In [18]:
stop_list = ['and','or','the','is','on','in']

In [19]:
novel_tokens = []
for t in tokens:
    if t not in stop_list:
        novel_tokens.append(t)

In [20]:
novel_tokens

['gas',
 'by',
 'my',
 'house',
 'hit',
 '3',
 '39',
 'i\\u2019m',
 'going',
 'to',
 'chapel',
 'hill',
 'sat',
 ':)',
 'theo',
 'walcott',
 'still',
 'shit\\u002c',
 'watch',
 'rafa',
 'johnny',
 'deal',
 'with',
 'him',
 'saturday',
 '']

In [21]:
stop_list_nltk = nltk.corpus.stopwords.words('english')

In [22]:
stop_list_nltk

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Stemming

In [23]:
stemmer = PorterStemmer()

In [24]:
tokens_final = [stemmer.stem(t) for t in novel_tokens]

In [25]:
tokens_final

['ga',
 'by',
 'my',
 'hous',
 'hit',
 '3',
 '39',
 'i\\u2019m',
 'go',
 'to',
 'chapel',
 'hill',
 'sat',
 ':)',
 'theo',
 'walcott',
 'still',
 'shit\\u002c',
 'watch',
 'rafa',
 'johnni',
 'deal',
 'with',
 'him',
 'saturday',
 '']

In [26]:
novel_tokens

['gas',
 'by',
 'my',
 'house',
 'hit',
 '3',
 '39',
 'i\\u2019m',
 'going',
 'to',
 'chapel',
 'hill',
 'sat',
 ':)',
 'theo',
 'walcott',
 'still',
 'shit\\u002c',
 'watch',
 'rafa',
 'johnny',
 'deal',
 'with',
 'him',
 'saturday',
 '']

In [27]:
striing = "Ola<p> eu sou </p>"

token = re.sub(r'<p>','',striing)
print(token)

Ola eu sou </p>


In [28]:
mapping = {}
mapping[':)'] = 'happy'

### Lemmatization



In [36]:
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cheers"))

cheer


In [38]:
tokens_final = [lemmatizer.lemmatize(w) for w in novel_tokens]
print(tokens_final)

['gas', 'by', 'my', 'house', 'hit', '3', '39', 'i\\u2019m', 'going', 'to', 'chapel', 'hill', 'sat', ':)', 'theo', 'walcott', 'still', 'shit\\u002c', 'watch', 'rafa', 'johnny', 'deal', 'with', 'him', 'saturday', '']


In [40]:
tweet_final = ' '.join([lemmatizer.lemmatize(w) for w in novel_tokens])
print(tweet_final)

gas by my house hit 3 39 i\u2019m going to chapel hill sat :) theo walcott still shit\u002c watch rafa johnny deal with him saturday 
