<h1 align="center"> Aplicações em Processamento de Linguagem Natural </h1>
<h2 align="center"> Aula 02 - Técnicas de Pré-Processamento de Texto</h2>
<h3 align="center"> Prof. Fernando Vieira da Silva MSc.</h3>

<h2> Técnicas para Pré-Processamento </h2>

<p>Vamos avaliar as técnicas mais comuns para prepararmos o texto para usar com algoritmos de aprendizado de máquina logo mais.</p>
<p>Como estudo de caso, vamos usar o texto de <i>Hamlet</i>, encontrado no corpus <i>Gutenberg</i> do pacote <b>NLTK</b></p>

<b>1. Baixando o corpus Gutenberg</b>

In [None]:
import nltk

nltk.download("gutenberg")

<b>2. Exibindo o texto "Hamlet"</b>

In [None]:
hamlet_raw = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')
print(hamlet_raw[:1000])

<b>3. Segmentação de sentenças e tokenização de palavras</b>

In [None]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(hamlet_raw)

print(sentences[:10])


In [None]:
from nltk.tokenize import word_tokenize

words = word_tokenize(sentences[0])

print(words)

<b>4. Removendo stopwords e pontuação</b>

In [None]:
from nltk.corpus import stopwords

stopwords_list = stopwords.words('english')

print(stopwords_list)

In [None]:
non_stopwords = [w for w in words if not w.lower() in stopwords_list]
print(non_stopwords)

In [None]:
import string
punctuation = string.punctuation
print(punctuation)

In [None]:
non_punctuation = [w for w in non_stopwords if not w in punctuation]

print(non_punctuation)

<b>5. Part of Speech (POS) Tags </b>

In [None]:
from nltk import pos_tag

pos_tags = pos_tag(words)

print(pos_tags)

As tags indicam a classificação sintática de cada palavra no texto. Ver <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" target="blank">https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html</a> para uma lista completa

<b>6. Stemming e Lemmatization</b>

Stemming permite obter a "raiz" da palavra, removendo sufixos, por exemplo.

In [None]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

sample_sentence = "He has already gone"
sample_words = word_tokenize(sample_sentence)

stems = [stemmer.stem(w) for w in sample_words]

print(stems)

Já lemmatization vai além de somente remover sufixos, obtendo a raiz linguística da palavra. Vamos usar as tags POS obtidas anteriormente para otimizar o lemmatizer.

In [None]:
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

pos_tags = nltk.pos_tag(sample_words)

lemmas = []
for w in pos_tags:
    if w[1].startswith('J'):
        pos_tag = wordnet.ADJ
    elif w[1].startswith('V'):
        pos_tag = wordnet.VERB
    elif w[1].startswith('N'):
        pos_tag = wordnet.NOUN
    elif w[1].startswith('R'):
        pos_tag = wordnet.ADV
    else:
        pos_tag = wordnet.NOUN
        
    lemmas.append(lemmatizer.lemmatize(w[0], pos_tag))
    
print(lemmas)

<b>7. N-gramas</b>

Além da técnica de <i>Bag-of-Words</i>, outra opção é utilizar n-gramas (onde "n" pode variar)

In [None]:
from nltk import word_tokenize

frase = 'o cachorro correu atrás do gato'


ngrams = ["%s %s %s" % (nltk.word_tokenize(frase)[i], \
                      nltk.word_tokenize(frase)[i+1], \
                      nltk.word_tokenize(frase)[i+2]) \
          for i in range(len(nltk.word_tokenize(frase))-2)]

print(ngrams)


In [None]:
non_punctuation = [w for w in words if not w.lower() in punctuation]

n_grams_3 = ["%s %s %s"%(non_punctuation[i], non_punctuation[i+1], non_punctuation[i+2]) for i in range(0, len(non_punctuation)-2)]

print(n_grams_3)

Também podemos usar a classe <b>CountVectorizer</b>, do scikit-learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(ngram_range=(3,3))

import numpy as np

arr = np.array([sentences[0]])

print(arr)

n_gram_counts = count_vect.fit_transform(arr)

print(n_gram_counts.toarray())

print(count_vect.vocabulary_)

Agora, vamos contar os n-grams (no nosso caso, trigramas) de todas as sentenças do texto:

In [None]:
arr = np.array(sentences)

n_gram_counts = count_vect.fit_transform(arr)

print(n_gram_counts.toarray()[:20])

print([k for k in count_vect.vocabulary_.keys()][:20])

<p><b>Exercício 2:</b>Exiba 10 lemmas mais frequentes do corpus Reuters, ignorando pontuações e stopwords.</p>



In [1]:
from nltk.corpus import reuters

reuters_raw_content = ''
for fid in reuters.fileids():
    reuters_raw_content += reuters.raw(fid)
    
# Primeiro coletei todo o conteúdo do corpus
print(len(reuters_raw_content))

from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(reuters_raw_content)

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.corpus import stopwords

stopwords_list = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

lemmas = {}

import string
punctuation = string.punctuation
punctuation += "```''''"

for sent in sentences:
    words = word_tokenize(sent)
    pos_tags = pos_tag(words)
    
    non_punctuation = [w for w in pos_tags if not w[0].lower() in punctuation]
    non_stopwords = [w for w in non_punctuation if not w[0].lower() in stopwords_list]
    
    for w in non_stopwords:
        if w[1].startswith('J'):
            pos = wordnet.ADJ
        elif w[1].startswith('V'):
            pos = wordnet.VERB
        elif w[1].startswith('N'):
            pos = wordnet.NOUN
        elif w[1].startswith('R'):
            pos = wordnet.ADV
        else:
            pos = wordnet.NOUN

        l = lemmatizer.lemmatize(w[0], pos)
        if not l in lemmas.keys():
            lemmas[l] = 0
        else:
            lemmas[l] += 1
            
import pprint
import operator
pprint.pprint(sorted(lemmas.items(), key=operator.itemgetter(1))[-10:])



8846853
[('vs', 5933),
 ('year', 6379),
 ('ct', 8096),
 ('v', 8176),
 ("'s", 8340),
 ('lt', 8693),
 ('pct', 9053),
 ('dlrs', 11697),
 ('mln', 18011),
 ('say', 26080)]
