<h1 align="center"> Aplicações em Processamento de Linguagem Natural </h1>
<h2 align="center"> Aula 04 - Classificação de Texto</h2>
<h3 align="center"> Prof. Fernando Vieira da Silva MSc.</h3>

<h2>Problema de Classificação</h2>

<p>Neste tutorial vamos trabalhar com um exemplo prático de problema de classificação de texto. O objetivo é identificar uma sentença como escrita "formal" ou "informal".</p>

<b>1. Obtendo o corpus</b>

<p>Para simplificar o problema, vamos continuar utilizando o corpus Gutenberg como textos formais e vamos usar mensagens de chat do corpus <b>nps_chat</b> como textos informais.</p>
<p>Antes de tudo, vamos baixar o corpus nps_chat:</p>

In [1]:
import nltk

nltk.download('nps_chat')

[nltk_data] Downloading package nps_chat to /usr/share/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!


True

In [2]:
from nltk.corpus import nps_chat

print(nps_chat.fileids())

['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml']


<p>Agora vamos ler os dois corpus e armazenar as sentenças em uma mesma ndarray. Perceba que também teremos uma ndarray para indicar se o texto é formal ou não. Começamos armazenando o corpus em lists. Vamos usar apenas 500 elementos de cada, para fins didáticos.</p>

In [3]:
import nltk

x_data_nps = []

for fileid in nltk.corpus.nps_chat.fileids():
    x_data_nps.extend([post.text for post in nps_chat.xml_posts(fileid)])

y_data_nps = [0] * len(x_data_nps)

x_data_gut = []
for fileid in nltk.corpus.gutenberg.fileids():
    x_data_gut.extend([' '.join(sent) for sent in nltk.corpus.gutenberg.sents(fileid)])
    
y_data_gut = [1] * len(x_data_gut)

x_data_full = x_data_nps[:500] + x_data_gut[:500]
print(len(x_data_full))
y_data_full = y_data_nps[:500] + y_data_gut[:500]
print(len(y_data_full))



1000
1000


In [4]:
print(x_data_full[:10])

['now im left with this gay name', ':P', 'PART', 'hey everyone  ', 'ah well', 'NICK :10-19-20sUser7', '10-19-20sUser7 is a gay name.', '.ACTION gives 10-19-20sUser121 a golf clap.', ':)', 'JOIN']


In [5]:
print(y_data_full[:10])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


<p>Em seguida, transformamos essas listas em ndarrays, para usarmos nas etapas de pré-processamento que já conhecemos.</p>

In [6]:
import numpy as np

x_data = np.array(x_data_full, dtype=object)
#x_data = np.array(x_data_full)
print(x_data.shape)
y_data = np.array(y_data_full)
print(y_data.shape)

(1000,)
(1000,)


<b>2. Dividindo em datasets de treino e teste</b>

<p>Para que a pesquisa seja confiável, precisamos avaliar os resultados em um dataset de teste. Por isso, vamos dividir os dados aleatoriamente, deixando 80% para treino e o demais para testar os resultados em breve.</p>

In [7]:
train_indexes = np.random.rand(len(x_data)) < 0.80

print(len(train_indexes))
print(train_indexes[:10])

1000
[ True  True  True  True  True  True  True  True False False]


In [8]:
x_data_train = x_data[train_indexes]
y_data_train = y_data[train_indexes]

print(len(x_data_train))
print(len(y_data_train))

786
786


In [9]:
x_data_test = x_data[~train_indexes]
y_data_test = y_data[~train_indexes]

print(len(x_data_test))
print(len(y_data_test))

214
214


<b>3. Treinando o classificador</b>

<p>Para tokenização, vamos usar a mesma função do tutorial anterior:</p>

In [10]:
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import wordnet

stopwords_list = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

def my_tokenizer(doc):
    words = word_tokenize(doc)
    
    pos_tags = pos_tag(words)
    
    non_stopwords = [w for w in pos_tags if not w[0].lower() in stopwords_list]
    
    non_punctuation = [w for w in non_stopwords if not w[0] in string.punctuation]
    
    lemmas = []
    for w in non_punctuation:
        if w[1].startswith('J'):
            pos = wordnet.ADJ
        elif w[1].startswith('V'):
            pos = wordnet.VERB
        elif w[1].startswith('N'):
            pos = wordnet.NOUN
        elif w[1].startswith('R'):
            pos = wordnet.ADV
        else:
            pos = wordnet.NOUN
        
        lemmas.append(lemmatizer.lemmatize(w[0], pos))

    return lemmas
    
    

<p>Mas agora vamos criar um <b>pipeline</b> contendo o vetorizador TF-IDF, o SVD para redução de atributos e um algoritmo de classificação. Mas antes, vamos encapsular nosso algoritmo para escolher o número de dimensões para o SVD em uma classe que pode ser utilizada com o pipeline:</p>

In [11]:
from sklearn.decomposition import TruncatedSVD

class SVDDimSelect(object):
    def fit(self, X, y=None):        
        try:
            self.svd_transformer = TruncatedSVD(n_components=round(X.shape[1]/2))
            self.svd_transformer.fit(X)
        
            cummulative_variance = 0.0
            k = 0
            for var in sorted(self.svd_transformer.explained_variance_ratio_)[::-1]:
                cummulative_variance += var
                if cummulative_variance >= 0.5:
                    break
                else:
                    k += 1
                
            self.svd_transformer = TruncatedSVD(n_components=k)
        except Exception as ex:
            print(ex)
            
        return self.svd_transformer.fit(X)
    
    def transform(self, X, Y=None):
        return self.svd_transformer.transform(X)
        
    def get_params(self, deep=True):
        return {}

<p>Finalmente podemos criar nosso pipeline:</p>

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='uniform')

my_pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=my_tokenizer)),\
                       ('svd', SVDDimSelect()), \
                       ('clf', clf)])

<p>Estamos quase lá... Agora vamos criar um objeto <b>RandomizedSearchCV</b> que fará a seleção de hiper-parâmetros do nosso classificador (aka. parâmetros que não são aprendidos durante o treinamento). Essa etapa é importante para obtermos a melhor configuração do algoritmo de classificação. Para economizar tempo de treinamento, vamos usar um algoritmo simples o <i>K nearest neighbors (KNN)</i>.

In [13]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

par = {'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']}


hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='accuracy', n_jobs=1, n_iter=20)


<p>E agora vamos treinar nosso algoritmo, usando o pipeline com seleção de atributos:</p>

In [14]:
print(x_data_train)

['now im left with this gay name' ':P' 'PART' 'hey everyone  ' 'ah well'
 'NICK :10-19-20sUser7' '10-19-20sUser7 is a gay name.'
 '.ACTION gives 10-19-20sUser121 a golf clap.' 'hi 10-19-20sUser59' 'PART'
 'there ya go 10-19-20sUser7' "don't golf clap me."
 'whats everyone up to?' 'PART' 'PART' "i'll thunder clap your ass."
 'PART' 'and i dont even know what that means.'
 'any ladis wanna chat? 29 m' 'my cousin drew a messed up pic on my cast'
 'PART' '24/m' 'boo.' 'lol 10-19-20sUser115' 'boo.' 'JOIN' 'PART'
 'he drew a girl with legs spread' 'boo.' 'hope he didnt draw a penis'
 'PART' 'ewwwww lol' 'JOIN' 'JOIN' 'r u serious' 'JOIN' 'PART'
 '& i have to go to the docs tomorrow' 'ya man'
 'I am too.. Connected to... Slip away... Fade away... Days away I... Still feel you... Touching me... Changing me... Considerably killing me... '
 'heeeey!' "don't you have a sharpie?" '26/m'
 "you're back 10-19-20sUser115" '10-19-20sUser129' 'yep'
 '10-19-20sUser115' 'PART' 'JOIN' 'not fast enough 10-1

In [15]:
hyperpar_selector.fit(X=x_data_train, y=y_data_train)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...i',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform'))]),
          fit_params=None, iid='warn', n_iter=20, n_jobs=1,
          param_distributions={'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='accuracy', verbose=0)

In [16]:
print("Best score: %0.3f" % hyperpar_selector.best_score_)
print("Best parameters set:")
best_parameters = hyperpar_selector.best_estimator_.get_params()
for param_name in sorted(par.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.687
Best parameters set:
	clf__n_neighbors: 2
	clf__weights: 'distance'


<b>4. Testando o classificador</b>

<p>Agora vamos usar o classificador com o nosso dataset de testes, e observar os resultados:</p>

In [17]:
from sklearn.metrics import *

y_pred = hyperpar_selector.predict(x_data_test)

print(accuracy_score(y_data_test, y_pred))

0.705607476635514


<b>5. Serializando o modelo</b><br>

In [18]:
import pickle

string_obj = pickle.dumps(hyperpar_selector)

In [19]:
model_file = open('model.pkl', 'wb')

model_file.write(string_obj)

model_file.close()

<b>6. Abrindo e usando um modelo salvo </b><br>

In [20]:

model_file = open('model.pkl', 'rb')
model_content = model_file.read()

obj_classifier = pickle.loads(model_content)

model_file.close()

res = obj_classifier.predict(["Where is the main"])

print(res)

[0]


In [21]:
res = obj_classifier.predict(x_data_test)
print(accuracy_score(y_data_test, res))

0.705607476635514


In [22]:
res = obj_classifier.predict(x_data_test)

print(res)

[0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0
 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1 1 1
 0 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1]


In [23]:
formal = [x_data_test[i] for i in range(len(res)) if res[i] == 1]

for txt in formal:
    print("%s\n" % txt)


sounds good to me.

hurry ladies

yes 10-19-20sUser30

yes 10-19-20sUser121??

i already wrote what i wanted you to read.

what'd I miss?

bye  10-19-20sUser148

dr phil said so

yes 10-19-20sUser92???

She recalled her past kindness -- the kindness , the affection of sixteen years -- how she had taught and how she had played with her from five years old -- how she had devoted all her powers to attach and amuse her in health -- and how nursed her through the various illnesses of childhood .

All looked up to them .

He was a nervous man , easily depressed ; fond of every body that he was used to , and hating to part with them ; hating change of every kind .

Whenever I see her , she always curtseys and asks me how I do , in a very pretty manner ; and when you have had her here to do needlework , I observe she always turns the lock of the door the right way and never bangs it .

He will be able to tell her how we all are ."

It was a happy circumstance , and animated Mr . Woodhouse for 

In [24]:
informal = [x_data_test[i] for i in range(len(res)) if res[i] == 0]

for txt in informal:
    print("%s\n" % txt)

:)

JOIN

26/ m/ ky women that are nice please pm me

JOIN

fuck you 10-19-20sUser121:@

that sounds painful

26/m

JOIN

26/m and sexy

& a head between her legs

I'll take one, please.

Any ladies wanna chat with 24/m

JOIN

my chair is too hard.

hey 

JOIN

yo, 10-19-20sUser133

:)

sho*

hey any guys with cams wanna play?

PART

sure 10-19-20sUser126

what did you but on e-bay

im considering changing my nickname to "ihavehotnips"

u should 10-19-20sUser44:)

JOIN

you should make it 'iamahotnip', 10-19-20sUser44

alright

hi 10-19-20sUser126, its so late

PART

answers for 10-19-20sUser139 ... hi 10-19-20sUser101 ;)

I like it when you do it, 10-19-20sUser83

iamahotnipwithpics

JOIN

uh huh 

A gold jeep charm for my necklace 10-19-20sUser30

OOooOO:)

how you doin 10-19-20sUser139

how many kts

lmao!!!

10-19-20sUser6,

.ACTION watches 10-19-20sUser6 hug the stuffin' outta 10-19-20sUser115....

please behave baby boy.. I gottsa go now

aww 10-19-20sUser6 have fun

forwads away

In [25]:
res2 = obj_classifier.predict(["Emma spared no exertions to maintain this happier flow of ideas , and hoped , by the help of backgammon , to get her father tolerably through the evening , and be attacked by no regrets but her own"])

print(res2)

[1]


<p><b>Exercício 4:</b>  Treine um modelo para classificar sentenças de acordo com o sentimento (positivo ou negativo), utilizando o corpus sentence_polarity do nltk.
    Dica: Para obter sentenças positivas use:
</p>

In [26]:
import nltk
nltk.download('sentence_polarity')
from nltk.corpus import sentence_polarity

sentencasPos = sentence_polarity.sents(categories=['pos'])
sentencasNeg = sentence_polarity.sents(categories=['neg'])

y_data_neg = [0] * len(sentencasNeg)
y_data_pos = [1] * len(sentencasPos)
y_data_full_pol =  y_data_pos[:500] + y_data_neg[:500] 




[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package sentence_polarity is already up-to-date!


In [27]:
x_data_pol_pos = []
x_data_pol_pos.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['pos'])])


In [28]:
x_data_pol_neg = []
x_data_pol_neg.extend([' '.join(sent) for sent in sentence_polarity.sents(categories=['neg'])])

In [29]:
x_data_full_pol = x_data_pol_pos[:500]+x_data_pol_neg[:500]

In [30]:
#nparray
import numpy as np

x_data_pol = np.array(x_data_full_pol, dtype=object)
#x_data = np.array(x_data_full)
print(x_data_pol.shape)
y_data_pol = np.array(y_data_full_pol)
print(y_data_pol.shape)

(1000,)
(1000,)


In [31]:
train_indexes_pol = np.random.rand(len(x_data_pol)) < 0.80

print(len(train_indexes_pol))
print(train_indexes_pol[:10])

x_data_train_pol = x_data_pol[train_indexes_pol]
y_data_train_pol = y_data_pol[train_indexes_pol]

print(len(x_data_train_pol))
print(len(y_data_train_pol))

x_data_test_pol = x_data_pol[~train_indexes_pol]
y_data_test_pol = y_data_pol[~train_indexes_pol]

print(len(x_data_test_pol))
print(len(y_data_test_pol))

1000
[ True  True  True  True  True False  True  True  True  True]
798
798
202
202


In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='uniform')

my_pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=my_tokenizer)),\
                       ('svd', SVDDimSelect()), \
                       ('clf', clf)])

In [33]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

par = {'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']}


hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='accuracy', n_jobs=1, n_iter=20)

In [34]:
hyperpar_selector.fit(X=x_data_train_pol, y=y_data_train_pol)

RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...i',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform'))]),
          fit_params=None, iid='warn', n_iter=20, n_jobs=1,
          param_distributions={'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='accuracy', verbose=0)

In [35]:
print("Best score: %0.3f" % hyperpar_selector.best_score_)
print("Best parameters set:")
best_parameters = hyperpar_selector.best_estimator_.get_params()
for param_name in sorted(par.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.548
Best parameters set:
	clf__n_neighbors: 13
	clf__weights: 'distance'


In [36]:
from sklearn.metrics import *

y_pred_pol = hyperpar_selector.predict(x_data_test_pol)

print(accuracy_score(y_data_test_pol, y_pred_pol))

0.5792079207920792


In [37]:
import pickle

string_obj = pickle.dumps(hyperpar_selector)

In [38]:
model_file = open('model.pkl', 'wb')
model_file.write(string_obj)
model_file.close()

In [39]:
model_file = open('model.pkl', 'rb')
model_content = model_file.read()

obj_classifier = pickle.loads(model_content)
model_file.close()
res = obj_classifier.predict(["wonderful"])
print(res)

[0]
