# Processamento de Linguagem Natural - Prática 01

## Bag of Words


O modelo de "saco de palavras" é uma representação simplificada usada no processamento de linguagem natural e recuperação de informação. Neste modelo, um texto (como uma sentença ou um documento) é representado como o saco (multiset) de suas palavras, desconsiderando a gramática e até a ordem das palavras, mas mantendo a multiplicidade.

Na classificação de documentos, um saco de palavras é um vetor esparso de ocorrência de contagens de palavras; Ou seja, um histograma esparso sobre o vocabulário.

In [1]:
# bag of words
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["Juliana gosta de cinema. Pedro também gosta de cinema.", 
               "Pedro também gosta de futebol."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())

[[2 2 0 2 1 1 1]
 [0 1 1 1 0 1 1]]
['cinema', 'de', 'futebol', 'gosta', 'juliana', 'pedro', 'também']


In [2]:
# Bag of Words - II
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
 
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
 
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense() 
print( vectorizer.vocabulary_ )


{'all': 0, 'my': 11, 'cats': 2, 'in': 7, 'row': 14, 'when': 25, 'cat': 1, 'sits': 17, 'down': 3, 'she': 15, 'looks': 9, 'like': 8, 'furby': 6, 'toy': 24, 'the': 21, 'from': 5, 'outer': 12, 'space': 19, 'sunshine': 20, 'loves': 10, 'to': 23, 'sit': 16, 'this': 22, 'for': 4, 'some': 18, 'reason': 13}


## carregando um dataset ( 20 newsgroups dataset )

20.000 documentos, 20 categorias <br>
Categorias: <br>
alt.atheism <br>
comp.graphics <br>
comp.os.ms-windows.misc <br>
comp.sys.ibm.pc.hardware <br>
comp.sys.mac.hardware <br>
comp.windows.x <br>
misc.forsale <br>
rec.autos <br>
rec.motorcycles <br>
rec.sport.baseball <br>
rec.sport.hockey <br>
sci.crypt <br>
sci.electronics <br>
sci.med <br>
sci.space <br>
soc.religion.christian <br>
talk.politics.guns <br>
talk.politics.mideast <br>
talk.politics.misc <br>
talk.religion.misc <br>

In [3]:
# carregar 5 categorias
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med', 'talk.politics.misc' ]

In [4]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

In [5]:
len(twenty_train.data)

2722

In [8]:
print("\n".join(twenty_train.data[3].split("\n")[:3]))

From: euclid@mrcnext.cso.uiuc.edu (Euclid K.)
Subject: Re: Anti-Viral Herbs
Article-I.D.: news.C51o24.8A4


In [9]:
print(twenty_train.target_names[twenty_train.target[3]])

sci.med


In [48]:
# 

In [17]:
print("\n".join(twenty_train.data[106].split("\n")[:3]))

From: pwhite@empros.com (Peter White)
Subject: Some questions from a new Christian
Lines: 50


In [18]:
print(twenty_train.target_names[twenty_train.target[106]])

soc.religion.christian


In [20]:
# saida classifida por categorias (valor)
twenty_train.target[:10]

array([4, 1, 3, 2, 4, 3, 1, 3, 1, 2])

In [21]:
# as 10 primeiras linhas (categorias)
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

talk.politics.misc
comp.graphics
soc.religion.christian
sci.med
talk.politics.misc
soc.religion.christian
comp.graphics
soc.religion.christian
comp.graphics
sci.med


### colocar numa BOW - Bag of words

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2722, 40604)

In [24]:
# BOW - palavras e ocorrências
X_train_counts.data

array([1, 1, 1, ..., 1, 3, 1], dtype=int64)

### Colocar a BOW numa Matriz TF-IDF (Term Frequency times Inverse Document Frequency)
#### pega as ocorrências e as transforma em frequências (Traduzindo: cada palavra vai receber um peso no documento)

In [30]:
# Matriz TF-IDF (importância do termo no documento)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2722, 40604)

In [31]:
# BOW - palavras e ocorrências - TF-IDF (importância do termo no documento)
X_train_tf.data

array([ 0.02212953,  0.02212953,  0.02212953, ...,  0.06984303,
        0.20952909,  0.06984303])

In [32]:
X_train_tf[0]

<1x40604 sparse matrix of type '<class 'numpy.float64'>'
	with 289 stored elements in Compressed Sparse Row format>

### aplicar o algoritmo Naive Bayes

In [33]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

### Predição

In [34]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new) # converte para BOW
X_new_tfidf = tfidf_transformer.transform(X_new_counts) # converte para TF-IDF

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


In [36]:
docs_new = ['planets', 'I was sick', 'my mac is good', 'I go to the church', 'white house', 'Ebola' ]
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'planets' => alt.atheism
'I was sick' => sci.med
'my mac is good' => comp.graphics
'I go to the church' => soc.religion.christian
'white house' => talk.politics.misc
'Ebola' => soc.religion.christian


### Usando um Pipeline (sequencia de comandos)

In [45]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf.fit(twenty_train.data, twenty_train.target) 

Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

### Avaliando a performance

In [47]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.79525386313465785

In [51]:
# usando uma SVM
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42
                                           )),
])
text_clf.fit(twenty_train.data, twenty_train.target)  

predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.90121412803532008

In [52]:
# metrica detalhada
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.94      0.78      0.85       319
         comp.graphics       0.86      0.98      0.91       389
               sci.med       0.93      0.87      0.90       396
soc.religion.christian       0.87      0.95      0.91       398
    talk.politics.misc       0.95      0.90      0.92       310

           avg / total       0.91      0.90      0.90      1812



In [53]:
# matriz de confusão
metrics.confusion_matrix(twenty_test.target, predicted)

array([[250,   9,  14,  42,   4],
       [  1, 380,   1,   3,   4],
       [  4,  36, 346,   5,   5],
       [  5,  10,   3, 379,   1],
       [  7,   8,   9,   8, 278]])