<br/>
<h1 align="center"> <span style = 'color:red'> Analise de Opinião com método TF-IDF - Multinomial NB
</span> </h1>
<br/>

<br/>

<br/>
<h3 align="left"> <span style = 'color:green'> No notebook "6  - Utilizando TF-IDF para classificação de texto" tem a explicação desse método
</span> </h3>
<br/>
<br/>

 




<br/>
<h3 align="left"> <span style = 'color:green'>   Importando as bibliotecas  </span> </h3>

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn import naive_bayes
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score


<br/>
<h3 align="left"> <span style = 'color:green'> Base de dados em arquivo texto txt separado por tabulação    </span> </h3>


In [2]:
df = pd.read_csv("movie_reviews.txt",sep = '\t', names = ['polarity','review'])

<br/>
<h3 align="left"> <span style = 'color:green'> Mostrando  os 5 primeiros registros   </span> </h3>

In [3]:
df.sample(5)

Unnamed: 0,polarity,review
19152,0,Celia Johnson is good as the Nurse. Michael Ho...
3852,1,Reign Over Me (titled after the who song) is a...
10946,1,*WARNING* Spoilers ahead... The writers of thi...
15321,0,The worst movie I have seen in quite a while. ...
10477,1,If you have never viewed this film and like ol...


<br/>
<h3 align="left"> <span style = 'color:green'> Verificando se a base está balanceada    </span> </h3>

In [7]:
# Verificando se a base está balanceada
print(40 * '*')
print("Quantidade de opinião positiva: ", df[df["polarity"]==1].count()["polarity"])
print(40 * '*')
print("Quantidade de opinião negativa: ", df[df["polarity"]==0].count()["polarity"])
print(40 * '*')
print(40 * '*')

print('Base balanceada')

****************************************
Quantidade de opinião positiva:  12500
****************************************
Quantidade de opinião negativa:  12500
****************************************
****************************************
Base balanceada


<br/>
<br/>
<h3 align="left"> <span style = 'color:green'> Criando uma lista de stopwords  em ingles </span> </h3>
<br/>


In [8]:
# variavel que vai receber as stopwords
stopset = set(stopwords.words('english'))

<br/>
<br/>
<h3 align="left"> <span style = 'color:green'>  Criando  matriz de TF-IDF de pesos das palavras no documento   </span> </h3>
<br/>


**Fazendo um processamento simple do texto:**


* todas para minusculas


* Retirada de acentos


* Remoção de stopwords

In [9]:

#matriz de pesos das palavras tf-idf

vectorizer = TfidfVectorizer(use_idf = True, lowercase = True, strip_accents= 'ascii', stop_words = stopset)

<br/>
<br/>
<h3 align="left"> <span style = 'color:green'> Nomeando a entrada e a saída e dividindo a base    </span> </h3>
<br/>





In [22]:
#variavel de saída y --- polaridade
y = df.polarity 

#variavel de entrada sendo treinada
X = vectorizer.fit_transform(df.review) 


#divisão da base treinamento e teste
X_train, X_test,y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=1)

<br/>
<br/>

<h3 align="left"> <span style = 'color:green'>  Criando o Modelo  MultinomialNB  </span> </h3>
<br/>


In [23]:
from sklearn.naive_bayes import MultinomialNB

#criando modelo
MNB = MultinomialNB()


#treinando modelo
MNB.fit(X_train,y_train)



MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


<br/>
<br/>

<h3 align="left"> <span style = 'color:green'> Avaliando o Desempenho do classificador  MultinomialNB   </span> </h3>
<br/>


In [24]:
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.metrics import classification_report

#predizendo os resultados do teste
y_pred = MNB.predict(X_test)


print(20 * '*')
print("Acurácia: {0:.2%}".format(accuracy_score(y_test,y_pred)))
print(20 * '*')
print('\t\t\tMatriz de Confusão')
print(60 * '*')
print(classification_report(y_test, y_pred))


********************
Acurácia: 86.48%
********************
			Matriz de Confusão
************************************************************
              precision    recall  f1-score   support

           0       0.85      0.88      0.87      2483
           1       0.88      0.85      0.86      2517

   micro avg       0.86      0.86      0.86      5000
   macro avg       0.87      0.86      0.86      5000
weighted avg       0.87      0.86      0.86      5000



<br/>
<br/>

<h3 align="left"> <span style = 'color:green'> Mostrando as palavras consideradas negativas e positivas    </span> </h3>
<br/>





In [13]:
words_polarity = []

for word in vectorizer.vocabulary_:
    index =  vectorizer.vocabulary_[word]
    words_polarity.append((word, MNB.coef_[0][index]))

words_polarity = sorted(words_polarity, key=lambda x: x[1])


print('------------------------------------------------------------------------------------------------------------------')
print("Negativas :" + ", ".join([str(x[0]) for x in words_polarity[:10]]))
print('------------------------------------------------------------------------------------------------------------------')
print("Positivas :" + ", ".join([str(x[0]) for x in words_polarity[-10:]]))

------------------------------------------------------------------------------------------------------------------
Negativas :jawaharlal, nehru, stealling, ryoga, vacuous, skateboard, skaters, skate, tricktris, premade
------------------------------------------------------------------------------------------------------------------
Positivas :time, see, well, story, like, good, great, one, film, movie



<br/>
<br/>
<h3 align="left"> <span style = 'color:green'>Testando o modelo com novos documentos - frases </span> </h3>
<br/>




In [14]:
for word in [["I liked the movie"],["I hate this character"],
             ["The plot of the movie was interesting"],
             ["It has been and amazing depiction of charaters"],
             ["Movie was boring"]]:
    review_word = np.array(word)
    review_vector = vectorizer.transform(review_word)
    print(str(word[0]) + " " + str(MNB.predict(review_vector)[0]))

I liked the movie 1
I hate this character 0
The plot of the movie was interesting 0
It has been and amazing depiction of charaters 1
Movie was boring 0


<h3 align="left"> <span style = 'color:green'>Faça um teste, escreva algumas frase   </span> </h3>
<br/>


In [15]:
for word in [["XXXXXXXXXXXXXXX"],["XXXXXXXXXXXXXXXXX"]]:
    review_word = np.array(word)
    review_vector = vectorizer.transform(review_word)
    print(str(word[0]) + " " + str(MNB.predict(review_vector)[0]))

XXXXXXXXXXXXXXX 0
XXXXXXXXXXXXXXXXX 0
