# MLN - Analise de Sentimentos

Este modelo de Machine Learning foi desenvolvido com a finalidade de capturar e interpretar emoções e opiniões expressas em textos, mais especificamente opiniões de filmes, permitindo uma análise de sentimentos positivos ou negativos.

## Import das bibliotecas:

In [8]:
# Instalação de mais bibliotecas
!pip install kaggle
!pip install unidecode



In [9]:
# Manipulação dos dados:
import pandas as pd
import numpy as np

# Machine Learning:
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Embedding, Flatten

# PLN (Processamento de Linguagem Natural) e Pré-Processamento de dados:
import nltk
from nltk import tokenize
from string import punctuation
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import unidecode

## MLN


In [10]:
!pip install unidecode



In [11]:
!pip install kaggle



In [12]:
#De upload no kaggle.json antes de rodar esse codigo
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [13]:
!kaggle datasets download -d luisfredgs/imdb-ptbr

Downloading imdb-ptbr.zip to /content
 85% 41.0M/48.4M [00:00<00:00, 67.5MB/s]
100% 48.4M/48.4M [00:00<00:00, 64.5MB/s]


In [14]:
from zipfile import ZipFile
file_name = 'imdb-ptbr.zip' #the file is your dataset exact name
with ZipFile(file_name, 'r') as zip:
  zip.extractall()
  print('Done')

Done


In [15]:
df_imdb = pd.read_csv("imdb-reviews-pt-br.csv")

In [16]:
df_imdb.head()

Unnamed: 0,id,text_en,text_pt,sentiment
0,1,Once again Mr. Costner has dragged out a movie...,"Mais uma vez, o Sr. Costner arrumou um filme p...",neg
1,2,This is an example of why the majority of acti...,Este é um exemplo do motivo pelo qual a maiori...,neg
2,3,"First of all I hate those moronic rappers, who...","Primeiro de tudo eu odeio esses raps imbecis, ...",neg
3,4,Not even the Beatles could write songs everyon...,Nem mesmo os Beatles puderam escrever músicas ...,neg
4,5,Brass pictures movies is not a fitting word fo...,Filmes de fotos de latão não é uma palavra apr...,neg


In [17]:
classificacao = df_imdb["sentiment"].replace(["neg", "pos"],[0, 1])
df_imdb["Classificacao"] = classificacao

In [18]:
tokenizer = tokenize.WordPunctTokenizer()

In [19]:
pontuacao = list()
for acento in punctuation:
    pontuacao.append(acento)

In [20]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [21]:
stop_words = nltk.corpus.stopwords.words("portuguese")
pontuacao_stopwords2 = pontuacao + stop_words
stopwords_sem_acento2 =  [unidecode.unidecode(texto) for texto in pontuacao_stopwords2]

In [22]:
sem_acentos2 = [unidecode.unidecode(texto) for texto in df_imdb["text_pt"]]

In [23]:
df_imdb["Processamento"] = sem_acentos2

In [24]:
nltk.download('rslp')

[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Unzipping stemmers/rslp.zip.


True

In [25]:
stemmer = nltk.RSLPStemmer()

In [26]:
frase_processada = list()
for opiniao in df_imdb["text_pt"]:
    nova_frase = list()
    opiniao = opiniao.lower()
    palavras_texto = tokenizer.tokenize(opiniao)
    for palavra in palavras_texto:
        if palavra not in stopwords_sem_acento2:
            nova_frase.append(stemmer.stem(palavra))
    frase_processada.append(' '.join(nova_frase))

df_imdb["Resultado"] = frase_processada

In [27]:
texto_tokenizado = [text.split() for text in df_imdb["Resultado"]]

In [28]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texto_tokenizado)
vocab_size = len(tokenizer.word_index) + 1

In [29]:
sequences = tokenizer.texts_to_sequences(texto_tokenizado)

In [30]:
padded_sequences = pad_sequences(sequences, maxlen=100)

In [31]:
numpy_array = np.array(padded_sequences)

In [32]:
numpy_array

array([[   0,    0,    0, ...,    3, 1320,  239],
       [3574, 2368,   34, ...,   94,  450,  135],
       [ 332,    2, 1078, ...,   58,   10,  163],
       ...,
       [ 104,  243,   92, ...,  925,   69, 1855],
       [  35,  213,  300, ...,  151, 6193,   24],
       [  33,  123, 2892, ...,   13,  121,   46]], dtype=int32)

In [33]:
len(np.unique(numpy_array))

56361

In [34]:
x_train, x_test, y_train, y_test = train_test_split(numpy_array, df_imdb["Classificacao"], test_size = 0.2, random_state = 42)

In [35]:
model = Sequential()
model.add(Embedding(56361, 32, input_length=100))
model.add(Conv1D(32, 7, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Conv1D(32, 7, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [36]:
model.fit(x_train, y_train, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f01d5761720>