# <font color='blue'>Data Science Academy</font>
# <font color='blue'>Deep Learning Frameworks</font>

In [1]:
# Versão da Linguagem Python
from platform import python_version
print('Versão da Linguagem Python Usada Neste Jupyter Notebook:', python_version())

Versão da Linguagem Python Usada Neste Jupyter Notebook: 3.7.6


## Construindo um Classificador de Sentimentos com PyTorch

Tudo o que expressamos (verbalmente ou por escrito) carrega enormes quantidades de informação. O tópico que escolhemos, nosso tom, nossa seleção de palavras, tudo acrescenta algum tipo de informação que pode ser interpretada e com o valor extraído dela. Em teoria, podemos entender e até prever o comportamento humano usando essas informações.

Mas há um problema: uma pessoa pode gerar centenas ou milhares de palavras em uma declaração, cada sentença com sua complexidade correspondente. Se você deseja dimensionar e analisar várias centenas, milhares ou milhões de pessoas ou declarações em uma determinada região, a situação é incontrolável.


In [2]:
# Para atualizar um pacote, execute o comando abaixo no terminal ou prompt de comando:
# pip install -U nome_pacote

# Para instalar a versão exata de um pacote, execute o comando abaixo no terminal ou prompt de comando:
# pip install nome_pacote==versão_desejada

# Depois de instalar ou atualizar o pacote, reinicie o jupyter notebook.

# Instala o pacote watermark. 
# Esse pacote é usado para gravar as versões de outros pacotes usados neste jupyter notebook.
!pip install -q -U watermark

In [3]:
# Imports
import torch
import pandas as pd
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import sklearn
from torch.utils.data import DataLoader, Dataset
from sklearn.feature_extraction.text import CountVectorizer
from tqdm.notebook import tqdm, tqdm_notebook

In [4]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Data Science Academy" --iversions

torch   1.4.0
pandas  1.0.3
sklearn 0.22.2
Data Science Academy


In [5]:
# Define o device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

## Carregando e Explorando os Dados

Usaremos um dataset disponível publicamente em https://www.imdb.com/interfaces/.

Os labels de sentimentos foram extraídos do portal: https://ai.stanford.edu/~amaas/data/sentiment/

In [6]:
# Carrega o dataset
nomes_colunas = ['Review', 'Sentimento']
dados_filmes = pd.read_csv('dados/imdb_reviews.csv', sep = '\t', names = nomes_colunas)

In [7]:
# Visualiza
dados_filmes.head()

Unnamed: 0,Review,Sentimento
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [8]:
# Shape
dados_filmes.shape

(748, 2)

In [9]:
# Verificando a proporção de sentimentos
dados_filmes['Sentimento'].value_counts()

1    386
0    362
Name: Sentimento, dtype: int64

## Representação Bag-of-Words

![](imagens/bag2.png)

![](imagens/bag1.jpeg)

![](imagens/bag3.png)

![](imagens/bag4.png)

## Manipulação de Texto

Começamos criando um "vetorizador".

Convertemos uma coleção de documentos de texto em uma matriz de contagens de tokens.

Essa implementação produz uma representação esparsa das contagens usando scipy.sparse.csr_matrix.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [10]:
# Criamos um vectorizer
vectorizer = CountVectorizer(stop_words = 'english', max_df = 0.99, min_df = 0.005)
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.99, max_features=None, min_df=0.005,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [11]:
# Extraímos as sequências do texto aplicando o vetorizador
sequences = vectorizer.fit_transform(dados_filmes.Review.tolist())
sequences

<748x320 sparse matrix of type '<class 'numpy.int64'>'
	with 2931 stored elements in Compressed Sparse Row format>

In [12]:
# Visualiza como dataframe
print(pd.DataFrame(sequences).head(3))

                                                   0
0    (0, 248)\t1\n  (0, 185)\t1\n  (0, 183)\t1\n ...
1    (0, 162)\t1\n  (0, 41)\t1\n  (0, 14)\t1\n  (...
2    (0, 183)\t1\n  (0, 28)\t1\n  (0, 305)\t1\n  ...


In [13]:
# Organiza os labels (sentimentos)
labels = dados_filmes.Sentimento.tolist()

In [14]:
# Visualiza uma amostra
labels[:5]

[0, 0, 0, 0, 1]

In [15]:
# Criamos o saco de palavras (vocabulário)
token2idx = vectorizer.vocabulary_

In [16]:
# Tipo
type(token2idx)

dict

In [17]:
# Total
print(len(token2idx))

320


In [18]:
# Visualiza
token2idx

{'slow': 248,
 'moving': 185,
 'movie': 183,
 'young': 319,
 'man': 171,
 'lost': 162,
 'characters': 41,
 'audience': 14,
 'half': 119,
 'black': 28,
 'white': 305,
 'clever': 48,
 'camera': 35,
 'disappointed': 73,
 'ridiculous': 227,
 'acting': 3,
 'poor': 211,
 'plot': 209,
 'lines': 156,
 'non': 190,
 'little': 157,
 'music': 186,
 'best': 24,
 'scene': 234,
 'trying': 285,
 'rest': 226,
 'lacks': 147,
 'art': 12,
 'works': 310,
 'guess': 118,
 'wasted': 300,
 'saw': 232,
 'today': 278,
 'thought': 275,
 'good': 115,
 'kids': 144,
 'bit': 27,
 'predictable': 213,
 'loved': 165,
 'casting': 38,
 'adorable': 8,
 'lot': 163,
 'look': 160,
 'songs': 251,
 'hilarious': 122,
 'cool': 55,
 'right': 228,
 'face': 92,
 'low': 167,
 'budget': 33,
 'long': 159,
 'consider': 53,
 'tale': 266,
 'single': 247,
 'film': 101,
 'll': 158,
 'cinematography': 46,
 'production': 218,
 'editing': 78,
 'directing': 70,
 'making': 170,
 'perfect': 200,
 'true': 283,
 'history': 123,
 'cinema': 45,
 'thi

In [19]:
# Quantas vezes a palavra "movie" aparece nas avaliações?
token2idx['movie']

183

In [20]:
# E a palavra good?
token2idx['good']

115

In [21]:
# Para facilitar nosso trabalho, vamos inverter chaves e colunas em nosso dicionário
idx2token = {idx: token for token, idx in token2idx.items()}

In [22]:
# Tipo
type(idx2token)

dict

In [23]:
# Total
print(len(idx2token))

320


In [24]:
# Visualiza
idx2token

{248: 'slow',
 185: 'moving',
 183: 'movie',
 319: 'young',
 171: 'man',
 162: 'lost',
 41: 'characters',
 14: 'audience',
 119: 'half',
 28: 'black',
 305: 'white',
 48: 'clever',
 35: 'camera',
 73: 'disappointed',
 227: 'ridiculous',
 3: 'acting',
 211: 'poor',
 209: 'plot',
 156: 'lines',
 190: 'non',
 157: 'little',
 186: 'music',
 24: 'best',
 234: 'scene',
 285: 'trying',
 226: 'rest',
 147: 'lacks',
 12: 'art',
 310: 'works',
 118: 'guess',
 300: 'wasted',
 232: 'saw',
 278: 'today',
 275: 'thought',
 115: 'good',
 144: 'kids',
 27: 'bit',
 213: 'predictable',
 165: 'loved',
 38: 'casting',
 8: 'adorable',
 163: 'lot',
 160: 'look',
 251: 'songs',
 122: 'hilarious',
 55: 'cool',
 228: 'right',
 92: 'face',
 167: 'low',
 33: 'budget',
 159: 'long',
 53: 'consider',
 266: 'tale',
 247: 'single',
 101: 'film',
 158: 'll',
 46: 'cinematography',
 218: 'production',
 78: 'editing',
 70: 'directing',
 170: 'making',
 200: 'perfect',
 283: 'true',
 123: 'history',
 45: 'cinema',
 274:

Vamos resumir tudo que fizemos em uma função.

In [25]:
# Definimos uma classe para extrair as frases
class Sequences():
    def __init__(self):
        self.vectorizer = CountVectorizer(stop_words = 'english', max_df = 0.99, min_df = 0.005)
        self.sequences = self.vectorizer.fit_transform(dados_filmes.Review.tolist())
        self.labels = dados_filmes.Sentimento.tolist()
        self.token2idx = self.vectorizer.vocabulary_
        self.idx2token = {idx: token for token, idx in self.token2idx.items()}
        
    def __getitem__(self, i):
        return self.sequences[i, :].toarray(), self.labels[i]
    
    def __len__(self):
        return self.sequences.shape[0]

In [26]:
# Extrai as frases do dataset e cria a matriz de dados, como os tokens, contagens e labels
dados_frases = Sequences()

In [27]:
# Confere o shape
print(dados_frases[5][0].shape)

(1, 320)


In [28]:
# Prepara os dados para treinamento no formato PyTorch
train_loader = DataLoader(dados_frases, batch_size = 4096)
train_loader

<torch.utils.data.dataloader.DataLoader at 0x1a2fccbe90>

## Definição e Construção do Modelo

Camada 1: $$x_1 = W_1 X + b_1$$
Função de Ativação: $$h_1 = \textrm{Relu}(x_1)$$
Camada 2: $$x_2 = W_2 h_1 + b_2$$
Saída: $$p = \sigma(x_2)$$
Loss: $$L = −(ylog(p)+(1−y)log(1−p))$$
Gradiente: 
$$\frac{\partial }{\partial W_1}L(W_1, b_1, W_2, b_2) = \frac{\partial L}{\partial p}\frac{\partial p}{\partial x_2}\frac{\partial x_2}{\partial h_1}\frac{\partial h_1}{\partial x_1}\frac{\partial x_1}{\partial W_1}$$

Atualização de Parâmetros:
$$W_1 = W_1 - \alpha \frac{\partial L}{\partial W_1}$$

In [29]:
# Classificador
class BagOfWordsClassifier(nn.Module):
    
    # Método construtor para inicializar os atributos
    def __init__(self, vocab_size, hidden1, hidden2):
        super(BagOfWordsClassifier, self).__init__()
        self.fc1 = nn.Linear(vocab_size, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, 1)
    
    # Método para a passada para a frente (forward)
    def forward(self, inputs):
        x = F.relu(self.fc1(inputs.squeeze(1).float()))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

In [30]:
# Cria o modelo
modelo = BagOfWordsClassifier(len(dados_frases.token2idx), 128, 64)

In [31]:
# Visualiza
modelo

BagOfWordsClassifier(
  (fc1): Linear(in_features=320, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
)

Para a função de perda (loss) usaremos Binary Cross Entropy.

In [32]:
# Define a função de perda
criterion = nn.BCEWithLogitsLoss()

Para o otimizador usaremos o algoritmo ADAM.

In [33]:
# Adam dinamicamente altera a taxa de aprendizagem
optimizer = optim.Adam([p for p in modelo.parameters() if p.requires_grad], lr = 0.001)

Agora treinamos o modelo.

In [34]:
# Treinamento

# Instância de treinamento do modelo
modelo.train()

# Lista para armazenar os erros a cada passada de treinamento
train_losses = []

# Número de épocas
epochs = 12

# Loop de treinamento
for epoch in range(epochs): 
    
    # Barra de progresso
    progress_bar = tqdm_notebook(train_loader, leave = False)
    
    # Listas de controle
    losses = []
    total = 0
    
    # Loop
    for inputs, target in progress_bar:
        
        # Modelo
        modelo.zero_grad()

        # Saída (previsão do modelo)
        output = modelo(inputs)
        
        # Cálculo do erro
        loss = criterion(output.squeeze(), target.float())
        
        # Instância do Backpropagation
        loss.backward()
        
        # Prepara atualização dos parâmetros (coeficientes)    
        nn.utils.clip_grad_norm_(modelo.parameters(), 3)

        # Executa o otimizador
        optimizer.step()
        
        # Atualiza a barra de progresso
        progress_bar.set_description(f'\nErro do Modelo: {loss.item():.3f}')
        
        # Erros e total
        losses.append(loss.item())
        total += 1
    
    # Erro da epoch
    epoch_loss = sum(losses) / total
    
    # Erro de treinamento
    train_losses.append(epoch_loss)
        
    tqdm.write(f'Epoch #{epoch + 1}\tErro em Treinamento: {epoch_loss:.3f}')

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #1	Erro em Treinamento: 0.693


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #2	Erro em Treinamento: 0.691


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #3	Erro em Treinamento: 0.690


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #4	Erro em Treinamento: 0.689


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #5	Erro em Treinamento: 0.687


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #6	Erro em Treinamento: 0.685


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #7	Erro em Treinamento: 0.684


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #8	Erro em Treinamento: 0.681


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #9	Erro em Treinamento: 0.679


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #10	Erro em Treinamento: 0.676


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #11	Erro em Treinamento: 0.673


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Epoch #12	Erro em Treinamento: 0.670


## Previsões de Sentimentos

In [35]:
# Função para prever o sentimento
def predict_sentiment(text):
    
    # Carrega o modelo
    modelo.eval()
    
    # Extrai as previsões do modelo
    with torch.no_grad():
        
        # Texto recebido como parâmetro convertido para vetor
        test_vector = torch.LongTensor(dados_frases.vectorizer.transform([text]).toarray())

        # Previsão
        output = modelo(test_vector)
        
        # Gera a previsão final como probabilidade
        prediction = torch.sigmoid(output).item()

        # Checa a probabilidade com limite de 0.5
        if prediction >= 0.5:
            print(f'{prediction:0.3}: Sentimento Positivo')
        else:
            print(f'{prediction:0.3}: Sentimento Negativo')

Primeiro calculamos o valor sigmóide (entre 0 e 1) para a previsão.

Se for maior ou igual que 0,5 classificamos como sentimento Positivo e se for menor que 0.5 classificamos como sentimento Negativo. 

Vamos testar o classificador de sentimentos.

In [36]:
# Texto de avaliação de filme
test_text = """
This poor excuse for a movie is terrible. It has been 'so good it's bad' for a
while, and the high ratings are a good form of sarcasm, I have to admit. But
now it has to stop. Technically inept, spoon-feeding mundane messages with the
artistic weight of an eighties' commercial, hypocritical to say the least, it
deserves to fall into oblivion. Mr. Derek, I hope you realize you are like that
weird friend that everybody know is lame, but out of kindness and Christian
duty is treated like he's cool or something. That works if you are a good
decent human being, not if you are a horrible arrogant bully like you are. Yes,
Mr. 'Daddy' Derek will end on the history books of the internet for being a
delusional sour old man who thinks to be a good example for kids, but actually
has a poster of Kim Jong-Un in his closet. Destroy this movie if you all have a
conscience, as I hope IHE and all other youtube channel force-closed by Derek
out of SPITE would destroy him in the courts.This poor excuse for a movie is
terrible. It has been 'so good it's bad' for a while, and the high ratings are
a good form of sarcasm, I have to admit. But now it has to stop. Technically
inept, spoon-feeding mundane messages with the artistic weight of an eighties'
commercial, hypocritical to say the least, it deserves to fall into oblivion.
Mr. Derek, I hope you realize you are like that weird friend that everybody
know is lame, but out of kindness and Christian duty is treated like he's cool
or something. That works if you are a good decent human being, not if you are a
horrible arrogant bully like you are. Yes, Mr. 'Daddy' Derek will end on the
history books of the internet for being a delusional sour old man who thinks to
be a good example for kids, but actually has a poster of Kim Jong-Un in his
closet. Destroy this movie if you all have a conscience, as I hope IHE and all
other youtube channel force-closed by Derek out of SPITE would destroy him in
the courts.
"""

# Previsão
predict_sentiment(test_text)

0.554: Sentimento Positivo


In [37]:
# Texto de avaliação de filme
test_text = """
Cool Cat Saves The Kids is a symbolic masterpiece directed by Derek Savage that
is not only satirical in the way it makes fun of the media and politics, but in
the way in questions as how we humans live life and how society tells us to
live life.

Before I get into those details, I wanna talk about the special effects in this
film. They are ASTONISHING, and it shocks me that Cool Cat Saves The Kids got
snubbed by the Oscars for Best Special Effects. This film makes 2001 look like
garbage, and the directing in this film makes Stanley Kubrick look like the
worst director ever. You know what other film did that? Birdemic: Shock and
Terror. Both of these films are masterpieces, but if I had to choose my
favorite out of the 2, I would have to go with Cool Cat Saves The Kids. It is
now my 10th favorite film of all time.

Now, lets get into the symbolism: So you might be asking yourself, Why is Cool
Cat Orange? Well, I can easily explain. Orange is a color. Orange is also a
fruit, and its a very good fruit. You know what else is good? Good behavior.
What behavior does Cool Cat have? He has good behavior. This cannot be a
coincidence, since cool cat has good behavior in the film.

Now, why is Butch The Bully fat? Well, fat means your wide. You wanna know who
was wide? Hitler. Nuff said this cannot be a coincidence.

Why does Erik Estrada suspect Butch The Bully to be a bully? Well look at it
this way. What color of a shirt was Butchy wearing when he walks into the area?
I don't know, its looks like dark purple/dark blue. Why rhymes with dark? Mark.
Mark is that guy from the Room. The Room is the best movie of all time. What is
the opposite of best? Worst. This is how Erik knew Butch was a bully.

and finally, how come Vivica A. Fox isn't having a successful career after
making Kill Bill.

I actually can't answer that question.

Well thanks for reading my review.
"""

# Previsão
predict_sentiment(test_text)

0.703: Sentimento Positivo


In [38]:
# Texto de avaliação de filme
test_text = """
What the heck is this ? There is not one redeeming quality about this terrible
and very poorly done "movie". I can't even say that it's a "so bad it's good
movie".It is undeniably pointless to address all the things wrong here but
unfortunately even the "life lessons" about bullies and stuff like this are so
wrong and terrible that no kid should hear them.The costume is also horrible
and the acting...just unbelievable.No effort whatsoever was put into this thing
and it clearly shows,I have no idea what were they thinking or who was it even
meant for. I feel violated after watching this trash and I deeply recommend you
stay as far away as possible.This is certainly one of the worst pieces of c***
I have ever seen.
"""

# Previsão
predict_sentiment(test_text)

0.447: Sentimento Negativo


In [39]:
# Texto de avaliação de filme
test_text = """
Don't let any bullies out there try and shape your judgment on this gem of a
title.

Some people really don't have anything better to do, except trash a great movie
with annoying 1-star votes and spread lies on the Internet about how "dumb"
Cool Cat is.

I wouldn't be surprised to learn if much of the unwarranted negativity hurled
at this movie is coming from people who haven't even watched this movie for
themselves in the first place. Those people are no worse than the Butch the
Bully, the film's repulsive antagonist.

As it just so happens, one of the main points of "Cool Cat Saves the Kids" is
in addressing the attitudes of mean naysayers who try to demean others who
strive to bring good attitudes and fun vibes into people's lives. The message
to be learned here is that if one is friendly and good to others, the world is
friendly and good to one in return, and that is cool. Conversely, if one is
miserable and leaving 1-star votes on IMDb, one is alone and doesn't have any
friends at all. Ain't that the truth?

The world has uncovered a great, new, young filmmaking talent in "Cool Cat"
creator Derek Savage, and I sure hope that this is only the first of many
amazing films and stories that the world has yet to appreciate.

If you are a cool person who likes to have lots of fun, I guarantee that this
is a movie with charm that will uplift your spirits and reaffirm your positive
attitudes towards life.
"""

# Previsão
predict_sentiment(test_text)

0.621: Sentimento Positivo


# Fim