# Análise de sentimento usando word embeddings - IMDB

Anteriormente vimos uma primeira solução de análise de sentimento utilizando *bag of words*.
Agora iremos ilustrar o uso de *word embeddings* como vetor de atributos latentes de cada palavra.

Duas soluções são propostas neste exercícios:

1. Utilizando rede neural com camadas densas
2. Utilizando camadas convolucionais 1D

Diferentemente da solução apresentada com *bag of words*, nestas duas soluções, é necessário que o
número de palavras seja o mesmo para cada amostra. Para isso, limita-se o número de palavras e caso
o número de palavras for menor, completa-se com um código especial e palavras além do limite são
descartadas.

## Importação dos pacotes

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import os, sys
import json
import numpy as np
import pandas as pd
import numpy.random as nr

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim.lr_scheduler import MultiStepLR, StepLR
from torch.utils.data import DataLoader, TensorDataset
from torch.autograd import Variable

from torchvision import datasets, transforms, models

import lib.pytorch_trainer as ptt

use_gpu = torch.cuda.is_available()
print('GPU available:', use_gpu)

GPU available: True


## Dataset IMDB

### Lendo do disco

O dataset é composto de 25 mil amostras de treinamento e 25 mil amostras de teste.
Cada amostra possui um texto de tamanho que varia entre 11 e 2494 palavras. 
Cada amostra tem um rótulo igual a 1 para denominar sentimento positivo e 0 para sentimento negativo.

In [2]:
word_index = json.load(open('/data/datasets/IMDB/imdb_word_index.json'))
data = np.load('/data/datasets/IMDB/imdb.npz')
x_test, x_train, y_train, y_test = data['x_test'], data['x_train'], data['y_train'], data['y_test']

n_words = len(word_index)
n_train = x_train.shape[0]
n_test  = x_test.shape[0]

word_list = [None for i in range(n_words+1)]
for k, v in word_index.items():
    word_list[v] = k

n_words, n_train, n_test

(88584, 25000, 25000)

In [3]:
def print_stats(x_train, x_test, word_list=None):
    print('Train word index limits:', min([min(s) for s in x_train]), max([max(s) for s in x_train]))
    print('Test word index limits:', min([min(s) for s in x_test]), max([max(s) for s in x_test]))
    print('\nTrain sequence length limits:', min([len(x) for x in x_train]), max([len(x) for x in x_train]))
    print('Test sequence length limits:', min([len(x) for x in x_test]), max([len(x) for x in x_test]))
    if word_list:
        print('\nMost frequent words:', word_list[1:11])
    
print_stats(x_train, x_test, word_list)

Train word index limits: 1 88584
Test word index limits: 1 88581

Train sequence length limits: 10 2493
Test sequence length limits: 6 2314

Most frequent words: ['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']


### Limitando o vocabulário

Retiramos das sequências as palavras com índice maior que o valor especificado em `voc_size`.

In [4]:
# voc_size = 5000
voc_size = 10000

xtra = [[w for w in x if (w < voc_size)] for x in x_train]
xval = [[w for w in x if (w < voc_size)] for x in x_test]
print_stats(xtra, xval)

Train word index limits: 1 9999
Test word index limits: 1 9999

Train sequence length limits: 9 2194
Test sequence length limits: 6 2198


### Obtendo sequências de mesmo comprimento

Fazemos com que todas as sequências tenham o mesmo comprimento, especificado em `seq_len`. As sequências mais longas que `seq_len` são truncadas e as menores, completadas com zeros.

In [5]:
def pad_sequences(sequences, seq_len, post_pad=True, fill_value=0):
    new_seq = []
    for seq in sequences:
        n = len(seq)
        if n > seq_len:
            if post_pad:
                new_seq.append(seq[-seq_len:])
            else:
                new_seq.append(seq[:seq_len])
        else:
            zseq = [fill_value for i in range(seq_len)]
            if post_pad:
                zseq[-n:] = seq
            else:
                zseq[:n] = seq
            new_seq.append(zseq)
    return new_seq
    

In [6]:
seq_len = 500
xtra = pad_sequences(xtra, seq_len, post_pad=True)
xval = pad_sequences(xval, seq_len, post_pad=True)
print_stats(xtra, xval)

Train word index limits: 0 9999
Test word index limits: 0 9999

Train sequence length limits: 500 500
Test sequence length limits: 500 500


### Convertendo para tensores

In [7]:
Xtrain = torch.from_numpy(np.array(xtra, np.int))
Xvalid = torch.from_numpy(np.array(xval, np.int))
ytrain = torch.from_numpy(np.array(y_train, np.int))
yvalid = torch.from_numpy(np.array(y_test, np.int))

Xtrain.size(), Xtrain.max(), Xvalid.size(), Xvalid.max()

(torch.Size([25000, 500]), 9999, torch.Size([25000, 500]), 9999)

## Rede Densa


In [8]:
class MySimpleNet(nn.Module):
    def __init__(self, seq_len=seq_len, voc_size=voc_size, embed_dim=None):
        super().__init__()
        self.flat_size = seq_len * embedding_dim        
        self.emb = nn.Embedding(voc_size, embed_dim)
        nn.init.xavier_uniform(self.emb.weight)
        self.fc1 = nn.Linear(self.flat_size, 128)
        self.fc2 = nn.Linear(128, 2)
        
    def forward(self, x):
        x = self.emb(x)
        x = x.view(-1, self.flat_size)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, 0.5)
        x = self.fc2(x)
        return x

### Treinamento

In [9]:
trainIt = False
resetIt = False

embedding_dim = 50
batch_size = 100
n_epochs = 10

# Callbacks
# ---------
state_fn = '../../models/sentimento_1'
accuracy_cb = ptt.AccuracyMetric()
chkpt_cb = ptt.ModelCheckpoint(state_fn, reset=resetIt, verbose=1)
print_cb = ptt.PrintCallback()
plot_cb = ptt.PlotCallback()

# Model, optimizer and learning rate scheduler
# --------------------------------------------
model = MySimpleNet(seq_len, voc_size, embedding_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=1.e-4, weight_decay=0.0005)
scheduler = StepLR(optimizer, step_size=5, gamma=0.75)

# Network trainer
# ---------------
training_parameters = {
    'model':         model, 
    'criterion':     nn.CrossEntropyLoss(),
    'optimizer':     optimizer, 
    'lr_scheduler':  scheduler, 
    'callbacks':     [accuracy_cb, chkpt_cb, print_cb],
}
trainer = ptt.DeepNetTrainer(**training_parameters)

In [10]:
if trainIt:
    trainer.fit(n_epochs, Xtrain, ytrain, valid_data=(Xvalid, yvalid), batch_size=batch_size)
else:
    print('Training disable, loading trained model')
    trainer.load_state('/data/models/sentimento_1')

Training disable, loading trained model


#### Treinamento em CPU (AWS c4x.2large, _compute optimized 8 cores_ ):
    Start training for 10 epochs
      1:  16.4s   T: 0.67021 0.58424   V: 0.54485 0.77676 best
      2:  14.3s   T: 0.37235 0.85344   V: 0.31478 0.87156 best
      3:  17.2s   T: 0.25475 0.90344   V: 0.28751 0.87880 best
      4:  14.2s   T: 0.21120 0.92364   V: 0.27403 0.88536 best
      5:  16.4s   T: 0.17996 0.93792   V: 0.27677 0.88396 
      6:  29.8s   T: 0.16073 0.94736   V: 0.27144 0.88504 best
      7: 125.4s   T: 0.14337 0.95564   V: 0.27229 0.88516 
      8: 167.4s   T: 0.12665 0.96464   V: 0.28843 0.87948 
      9: 181.3s   T: 0.11348 0.97008   V: 0.28020 0.88552 
     10: 185.8s   T: 0.09831 0.97824   V: 0.28082 0.88344 
    Best model was saved at epoch 6 with loss 0.27144: ../../models/sentimento_1
    Stop training at epoch: 10/10

#### Treinamento em GPU (GTX1080, 8GB):
    Start training for 10 epochs
      1:   5.6s   T: 0.65424 0.62108   V: 0.50405 0.79320 best
      2:   5.5s   T: 0.35170 0.86240   V: 0.30834 0.87400 best
      3:   5.5s   T: 0.24396 0.90708   V: 0.28117 0.88336 best
      4:   5.5s   T: 0.19977 0.92880   V: 0.27382 0.88544 best
      5:   5.5s   T: 0.16876 0.94296   V: 0.27290 0.88604 best
      6:   5.5s   T: 0.14886 0.95212   V: 0.27487 0.88500 
      7:   5.5s   T: 0.13193 0.96012   V: 0.27411 0.88532 
      8:   5.5s   T: 0.11622 0.96904   V: 0.27760 0.88424 
      9:   5.6s   T: 0.10027 0.97724   V: 0.28121 0.88376 
     10:   5.6s   T: 0.08648 0.98296   V: 0.29508 0.88008 
    Best model was saved at epoch 5 with loss 0.27290: ../../models/sentimento_1
    Stop training at epoch: 10/10


### Avaliação

In [11]:
rmetrics = trainer.evaluate(Xtrain, ytrain, metrics=[accuracy_cb],batch_size=2000)
print('Model training set accuracy after training: {:.5f}'.format(rmetrics['acc']))
print()
rmetrics = trainer.evaluate(Xvalid, yvalid, metrics=[accuracy_cb], batch_size=2000)
print('Model validation set accuracy after training: {:.5f}'.format(rmetrics['acc']))

evaluate: 12/12 ok
Model training set accuracy after training: 0.95252

evaluate: 12/12 ok
Model validation set accuracy after training: 0.88604


## Rede convolucional

In [12]:
class MyNet(nn.Module):
    
    def __init__(self, seq_len=seq_len, voc_size=voc_size, embed_dim=embedding_dim, 
                 n_conv_filters=128, conv_kernel_size=5):
        super().__init__()
        
        k = conv_kernel_size - 1
        n = (((seq_len - k) // 2 - k) // 2 - k) // 2
        self.flat_size = n * n_conv_filters
        
        self.embedding = nn.Embedding(voc_size, embed_dim)
        nn.init.xavier_uniform(self.embedding.weight)

        self.conv_net = nn.Sequential(
            nn.Conv1d(embed_dim, n_conv_filters, conv_kernel_size),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Dropout(0.5),
            
            nn.Conv1d(n_conv_filters, n_conv_filters, conv_kernel_size),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Dropout(0.5),
            
            nn.Conv1d(n_conv_filters, n_conv_filters, conv_kernel_size),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Dropout(0.5),
        )
        
        self.fc_net = nn.Sequential(
            nn.Linear(self.flat_size, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, 2),
        )

    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1, 2)
        x = self.conv_net(x)
        x = x.view(-1, self.flat_size)
        x = self.fc_net(x)
        return x

### Treinamento

In [13]:
trainIt = False
resetIt = True

# Callbacks
# ---------
state_fn = '../../models/sentimento_3'
accuracy_cb = ptt.AccuracyMetric()
chkpt_cb = ptt.ModelCheckpoint(state_fn, reset=resetIt, verbose=1)
print_cb = ptt.PrintCallback()
plot_cb = ptt.PlotCallback()

# Model, optimizer and learning rate scheduler
# --------------------------------------------
model = MyNet(seq_len, voc_size, embedding_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=1.e-3, weight_decay=0)
scheduler = StepLR(optimizer, step_size=5, gamma=0.75)

# Network trainer
# ---------------
training_parameters = {
    'model':         model, 
    'criterion':     nn.CrossEntropyLoss(),
    'optimizer':     optimizer, 
    'lr_scheduler':  scheduler, 
    'callbacks':     [accuracy_cb, chkpt_cb, print_cb],
}
trainer = ptt.DeepNetTrainer(**training_parameters)

In [14]:
if trainIt:
    trainer.fit(4, Xtrain, ytrain, valid_data=(Xvalid, yvalid), batch_size=batch_size)
else:
    print('Training disable, loading trained model')
    trainer.load_state('/data/models/sentimento_3')

Training disable, loading trained model


#### Treinamento em CPU (AWS c4x.2large, _compute optimized 8 cores_ ):
    Start training for 10 epochs
      1: 177.9s   T: 0.54422 0.66508   V: 0.31881 0.86320 best
      2: 176.4s   T: 0.24304 0.90532   V: 0.30452 0.86872 best
      3: 180.2s   T: 0.17245 0.93924   V: 0.33442 0.87108 
      4: 176.7s   T: 0.11281 0.95988   V: 0.37404 0.86836 
      5: 188.7s   T: 0.06431 0.97768   V: 0.45143 0.86640 
      6: 179.7s   T: 0.04120 0.98564   V: 0.58594 0.85720 
      7: 175.4s   T: 0.03657 0.98700   V: 0.73639 0.85680 
      8: 189.7s   T: 0.02824 0.99052   V: 0.73352 0.85636 
      9: 192.3s   T: 0.01951 0.99304   V: 0.77173 0.85816 
     10: 188.2s   T: 0.01272 0.99564   V: 1.04543 0.86184 
    Best model was saved at epoch 2 with loss 0.30452: ../../models/sentimento_3
    Stop training at epoch: 10/10

#### Treinamento em GPU (GTX1080, 8GB):
    Start training for 4 epochs
      1:   9.6s   T: 0.61965 0.59356   V: 0.36818 0.84188 best
      2:   9.5s   T: 0.27770 0.88764   V: 0.27181 0.88704 best
      3:   9.5s   T: 0.17676 0.93412   V: 0.30976 0.86932 
      4:   9.5s   T: 0.11982 0.95664   V: 0.35523 0.87364 
    Best model was saved at epoch 2 with loss 0.27181: ../../models/sentimento_3
    Stop training at epoch: 4/4

### Avaliação

In [15]:
rmetrics = trainer.evaluate(Xtrain, ytrain, metrics=[ptt.AccuracyMetric()], batch_size=2000)
print('Model training set accuracy after training: {:.5f}'.format(rmetrics['acc']))
print()
rmetrics = trainer.evaluate(Xvalid, yvalid, metrics=[ptt.AccuracyMetric()], batch_size=2000)
print('Model validation set accuracy after training: {:.5f}'.format(rmetrics['acc']))

evaluate: 12/12 ok
Model training set accuracy after training: 0.94968

evaluate: 12/12 ok
Model validation set accuracy after training: 0.88704


## Resumo dos resultados

1. Experimento *bag of words*: 87% de acurácia
2. Experimento *word embeddings*, rede densa: 88.7%
3. Experimento *word embeddings*, rede convolucional: 88.8%