# Análise de sentimento usando word embeddings - IMDB

Anteriormente vimos uma primeira solução de análise de sentimento utilizando *bag of words*.
Agora iremos ilustrar o uso de *word embeddings* como vetor de atributos latentes de cada palavra.

Duas soluções são propostas neste exercícios:

1. Utilizando rede neural com camadas densas
2. Utilizando camadas convolucionais 1D

Diferentemente da solução apresentada com *bag of words*, nestas duas soluções, é necessário que o
número de palavras seja o mesmo para cada amostra. Para isso, limita-se o número de palavras e caso
o número de palavras for menor, completa-se com um código especial e palavras além do limite são
descartadas.

## Importação dos pacotes

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import os, sys
import json
import numpy as np
import pandas as pd
import numpy.random as nr

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim.lr_scheduler import MultiStepLR, StepLR
from torch.utils.data import DataLoader, TensorDataset
from torch.autograd import Variable

from torchvision import datasets, transforms, models

import lib.pytorch_trainer as ptt

use_gpu = torch.cuda.is_available()
print('GPU available:', use_gpu)

GPU available: True


## Dataset IMDB

### Lendo do disco

O dataset é composto de 25 mil amostras de treinamento e 25 mil amostras de teste.
Cada amostra possui um texto de tamanho que varia entre 6 e 2493 palavras. 
Cada amostra tem um rótulo igual a 1 para denominar sentimento positivo e 0 para sentimento negativo.

In [2]:
word_index = json.load(open('/data/datasets/IMDB/imdb_word_index.json'))
data = np.load('/data/datasets/IMDB/imdb.npz')
x_test, x_train, y_train, y_test = data['x_test'], data['x_train'], data['y_train'], data['y_test']

n_words = len(word_index)
n_train = x_train.shape[0]
n_test  = x_test.shape[0]

word_list = [None for i in range(n_words+1)]
for k, v in word_index.items():
    word_list[v] = k

n_words, n_train, n_test

(88584, 25000, 25000)

In [3]:
' '.join([ word_list[x] for x in x_train[0]])

"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

In [4]:
y_train[0]

1

In [5]:
def print_stats(x_train, x_test, word_list=None):
    print('Train word index limits:', min([min(s) for s in x_train]), max([max(s) for s in x_train]))
    print('Test word index limits:', min([min(s) for s in x_test]), max([max(s) for s in x_test]))
    print('\nTrain sequence length limits:', min([len(x) for x in x_train]), max([len(x) for x in x_train]))
    print('Test sequence length limits:', min([len(x) for x in x_test]), max([len(x) for x in x_test]))
    if word_list:
        print('\nMost frequent words:', word_list[1:11])
    
print_stats(x_train, x_test, word_list)

Train word index limits: 1 88584
Test word index limits: 1 88581

Train sequence length limits: 10 2493
Test sequence length limits: 6 2314

Most frequent words: ['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']


### Limitando o vocabulário

Retiramos das sequências as palavras com índice maior que o valor especificado em `voc_size`.

In [6]:
voc_size = 5000

xtra = [[w for w in x if (w < voc_size)] for x in x_train]
xval = [[w for w in x if (w < voc_size)] for x in x_test]
print_stats(xtra, xval)

Train word index limits: 1 4999
Test word index limits: 1 4999

Train sequence length limits: 9 1973
Test sequence length limits: 6 2113


### Vetor de comprimento das sequências

In [7]:
seq_len_tra = [len(x_train[i]) for i in range(len(x_train))]
seq_len_tra = torch.LongTensor(seq_len_tra)
seq_len_tra.max()

2493

In [8]:
seq_len_val = [len(x_test[i]) for i in range(len(x_test))]
seq_len_val = torch.LongTensor(seq_len_val)
seq_len_val.max()

2314

### Obtendo sequências de mesmo comprimento

Fazemos com que todas as sequências tenham o mesmo comprimento, especificado em `seq_len`.
Neste caso, as sequências terão o comprimento da sequência mais longa.
A célula recorrente irá processar as sequências até encontrar o código de "fill_value".
Desta forma, cada sequência será processada de acordo com o seu número de palavras.

In [9]:
def pad_sequences(sequences, seq_len, post_pad=True, fill_value=0):
    new_seq = []
    for seq in sequences:
        n = len(seq)
        if n > seq_len:
            if post_pad:
                new_seq.append(seq[-seq_len:])
            else:
                new_seq.append(seq[:seq_len])
        else:
            zseq = [fill_value for i in range(seq_len)]
            if post_pad:
                zseq[-n:] = seq
            else:
                zseq[:n] = seq
            new_seq.append(zseq)
    return new_seq
    

In [10]:
seq_len = 2500
xtra = pad_sequences(xtra, seq_len)
xval = pad_sequences(xval, seq_len)
print_stats(xtra, xval)

Train word index limits: 0 4999
Test word index limits: 0 4999

Train sequence length limits: 2500 2500
Test sequence length limits: 2500 2500


### Convertendo para tensores

In [11]:
Xtrain = torch.from_numpy(np.array(xtra, np.int))
Xvalid = torch.from_numpy(np.array(xval, np.int))
ytrain = torch.from_numpy(np.array(y_train, np.int))
yvalid = torch.from_numpy(np.array(y_test,  np.int))

Xtrain.size(), Xtrain.max(), Xvalid.size(), Xvalid.max()

(torch.Size([25000, 2500]), 4999, torch.Size([25000, 2500]), 4999)

In [29]:
# (samples, sequence, atrr)-(2,3,2)
a = np.array([[[ 1, 2],
               [ 4, 5],
               [-1,-1]],
              [[ 7, 9],
               [-1,-1],
               [-1,-1]]])
av = torch.from_numpy(a)
av = av.transpose(1,0)
print('av:',av)
seq_len = torch.from_numpy(np.array([3,2,1]))
h = np.arange(12*2).reshape(3,4,2)
#print(h)

last_output = torch.index_select(av, 0, seq_len - 1)
print(last_output)
tmp_indices = torch.LongTensor(range(seq_len.size(0)))
print(tmp_indices)
tmp_indices = tmp_indices.view(-1, 1, 1)
print(tmp_indices)
tmp_indices = tmp_indices.expand(last_output.size(0), 1, last_output.size(2))
print(tmp_indices)
last_output = torch.gather(last_output, 1, tmp_indices)
print(last_output)

av: 
(0 ,.,.) = 
  1  2
  7  9

(1 ,.,.) = 
  4  5
 -1 -1

(2 ,.,.) = 
 -1 -1
 -1 -1
[torch.LongTensor of size 3x2x2]


(0 ,.,.) = 
 -1 -1
 -1 -1

(1 ,.,.) = 
  4  5
 -1 -1

(2 ,.,.) = 
  1  2
  7  9
[torch.LongTensor of size 3x2x2]


 0
 1
 2
[torch.LongTensor of size 3]


(0 ,.,.) = 
  0

(1 ,.,.) = 
  1

(2 ,.,.) = 
  2
[torch.LongTensor of size 3x1x1]


(0 ,.,.) = 
  0  0

(1 ,.,.) = 
  1  1

(2 ,.,.) = 
  2  2
[torch.LongTensor of size 3x1x2]



RuntimeError: Invalid index in gather at /pytorch/torch/lib/TH/generic/THTensorMath.c:445

## Rede Recorrente


In [22]:
class ModelRNN(nn.Module):
    
    def __init__(self, hidden_size, voc_size=voc_size, embed_dim = None):
        super().__init__()
        self.hidden_size = hidden_size
        self.emb = nn.Embedding(voc_size, embed_dim)
        nn.init.xavier_uniform(self.emb.weight)
        self.rnn = nn.RNN(input_size=embed_dim,
                          hidden_size=hidden_size,
                          batch_first=True,dropout=0.05)
        self.fc1 = nn.Linear(hidden_size,2)
        
    def forward(self, x,seq_len):
        x = self.emb(x)
        x,_ = self.rnn(x)
        y = torch.index_select(x,1,seq_len-1) # seleciona apenas o estado da última palavra 
        x = self.fc1(y)
        x = torch.squeeze(x,0)
        return x
    
model_rnn = ModelRNN(100,5000,50)
if use_gpu:
    model_rnn = model_rnn.cuda()
model_rnn

ModelRNN (
  (emb): Embedding(5000, 50)
  (rnn): RNN(50, 100, batch_first=True, dropout=0.05)
  (fc1): Linear (100 -> 2)
)

## Testando predict com uma amostra

In [24]:
print(Xtrain[0:2].size())
print(seq_len_tra[:2])
xin = Variable(Xtrain[0:2])
slen = Variable(seq_len_tra[:2])
if use_gpu:
    xin = xin.cuda()
    slen = slen.cuda()
y = model_rnn(xin,slen)
y

torch.Size([2, 2500])

 138
 433
[torch.cuda.LongTensor of size 2 (GPU 0)]



Variable containing:
(0 ,.,.) = 
1.00000e-02 *
  -1.0478  5.8668
  -1.0478  5.8668

(1 ,.,.) = 
1.00000e-02 *
  -1.0478  5.8668
  -1.0478  5.8668
[torch.cuda.FloatTensor of size 2x2x2 (GPU 0)]

### Treinamento

In [25]:
trainIt = True
resetIt = True

embedding_dim = 50
batch_size = 100
n_epochs = 10

# Callbacks
# ---------
state_fn = '../../models/sentimento_rnn'
accuracy_cb = ptt.AccuracyMetric()
chkpt_cb = ptt.ModelCheckpoint(state_fn, reset=resetIt, verbose=1)
print_cb = ptt.PrintCallback()
plot_cb = ptt.PlotCallback()

# Model, optimizer and learning rate scheduler
# --------------------------------------------
optimizer = torch.optim.Adam(model_rnn.parameters(), lr=1e-4, weight_decay=0.0005)
scheduler = StepLR(optimizer, step_size=5, gamma=0.75)

# Network trainer
# ---------------
training_parameters = {
    'model':         model_rnn, 
    'criterion':     nn.CrossEntropyLoss(),
    'optimizer':     optimizer, 
    'lr_scheduler':  scheduler, 
    'callbacks':     [accuracy_cb, chkpt_cb, print_cb],
}
trainer = ptt.DeepNetTrainer(**training_parameters)


In [26]:
if trainIt:
    trainer.fit(n_epochs, Xtrain, ytrain, valid_data=(Xvalid, yvalid), 
                batch_size=batch_size)
else:
    print('\nTraining disabled.\nThis model was trained for {} epochs.'.format(trainer.last_epoch))

Start training for 10 epochs


TypeError: forward() missing 1 required positional argument: 'seq_len'

- Resultado com pad de sequencia para a sequência máxima:
      1:  91.5s   T: 0.69234 0.51776   V: 0.69031 0.53744 best
      2:  92.4s   T: 0.61800 0.65836   V: 0.42015 0.81724 best
      3:  92.1s   T: 0.36631 0.84516   V: 0.34066 0.85732 best
      4:  93.2s   T: 0.30048 0.87840   V: 0.32623 0.86652 best

### Avaliação

In [12]:
if 'ModelCheckpoint' in [cb.__class__.__name__ for cb in trainer.callbacks]:
    trainer.load_state(state_fn)

rmetrics = trainer.evaluate(Xtrain, ytrain, metrics=[accuracy_cb])
print('Model training set accuracy after training: {:.5f}'.format(rmetrics['acc']))
print()
rmetrics = trainer.evaluate(Xvalid, yvalid, metrics=[accuracy_cb])
print('Model validation set accuracy after training: {:.5f}'.format(rmetrics['acc']))

evaluate: 2499/2499 ok
Model training set accuracy after training: 0.89896

evaluate: 2499/2499 ok
Model validation set accuracy after training: 0.86048


## Resumo dos resultados

1. Experimento *bag of words*: 87% de acurácia
2. Experimento *word embeddings*, rede densa: 88%
3. Experimento *word embeddings*, rede convolucional: 89%