# Description

**Classifying the ratings of the IMDb movie's reviews**. Used **Hierarchical attention** as explained in [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf).

In [1]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0,1

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=0,1


In [2]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2


from tqdm import tqdm_notebook
import warnings; warnings.filterwarnings('ignore')
from pathlib import Path
import pandas as pd
import numpy as np
import re
import spacy
from collections import Counter
from  matplotlib import  pyplot as  plt

import torch
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.dataloader import default_collate
from torch.nn.utils.rnn import pad_sequence
from torch import nn
from torch.nn import functional as F
from torch import optim
from sklearn.metrics import accuracy_score

BATCH_SIZE = 64
EMBEDDING_DIM = 200
HIDDEN_SIZE = 50
OUT_SIZE = 10  # 1-10  (0's excluded)


In [3]:
PATH = Path('./aclImdb/')
TRAIN_PATH = Path('./aclImdb/train/')
TEST_PATH = Path('./aclImdb/test/')

pos_dir = 'pos'
neg_dir = 'neg' 

#  Create DataFrame | x | y | 

In [4]:
def construct_df(data_path:Path):
    '''
    
    :param data_path: either path to training folder or to testing folder.
    '''
    file_paths = []
    ratings = []
    for sent_type in [pos_dir,neg_dir]:
        for file in (data_path/sent_type).iterdir():
            file_paths.append(file)
            ratings.append(int(file.parts[-1].split('_')[1].split('.')[0]))

    return pd.DataFrame(list(zip(file_paths,ratings)), columns=['file_paths', 'ratings'])
        

80%-20% for train-validation split

In [5]:
data_df = construct_df(TRAIN_PATH)

train_idx = np.random.rand(len(data_df)) < .8

train_df =  data_df[train_idx].reset_index(drop=True)
val_df = data_df[~train_idx].reset_index(drop=True)
test_df = construct_df(TEST_PATH)

In [6]:
data_df.ratings.value_counts()

1     5100
10    4732
8     3009
4     2696
7     2496
3     2420
2     2284
9     2263
Name: ratings, dtype: int64

In [7]:
train_df.head()

Unnamed: 0,file_paths,ratings
0,aclImdb/train/pos/5826_10.txt,10
1,aclImdb/train/pos/2621_8.txt,8
2,aclImdb/train/pos/1633_9.txt,9
3,aclImdb/train/pos/1890_10.txt,10
4,aclImdb/train/pos/6960_7.txt,7


# Tokenization

In [8]:
# in  doc
open(train_df.iloc[15,0], 'r').read()

"I have read the book and I must say that this movie stays true to form. I think this is the beginning of the psychological thrillers in the same genre of Psycho. Cristina Raines gives an excellent performance as the lead, and Burgess Meredith gives an excellent supporting actor as the next-door neighbor. I have seen this movie at least twice and I think that I am going to buy both the book and the movie for my collection. The suspense just keeps building up to the climatic end, the twist you will never see coming. If you like movies like Signs and The Village, the Sentinel will be a classic prelude. Also, what is interesting is the actors in the movie-you would not recognize them if you did not read the credits. The late Jerry Orbach is great as the commercial director and Jeff Goldblum is excellent as the photographer. Also there is Beverly D'Angelo, who is underrated but great."

In [9]:
re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)
def sub_br(x): return re_br.sub("\n", x.lower())

def split_sentece(x, splt_char=['.','!','?']): 
    ''' Splits text over the different tokens'''
    A = x if isinstance(x, list) else [x]
    B = A
    for char in splt_char:
        A = B
        B = []
        for sentence in A:
            if len(sentence)>0:
                B.extend(sentence.split(char))
    return B

my_tok = spacy.load('en')
def spacy_tok(x): return [tok.text for tok in my_tok.tokenizer(sub_br(x))]

In [10]:
file = train_df.iloc[3,0]
file_sentences_level = split_sentece(sub_br(file.read_text()))
file_word_level = [spacy_tok(sentence) for sentence in file_sentences_level]
print('\n'.join([' ● '.join(sentence) for sentence in file_word_level]))

this ● is ● one ● of ● those ● movies ● that ● i ● 've ● seen ● so ● many ● times ● that ● i ● can ● quote ● most ● of ● it
  ● some ● of ● the ● lines ● in ● this ● movie ● are ● just ● unbeatable
  ● i ● particularly ● enjoy ● watching ● him ● stumble ● and ● fall ● while ● drunk ● , ● go ● out ● to ● the ● fancy ● restaurant ● drunk ● and ● the ● part ● with ● the ● moose


 ● i ● do ● n't ● know ● how ● many ● times ● i ● have ● seen ● this ● sequence ● but ● it ● 's ● funny ● every ● time
  ● from ● the ● moment ● arthur ● gets ● to ● susan ● 's ● dad ● 's ● place ● to ● the ● bit ● with ● the ● moose ● , ● you ● pretty ● much ● laugh ● the ● whole ● time
  ● i ● remember ● watching ● the ● out ● - ● takes ● regarding ● the ● bit ● with ● the ● moose
  ● it ● went ● down ● just ● like ● i ● 'd ● imagined ● it ● 'd ● be ● like
  ● they ● were ● all ● laughing ● so ● hard ● it ● was ● difficult ● for ● them ● to ● film ● it


 ● the ● late ● sir ● john ● gielgud ● was ● a ● wonderfu

# Vocabulary 

I choose to use all the vocabulary in training and validation to avoid re-computations in testing time afterwards. (cheater! I know) 

In [11]:
counts = Counter()
for path in train_df.file_paths:
    counts.update(spacy_tok(path.read_text(encoding='utf-8')))

In [12]:
counts.values()

dict_values([69405, 1350, 11377, 7283, 2538, 60878, 1516, 221311, 40399, 9428, 419, 16477, 2113, 58858, 16264, 35477, 130322, 3166, 117355, 7302, 5518, 22914, 5360, 10036, 1050, 1348, 13685, 19093, 60, 88927, 7180, 18715, 52259, 414, 296, 75336, 21490, 2217, 666, 24162, 8500, 25508, 426, 810, 37442, 277, 210, 7209, 12165, 40365, 13389, 27282, 215, 600, 643, 269718, 4501, 49773, 23, 7540, 1354, 9278, 391, 1226, 124, 131279, 33, 23575, 933, 5, 1338, 1286, 14, 5, 3627, 4489, 285, 76673, 4840, 801, 64, 895, 85, 221144, 199, 5361, 319, 14, 3500, 2796, 453, 21596, 349, 108961, 2373, 751, 14196, 26664, 1567, 2853, 11169, 225, 2598, 3395, 9239, 291, 123, 4892, 155, 2721, 223, 580, 2466, 3158, 960, 6401, 9, 53, 24307, 573, 3976, 5269, 2206, 3068, 7121, 18296, 527, 1497, 31, 547, 6348, 5208, 32047, 5548, 5043, 1835, 306, 2544, 293, 5, 507, 3331, 524, 1493, 403, 6560, 550, 2951, 743, 1832, 239, 35682, 83, 1040, 106, 233, 9598, 3395, 2559, 11470, 519, 984, 10684, 8715, 9079, 284, 7965, 492, 1864, 

Filter out words that appear less than 10 times because they are likely to don't appear in the test set. We will set them to unknown `<UNK>`.

In [13]:
vocab2index = {"<PAD>":0, "<UNK>":1}
words = ["<PAD>", "<UNK>"]
for word in counts:
    if counts[word]>10:
        vocab2index[word] = len(words)
        words.append(word)

In [14]:
# words
list(vocab2index.items())[:5]

[('<PAD>', 0), ('<UNK>', 1), ('i', 2), ('remember', 3), ('when', 4)]

In [15]:
print(f'We contemplate {len(words)} words')

We contemplate 17051 words


# Order docs by length (in  terms of sentences)

In [16]:
def doc_length(file_path):
    return len(split_sentece(sub_br(file_path.read_text('utf-8'))))

def order_docs(df):
    order = []
    for i, path in enumerate(df.file_paths):
        order.append((i, doc_length(path)))
    
    idxs = [x[0] for x in sorted(order, key=lambda x: x[1])]
    
    return df.iloc[idxs,:].reset_index(drop=True)

In [17]:
# train_df = order_docs(train_df)
# val_df = order_docs(val_df)
# test_df = order_docs(test_df)

In [18]:
train_df.ratings.unique()

array([10,  8,  9,  7,  1,  2,  3,  4])

# Create DataSet

In [19]:
bs = 64
MAX_SENT = 148
MAX_WORDS = 2802

In [20]:
class doc_s_w(Dataset):
    
    def __init__(self, df, def_idx=1):
        
        df = order_docs(df)
        self.paths = df.file_paths
        self.y = df.ratings
        self.def_idx = def_idx
        
    def __len__(self,):
        return len(self.y)
        
    def __getitem__(self, idx):
        file = self.paths[idx]
        doc_sentences_level = split_sentece(sub_br(file.read_text('utf-8')))
        doc_word_level = [spacy_tok(sentence) for sentence in doc_sentences_level]
        x = [[vocab2index.get(w, self.def_idx) for w in s] for s in doc_word_level]
        
        max_n_words = max([len(s) for s in x])
        
        return x, self.y[idx]-1, len(x), max_n_words

In [21]:
def dynamic_word_sentece_padding(batch, MAX_SENT=MAX_SENT, MAX_WORDS=MAX_WORDS):
    '''
    We have set a maximum number of words and sentences. 
    If a doc goes over we just ignore the tail.
    ''' 
    
#     compute dimensions of batch
    dim_sent = min(MAX_SENT, max([b[2] for b in batch]))
    dim_words = min(MAX_WORDS, max([b[3] for b in batch]))
    
#     Create sentece input
    X =[]
    for sentences,*_ in batch:
        A = np.zeros([dim_sent, dim_words])
        for i in range(min([len(sentences),dim_sent])):
            fill_up_to = min(len(sentences[i]), dim_words)
            A[i,:fill_up_to] = sentences[i][:fill_up_to]
        X.append(A)
            
    y = [b[1] for b in batch]

    new_batches = list(zip(X,y))
    return default_collate(new_batches)

In [22]:
# testing

# train_ds = doc_s_w(train_df)
# val_ds = doc_s_w(val_df)
# test_ds = doc_s_w(test_df)

# train_dl = DataLoader(train_ds, shuffle=False, batch_size=64, collate_fn=dynamic_word_sentece_padding, drop_last=True)
# val_dl = DataLoader(val_ds, shuffle=False, batch_size=64, collate_fn=dynamic_word_sentece_padding, drop_last=True)
# test_dl = DataLoader(test_ds, shuffle=False, batch_size=64, collate_fn=dynamic_word_sentece_padding, drop_last=True)


# x, y = next(iter(train_dl))
# x, y  = x.long(), y.long()
# x.shape, y.shape

# Architecture

In [23]:
vocab_size = len(words)

In [24]:
class Attention(nn.Module):
    
    def __init__(self,  input_size, dropout=0):
        super().__init__()
        
        self.linear1 = nn.Linear(input_size, input_size)
                
        # To be tested:
        self.linear2 = nn.Linear(input_size, 1, bias=False)
        
        self.dropout = nn.Dropout(dropout)

        
    def forward(self, x):
        
        out = F.tanh(self.linear1(x))
        
        out = self.linear2(out)

        out = out.squeeze()
        
        out = self.dropout(out)

        w = F.softmax(out, dim=-1)
            
#         w = self.dropout(w)#/torch.sum(w.unsqueeze(-1),-1)
        
        out = torch.bmm(w.unsqueeze(-1).permute(0,2,1), x).squeeze()

        return out
    
class HiererchicalAttention(nn.Module):
    
    def __init__(self, embedding_dim, hidden_size, vocab_size, bs, out_size,
                dropout=0, max_words=MAX_WORDS, max_sentences=MAX_SENT):
        
        super().__init__()
        
        self.hidden_size = hidden_size
        self.bs = bs

        # word embbeding
        self.emb = nn.Embedding(num_embeddings=vocab_size, 
                                embedding_dim=embedding_dim)
        
        # word level
        self.word_GRU = nn.GRU(input_size=embedding_dim, hidden_size=hidden_size,
                               bidirectional=True, batch_first=True, dropout=dropout)
    
        self.word_Attention = Attention(2*hidden_size, dropout=dropout)
        
        # sentece level
        self.sentence_GRU = nn.GRU(input_size=2*hidden_size, hidden_size=hidden_size,
                               bidirectional=True, batch_first=True, dropout=dropout)
        
        self.sentence_Attention = Attention(2*hidden_size, dropout=dropout)
        
        # final linear
        self.linear = nn.Linear(2*hidden_size, out_size)
        
    def forward(self, x, h_0):
        
        sent_encoding  = []
        
        # GRU + Attention (word - level) - :output: vector encoding the sentence 
        for sent_idx in range(x.shape[1]):
            
            x_i = x[:,sent_idx,:]
            
            out_i = self.emb(x_i)
            
            out_i = self.word_GRU(out_i, h_0)[0]
            
            out_i = self.word_Attention(out_i)

            sent_encoding.append(out_i.unsqueeze(-2))
        
        out = torch.cat(sent_encoding, dim=-2)
        
        # GRU + Attention (sentence - level) - :output: vector encoding the sentence 
        # We don't want sentence level attention if there is only one sentence (this just adds noise)
        if x.shape[1]>1:
            
            out = self.sentence_GRU(out, h_0)[0]

            out = self.sentence_Attention(out)
        
        out = self.linear(out)
        
        return out.squeeze()
    
    def initHidden(self, cuda=True):
        hidden=torch.zeros(2, self.bs, self.hidden_size, requires_grad=False)
        return hidden.cuda() if cuda else hidden

In [25]:
## Testing
# net = HiererchicalAttention(embedding_dim=EMBEDDING_DIM,
#                              hidden_size=HIDDEN_SIZE, 
#                              vocab_size=vocab_size, 
#                              bs=BATCH_SIZE, out_size=OUT_SIZE).cuda().cuda()
# h_0 = net.initHidden()
# x.shape, net(x.long().cuda(), h_0).shape, y.shape

# Training definition

In [26]:
def save_model(m, p): torch.save(m.state_dict(), p)


def load_model(m, p): m.load_state_dict(torch.load(p))


In [27]:

def cos_cycle(start_lr, end_lr, n_iterations):
    '''cosine annealing'''
    i = np.arange(n_iterations)
    c_i = 1 + np.cos(i*np.pi/n_iterations)
    return end_lr + (start_lr - end_lr)/2 *c_i
    
    
class step_policy:
    '''
    One-cycle learning rate and momentum policy with cosine annealing.
    '''
    
    def __init__(self, n_epochs, dl, max_lr, div_factor:float=25., pctg:float=.3, moms:tuple=(.95,.85), delta=1/1e4):
        
        total_iterations = n_epochs*len(dl)
        
        max_lr, min_start, min_end = (max_lr, 
                                      max_lr/div_factor, 
                                      max_lr/div_factor*delta)
        
        self.stages = (int(total_iterations*pctg), total_iterations - int(total_iterations*pctg))
        
        lr_diffs = ((min_start, max_lr),(max_lr, min_end))
        mom_diffs = (moms, (moms[1],moms[0]))

        self.lr_schedule = self._create_schedule(lr_diffs)
        self.mom_schedule = self._create_schedule(mom_diffs)
        
        self.iter = -1
        
    def _create_schedule(self, diffs):
        individual_stages = [cos_cycle(start, end, n) for ((start, end),n) in zip(diffs, self.stages)]
        return np.concatenate(individual_stages)
    
    def step(self):
        self.iter += 1
        return [sch[self.iter] for sch in [self.lr_schedule, self.mom_schedule]]
    
    
    
class OptimizerWrapper:
    '''
    Wrapper to use wight decay in optim.Adam without influencing its algorithm.
    Takes care of the change in learning rate / momenutm at every iteration.
    
    '''
    
    def __init__(self, model, n_epochs, dl, max_lr, div_factor=None, wd=0):
        
        self.policy =  step_policy(n_epochs=n_epochs, dl=dl, 
                                   max_lr=max_lr, div_factor=div_factor)
        
        self.model = model
        self._wd = wd
        
        p = filter(lambda x: x.requires_grad, model.parameters())
        
        self.optimizer = optim.Adam(params=p, lr=0)
    
    def _update_optimizer(self):
        lr_i, mom_i = self.policy.step()
        for group in self.optimizer.param_groups:
            group['lr'] = lr_i
            group['betas'] = (mom_i, .999)

    def step(self):
        self._update_optimizer()
        if self._wd!=0:
            for group in self.optimizer.param_group:
                for p  in group['params']: p.data.mul_(group['lr']*self._wd)
        self.optimizer.step()
        
    def zero_grad(self): self.optimizer.zero_grad()
        
    def reset(self, n_epochs, dl, max_lr):
        self.iter = -1
        self.policy =  step_policy(n_epochs=n_epochs, dl=dl, max_lr=max_lr)
    

In [28]:
def softmax(x):
    m = x.max(1)
    num = np.exp(x-np.expand_dims(m,1))
    den = np.exp(x-np.expand_dims(m,1)).sum(1)
    return num/np.expand_dims(den,1)

def accuracy(y, pred):
    """
    Accuracy score
    """
    pred = pred.argmax(-1)
    return accuracy_score(y, pred)


def validate(model, valid_dl, h_0):
    """
    Validation/Testing loop
    """
    model.eval()
    div = 0
    agg_loss = 0
    ys = np.empty((0), int)
    preds = np.empty((0, 10), float)
    for it, (x,y) in enumerate(valid_dl):
        
        x = x.long().cuda()
        y = y.long().cuda()
        
        out = model(x, h_0)
        loss = F.cross_entropy(input=out,target=y)

        agg_loss += loss.item()
        div += 1
        
        preds = np.append(preds, out.cpu().detach().numpy(), axis=0)
        ys = np.append(ys, y.cpu().numpy(), axis=0)
    
    preds = softmax(preds)
    val_loss = agg_loss/div
    measures = accuracy(ys, preds)
    model.train()
    return val_loss, measures
      

def train(n_epochs, train_dl, model, h_0, valid_dl=None, max_lr=.01, div_factor=25):
    """Training loop
    """
            
    optimizer = OptimizerWrapper(model, n_epochs, train_dl,
                                 max_lr=max_lr, div_factor=div_factor)
    
    min_val_loss = np.inf
    
    for epoch in tqdm_notebook(range(n_epochs)):
        model.train()
        div = 0
        agg_loss = 0
        for it, (x,y) in enumerate(train_dl):
            
            x = x.long().cuda()
            y = y.long().cuda()
            
            out = model(x, h_0)
            loss = F.cross_entropy(input=out,target=y)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            agg_loss += loss.item()
            div += 1
            
        if valid_dl is None: print(f'Ep. {epoch+1} - train loss {agg_loss/div:.4f}')
        else:
            val_loss, measure = validate(model, valid_dl, h_0)
            print(f'Ep. {epoch+1} - train loss {agg_loss/div:.4f} -  val loss {val_loss:.4f} avg accuracy {measure:.4f}')
            if val_loss < min_val_loss:
                min_val_loss = val_loss
                save_model(model, './best_model.pth')
                torch.save(h_0, 'best_model_hs.pth')

# Running the model

In [None]:
n_epochs = 10

In [None]:
# data
train_ds = doc_s_w(train_df)
val_ds = doc_s_w(val_df)
test_ds = doc_s_w(test_df)

train_dl = DataLoader(train_ds, shuffle=True, batch_size=BATCH_SIZE, collate_fn=dynamic_word_sentece_padding, drop_last=True)
val_dl = DataLoader(val_ds, shuffle=False, batch_size=BATCH_SIZE, collate_fn=dynamic_word_sentece_padding, drop_last=True)
test_dl = DataLoader(test_ds, shuffle=False, batch_size=BATCH_SIZE, collate_fn=dynamic_word_sentece_padding, drop_last=True)

# model
model = HiererchicalAttention(embedding_dim=EMBEDDING_DIM, hidden_size=HIDDEN_SIZE, vocab_size=vocab_size, 
                              bs=BATCH_SIZE, out_size=OUT_SIZE, dropout=.3).cuda()
h_0 = model.initHidden(cuda=True)

# train
train(n_epochs, train_dl, model, h_0, val_dl, max_lr=5e-3)

A Jupyter Widget

In [None]:
load_model(model, ./best_model.pth)

A Jupyter Widget

In [None]:
val_loss, measure = validate(model, test_dl, h_0)
print(f'Test loss {val_loss:.4f} Accuracy {measure:.4f}')