<a href="https://colab.research.google.com/github/sanjeevr5/NLP_Excercises/blob/main/DL_NLP_With_Torch_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-class classification problem using RNNs

This exercise is performed on AGNews dataset with four news categories:

1 : World
2 : Sports
3 : Business
4 : Sci/Tec 

<b> Architecture details </b>

1. We shall use bpeMB word representations as word vectors. This is a sub word tokenization process with a vector for every sub token. Sub-word tokens can make the sentences longer and hence to be used with care.
2. We will just use the titles to predict the news category.
3. This RNN architecture does not pad sequences, we shall feed the packed sequences directly to the embedding layer.
4. Dataloaders to have generator kind of data feed to the model.
5. Passing outputs of output layer or the hidden state to the dense layer.

## Downloading the data

In [None]:
%%capture
!pip install bpemb
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv

## Importing the essentials

In [None]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils import data
from torch.nn.utils.rnn import pack_sequence
import time
import torch.optim as optim
from bpemb import BPEmb


bpemb_en = BPEmb(lang="en", dim=300, vs = 10000) #vs will be the voacb size
print(bpemb_en.vectors.shape)
embed_matrix = torch.tensor(bpemb_en.vectors)
SEED = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs25000.model


100%|██████████| 661443/661443 [00:00<00:00, 15653630.99B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs25000.d300.w2v.bin.tar.gz


100%|██████████| 28009055/28009055 [00:00<00:00, 56909932.83B/s]


(25000, 300)


## Tokenizer and pre-processing

A minimal pre-processing to retain only characters.

In [None]:
train_data = pd.read_csv('./train.csv', sep = ',', header = None)
test_data = pd.read_csv('./test.csv', sep = ',', header = None)

print(f'Train shape : {train_data.shape} test shape : {test_data.shape}')

Train shape : (120000, 3) test shape : (7600, 3)


In [None]:
train_data.head()

Unnamed: 0,0,1,2
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [None]:
X_train, y_train = train_data.iloc[:,1].map(lambda x : x.split(' (')[0].lower()), train_data.iloc[:,0].map(lambda x : int(x - 1))
X_test, y_test = test_data.iloc[:,1].map(lambda x : x.split(' (')[0].lower()), test_data.iloc[:,0].map(lambda x : int(x - 1))

In [None]:
train_encoded = [torch.tensor(bpemb_en.encode_ids(item)) for item in X_train.values]
test_encoded = [torch.tensor(bpemb_en.encode_ids(item)) for item in X_test.values]

print('train sample:', train_encoded[0])
print('test sample:', test_encoded[0])

train sample: tensor([ 2029,    66, 24935,  7820, 24697,   810,   423,     7,  1149])
test sample: tensor([15923,    72,     3,    47, 14396,   297,  9981])


## Data loader with padding

In [None]:
from torch.utils.data import DataLoader, Dataset

class Data_Iterator(data.Dataset):

  def __init__(self, text, label):
    super(Data_Iterator, self).__init__()
    assert len(text) == len(label)
    self.text = text
    self.label = label
  
  def __len__(self):
    return len(self.label)

  def __getitem__(self, index):
    return self.text[index], self.label[index]

train_data = Data_Iterator(train_encoded, y_train)
test_data = Data_Iterator(test_encoded, y_test)

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_sequence, pad_packed_sequence

def packed(batch):
  sorted_batch = sorted(batch, key=lambda x: x[0].shape[0], reverse=True)
  text_seq = [i[0] for i in sorted_batch]
  text_seq = pack_sequence(text_seq, enforce_sorted=False).to(device)
  target = torch.LongTensor([i[1] for i in sorted_batch]).to(device)
  return text_seq, target

trainloader = DataLoader(train_data, batch_size=64, shuffle=True, collate_fn= packed) 
testloader = DataLoader(test_data, batch_size=128, shuffle=True, collate_fn = packed)

In [None]:
class Seq_Model(nn.Module):

  def __init__(self, embedding_size, lstm_hidden_units, n_layers, n_classes, bi_d = True, drop_rate = 0.3):
    super(Seq_Model, self).__init__()

    self.embedding = nn.Embedding.from_pretrained(torch.as_tensor(embed_matrix))
    self.seq = nn.LSTM(embedding_size, lstm_hidden_units, num_layers = n_layers, bidirectional = bi_d, dropout = drop_rate)
    self.fc = nn.Linear(2 * lstm_hidden_units if bi_d else lstm_hidden_units, n_classes)
    self.dropout = nn.Dropout(drop_rate)
    self.bi_d = bi_d

  def forward(self, input_batch):
    """
      ref : "https://discuss.pytorch.org/t/how-to-use-pack-sequence-if-we-are-going-to-use-word-embedding-and-bilstm/28184/4"
      the problem is packed sequences can be directly fed to LSTM/RNN but not to the embedding layer!
    """
    #input_batch = [sent len, batch size]

    #embed = self.dropout(simple_elementwise_apply(self.embedding, input_batch))#self.embedding(simple_elementwise_applyinput_batch))
    #embed = [seq_len, batch_size, embed_dim]
    embed = simple_elementwise_apply(self.embedding, input_batch)
    packed_output, (hidden, cell) = self.seq(embed)
    hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1) if self.bi_d else hidden[-1])
    return self.fc(hidden)


def simple_elementwise_apply(fn, packed_sequence):
    """applies a pointwise function fn to each element in packed_sequence"""
    return torch.nn.utils.rnn.PackedSequence(fn(packed_sequence.data), packed_sequence.batch_sizes)

model = Seq_Model(300, 512, 3, 4)
print(f'The number of trainable parameters are : {sum(p.numel() for p in model.parameters() if p.requires_grad):,}')

The number of trainable parameters are : 15,937,540


In [None]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
model = model.to(device)
criterion = criterion.to(device)

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

def accuracy(preds, true):
  _, index = torch.max(preds, dim = 1)
  return (index == true).sum().float() / len(preds)

In [None]:
def train_m(model, iterator, optimizer, l):
  e_loss = 0
  e_acc = 0
  model.train()

  for inputs, labels in iterator:
    optimizer.zero_grad()
    preds = model(inputs)
    acc = accuracy(preds,  labels)
    loss = l(preds.squeeze(1), labels.long())
    loss.backward()
    optimizer.step()
    e_loss += loss.item()
    e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

def evaluate_m(model, iterator, l):
  e_loss = 0
  e_acc = 0
  model.eval()
  with torch.no_grad():
    for inputs, labels in iterator:
      preds = model(inputs)
      loss = l(preds.squeeze(1), labels.long())
      acc = accuracy(preds,  labels)
      e_loss += loss.item()
      e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

In [None]:
N_EPOCHS = 20

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train_m(model, trainloader, optimizer, criterion)
    valid_loss, valid_acc = evaluate_m(model, testloader, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch+1:02} / {N_EPOCHS} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 / 20 | Epoch Time: 1m 10s
	Train Loss: 1.216 | Train Acc: 48.54%
	 Val. Loss: 1.382 |  Val. Acc: 35.53%
Epoch: 02 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.155 | Train Acc: 50.79%
	 Val. Loss: 1.379 |  Val. Acc: 36.04%
Epoch: 03 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.134 | Train Acc: 51.66%
	 Val. Loss: 1.400 |  Val. Acc: 36.61%
Epoch: 04 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.119 | Train Acc: 52.69%
	 Val. Loss: 1.390 |  Val. Acc: 35.11%
Epoch: 05 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.109 | Train Acc: 53.20%
	 Val. Loss: 1.408 |  Val. Acc: 35.79%
Epoch: 06 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.091 | Train Acc: 54.16%
	 Val. Loss: 1.421 |  Val. Acc: 36.68%
Epoch: 07 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.084 | Train Acc: 54.46%
	 Val. Loss: 1.415 |  Val. Acc: 34.89%
Epoch: 08 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.079 | Train Acc: 54.79%
	 Val. Loss: 1.406 |  Val. Acc: 35.84%
Epoch: 09 / 20 | Epoch Time: 1m 9s
	Train Loss: 1.068 | Train Acc: 55.43%
	 Val. Loss: 1.439 | 