<a href="https://colab.research.google.com/github/sanjeevr5/NLP/blob/main/Torch_NLP_SERIES_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. Multi-Class Classification (Predicting News Category) Without TorchText


- Architecture Used : Stacked Bi-Directional LSTM With BPEmb embeddings(https://github.com/bheinzerling/bpemb)
- Referring : https://github.com/bentrevett/pytorch-sentiment-analysis
- Field : https://github.com/pytorch/text/blob/master/torchtext/data/field.py
- Dataset : https://github.com/mhjabreel/CharCnn_Keras/blob/master/data/ag_news_csv
- Label Reference : {0: 'WORLD', 1: 'SPORTS', 2: 'BIZ', 3: 'TECH'}
- **Concepts covered : packed_sequences, custom data iterator and data loader for variable length data**

In [None]:
%%capture
!pip install bpemb
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv

In [None]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils import data
from torch.nn.utils.rnn import pack_sequence
import time
import torch.optim as optim

from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", dim=300, vs = 10000) #vs will be the voacb size
print(bpemb_en.vectors.shape)
embed_matrix = bpemb_en.vectors
SEED = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|██████████| 400869/400869 [00:00<00:00, 570003.42B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d300.w2v.bin.tar.gz


100%|██████████| 11189884/11189884 [00:01<00:00, 6605870.04B/s]


(10000, 300)


In [None]:
!head -5 train.csv

"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
"3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
"3","Oil and Economy Cloud Stocks' Outlook (Reuters)","Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums."
"3","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)","Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."
"3","Oil prices soar to all-time record, posing new menace to US e

In [None]:
!sed -i '1s/^/"cat","title","news"\n/' ./train.csv
!sed -i '1s/^/"cat","title","news"\n/' ./test.csv

In [None]:
!head -5 test.csv

"cat","title","news"
"3","Fears for T N pension after talks","Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
"4","The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com)","SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket."
"4","Ky. Company Wins Grant to Study Peptides (AP)","AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins."
"4","Prediction Unit Helps Forecast Wildfires (AP)","AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will

In [None]:
train_data = pd.read_csv('./train.csv', sep = ',')
test_data = pd.read_csv('./test.csv', sep = ',')

In [None]:
print(f'Train data shape : {train_data.shape}')
print(f'Test data shape : {test_data.shape}')
train_data.head()

Train data shape : (120000, 3)
Test data shape : (7600, 3)


Unnamed: 0,cat,title,news
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [None]:
train_text = train_data.title.map(lambda x : x.split(' (')[0].lower() )
test_text = test_data.title.map(lambda x : x.split(' (')[0].lower() )
train_label = train_data.cat.values - 1 # labels should start from 0
test_label = test_data.cat.values - 1

In [None]:
print('The train text shape : ', train_text.shape)
print('The test text shape : ', test_text.shape)
print('The train label shape : ', train_label.shape)
print('The train text shape : ', test_label.shape)

The train text shape :  (120000,)
The test text shape :  (7600,)
The train label shape :  (120000,)
The train text shape :  (7600,)


In [None]:
train_encoded = [torch.tensor(bpemb_en.encode_ids(item)) for item in train_text]
test_encoded = [torch.tensor(bpemb_en.encode_ids(item)) for item in test_text]

In [None]:
class Data_Iterator(data.Dataset):

  def __init__(self, text, label):
    super(Data_Iterator, self).__init__()
    assert len(text) == len(label)
    self.text = text
    self.label = label
  
  def __len__(self):
    return len(self.label)

  def __getitem__(self, index):
    return self.text[index], self.label[index]

train_data = Data_Iterator(train_encoded, train_label)
test_data = Data_Iterator(test_encoded, test_label)

In [None]:
def packed(batch):
  """
  https://www.codefull.net/2018/11/use-pytorchs-dataloader-with-variable-length-sequences-for-lstm-gru/
  how to use dataloader with variable length sequences : LIFE SAVER!!!
  The real reason for going with dataloader is parallelization though it is not used here :)
  """
  sorted_batch = sorted(batch, key=lambda x: x[0].shape[0], reverse=True)
  text_seq = [i[0] for i in sorted_batch]
  text_seq = pack_sequence(text_seq, enforce_sorted=False).to(device)
  leng = torch.LongTensor([len(i[0]) for i in sorted_batch]).to(device)
  target = torch.LongTensor([i[1] for i in sorted_batch]).to(device)
  return text_seq, leng, target

trainloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=False, collate_fn= packed) 
testloader = torch.utils.data.DataLoader(test_data, batch_size=64, shuffle=False, collate_fn = packed)

<b> Implementation Details </b>

- We are going to use bi-directional LSTM stacked on top of one another
- For RNN we get output, hidden but for LSTM we get output, (hidden, cell) since cell state is known to solve the problem of vanishing gradients
- Just like RNN hidden and cell state consits of the vector representations of size hidden units of the last timestep
- Since, we are using Bi representation we will get the hidden and cell state vector with 2 * hidden units one for the forward pass and the next for the backward pass [forward_layer_0, backward_layer_0, forward_layer_1, backward_layer 1, ..., forward_layer_n, backward_layer n]
- The final hidden or cell state is of the shape [2 * n_layers, batch_size, hidden_dims] 

In [None]:
class Seq_Model(nn.Module):

  def __init__(self, embedding_size, lstm_hidden_units, n_layers, n_classes, bi_d = True, drop_rate = 0.3):
    super(Seq_Model, self).__init__()

    self.embedding = nn.Embedding.from_pretrained(torch.as_tensor(embed_matrix))
    self.seq = nn.LSTM(embedding_size, lstm_hidden_units, num_layers = n_layers, bidirectional = bi_d, dropout = drop_rate)
    self.fc = nn.Linear(2 * lstm_hidden_units, n_classes)
    self.dropout = nn.Dropout(drop_rate)

  def forward(self, input_batch):
    """
      ref : "https://discuss.pytorch.org/t/how-to-use-pack-sequence-if-we-are-going-to-use-word-embedding-and-bilstm/28184/4"
      the problem is packed sequences can be directly fed to LSTM/RNN but not to the embedding layer!
    """
    #input_batch = [sent len, batch size]

    #embed = self.dropout(simple_elementwise_apply(self.embedding, input_batch))#self.embedding(simple_elementwise_applyinput_batch))
    #embed = [seq_len, batch_size, embed_dim]
    embed = simple_elementwise_apply(self.embedding, input_batch)
    #packed_embedded = nn.utils.rnn.pack_padded_sequence(embed, seq_lens)
    packed_output, (hidden, cell) = self.seq(embed)
    #output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
    hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
    # Concatenate the forward and backward hidden state
    return self.fc(hidden)

def simple_elementwise_apply(fn, packed_sequence):
    """applies a pointwise function fn to each element in packed_sequence"""
    return torch.nn.utils.rnn.PackedSequence(fn(packed_sequence.data), packed_sequence.batch_sizes)

model = Seq_Model(300, 512, 2, 4)
print('The number of trainable parameters are :', sum(p.numel() for p in model.parameters() if p.requires_grad))

The number of trainable parameters are : 9637892


In [None]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
model = model.to(device)
criterion = criterion.to(device)

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
def accuracy(preds, true):
  _, index = torch.max(preds, dim = 1)
  return (index == true).sum().float() / len(preds)

def train_m(model, iterator, optimizer, l):
  e_loss = 0
  e_acc = 0
  model.train()

  for inputs, labels in iterator:
    optimizer.zero_grad()
    preds = model(inputs)
    acc = accuracy(preds,  labels)
    loss = l(preds.squeeze(1), labels.long())
    loss.backward()
    optimizer.step()
    e_loss += loss.item()
    e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

def evaluate_m(model, iterator, l):
  e_loss = 0
  e_acc = 0
  model.eval()
  with torch.no_grad():
    for inputs, labels in iterator:
      preds = model(inputs)
      loss = l(preds.squeeze(1), labels.long())
      acc = accuracy(preds,  labels)
      e_loss += loss.item()
      e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train_m(model, trainloader, optimizer, criterion)
    valid_loss, valid_acc = evaluate_m(model, testloader, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} / {N_EPOCHS} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 / 5 | Epoch Time: 0m 58s
	Train Loss: 1.133 | Train Acc: 54.76%
	 Val. Loss: 1.119 |  Val. Acc: 54.29%
Epoch: 02 / 5 | Epoch Time: 0m 58s
	Train Loss: 1.083 | Train Acc: 56.85%
	 Val. Loss: 1.112 |  Val. Acc: 55.00%
Epoch: 03 / 5 | Epoch Time: 0m 59s
	Train Loss: 1.036 | Train Acc: 58.96%
	 Val. Loss: 1.128 |  Val. Acc: 54.88%
Epoch: 04 / 5 | Epoch Time: 0m 59s
	Train Loss: 0.972 | Train Acc: 61.52%
	 Val. Loss: 1.187 |  Val. Acc: 54.56%
Epoch: 05 / 5 | Epoch Time: 0m 59s
	Train Loss: 0.891 | Train Acc: 64.43%
	 Val. Loss: 1.272 |  Val. Acc: 53.32%


## Inference

In [None]:
LABEL_DICT =  {0: 'WORLD', 1: 'SPORTS', 2: 'BIZ', 3: 'TECH'}

def predict(sentence):
  input_tensor = pack_sequence([torch.LongTensor(bpemb_en.encode_ids(sentence))], enforce_sorted=False).to(device)
  pred = model(input_tensor)
  _, index = torch.max(pred, dim = 1)
  return LABEL_DICT[index.item()]

In [None]:
predict('Oil prices are up!')

'BIZ'