<a href="https://colab.research.google.com/github/sanjeevr5/NLP/blob/main/Torch_NLP_SERIES_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Multiclass Classification With TorchText & TabularDataset


- Architecture Used : Simple RNN
- Referring : https://github.com/bentrevett/pytorch-sentiment-analysis
- Field : https://github.com/pytorch/text/blob/master/torchtext/data/field.py
- Dataset : https://github.com/mhjabreel/CharCnn_Keras/blob/master/data/ag_news_csv
- Label : {1: 'WORLD', 2: 'SPORTS', 3: 'BIZ', 4: 'TECH'}
- Custom embeddings used


In [None]:
import torch
from torchtext import data
from torchtext.data import TabularDataset
import torch.nn as nn
import torch.optim as optim
SEED = 34
torch.manual_seed(SEED)

NEWS = data.Field(tokenize = 'spacy', lower = True)
CLASS = data.LabelField(dtype = torch.long)

In [None]:
%%capture
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv


In [None]:
!head -5 train.csv

"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
"3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
"3","Oil and Economy Cloud Stocks' Outlook (Reuters)","Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums."
"3","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)","Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."
"3","Oil prices soar to all-time record, posing new menace to US e

In [None]:
!sed -i '1s/^/"cat","title","news"\n/' ./train.csv
!sed -i '1s/^/"cat","title","news"\n/' ./test.csv

In [None]:
!head -5 train.csv

"cat","title","news"
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
"3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
"3","Oil and Economy Cloud Stocks' Outlook (Reuters)","Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums."
"3","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)","Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."


In [None]:
field_names = [('cat', CLASS), ('title', NEWS), (2, None)]
train, test = TabularDataset.splits('/content/', train = 'train.csv', test = 'test.csv', format = 'CSV', fields = field_names, skip_header = True)
#train, val = train.split(stratified = True, strata_field = 'cat', random_state = SEED) # WORKS ON TENSOR BASIS ONLY (NOT USING SPLITS)

In [None]:
NEWS.build_vocab(train, max_size = 15000) #<UNK> and <PAD> tokens are attached hence 15k+2
CLASS.build_vocab(train)
print(f"Unique tokens in TEXT vocabulary: {len(NEWS.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(CLASS.vocab)}")

Unique tokens in TEXT vocabulary: 15002
Unique tokens in LABEL vocabulary: 4


In [None]:
print(NEWS.vocab.freqs.most_common(20)) #most common items
print(NEWS.vocab.itos[:10])#integer to string
print(CLASS.vocab.stoi) #string to integer

[('to', 23896), ('in', 17660), ('(', 17132), (')', 17130), (',', 16321), ('-', 13503), ('#', 12950), ('for', 12435), (':', 9629), ('on', 9584), ('of', 9078), (';', 7780), ('ap', 7777), ('the', 6415), ('39;s', 6160), ('a', 4915), ("'", 4328), ('reuters', 4261), ('at', 4231), ('with', 4088)]
['<unk>', '<pad>', 'to', 'in', '(', ')', ',', '-', '#', 'for']
defaultdict(<function _default_unk_index at 0x7f51ff9a21e0>, {'1': 0, '2': 1, '3': 2, '4': 3})


In [None]:
print(f'Number of training examples: {len(train)}')
print(f'Number of testing examples: {len(test)}')

Number of training examples: 120000
Number of testing examples: 7600


- We will try to predict stuffs only using the title

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = data.BucketIterator.splits(
    (train, test), sort = True, sort_key = lambda x : len(x.title), sort_within_batch = True, 
    batch_size = BATCH_SIZE,
    device = device)

## 1 Simple RNN Training

- Takes a hidden layer representation which is equal to the number of hidden units
- The input is the embedding vector
- Initially the hidden vector will consist only of zeros

> In PyTorch the batch becomes the second dimension and hence the input to the RNN is [sentence_length, batch_size, one_hot vector] <br>
> Now, the output of the embedding layer is [sentence_len, batch_size, embedding_vector]<br>
> RNN performs tanh(embedding_vector, hidden_state)<br>
> The RNN will output two vectors one is the output vector and the other is the hidden state <br>
> The hidden vector is of the shape : [1, batch_size, hidden_size] interpret as one hidden vector per review in a batch <br>
> The output vector is of the shape : [sentence_len, batch_size, hidden_size] interpret as one hidden vector per review in a batch <br>
> The difference between the output vector and the hidden vector is that it has the hidden state for every time_step and the hidden vector consists of the hidden state of the final time step

In [None]:
class RNN_Single(nn.Module):

  def __init__(self, input_dim, embed_size, hidden_state_size, classes):
    super(RNN_Single, self).__init__()
    self.embed = nn.Embedding(input_dim, embed_size)
    self.rnn = nn.RNN(embed_size, hidden_state_size)
    self.fc = nn.Linear(hidden_state_size, classes)

  def forward(self, input_batch):

    embedding_batch = self.embed(input_batch)
    #embedding batch is [batch_size, sentence_len, embeddings]
    #xxxxxxxxxxxxxxxx Note : that since we are using TorchText there is no need of padding or explicitly mentioning the sentence lengths
    # The process of using Embeddings and LSTM is much more different when used it explicitly without the TorchText xxxxxxxxxxxxxxxxxxx
    output, hidden = self.rnn(embedding_batch)
    hidden = hidden.squeeze(0)
    assert torch.equal(output[-1,:,:], hidden) # Comparing the last time step output vector to the hidden vector and this should be equal
    return self.fc(hidden)

INPUT_DIM = len(NEWS.vocab)
EMBED_DIM = 128
HIDDEN_UNITS = 512
CLASSES = 4

model = RNN_Single(INPUT_DIM, EMBED_DIM, HIDDEN_UNITS, CLASSES)

print('The number of trainable parameters are :', sum(p.numel() for p in model.parameters() if p.requires_grad))

The number of trainable parameters are : 2251012


In [None]:
optimizer = optim.SGD(model.parameters(), lr = 1e-3)
criterion = nn.CrossEntropyLoss()
model = model.to(device)
criterion = criterion.to(device)

"
You may recall when initializing the LABEL field, we set dtype=torch.float. This is because TorchText sets tensors to be LongTensors by default, however our criterion expects both inputs to be FloatTensors. Setting the dtype to be torch.float, did this for us. The alternative method of doing this would be to do the conversion inside the train function by passing batch.label.float() instad of batch.label to the criterion.
"

In [None]:
def accuracy(preds, true):
  _, index = torch.max(preds, dim = 1)
  return (index == true).sum().float() / len(preds)

def train_m(model, iterator, optimizer, l):
  e_loss = 0
  e_acc = 0
  model.train()

  for batch in iterator:
    optimizer.zero_grad()
    preds = model(batch.title)# Call using the column name
    acc = accuracy(preds,  batch.cat)
    loss = l(preds.squeeze(1), batch.cat.long())
    acc = accuracy(preds,  batch.cat)
    loss.backward()
    optimizer.step()
    e_loss += loss.item()
    e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

def evaluate_m(model, iterator, l):
  e_loss = 0
  e_acc = 0
  model.eval()
  with torch.no_grad():
    for batch in iterator:
      preds = model(batch.title)
      loss = l(preds.squeeze(1), batch.cat.long())
      acc = accuracy(preds,  batch.cat)
      e_loss += loss.item()
      e_acc += acc.item()
  return e_loss/len(iterator), e_acc/len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train_m(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate_m(model, test_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} / {N_EPOCHS} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 / 5 | Epoch Time: 0m 5s
	Train Loss: 1.366 | Train Acc: 30.60%
	 Val. Loss: 1.352 |  Val. Acc: 32.08%
Epoch: 02 / 5 | Epoch Time: 0m 5s
	Train Loss: 1.336 | Train Acc: 34.98%
	 Val. Loss: 1.325 |  Val. Acc: 35.63%
Epoch: 03 / 5 | Epoch Time: 0m 5s
	Train Loss: 1.315 | Train Acc: 37.55%
	 Val. Loss: 1.306 |  Val. Acc: 37.89%
Epoch: 04 / 5 | Epoch Time: 0m 5s
	Train Loss: 1.297 | Train Acc: 39.30%
	 Val. Loss: 1.289 |  Val. Acc: 39.33%
Epoch: 05 / 5 | Epoch Time: 0m 5s
	Train Loss: 1.279 | Train Acc: 40.80%
	 Val. Loss: 1.271 |  Val. Acc: 40.80%


## Prediction

In [None]:
import spacy
nlp = spacy.load('en')
CLASS_DICT = {0: 'WORLD', 1: 'SPORTS', 2: 'BIZ', 3: 'TECH'}
def predict(sentence):
  model.eval()
  tokens = [word.text for word in nlp.tokenizer(sentence)]
  indices = [NEWS.vocab.stoi[word] for word in tokens]
  input_tensor = torch.LongTensor(indices).to(device)
  input_tensor = input_tensor.unsqueeze(1)
  #length_tensor = torch.LongTensor([len(indices)])
  prediction = torch.sigmoid(model(input_tensor))
  _, index = torch.max(prediction, dim = 1)
  return CLASS_DICT[index.item()]

In [None]:
predict('Oil prices are up!')

'TECH'

<b> Points learnt </b>

- Labels should be always of long type for CrossEntropy & float for BCEWithLogitsLoss
- While using data.BucketIterator.splits make sure sort is set to False https://github.com/pytorch/text/issues/474