# ICE - 7.
# Text Classification with TorchText
==================================

This tutorial shows how to use the text classification datasets
in ``torchtext``, including

::

   - AG_NEWS,
   - SogouNews,
   - DBpedia,
   - YelpReviewPolarity,
   - YelpReviewFull,
   - YahooAnswers,
   - AmazonReviewPolarity,
   - AmazonReviewFull

You can download these datasets using Google Searcg they are avilable for free.

This example shows how to train a supervised learning algorithm for
classification using one of these ``TextClassification`` datasets.

Load data with ngrams
---------------------

Generally speaking, we first need to do preprocessing for any NLP tasks.

Here are some items you can remind your self:

Build and preprocess dataset:

- Segment sentences. Segment words to subwords or characters?
- Change words in lower case?
- Delete stop words ?
- Create special tokens ( i.e. [UNK] [BOS] [EOS] [PAD] ) ?

Build vocabulary:

- Discard words whose frequencies are under a threshold ?
- Build map from word string to index in the embedding table ( str -> int ) 
- Build label vocabulary
- Numericalize words ( Transform list of words to list of numbers )
- Choose pad or not ( Using [PAD] )


For this specific task, a bag of ngrams feature is applied to capture some partial information
about the local word order. In practice, bi-gram or tri-gram are applied
to provide more benefits as word groups than only one word. An example:

::

   For text: **"load data with ngrams"**  
   1-gram results: "load", "data", "with", "ngrams"  
   Bi-grams results: "load data", "data with", "with ngrams"  
   Tri-grams results: "load data with", "data with ngrams"

``TextClassification`` Dataset supports the ngrams method. By setting
ngrams to 2, the example text in the dataset will be a list of single
words plus bi-grams string.




In [None]:
%matplotlib inline
!pip install torch>=1.3.1
!pip install torchtext==0.4

Collecting torchtext==0.4
  Downloading torchtext-0.4.0-py3-none-any.whl (53 kB)
[?25l[K     |██████▏                         | 10 kB 30.3 MB/s eta 0:00:01[K     |████████████▍                   | 20 kB 33.8 MB/s eta 0:00:01[K     |██████████████████▌             | 30 kB 20.8 MB/s eta 0:00:01[K     |████████████████████████▊       | 40 kB 17.3 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51 kB 9.2 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 1.7 MB/s 
Installing collected packages: torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.10.0
    Uninstalling torchtext-0.10.0:
      Successfully uninstalled torchtext-0.10.0
Successfully installed torchtext-0.4.0


In [None]:
import torch
import torchtext
from torchtext.datasets import text_classification
NGRAMS = 1
import os
if not os.path.isdir('./.data'):
	os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ag_news_csv.tar.gz: 11.8MB [00:00, 37.6MB/s]
120000lines [00:05, 23518.88lines/s]
120000lines [00:10, 11718.74lines/s]
7600lines [00:00, 12229.67lines/s]


In [None]:
print('One item of training data:', train_dataset[0] ) #(label id, token id tensor)
print('Vocabulary size (including ngrams):', len(train_dataset.get_vocab()))
print('Class size:', len(train_dataset.get_labels()))

One item of training data: (2, tensor([  432,   426,     2,  1606, 14839,   114,    67,     3,   849,    14,
           28,    15,    28,    16, 50726,     4,   432,   375,    17,    10,
        67508,     7, 52259,     4,    43,  4010,   784,   326,     2]))
Vocabulary size (including ngrams): 95812
Class size: 4


Define the model
----------------

The model is composed of the
`EmbeddingBag <https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag>`__
layer and the linear layer (see the figure below). ``nn.EmbeddingBag``
computes the mean value of a “bag” of embeddings. The text entries here
have different lengths. ``nn.EmbeddingBag`` requires no padding here
since the text lengths are saved in offsets.

Additionally, since ``nn.EmbeddingBag`` accumulates the average across
the embeddings on the fly, ``nn.EmbeddingBag`` can enhance the
performance and memory efficiency to process a sequence of tensors.

![](../_static/img/text_sentiment_ngrams_model.png)

To make "offsets" more easy to understand, we put an overview explaination here. 

The returned values ( text and offsets ) of generate_batch function will be directly used as the input of the model. In the forward pass, they are used as parameters of self.embedding.

text is a tensor of shape (N,)  it will be treated as a concatenation of multiple bags (sequences). offsets is required to be a 1D tensor containing the starting index positions of each bag in input. Therefore, for offsets of shape (B), text will be viewed as having B bags. (B is batch_size)

For example, assume batch size is 2. In one batch, we have 2 sentences, "I love python", "I love machine learning"

We will first create bag of words for each sentence ["I", "love", "python"], ["I", "love", "machine", "learning"]. Then they will be first transformed to word index according to the vocabulary, tensor([2,1,5]),tensor([2,1,4,7]).
In this case, then we can concatenate tensor([2,1,5]) and tensor([2,1,4,7]) to get text tensor([2,1,5,2,1,4,7]) of shape (N,) (N=7). We also need to create offsets tensor([0,3]) of shape (B,) (B=2). In tensor([0,3]), 0 means the starting index of first sentence in tensor([2,1,5,2,1,4,7]), and 3 means the starting index of second sentence in tensor([2,1,5,2,1,4,7]).





In [None]:
import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class): #Initilaize modules.
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True) # This is equivalent to nn.Embedding followed by torch.mean(dim=0)
        self.fc = nn.Linear(embed_dim, num_class) #Use linear layer here. 
        self.init_weights()

    def init_weights(self):  # Randomly initilaize parameters
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange) # Uniform distribution
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()  # Make bias zeros 

    def forward(self, text, offsets): # Foward pass
        embedded = self.embedding(text, offsets) # input (N,)  it will be treated as a concatenation of multiple bags (sequences). 
        # offsets is required to be a 1D tensor containing the starting index positions of each bag in input. 
        # Therefore, for offsets of shape (B), input will be viewed as having B bags.
        # ouput (B, embed_dim)
        a =  self.fc(embedded)
        return a # ouput (B, num_class)

Initiate an instance
--------------------

The AG_NEWS dataset has four labels and therefore the number of classes
is four.

::

   1 : World
   2 : Sports
   3 : Business
   4 : Sci/Tec

The vocab size is equal to the length of vocab (including single word
and ngrams). The number of classes is equal to the number of labels,
which is four in AG_NEWS case.




In [None]:
VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUM_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

Functions used to generate batch
--------------------------------




Since the text entries have different lengths, a custom function
generate_batch() is used to generate data batches and offsets. The
function is passed to ``collate_fn`` in ``torch.utils.data.DataLoader``.
The input to ``collate_fn`` is a list of tensors with the size of
batch_size, and the ``collate_fn`` function packs them into a
mini-batch. Pay attention here and make sure that ``collate_fn`` is
declared as a top level def. This ensures that the function is available
in each worker.

The text entries in the original data batch input are packed into a list
and concatenated as a single tensor as the input of ``nn.EmbeddingBag``.
The offsets is a tensor of delimiters to represent the beginning index
of the individual sequence in the text tensor. Label is a tensor saving
the labels of individual text entries.




In [None]:
def generate_batch(batch): 
    # Input: a iterator of items with length of batch_size. For example:[(1,(tensor([2,4,3])),(0,tensor([6,5]))]
    # Generate a batch used in SGD
    label = torch.tensor([entry[0] for entry in batch]) #tensor of shape (batch_size,)
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    # torch.Tensor.cumsum returns the cumulative sum
    # of elements in the dimension dim.
    # torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) # For example tensor([0,3])
    text = torch.cat(text) # a list of tensor -> tensor of shape (sum([len(i) for i in text]),) For example, tensor([2,4,3,6,5])
    return text, offsets, label

Define functions to train the model and evaluate results.
---------------------------------------------------------




`torch.utils.data.DataLoader <https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader>`__
is recommended for PyTorch users, and it makes data loading in parallel
easily (a tutorial is
`here <https://pytorch.org/tutorials/beginner/data_loading_tutorial.html>`__).
We use ``DataLoader`` here to load AG_NEWS datasets and send it to the
model for training/validation.




In [None]:
from torch.utils.data import DataLoader
BATCH_SIZE = 16

def train_func(sub_train_):

    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)  # Iterable batches
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad() # Before each optimization, make previous gradients zeros
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls) # Forward pass to compute loss
        train_loss += loss.item()
        # Extract the number from a tensor containing only one item, this number will be used in later printing
        loss.backward() # Backforward propagation to compute gradients of each variable node
        optimizer.step() # Update parameters according to gradients
        #choose the class with the highest score as current prediction and compare with gold label (cls )
        train_acc += (output.argmax(1) == cls).sum().item() 
        
    # Adjust the learning rate. After each epoch, do learning rate decay ( optional )
    scheduler.step()

    return train_loss / len(sub_train_), train_acc / len(sub_train_) #return average loss and acc to print

def test(data_):
    #Similar to train_func but do not need back propagation or parameter update !
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad(): # prevent computing gradient, could not use backward()
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(data_), acc / len(data_)

Split the dataset and run the model
-----------------------------------

Since the original AG_NEWS has no valid dataset, we split the training
dataset into train/valid sets with a split ratio of 0.95 (train) and
0.05 (valid). Here we use
`torch.utils.data.dataset.random_split <https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split>`__
function in PyTorch core library.

`CrossEntropyLoss <https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__
criterion combines nn.LogSoftmax() and nn.NLLLoss() in a single class.
It is useful when training a classification problem with C classes.
`SGD <https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html>`__
implements stochastic gradient descent method as optimizer. The initial
learning rate is set to 4.0.
`StepLR <https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR>`__
is used here to adjust the learning rate through epochs.




In [None]:
import time
from torch.utils.data.dataset import random_split
N_EPOCHS = 5
min_valid_loss = float('inf')

#Use CrossEntropyLoss() as the criterion. 
#The input is the output of the model. First do logsoftmax, then compute cross-entropy loss. 
criterion = torch.nn.CrossEntropyLoss().to(device) 
#Use SGD as optimizer.
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
#Use exponential decay to decrease learning rate
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
#Split whole training dataset to create validation (hold-out datset)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len]) 
for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60
    
    #Print information to monitor the training process
    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0 minutes, 9 seconds
	Loss: 0.0257(train)	|	Acc: 85.5%(train)
	Loss: 0.0000(valid)	|	Acc: 88.6%(valid)
Epoch: 2  | time in 0 minutes, 9 seconds
	Loss: 0.0166(train)	|	Acc: 91.1%(train)
	Loss: 0.0000(valid)	|	Acc: 90.5%(valid)
Epoch: 3  | time in 0 minutes, 9 seconds
	Loss: 0.0141(train)	|	Acc: 92.4%(train)
	Loss: 0.0000(valid)	|	Acc: 90.9%(valid)
Epoch: 4  | time in 0 minutes, 9 seconds
	Loss: 0.0124(train)	|	Acc: 93.3%(train)
	Loss: 0.0000(valid)	|	Acc: 91.2%(valid)
Epoch: 5  | time in 0 minutes, 9 seconds
	Loss: 0.0112(train)	|	Acc: 93.9%(train)
	Loss: 0.0000(valid)	|	Acc: 91.5%(valid)


Running the model on GPU with the following information:

Epoch: 1 \| time in 0 minutes, 11 seconds

::

       Loss: 0.0263(train)     |       Acc: 84.5%(train)
       Loss: 0.0001(valid)     |       Acc: 89.0%(valid)


Epoch: 2 \| time in 0 minutes, 10 seconds

::

       Loss: 0.0119(train)     |       Acc: 93.6%(train)
       Loss: 0.0000(valid)     |       Acc: 89.6%(valid)


Epoch: 3 \| time in 0 minutes, 9 seconds

::

       Loss: 0.0069(train)     |       Acc: 96.4%(train)
       Loss: 0.0000(valid)     |       Acc: 90.5%(valid)


Epoch: 4 \| time in 0 minutes, 11 seconds

::

       Loss: 0.0038(train)     |       Acc: 98.2%(train)
       Loss: 0.0000(valid)     |       Acc: 90.4%(valid)


Epoch: 5 \| time in 0 minutes, 11 seconds

::

       Loss: 0.0022(train)     |       Acc: 99.0%(train)
       Loss: 0.0000(valid)     |       Acc: 91.0%(valid)




Evaluate the model with test dataset
------------------------------------




In [None]:
print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.0001(test)	|	Acc: 90.4%(test)


Checking the results of test dataset…

::

       Loss: 0.0237(test)      |       Acc: 90.5%(test)




Test on a random news
---------------------

Use the best model so far and test a golf news. The label information is
available
`here <https://pytorch.org/text/datasets.html?highlight=ag_news#torchtext.datasets.AG_NEWS>`__.




In [None]:
import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        #tokenize, generate ngram list and use vocabulary to numericalize
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)]) 
        output = model(text, torch.tensor([0]))
        # choose the class with the highest score as prediction 
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    consid ering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, model, vocab, 1)])

This is a Sports news


This is a Sports news




# **Tasks for Today's ICE**
## *All Implementation should be perfomed in Pytorch/torchtext*

In [None]:
# The code for the models is provided as an attachment as Zip file. Please use that as reference to perform classification in terms of accuracy and then test any random piece of code
# as test data to idetify what the article is talking about i.e., "This is Political News". You can use any dataset that are introduced in the start
# All datasets are available for free using google search.
# Final step is to compare the accuracies and provide discussion on why one model has a better performance while do not. 

In [None]:
# Use CNN for the above scenario

In [None]:
# Use RNN for the above scenario

In [None]:
# Use LSTM for the above scenario

In [None]:
# Use LSTM Attention for the above scenario

In [None]:
# Use Self Attention for the above scenario

In [None]:
# Is it possible to use transformer for the above scenario i.e., Text Sentiment Analysis / Classification. Answer yes/no. If yes implement using the resource below
# If No why do you think it is not possible for transformer to perform text classification in PyTorch.
# You can take help from https://github.com/Renovamen/Text-Classification/tree/master/models/Transformer to implement transformers.