# Text Classification Example:

This example shows how to use the text classification datasets, including  

    - AG_NEWS,
    - SogouNews, 
    - DBpedia, 
    - YelpReviewPolarity,
    - YelpReviewFull, 
    - YahooAnswers, 
    - AmazonReviewPolarity,
    - AmazonReviewFull

Those datasets are added to TorchText and can be loaded with a single command. 
 
This example shows the applciation of TextClassification Dataset. It reproduces the sentiment analysis with the fast text classifier described in [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759). In the papaer, the sentences are represent as bag of words (BoW) and train a linear classifier. 


## Load data with ngrams
A bag of ngrams features is used in the paper to capture some partial information about the local word order. In practice, bi-gram or tri-gram are applied to provide more benefits as word groups than only one word. An example:

    "load data with ngrams"
    Bi-grams results: "load data", "data with", "with ngrams"
    Tri-grams results: "load data with", "data with ngrams"

TextClassificationDataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.

In [None]:
import torch
import torchtext
from torchtext.datasets import text_classification
NGRAMS = 2
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)
BATCH_SIZE = 128
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Define the model

The model is composed of the embeddingbag layer and the linear layer (see the figure below)
<img src="./pictures/text_sentiment_ngrams_model.png" width="600" height="360">. nn.EmbeddingBag computes the mean of 'bags' of embeddings. Since it doesn't instantiate the intermediate embeddings, nn.EmbeddingBag can enhance the performance and memory efficiency to process a sequence of tensors. Additionally, the text entries here have different lengths. nn.EmbeddingBag requires no padding here so this method is much faster than the original one with TorchText Iterator and Batch.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
        
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

## Initiate an instance

The AG_NEWS dataset has four labels and therefore the number of classes is four.

    1 : World
    2 : Sports
    3 : Business
    4 : Sci/Tec

The vocab size is equal to the length of vocab (including single word and ngrams). The number of classes is equal to the number of labels, which is four in AG_NEWS case.

In [None]:
VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 256
NUN_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

## Add optimizer and loss function

[SGD](https://pytorch.org/docs/stable/optim.html?highlight=sgd#torch.optim.SGD) algorithm is used to optimize the model. 
[CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss) criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class. It is useful when training a classification problem with C classes.

In [None]:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=4.0)
loss_func = torch.nn.CrossEntropyLoss().to(device)

## Functions used to generate batch

Since the text entries have different lengths, a custom function generate_batch() is used to generate data batches and offsets, which are compatible with EmbeddingBag. The function is passed to 'collate_fn' in torch.utils.data.DataLoader. The input to 'collate_fn' is a list of tensors with the size of batch_size, and the 'collate_fn' function packs them into a mini-batch. Pay attention here and make sure that 'collate_fn' is declared as a top level def. This ensures that the function is available in each worker.

In [None]:
def generate_batch(batch):

    def generate_offsets(data_batch):
        offsets = [0]
        for entry in data_batch:
            offsets.append(offsets[-1] + len(entry))
        offsets = torch.tensor(offsets[:-1])
        return offsets

    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = generate_offsets(text)
    text = torch.cat(text)
    return text, offsets, label

## Define a function to train the model and evaluate results.

torch.utils.data.DataLoader is recommended for PyTorch domain libraries. We use DataLoader here to load AG_NEWS datasets and send it to the model for training/validation.

In [None]:
from torch.utils.data import DataLoader

def train_evaluate_func(model, dataset, loss_func, device, batch_size=64, status='train', optimizer=None):

    _loss = 0
    _acc = 0

    if status == 'train':
        model.train()
    else:
        model.eval()

    data = DataLoader(dataset, batch_size=batch_size, shuffle=True,
                      collate_fn=generate_batch, num_workers=1)

    for i, (text, offsets, label) in enumerate(data):
        if status == 'train':
            optimizer.zero_grad()

        text, offsets, label = text.to(device), offsets.to(device), label.to(device)
        output = model(text, offsets)
        loss = loss_func(output, label)
        _loss += loss.item()

        if status == 'train':
            loss.backward()
            optimizer.step()

        _acc += (output.argmax(1) == label).sum().item()

    return _loss / len(dataset), _acc / len(dataset)

## Split the dataset and run the model
Since the original AG_NEWS has no valid dataset, we have to split the training dataset into train/valid sets with the split ratios of 0.7 (train) and 0.3 (valid).

In [None]:
import time
import random
from torch.utils.data.dataset import random_split
N_EPOCHS = 12
min_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_len = int(len(train_dataset) * 0.7)
    sub_train_dataset, sub_valid_dataset = random_split(train_dataset, [train_len, len(train_dataset) - train_len])
    train_loss, train_acc = train_evaluate_func(model, sub_train_dataset, loss_func,
                                                device=device, status='train', optimizer=optimizer)

    with torch.no_grad():
        valid_loss, valid_acc = train_evaluate_func(model, sub_valid_dataset, loss_func,
                                                    device=device, status='valid')
    
    _secs = int(time.time() - start_time)
    _mins = _secs / 60
    _secs = _secs % 60

    if valid_loss < min_valid_loss:
        min_valid_loss = valid_loss
        torch.save(model, 'text_classification.pt')
    
    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(_mins, _secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\t{valid_loss:.4f}(valid)')
    print(f'\tAcc: {train_acc * 100:.1f}%(train)\t|\t{valid_acc * 100:.1f}%(valid)')

Running the model on GPU with the following information:

Epoch: 1  | time in 0 minutes, 5 seconds

        Loss: 0.0088(train)     |       0.0064(valid)
        Acc: 79.9%(train)       |       85.7%(valid)
        
Epoch: 2  | time in 0 minutes, 5 seconds

        Loss: 0.0048(train)     |       0.0046(valid)
        Acc: 89.8%(train)       |       90.2%(valid)
        
Epoch: 3  | time in 0 minutes, 5 seconds

        Loss: 0.0038(train)     |       0.0039(valid)
        Acc: 92.1%(train)       |       91.6%(valid)
        
Epoch: 4  | time in 0 minutes, 5 seconds

        Loss: 0.0031(train)     |       0.0032(valid)
        Acc: 93.5%(train)       |       93.1%(valid)
        
Epoch: 5  | time in 0 minutes, 5 seconds

        Loss: 0.0026(train)     |       0.0028(valid)
        Acc: 94.7%(train)       |       94.2%(valid)
        
Epoch: 6  | time in 0 minutes, 5 seconds

        Loss: 0.0022(train)     |       0.0021(valid)
        Acc: 95.8%(train)       |       95.8%(valid)
        
Epoch: 7  | time in 0 minutes, 5 seconds

        Loss: 0.0018(train)     |       0.0018(valid)
        Acc: 96.6%(train)       |       96.3%(valid)
        
Epoch: 8  | time in 0 minutes, 5 seconds

        Loss: 0.0014(train)     |       0.0014(valid)
        Acc: 97.4%(train)       |       97.3%(valid)
        
Epoch: 9  | time in 0 minutes, 5 seconds

        Loss: 0.0011(train)     |       0.0013(valid)
        Acc: 98.0%(train)       |       97.7%(valid)
        
Epoch: 10  | time in 0 minutes, 5 seconds

        Loss: 0.0009(train)     |       0.0010(valid)
        Acc: 98.5%(train)       |       98.3%(valid)
        
Epoch: 11  | time in 0 minutes, 5 seconds

        Loss: 0.0008(train)     |       0.0009(valid)
        Acc: 98.8%(train)       |       98.3%(valid)
        
Epoch: 12  | time in 0 minutes, 5 seconds

        Loss: 0.0006(train)     |       0.0009(valid)
        Acc: 99.1%(train)       |       98.3%(valid)

## Evaluate the model with test dataset

In [None]:
print('Checking the results of test dataset...')
with torch.no_grad():
    test_loss, test_acc = train_evaluate_func(model, test_dataset, loss_func,
                                              device=device, status='test')
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...

        Loss: 0.005(test)       |       Acc: 90.4%(test)
        
The results are consistent with the reference paper [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759).

## Test on a random news

Use the best model so far and test a golf news.

In [None]:
import re
from torchtext.data.utils import ngrams_iterator

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

vocab = train_dataset._vocab

def predict_label(text_string, ngrams):
    model.eval()
    text_string = re.sub(r'[^a-zA-Z0-9\s]', ' ', text_string)
    tokens = text_string.split(" ")
    ngrams_tokens = ngrams_iterator(tokens, ngrams)
    indexed_tensor = torch.tensor([vocab[token] for token in ngrams_tokens]).to(device)
    offsets = torch.Tensor([0]).long().to(device)
    result = model(indexed_tensor, offsets)
    label_index = result.argmax(1).item() + 1
    return ag_news_label[label_index]

example_text_string = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

new_label = ""
with open("text_classification.pt", 'rb') as f:
    model = torch.load(f)
    news_label = predict_label(example_text_string, 1)

print("This is a %s news" %news_label)

This is a Sports news