# Text Classification Example:

This example reproduces the sentiment analysis with the fast text classifier described in [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759). In the papaer, the sentences are represent as bag of words (BoW) and train a linear classifier. Recently, a few text classification datasets, including 
    - AG_NEWS,
    - SogouNews, 
    - DBpedia, 
    - YelpReviewPolarity,
    - YelpReviewFull, 
    - YahooAnswers, 
    - AmazonReviewPolarity,
    - AmazonReviewFull
are added to PyTorch/torchtext and can be loaded with a single command. This example shows the applciation of TextClassificationDataset and reproduce the results of the paper.

## Load data with ngrams
A bag of ngrams features is used in the paper to capture some partial information about the local word order. In practice, bi-gram or tri-gram are applied to provide more benefits as word groups than only one word. An example:

    "load data with ngrams"
    Bi-grams results: "load data", "data with", "with ngrams"
    Tri-grams results: "load data with", "data with ngrams"

TextClassificationDataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.

Data iterators are loaded via the iters() function in the instance with a batch size of 128 and computation device. At the same time, the word strings are numericalized (i.e. converted from a list of tokens to a list of indexs).

In [None]:
import torch
import torchtext
from torchtext.datasets import AG_NEWS
NGRAMS = 2
txt_cls = AG_NEWS(ngrams=NGRAMS)
BATCH_SIZE = 512
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Define the model

The model is composed of the embedding layer and the linear layer. Between the two layers, we apply a 2D average pooling function over the input signal (see [avg_pool2d](https://pytorch.org/docs/stable/nn.html?highlight=avg_pool2d#torch.nn.AvgPool2d)). Then, we use the log-softmax function to compute the probability distribution over the classes.
<img src="./pictures/text_sentiment_ngrams_model.png" width="600" height="360">

In [None]:
import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
        
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

## Initiate an instance

The AG_NEWS dataset has four labels and therefore the number of classes is four.

    1 : World
    2 : Sports
    3 : Business
    4 : Sci/Tec

The vocab size is equal to the length of vocab (including single word and ngrams).

In [None]:
VOCAB_SIZE = len(txt_cls.fields['text'].vocab)
EMBED_DIM = 128
NUN_CLASS = 4
UNK_IDX = txt_cls.fields['text'].vocab.stoi[txt_cls.fields['text'].unk_token]
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)
model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBED_DIM)

## Add optimizer and loss function

[Adam](https://pytorch.org/docs/stable/optim.html?highlight=adam#torch.optim.Adam) algorithm is used to optimize the model. 
[NLLLoss](https://pytorch.org/docs/stable/nn.html?highlight=nll_loss#torch.nn.NLLLoss) function calculates the negative log likelihood loss, which is used to train a classification problem with C classes.

In [None]:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=4.0)
loss_func = torch.nn.CrossEntropyLoss().to(device)

## Functions used to generate batch

Functions generate_batch() and generate_offsets() are used to generate data batches and offsets, which are compatible with EmbeddingBag. EmbeddingBad is efficient to process a sequence of tensors with different lengths so that no padding is required here. This is faster than the origin method depending on TorchText Iterator and Batch.

In [None]:
def generate_offsets(data_batch):
    offsets = [0]
    for entry in data_batch:
        offsets.append(offsets[-1] + len(entry))
    offsets = torch.tensor(offsets[:-1])
    return offsets

def generate_batch(examples, i, batch_size):
    text_batch, label_batch = [], []
    for idx in range(i, min(i + batch_size, len(examples))):
        text_batch.append(examples[idx][0])
        label_batch.append(examples[idx][1])

## Define a function to train the model and evaluate results.

In [None]:
def train_evaluate_func(model, examples, loss_func, device, batch_size=64, status='train', optimizer=None):

    _loss = 0
    _acc = 0

    if status == 'train':
        model.train()
    else:
        model.eval()

    for i in range(0, len(examples), batch_size):
        text, offsets, label = generate_batch(examples, i, batch_size)
        if status == 'train':
            optimizer.zero_grad()
        output = model(text, offsets)
        loss = loss_func(output, label)
        _loss += loss.item()

        if status == 'train':
            loss.backward()
            optimizer.step()

        _acc += (output.argmax(1) == label).sum().item()

    return _loss / len(examples), _acc / len(examples)

## Numericalize text and label

In [None]:
train_examples_tensors = []
for idx in range(len(txt_cls.train_examples)):
    text = txt_cls.fields['text'].numericalize([txt_cls.train_examples[idx].text])
    label = torch.Tensor([txt_cls.train_examples[idx].label - 1]).long()
    train_examples_tensors.append([text, label])

test_examples_tensors = []
for idx in range(len(txt_cls.test_examples)):
    text = txt_cls.fields['text'].numericalize([txt_cls.test_examples[idx].text])
    label = torch.Tensor([txt_cls.test_examples[idx].label - 1]).long()
    test_examples_tensors.append([text, label])

## Run the model

In [None]:
import time
N_EPOCHS = 6
min_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    import random
    random.shuffle(train_examples_tensors)
    split_idx = int(len(train_examples_tensors) * 0.7)
    train_examples, valid_examples = train_examples_tensors[:split_idx], train_examples_tensors[split_idx:]
    
    train_loss, train_acc = train_evaluate_func(model, train_examples, loss_func, device=device, status='train', optimizer=optimizer)
    with torch.no_grad():
        valid_loss, valid_acc = train_evaluate_func(model, valid_examples, loss_func, device=device, status='valid')
    
    _secs = int(time.time() - start_time)
    _mins = _secs / 60
    _secs = _secs % 60

    if valid_loss < min_valid_loss:
        min_valid_loss = valid_loss
        torch.save(model, 'text_classification.pt')
    
    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(_mins, _secs))
    print(f'\tLoss: {train_loss:.3f}(train)\t|\t{valid_loss:.3f}(valid)')
    print(f'\tAcc: {train_acc * 100:.1f}%(train)\t|\t{valid_acc * 100:.1f}%(valid)')

Running the model on GPU with the following information:

Epoch: 1  | time in 12 minutes, 33 seconds

        Loss: 0.009(train)      |       0.006(valid)
        Acc: 77.8%(train)       |       87.4%(valid
        
Epoch: 2  | time in 18 minutes, 31 seconds

        Loss: 0.005(train)      |       0.005(valid
        Acc: 89.7%(train)       |       90.2%(valid
        
Epoch: 3  | time in 10 minutes, 11 seconds

        Loss: 0.004(train)      |       0.004(valid)
        Acc: 91.9%(train)       |       91.8%(valid)
        
Epoch: 4  | time in 10 minutes, 55 seconds          

        Loss: 0.003(train)      |       0.003(valid)
        Acc: 93.4%(train)       |       92.8%(valid)
        
Epoch: 5  | time in 10 minutes, 57 seconds         

        Loss: 0.003(train)      |       0.003(valid)
        Acc: 94.5%(train)       |       94.7%(valid)
        
Epoch: 6  | time in 18 minutes, 19 seconds          

        Loss: 0.002(train)      |       0.002(valid)
        Acc: 95.6%(train)       |       95.3%(valid)

## Evaluate the model with test dataset

In [None]:
print('Checking the results of test dataset...')
with torch.no_grad():
    random.shuffle(test_examples_tensors)
    test_loss, test_acc = train_evaluate_func(model, test_examples_tensors, loss_func, device=device, status='test')
print(f'\tLoss: {test_loss:.3f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...

        Loss: 0.005(test)       |       Acc: 90.4%(test)
        
The results are consistent with the reference paper [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759).

## Test on a random news

Use the best model so far and test a golf news.

In [None]:
import re
from torchtext.data.utils import generate_ngrams

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict_label(text_string, ngrams):
    model.eval()
    text_string = re.sub(r'[^a-zA-Z0-9\s]', ' ', text_string)
    tokens = text_string.split(" ")
    ngrams_tokens = generate_ngrams(tokens, ngrams)
    indexed = [txt_cls.fields['text'].vocab.stoi[item] for item in ngrams_tokens]
    indexed_tensor = torch.LongTensor(indexed).to(device)
    offsets = torch.Tensor([0]).long().to(device)
    result = model(indexed_tensor, offsets)
    label_index = result.argmax(1).item() + 1
    return ag_news_label[label_index]

example_text_string = "Defending champion Bryson DeChambeau and 2017 \
    champion Dustin Johnson have committed to play in THE NORTHERN \
    TRUST this August for the first event of the FedExCup Playoffs. \
    DeChambeau and Johnson will be joined by the biggest names in the \
    game at Liberty National Golf Club, which gets ready to host THE \
    NORTHERN TRUST for the third time (previously 2009 and 2013). Other \
    notable past champions of THE NORTHERN TRUST include Adam \
    Scott (who won in 2013), Sergio Garcia, Jason Day, Matt \
    Kuchar and Patrick Reed."

new_label = ""
with  open("text_classification.pt", 'rb') as f:
    model = torch.load(f)
    news_label = predict_label(example_text_string, 2)

print("This is a %s news" %news_label)

This is a Sports news