# Applied Deep Learning Project - Text Classifications
**For Milestone 1, I have followed the tutorial: `TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY` by PyTorch and added my comments inline based on my understanding.** 

The original tutorial can be found at https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

In [1]:
!pip uninstall -y torchdata torchtext portalocker
!pip install torchdata torchtext portalocker

[0mFound existing installation: torchtext 0.14.1
Uninstalling torchtext-0.14.1:
  Successfully uninstalled torchtext-0.14.1
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchdata
  Downloading torchdata-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m89.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchtext
  Downloading torchtext-0.15.1-cp39-cp39-manylinux1_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m91.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Collecting torch==2.0.0
  Downloading torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl (619.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecti

## Access to the raw dataset iterators
The torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the AG_NEWS dataset iterators yield the raw data as a tuple of label and text.

In [2]:
import torch
from torchtext.datasets import AG_NEWS

In [3]:
# Create an iterator over the training subset of the AG_NEWS dataset
train_iter = iter(AG_NEWS(split='train'))

In [4]:
# Get the next item from the iterator. The result is a tuple (c, t) where c is the category this news is in, and t is the news text
next(train_iter)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [5]:
next(train_iter)

(3,
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')

In [6]:
next(train_iter)

(3,
 "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.")

## Prepare data processing pipelines

In [7]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [8]:
# Create a function that can be used to tokenize a given text into individual words or tokens. 
# Use the 'basic_english' tokenizer, which is a simple tokenizer that splits the text into lowercase words and removes punctuation.
tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

# For each item in the 'data_iter' iterator, extract the second element (news text), then yield the extracted result one by one.
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

In [9]:
# Use 'build_vocab_from_iterator' function to create a vocabulary from a collection of tokens, with the <unk> token as a special token.
# Use 'yield_tokens' function to generate the tokens from the AG_NEWS dataset.
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
# Set the default index of the vocabulary to the index of the <unk> token
vocab.set_default_index(vocab["<unk>"])

In [10]:
# The vocabulary block converts a list of tokens into integers.
vocab(['here', 'is', 'an', 'example'])

[475, 21, 30, 5297]

In [11]:
# Define a lambda function called text_pipeline that takes in a string x, tokenizes it using the tokenizer, and converts the tokens to indices using the vocab object
text_pipeline = lambda x: vocab(tokenizer(x))
# Define a lambda function called label_pipeline that takes in a label x, converts it to an integer, and subtracts 1 (to map the class labels from 1-4 to 0-3)
label_pipeline = lambda x: int(x) - 1

In [12]:
# The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 
# The label pipeline converts the label into integers. For example:
text_pipeline('here is the an example')

[475, 21, 2, 30, 5297]

In [13]:
label_pipeline('10')

9

## Generate data batch and iterator
[torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader) is recommended for PyTorch users (a tutorial is at https://pytorch.org/tutorials/beginner/data_loading_tutorial.html). It works with a map-style dataset that implements the getitem() and len() protocols, and represents a map from indices/keys to data samples. It also works with an iterable dataset with the shuffle argument of False.

Before sending to the model, collate_fn function works on a batch of samples generated from DataLoader. The input to collate_fn is a batch of data with the batch size in DataLoader, and collate_fn processes them according to the data processing pipelines declared previously. Pay attention here and make sure that collate_fn is declared as a top level def. This ensures that the function is available in each worker.

In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

In [14]:
# Import the DataLoader class.
# The DataLoader class is a PyTorch utility that allows for efficient loading of large datasets during training or testing of machine learning models. 
#   It allows for batching, shuffling, and other data loading and processing operations that can help speed up training and improve model performance.
from torch.utils.data import DataLoader
# Create a device object that specifies whether to use a CUDA-enabled GPU device (if available) or the CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [16]:
# This function takes in a batch of examples from a dataset, preprocesses the labels and text using the label_pipeline and text_pipeline functions, 
#   concatenates the processed text tensors into a single tensor, and returns a tuple of tensors that can be directly used as input to a PyTorch model.
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

In [17]:
# Create an iterator over the training subset of the AG_NEWS dataset
train_iter = AG_NEWS(split='train')
# Set up a DataLoader object that can be used to efficiently load batches of data from the AG_NEWS training subset
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

## Define the model
The model is composed of the [nn.EmbeddingBag](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag) layer plus a linear layer for the classification purpose. nn.EmbeddingBag with the default mode of “mean” computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.

Additionally, since nn.EmbeddingBag accumulates the average across the embeddings on the fly, nn.EmbeddingBag can enhance the performance and memory efficiency to process a sequence of tensors.

In [18]:
from torch import nn

# Define a custom module for text classification, subclassed from nn.Module
class TextClassificationModel(nn.Module):

    # Define the constructor for the module, which takes in the vocabulary size, embedding dimension, and number of classes
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()

        # Create an EmbeddingBag layer with the specified vocabulary size and embedding dimension
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)

        # Create a linear layer to map the embedded text to the specified number of classes
        self.fc = nn.Linear(embed_dim, num_class)

        # Initialize the weights for the embedding and linear layers using uniform random initialization
        self.init_weights()

    # Define a function to initialize the weights for the embedding and linear layers
    def init_weights(self):
        initrange = 0.5    # Set the range for the uniform random initialization
        self.embedding.weight.data.uniform_(-initrange, initrange)    # Initialize the embedding weights
        self.fc.weight.data.uniform_(-initrange, initrange)    # Initialize the linear weights
        self.fc.bias.data.zero_()    # Set the bias for the linear layer to zero

    # Define the forward function for the module, which takes in text and offsets as input and returns the output of the linear layer
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)    # Embed the input text using the EmbeddingBag layer
        return self.fc(embedded)    # Apply the linear layer to the embedded text and return the result


## Initiate an instance
The AG_NEWS dataset has four labels and therefore the number of classes is four.

We build a model with the embedding dimension of 64. The vocab size is equal to the length of the vocabulary instance. The number of classes is equal to the number of labels

In [20]:
train_iter = AG_NEWS(split='train')
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

## Define functions to train the model and evaluate results

In [21]:
import time

# Define a function to train the model using a dataloader
def train(dataloader):
    model.train()                   # Set the model to training mode
    total_acc, total_count = 0, 0   # Initialize variables to keep track of accuracy and count
    log_interval = 500              # Set the number of batches between each logging statement
    start_time = time.time()        # Record the start time

    # Loop over the batches in the dataloader
    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()                       # Zero out the gradients for the optimizer
        predicted_label = model(text, offsets)      # Make predictions for the current batch
        loss = criterion(predicted_label, label)    # Compute the loss between the predictions and labels
        loss.backward()                             # Backpropagate the gradients through the model
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1) # Clip the gradients to avoid exploding gradients
        optimizer.step()                            # Update the model parameters using the optimizer
        total_acc += (predicted_label.argmax(1) == label).sum().item() # Update the accuracy and count variables
        total_count += label.size(0)

        # Print a logging statement every log_interval batches
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0           # Reset the accuracy and count variables
            start_time = time.time()                # Record the start time for the next log interval

# Define a function to evaluate the model using a dataloader
def evaluate(dataloader):
    model.eval()                    # Set the model to evaluation mode
    total_acc, total_count = 0, 0

    # Disable gradient tracking since we are not training
    with torch.no_grad():
        # Loop over the batches in the dataloader
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label) 
            total_acc += (predicted_label.argmax(1) == label).sum().item() 
            total_count += label.size(0)
    return total_acc/total_count

## Split the dataset and run the model
Since the original AG_NEWS has no valid dataset, we split the training dataset into train/valid sets with a split ratio of 0.95 (train) and 0.05 (valid). Here we use torch.utils.data.dataset.random_split function in PyTorch core library.

CrossEntropyLoss criterion combines nn.LogSoftmax() and nn.NLLLoss() in a single class. It is useful when training a classification problem with C classes. SGD implements stochastic gradient descent method as the optimizer. The initial learning rate is set to 5.0. StepLR is used here to adjust the learning rate through epochs.

In [22]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Define hyperparameters for the training process
EPOCHS = 10       # epoch
LR = 5            # learning rate
BATCH_SIZE = 64   # batch size for training

# Define the loss function, optimizer, and learning rate scheduler for the model
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)

# Initialize a variable to keep track of the total accuracy over epochs
total_accu = None

# Load the training and testing dataset and convert to the map-style format
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

# Split the training dataset into training and validation sets
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

# Create dataloaders for the training, validation, and testing sets using the collate_batch function
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

# Loop over the specified number of epochs and train the model on the training set, evaluate on the validation set, and adjust the learning rate as needed
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)

    # Adjust the learning rate using the scheduler if the validation accuracy does not improve
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val

    # Print a logging statement for the current epoch, including the time elapsed, validation accuracy, and epoch number
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 1782 batches | accuracy    0.685
| epoch   1 |  1000/ 1782 batches | accuracy    0.856
| epoch   1 |  1500/ 1782 batches | accuracy    0.876
-----------------------------------------------------------
| end of epoch   1 | time: 29.49s | valid accuracy    0.876 
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.897
| epoch   2 |  1000/ 1782 batches | accuracy    0.904
| epoch   2 |  1500/ 1782 batches | accuracy    0.903
-----------------------------------------------------------
| end of epoch   2 | time: 30.05s | valid accuracy    0.893 
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.915
| epoch   3 |  1000/ 1782 batches | accuracy    0.914
| epoch   3 |  1500/ 1782 batches | accuracy    0.915
-----------------------------------------------------------
| end of epoch   3 | time: 27.69s | valid accuracy    0.903 
-------------------------------

## Evaluate the model with test dataset
Check the results of the test dataset

In [23]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.908


## Test on a random news
Use the best model so far and test a golf news.

In [24]:
ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news
