<a href="https://colab.research.google.com/github/rhiosutoyo/Teaching-Deep-Learning-and-Its-Applications/blob/main/7_3_topic_classification_using_rnn_based_models_and_ag_news_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Classification using RNN Based Models and AG News Dataset

## Objective
Classify news articles into predefined categories (World, Sports, Business, Sci/Tech) using an RNN-based model.

## Basic Implementation
This process involves training a simple RNN model on the AG News dataset to classify news articles into four categories.

## Future Improvement (see section 12)
Improvements over a simple RNN, using bidirectional LSTM, include better handling of long-term dependencies and regularization to prevent overfitting, resulting in more accurate and robust topic classification.

# 1.	Set Up Environment
* Import necessary libraries (torch, torchtext, spacy).
* Set a random seed for reproducibility.

In [1]:
!pip install torch torchtext spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m67.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader, Dataset
import random



In [3]:
# Set random seed for reproducibility
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# 2.	Load and Prepare Data
* Use torchtext.datasets.AG_NEWS to load the AG News dataset.
* Tokenize text data using spacy.
* Build a vocabulary from the training data.

In [4]:
# Load the AG News dataset
train_iter = AG_NEWS(split='train')
test_iter = AG_NEWS(split='test')

In [5]:
# Tokenization
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

In [6]:
# Build vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# 3. Define Hyperparameters
* Set parameters such as batch size, embedding dimension, hidden dimension, output dimension, number of LSTM layers, dropout rate, and number of training epochs.

In [7]:
# Define hyperparameters
BATCH_SIZE = 64
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 4
EPOCHS = 5

# 4. Create Data Pipelines and Loaders
* Create pipelines to process text and label data.
* Implement a custom Dataset class for AG News data.
* Create DataLoaders to handle batching and shuffling of data.

In [8]:
# Text pipeline
text_pipeline = lambda x: vocab(tokenizer(x))

# Label pipeline
label_pipeline = lambda x: int(x) - 1

# Define custom dataset
class NewsDataset(Dataset):
    def __init__(self, data_iter):
        self.data = list(data_iter)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        label, text = self.data[idx]
        return label_pipeline(label), torch.tensor(text_pipeline(text), dtype=torch.long)

In [9]:
# Create DataLoaders
train_dataset = NewsDataset(AG_NEWS(split='train'))
test_dataset = NewsDataset(AG_NEWS(split='test'))

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=lambda x: x)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=lambda x: x)

# 5. Define the RNN Model
* Implement an LSTM-based model using torch.nn.Module.
* Include embedding, bidirectional LSTM, and fully connected layers, along with dropout for regularization.

In [10]:
# Define the RNN model
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, hidden = self.rnn(embedded)
        return self.fc(hidden.squeeze(0))

# 6. Initialize Model, Optimizer, and Loss Function
* Instantiate the model.
* Use the Adam optimizer and CrossEntropyLoss function.
* Move the model and loss function to GPU if available.

In [11]:
# Create the model
INPUT_DIM = len(vocab)

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [12]:
# Define optimizer and loss function
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

In [13]:
# Move model and criterion to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
criterion = criterion.to(device)

# 7. Training Loop
* Implement a training loop to iterate through the dataset, performing forward and backward passes, computing loss, and updating model parameters.
* Track and print training loss and accuracy.

In [14]:
# Training loop
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in iterator:
        labels = torch.tensor([item[0] for item in batch]).to(device)
        texts = torch.nn.utils.rnn.pad_sequence([item[1] for item in batch], batch_first=True).to(device)

        optimizer.zero_grad()
        predictions = model(texts)
        loss = criterion(predictions, labels)
        acc = (predictions.argmax(1) == labels).sum().item() / len(labels)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# 8. Evaluation Loop
* Implement an evaluation loop to assess the model’s performance on the test dataset without updating parameters.
* Track and print validation loss and accuracy.

In [15]:
# Evaluation loop
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            labels = torch.tensor([item[0] for item in batch]).to(device)
            texts = torch.nn.utils.rnn.pad_sequence([item[1] for item in batch], batch_first=True).to(device)

            predictions = model(texts)
            loss = criterion(predictions, labels)
            acc = (predictions.argmax(1) == labels).sum().item() / len(labels)

            epoch_loss += loss.item()
            epoch_acc += acc

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# 9. Train and Evaluate Model
Calling the function training and evaluation loop.

In [16]:
# Training the model
for epoch in range(EPOCHS):
    train_loss, train_acc = train(model, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, test_loader, criterion)

    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01
	Train Loss: 1.396 | Train Acc: 25.19%
	 Val. Loss: 1.387 |  Val. Acc: 25.66%
Epoch: 02
	Train Loss: 1.395 | Train Acc: 25.12%
	 Val. Loss: 1.399 |  Val. Acc: 25.79%
Epoch: 03
	Train Loss: 1.395 | Train Acc: 25.10%
	 Val. Loss: 1.413 |  Val. Acc: 26.12%
Epoch: 04
	Train Loss: 1.395 | Train Acc: 25.34%
	 Val. Loss: 1.389 |  Val. Acc: 25.45%
Epoch: 05
	Train Loss: 1.396 | Train Acc: 25.50%
	 Val. Loss: 1.398 |  Val. Acc: 25.41%


#10. Make Predictions
* Create a function to make predictions on new text inputs.
* Preprocess input text, run the model, and map predictions to topic labels.

In [17]:
# Function for making predictions on new data
def predict_topic(model, sentence):
    model.eval()
    tokenized = text_pipeline(sentence)
    tensor = torch.tensor(tokenized, dtype=torch.long).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = model(tensor)
    return prediction.argmax(1).item()

# 11. Example Inputs and Outputs
* Provide multiple example sentences representing different topics.
* Use the prediction function to classify each example and print the predicted topics.

In [18]:
# Example inputs and summarization outputs
example_sentences = [
    "The President met with leaders from several countries to discuss the trade agreement.",
    "Scientists have discovered a new species of fish in the deep ocean.",
    "The football player signed a record-breaking contract with his new team.",
    "Researchers are developing new technologies to improve solar panel efficiency.",
    "A major earthquake struck the coastal region, causing widespread damage.",
    "The tennis player won her first Grand Slam title in an impressive performance."
]

label_mapping = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}

for sentence in example_sentences:
    predicted_topic = predict_topic(model, sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Topic: {label_mapping[predicted_topic]}")
    print()

Sentence: The President met with leaders from several countries to discuss the trade agreement.
Predicted Topic: Sci/Tech

Sentence: Scientists have discovered a new species of fish in the deep ocean.
Predicted Topic: Business

Sentence: The football player signed a record-breaking contract with his new team.
Predicted Topic: Sci/Tech

Sentence: Researchers are developing new technologies to improve solar panel efficiency.
Predicted Topic: Business

Sentence: A major earthquake struck the coastal region, causing widespread damage.
Predicted Topic: Sci/Tech

Sentence: The tennis player won her first Grand Slam title in an impressive performance.
Predicted Topic: Sci/Tech



# 12. Future Improvements
As you can see, the model's accuracy is low (less than 30%). There is several ways to improve the code, which are:
1. Switch to an LSTM.
2. Use a bidirectional LSTM.
3. Add dropout layers.
4. Increase the embedding dimension.
5. Train for more epochs.

These changes should improve the model’s performance.