# **Multi-Class Text Classification on the Reuters Dataset**
### _A Deep Learning Approach Using TensorFlow & Keras_

---

In this report, we develop a deep learning model to classify newswires into different topics using the Reuters dataset. This process follows the workflow outlined in Deep Learning with Python by François Chollet.

reference url: "https://www.manning.com/books/deep-learning-with-python"

The report will follow the **Deep Learning Workflow**:
1. **Introduction**
2. **Choosing a measure of success**
3. **Deciding an evaluation protocol**
4. **Preparing the data**
5. **Developing a model better than the baseline**
6. **Scaling up the model until it overfits**
7. **Regularization and hyperparameter tuning**
8. **Discussion & Interpretation**
9. **Conclusion & Future Work**
10. **References**

## **1. Introduction**

### **1.1 Problem Definition**

Text classification is a fundamental task in Natural Language Processing (NLP) where textual data is automatically assigned to predefined categories based on content. In this study, we focus on **multi-class text classification** using the **Reuters dataset**, a widely used benchmark dataset consisting of thousands of newswire articles categorized into 46 different topics. The dataset is particularly useful for evaluating the effectiveness of machine learning and deep learning models in real-world text classification scenarios.

The challenge of text classification in the Reuters dataset arises due to the diverse nature of topics, varying document lengths, and the imbalance in class distribution. Some categories have significantly more examples than others, making it essential to develop robust deep learning models capable of handling imbalanced datasets and learning meaningful representations from textual data

### **1.2 Importance of Text Classification**

By leveraging neural networks, we can build models that capture complex relationships within text data, leading to higher classification performance and better generalization across unseen articles. **Text classification** plays a crucial role in the Reuters dataset by:

- **Automating news categorization**: By classifying news articles into predefined topics, media organizations and analysts can efficiently organize and retrieve relevant content.
- **Enhancing financial and economic analysis**: The Reuters dataset contains articles related to business, trade, and markets, making automated classification valuable for financial institutions monitoring economic trends.
- **Filtering and information retrieval**: Categorization helps in filtering irrelevant information and improves the performance of search and recommendation systems for news platforms.
- **Improving decision-making processes**: Automated classification can assist businesses, journalists, and analysts in identifying key topics of interest without manually sifting through large volumes of articles.
- **Advancing deep learning methodologies**: The dataset serves as a benchmark for testing and improving deep learning techniques for text processing, showcasing the power of neural networks in handling complex language patterns.

### **1.3 Aims and Objectives**

The main aim of this report is to develop a deep learning model capable of accurately classifying news articles into predefined categories using the Reuters dataset in the real world. This report follows the deep learning workflow provided in Coursera Week 20 of CM3015, ensuring best practices in data preprocessing, model training, and evaluation. Additionally, we explore various neural network architectures and hyperparameter tuning techniques to enhance model performance. 

The key objectives include preparing the data, implementing a baseline model for benchmarking, developing and training a deep learning model using TensorFlow and Keras, evaluating its performance using accuracy and loss metrics, implementing hyperparameter tuning and regularization, and providing insights for future improvements and current limitations.

---

## **2. Choosing a Measure of Success**

In this report, multiple metrics will be used, including accuracy, precision, recall and f1 score. However, the primary measure of success will be **validation accuracy**, as it reflects the model's ability to generalize to unseen data. This is selected as the primary metric because:

- Higher validation accuracy indicates better performance in classifying new articles in the real world correctly. This reflects back on the main aim of this report which is to develop a model that can accurately classify news articles into predefined categories in the real world.
- If validation accuracy is significantly lower than training accuracy, it would mean that the model is overfitting, necessitating regularization techniques. 

There will be 4 different models in this report, 2 baseline models, 1 simple model, and 1 complex model. The validation accuracy will be compared across these models to assess improvements in classification performance.

Furthermore, since validation accuracy is used as the primary metric, categorical cross-entropy loss is chosen as the main loss function. This is because minimizing categorical cross-entropy leads to improved probability estimates, which in turn increases classification accuracy. Additionally, categorical cross-entropy ensures that the model is penalized appropriately for incorrect predictions, making it more effective in learning meaningful patterns in the data.

---

## **3. Deciding an Evaluation Protocol**

To ensure the model's performance is properly assessed, **holdout validation** is used as the primary validation method. This involves splitting the dataset into distinct training and validation subsets at the beginning of the process. Since the dataset is large enough, this approach is effective in ensuring the model generalizes well without unnecessary complexity. The test set will remain untouched until final evaluation. 

The dataset will be divided into three distinct subsets:

- **Training Set**: This subset comprises the majority of the data and is used for training the model. The model learns patterns and relationships in this phase.

- **Validation Set**: A portion of the dataset will be set aside as validation data. This set is crucial for tuning hyperparameters and monitoring for overfitting. Performance on the validation set helps determine when the model starts memorizing the training data instead of generalizing to new inputs.

- **Test Set**: Once the model is finalized, its generalization ability is evaluated using a completely separate test set. This ensures an unbiased estimate of real-world performance.

The train-validation split is performed before training begins. A typical approach is to allocate 80% of the available data for training and 20% for validation. However, in this case, we will use the first 1,000 training samples as the validation set while the remaining will be used for training. The test set provided by the Reuters dataset remains untouched until the final evaluation.

---

## **4. Preparing the data**

### **4.1 Import Necessary Libraries**

First and foremost, we will import necessary libraries here. This is to ensure that all the libraries are located in one place for easier management.​

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import reuters
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models, regularizers
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import ParameterGrid
from collections import Counter
from tabulate import tabulate

print(tf.__version__, '')

2.14.0 


### **4.2 Load the Dataset**

In this section, we load the reuters dataset from Keras and split it into training and test sets. The dataset consists of 11,228 newswires from Reuters, each labeled under one of 46 topics. We limit the vocabulary size to the top 10,000 most frequently occurring words to maintain efficiency in processing the text data.

**Dataset Overview**

- **Total Samples**: 11,228
    - **Training Samples**: 8,982
    - **Test Samples**: 2,246
- **Number of Classes**: 46 (Multi-Class Classification)
- **Vocabulary Size**: Limited to the top 10,000 most frequently occurring words to maintain efficiency
- **Dataset Format**: Each sample is represented as a sequence of word indices corresponding to a predefined word dictionary

Reference url: "https://keras.io/api/datasets/reuters/"

In [2]:
# Initialised num_words to improve consistency in the code. 
# This also helps in changing the number of words in the future.
num_words = 10000

num_classes = 46

In [3]:
# Load data with 80-20 train-test split
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=num_words)

# {Original Code}
print(f"Training samples: {len(train_data)}")
print(f"Testing samples: {len(test_data)}")

Training samples: 8982
Testing samples: 2246


### **4.3 View the Samples**

In [4]:
# Load word index
# {Original Code}
word_index = reuters.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()} 

def decode_news(sequence):
    return " ".join([reverse_word_index.get(i - 3, "?") for i in sequence])

In [5]:
# Pick a random news article
sample_index = 1
original_text = decode_news(train_data[sample_index])

print("Original News Article (Before Processing):\n")
print(original_text)

Original News Article (Before Processing):

? generale de banque sa lt ? br and lt heller overseas corp of chicago have each taken 50 pct stakes in ? company sa ? factors generale de banque said in a statement it gave no financial details of the transaction sa ? ? turnover in 1986 was 17 5 billion belgian francs reuter 3


### **4.4 Prepare the Data**

The data from the Reuters dataset consists of lists of integers, where each integer represents a word index in a predefined vocabulary dictionary. Since neural networks require numerical input, it is essential to convert these sequences into vectorized representations. Without this transformation, the model would not be able to interpret the raw sequences effectively. Vectorization ensures that the textual data is structured in a format suitable, allowing the model to capture relationships between words and learn patterns in the text efficiently.

Additionally, the labels in the dataset, which indicate the category of each news article, must be encoded into a numerical format. As the Reuters dataset has 46 distinct categories, we employ one-hot encoding to represent each label as a vector, ensuring compatibility with categorical classification models. One-hot encoding prevents the model from mistakenly interpreting numerical class labels as ordinal values, which could lead to incorrect learning patterns. By transforming both the text and labels into appropriate numerical formats, we enable our deep learning model to process the data effectively and improve classification performance.

#### **4.4.1 Tokenization & Vectorization**

Since the dataset consists of lists of integers (word indices), it is necessary to transform them into a format suitable for neural network training. Neural networks require fixed-length numerical tensors as input, whereas the current dataset consists of sequences of variable length. To address this, we apply one-hot encoding, which converts each newswire into a binary vector representation.

Each vector has a length equal to the number of words in the dictionary (10,000 in this case). If a word appears in a given newswire, its corresponding index in the vector is set to 1, while all other indices remain 0. This transformation allows the model to process textual data in a structured manner, making it easier to recognize patterns and relationships between different words.

**{The code for vectorizing is taken from CM3015 Machine Learning and Neural Networks course}**

In [6]:
def vectorize_sequences(sequences, dimension=num_words):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.0  
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [7]:
# Check the same sample from above as a vectorized format
# {Original Code}
vectorized_sample = x_train[sample_index]

print("\nVectorized Representation (After Processing):\n")
print(vectorized_sample)  


Vectorized Representation (After Processing):

[0. 1. 1. ... 0. 0. 0.]


As mentioned above, the vector index is 1 if a word exist in the 10,000 dimension vector and 0 if it does not exist. This is a binary representation of the words in the newswire.

#### **4.4.2 Encoding the Labels**

Since we have 46 distinct categories, we need to encode the target labels into a format suitable for multi-class classification. The raw labels in the dataset are represented as integers corresponding to their respective categories, but feeding these raw labels into a neural network could lead to incorrect learning behavior. Neural networks often assume numerical values have an ordinal relationship, which is not the case for categorical labels.

To ensure proper classification, we convert the labels into a one-hot encoded format. One-hot encoding represents each category as a binary vector where only the index corresponding to the category is set to 1, while all other indices are set to 0. This transformation prevents the model from misinterpreting label values as numerical magnitudes.

In [8]:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

In [9]:
# Shows the one-hot encoded label for the same sample
# {Original Code}
print("\nOne-Hot Encoded Label:\n")
print(one_hot_train_labels[sample_index])


One-Hot Encoded Label:

[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


From this we can see that the labels are encoded as one-hot vectors, where each vector has a length equal to the number of classes (46 in this case). The index corresponding to the true class is set to 1, while all other indices are set to 0.

In [10]:
# Check the shapes to ensure everything is consistent
print(f"train_data shape: {len(train_data)}, train_labels shape: {len(train_labels)}")
print(f"test_data shape: {len(test_data)}, test_labels shape: {len(test_labels)}")
print(f"x_train shape: {x_train.shape}, x_test shape: {x_test.shape}")
print(f"one_hot_train_labels shape: {one_hot_train_labels.shape}, one_hot_test_labels shape: {one_hot_test_labels.shape}")

train_data shape: 8982, train_labels shape: 8982
test_data shape: 2246, test_labels shape: 2246
x_train shape: (8982, 10000), x_test shape: (2246, 10000)
one_hot_train_labels shape: (8982, 46), one_hot_test_labels shape: (2246, 46)


#### **4.4.3 Data Splitting for Validation**

To evaluate model performance during training, the dataset is split into training and validation sets as mentioned above. 

- The first 1,000 samples from x_train and one_hot_train_labels are set aside as validation data (x_val, y_val).
- The remaining data (partial_x_train, partial_y_train) is used for training.

In [11]:
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

---

## **5. Developing a Model Better Than the Baseline**

---

## **6. Scaling Up the Model Until it Overfits**

---

## **7. Regularization and Hyperparameter Tuning**

---

## **8. Discussion & Interpretation**

---

## **9. Conclusion & Future Work**

---

## **10. References**