Step 1:- Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
import torch



Here the necessary libraries that are required for transformers model implementation is imported. Numpy and pandas are used for handling the data. sklearn is used to calculate the accuracy and split the data. Transformers are used to load the bert models. Torch is used for deep learning.

Step 2: Checking the availability of the GPU

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")



Using device: cpu


GPU is highly recommended as it makes the training much faster. But in this project, CPU is used as dedicated GPU is not available in my laptop. Compared to GPU, CPU is much slower but still does its job.

Step 3 :Loading and preprocessing the AG News Dataset

In [None]:

df = pd.read_csv("https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv", header=None)
df.columns = ["label", "title", "description"]
df["text"] = df["title"] + " " + df["description"]



Step4 : A smaller dataset is used for faster training

In [None]:
df_sampled = df.sample(5000)  # It will Reduce the dataset size for quick training
X_train, X_test, y_train, y_test = train_test_split(df_sampled["text"], df_sampled["label"] - 1, test_size=0.2, random_state=42)



Step 5 : Tokenizing the function- converting text to numbers 

In [5]:
# ✅ 4. Tokenization Function
def tokenize_function(examples, tokenizer, max_length=128):
    return tokenizer(examples, padding="max_length", truncation=True, max_length=max_length)



Since transformers doesn't understand the text, its converted to numbers using the above function

Step 6 : Automating Model Training and Evaluation

In [None]:
    # Model Training and Evaluation
def train_and_evaluate(model_name, batch_size, learning_rate):
    print(f"\n🔹 Training Model: {model_name} | Batch Size: {batch_size} | Learning Rate: {learning_rate}\n")

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Tokenization
    train_encodings = tokenize_function(X_train.tolist(), tokenizer)
    test_encodings = tokenize_function(X_test.tolist(), tokenizer)

    # Converting the Data to PyTorch Dataset
    class CustomDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels
        
        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item["labels"] = torch.tensor(self.labels[idx])
            return item
        
        def __len__(self):
            return len(self.labels)

    train_dataset = CustomDataset(train_encodings, y_train.tolist())
    test_dataset = CustomDataset(test_encodings, y_test.tolist())

    # Loading the Model
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(set(y_train))).to(device)

    # Defining the Training Arguments
    training_args = TrainingArguments(
        output_dir=f"./results_{model_name.replace('/', '_')}",
        evaluation_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=1,  # Faster training
        learning_rate=learning_rate,
        weight_decay=0.01,
        logging_dir="./logs",
        save_strategy="epoch",
        save_total_limit=2,
        load_best_model_at_end=True,
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

    # Training the Model
    trainer.train()

    # Evaluating the Model
    predictions = trainer.predict(test_dataset)
    preds = np.argmax(predictions.predictions, axis=1)
    accuracy = accuracy_score(y_test, preds)

    print(f"✅ Model: {model_name} | Batch Size: {batch_size} | Learning Rate: {learning_rate} | Accuracy: {accuracy:.4f}")
    return accuracy



Here, A function is defined that will automatically train and test different transformer models with their respective hyperparameter settings. The function imports a tokenizer to convert the text into numerical format and pre-processes the data by tokenizing, padding, and truncating it for uniformity. It encapsulates the data into PyTorch-compatible dataset, then imports a pre-trained transformer model that we set to use GPU if available. Training arguments such as the batch size, learning rate, and number of epochs are set to maximize performance. The Hugging Face Trainer is used to manage the training process, and the model is evaluated on the test set once trained. Finally, the function calculates and prints the accuracy, allowing for the comparison of different models and hyperparameter settings to determine the best performing configuration.

Step 7: Running Experiments on the models and comparing them

In [None]:

results = {}

# Model 1: TinyBERT (Fastest)
results["bert-tiny_8_2e-5"] = train_and_evaluate("prajjwal1/bert-tiny", batch_size=8, learning_rate=2e-5)
results["bert-tiny_16_5e-5"] = train_and_evaluate("prajjwal1/bert-tiny", batch_size=16, learning_rate=5e-5)

# Model 2: DistilBERT (More Accurate)
results["distilbert_8_2e-5"] = train_and_evaluate("distilbert-base-uncased", batch_size=8, learning_rate=2e-5)
results["distilbert_16_5e-5"] = train_and_evaluate("distilbert-base-uncased", batch_size=16, learning_rate=5e-5)




🔹 Training Model: prajjwal1/bert-tiny | Batch Size: 8 | Learning Rate: 2e-05



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,1.241,1.112422


✅ Model: prajjwal1/bert-tiny | Batch Size: 8 | Learning Rate: 2e-05 | Accuracy: 0.7470

🔹 Training Model: prajjwal1/bert-tiny | Batch Size: 16 | Learning Rate: 5e-05



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,0.957559


✅ Model: prajjwal1/bert-tiny | Batch Size: 16 | Learning Rate: 5e-05 | Accuracy: 0.8370

🔹 Training Model: distilbert-base-uncased | Batch Size: 8 | Learning Rate: 2e-05



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.4689,0.303354


✅ Model: distilbert-base-uncased | Batch Size: 8 | Learning Rate: 2e-05 | Accuracy: 0.9070

🔹 Training Model: distilbert-base-uncased | Batch Size: 16 | Learning Rate: 5e-05



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,0.276372


✅ Model: distilbert-base-uncased | Batch Size: 16 | Learning Rate: 5e-05 | Accuracy: 0.9080


This is a comparison of BERT-Tiny and DistilBERT-Base-Uncased on the AG News dataset with different batch sizes and learning rates. BERT-Tiny achieved 74.7% accuracy with (batch size: 8, learning rate: 2e-5) and improved to 83.7% with (16, 5e-5). DistilBERT, being more efficient, outperformed BERT-Tiny, achieving 90.7% with (8, 2e-5) and 90.8% with (16, 5e-5). The minimal improvement in DistilBERT suggests that it generalizes well even with reduced batch sizes. Warning indicates that some model weights had been recently set, further proving the need to fine-tune. These observations highlight the performance impact of model choice and optimization of hyperparameters.

Step 8: Printing the Results

In [8]:
# ✅ 7. Print Final Comparison
print("\n🔹 FINAL RESULTS COMPARISON")
for config, acc in results.items():
    print(f"{config}: Accuracy = {acc:.4f}")




🔹 FINAL RESULTS COMPARISON
bert-tiny_8_2e-5: Accuracy = 0.7470
bert-tiny_16_5e-5: Accuracy = 0.8370
distilbert_8_2e-5: Accuracy = 0.9070
distilbert_16_5e-5: Accuracy = 0.9080


Key Takeaways from Model Performance

1)DistilBERT outperforms BERT-Tiny

DistilBERT consistently gets a higher accuracy (~90.7-90.8%) than BERT-Tiny (~74.7-83.7%).

This is to be expected since DistilBERT is designed to retain most of BERT's knowledge but is more efficient.

2️)Increased Batch Size Enhances Stability

Increasing the batch size from 8 to 16 helped both models, with BERT-Tiny improving considerably (74.7% → 83.7%).

DistilBERT also saw a minor gain (90.7% → 90.8%), proving larger batches lead to smoother updates.

However, very large batch sizes could cause overfitting (though not encountered here).

3️)Higher Learning Rate Serves Small Models

BERT-Tiny gained from increased learning rate (5e-5 performed better than 2e-5).

Since it's a compact model, it can learn at a faster rate.

DistilBERT, a more capable model, performed decently even on the lower learning rate.

4)Best Overall Model: DistilBERT (Batch 16, LR 5e-5)

It yielded the best accuracy (90.8%) along with faster training.

And even when using batch 8 and LR 2e-5, DistilBERT still reached 90.7%, which is an evidence of its efficiency.

Final Thought: Accuracy and training efficiency being the goal, the best model is DistilBERT with Batch 16 and LR 5e-5