# 📘 Fine-Tune DistilBERT on AG News
This tutorial demonstrates how to fine-tune a pretrained DistilBERT model on the AG News dataset using HuggingFace's Transformers library. We go step-by-step: from installing dependencies and loading data, to training the model and evaluating results.

# 🧠 Fine-Tune DistilBERT on AG News (Fixed Version)

This notebook loads the AG News dataset via Kaggle, tokenizes it, fine-tunes a DistilBERT model, and evaluates the results with a confusion matrix.

## 📦 Install Required Libraries
Install HuggingFace Transformers, Datasets, Evaluate, Scikit-learn, and Kaggle.

In [None]:
# Install required packages
!pip install -U transformers datasets evaluate scikit-learn kaggle --quiet


## 🔐 Upload Kaggle API Key
Upload your `kaggle.json` file downloaded from your Kaggle account page.

In [None]:
# Upload kaggle.json to authenticate
from google.colab import files
files.upload()  # Choose kaggle.json


## 📥 Download AG News from Kaggle
Use the Kaggle CLI to download and unzip the dataset into the Colab environment.

In [None]:
# Set up Kaggle API
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


## 🧼 Load and Preprocess the Dataset
We read the CSVs, merge `title` and `description` fields, convert labels from 1–4 to 0–3, and clean the data.

In [None]:
# Download and unzip AG News dataset
!kaggle datasets download -d amananandrai/ag-news-classification-dataset
!unzip ag-news-classification-dataset.zip


## 🔠 Tokenize the Text
Use `DistilBertTokenizerFast` to tokenize the `text` column with truncation and padding.

In [None]:
# Imports
from datasets import Dataset, DatasetDict
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    TrainingArguments,
    Trainer
)
import evaluate
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt


## 🤖 Load the Pretrained Model
Load a `DistilBertForSequenceClassification` model with 4 output labels for our classification task.

In [None]:
# Load CSVs and skip the header row
train_df = pd.read_csv("train.csv", skiprows=1, header=None, names=["label", "title", "description"])
test_df = pd.read_csv("test.csv", skiprows=1, header=None, names=["label", "title", "description"])


train_df["text"] = train_df["title"] + " " + train_df["description"]
test_df["text"] = test_df["title"] + " " + test_df["description"]

# 🔧 Convert 1–4 labels to 0–3
train_df["label"] = train_df["label"].astype(int) - 1
test_df["label"] = test_df["label"].astype(int) - 1

# Drop unnecessary columns
train_df = train_df[["label", "text"]]
test_df = test_df[["label", "text"]]

# Convert to HuggingFace Datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
dataset = DatasetDict({"train": train_dataset, "test": test_dataset})


## ⚙️ Define Training Arguments
Specify batch sizes, number of epochs, logging directory, and disable external logging.

In [None]:
# Tokenization
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

encoded_dataset = dataset.map(tokenize, batched=True)
encoded_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])


## 📏 Define Evaluation Metric
We use HuggingFace's `evaluate` to calculate classification accuracy.

In [None]:
# Load model
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)


## 🚀 Train the Model
Initialize the `Trainer` and run the training process.

In [None]:
# ✅ Fix: Use minimal compatible TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    logging_dir="./logs",
    report_to="none"  # disables W&B and others
)


## 📊 Evaluate the Model
Display the classification report and confusion matrix to understand model performance.

In [None]:
# Define metrics
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"]}


In [None]:
# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()


In [None]:
trainer.save_model("checkpoint-epoch-x")

In [None]:
from google.colab import files
import shutil

shutil.make_archive("distilbert-agnews", 'zip', "distilbert-agnews-checkpoint")
files.download("distilbert-agnews.zip")


In [None]:
# Evaluate and visualize
predictions = trainer.predict(encoded_dataset["test"])
y_true = predictions.label_ids
y_pred = np.argmax(predictions.predictions, axis=1)

print(classification_report(y_true, y_pred, target_names=["World", "Sports", "Business", "Sci/Tech"]))

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["World", "Sports", "Business", "Sci/Tech"])
disp.plot(cmap="Blues", xticks_rotation=45)
plt.title("Confusion Matrix - DistilBERT on AG News")
plt.tight_layout()
plt.show()
