# Fine Tuning Small Language Model on AG News Dataset

This notebook demonstrates fine-tuning of a Small Language Model (DistilBERT) using AG News dataset from Hugging Face. The model is trained to classify news articles into four categories.


In [None]:
!pip install transformers datasets evaluate accelerate


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np

## Dataset Selection

AG News dataset is used for multi-class news classification. It contains news articles categorized into World, Sports, Business, and Science/Technology.



In [None]:
dataset = load_dataset("ag_news")

dataset


In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


In [None]:
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

## Model Selection

DistilBERT is chosen because it is a lightweight Small Language Model that provides high performance with faster training time.



In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
      "distilbert-base-uncased",
          num_labels=4
          )



In [None]:
metric = evaluate.load("accuracy")


In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(3000)),
    eval_dataset=tokenized_datasets["test"].shuffle(seed=42).select(range(1000)),
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

The pre-trained DistilBERT model is fine-tuned using AG News dataset to adapt it for news classification.

In [None]:
trainer.train()

## Model Evaluation

Accuracy metric is used to evaluate classification performance.


In [None]:
trainer.evaluate()


## Results

The model achieved approximately 89% accuracy on the evaluation dataset. The training loss reduced steadily indicating successful learning.

## Observations

- Fine-tuning improves classification accuracy.
- Pre-trained models reduce training time.
- Increasing training epochs can further improve results.


In [None]:
text = "NASA launches a new satellite for space research"

inputs = tokenizer(text, return_tensors="pt")
# Move inputs to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)

prediction = np.argmax(outputs.logits.detach().cpu().numpy())

labels = ["World", "Sports", "Business", "Sci/Tech"]
print("Predicted Category:", labels[prediction])

In [None]:
labels = ["World", "Sports", "Business", "Sci/Tech"]

print("Predicted Category:", labels[prediction])


In [None]:
from google.colab import files
files.upload()


In [None]:
import nbformat

path = "/content/Fine_tuning_Lab_task1 (1).ipynb"

# Read notebook
with open(path, "r", encoding="utf-8") as f:
    nb = nbformat.read(f, as_version=4)

    # Remove widgets metadata
    if "widgets" in nb.get("metadata", {}):
        del nb["metadata"]["widgets"]

        # Write cleaned notebook
        with open("/content/Fine_tuning_Lab_task1_clean.ipynb", "w", encoding="utf-8") as f:
            nbformat.write(nb, f)

            print("Notebook cleaned successfully!")
