## **Fine-Tuning a Small Language Model (SLM) Using Transformer Models**
**Objective:**  
To fine-tune a Small Language Model (SLM) with fewer than 3B parameters on a text dataset using Hugging Face and evaluate its performance.

**Model Used:** DistilGPT-2  
**Dataset Used:** AG News  
**Platform:** Google Colab


## Install Required Libraries
These libraries are required for loading datasets, fine-tuning transformer models, and evaluating performance.



In [None]:
!pip install transformers datasets accelerate evaluate torch




Import libraries

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
import evaluate

## Load Dataset (AG News)

The AG News dataset consists of news article texts across four categories.
We use only the **text** field to train the language model.
- World
- Sports
- Business
- Science/Technology

For this experiment:
- Only the **text field** is used
- Labels are ignored since this is a **causal language modeling** task

In [None]:
from datasets import load_dataset

dataset = load_dataset("ag_news")

##Small Language Model (distilgpt2)

### Reason for Choosing This Model
- Model size is under 3B parameters as per task requirements.
- Efficient fine-tuning on limited compute (Google Colab).
- Good balance between performance and training speed.

In [None]:
model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## Tokenize the dataset


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id



Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
        return_attention_mask=True
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names)


Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

## Prepare Training and Evaluation Data

In [None]:
train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(5000))
eval_dataset = tokenized_dataset["test"].shuffle(seed=42).select(range(1000))


## Define Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to="tensorboard",
    save_total_limit=1
)


## Create Trainer

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)

## Fine-Tune the Model
The model was fine-tuned using Hugging Face Trainer API for efficiency and reproducibility.


In [None]:
trainer.train()


Epoch,Training Loss,Validation Loss
1,3.889557,3.669334
2,3.645436,3.637822


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=2500, training_loss=3.800827490234375, metrics={'train_runtime': 325.6386, 'train_samples_per_second': 30.709, 'train_steps_per_second': 7.677, 'total_flos': 326620938240000.0, 'train_loss': 3.800827490234375, 'epoch': 2.0})

Evaluate the Model

## Model Evaluation

### Metrics Used
- **Loss**
- **Perplexity** (for language modeling)
- **Accuracy / F1-score** (if classification-based)

These metrics help measure how well the fine-tuned model performs on unseen data.


In [None]:
eval_results = trainer.evaluate()
eval_results


{'eval_loss': 3.637822151184082,
 'eval_runtime': 6.7938,
 'eval_samples_per_second': 147.193,
 'eval_steps_per_second': 36.798,
 'epoch': 2.0}

Lower perplexity indicates better language modeling performance.

In [None]:
import math
perplexity = math.exp(eval_results["eval_loss"])
print("Perplexity:", perplexity)


Perplexity: 38.00896873362309


## Observations

- Model performance improved after fine-tuning.
- Training loss consistently decreased.
- Smaller models can still perform well with proper data.

### Challenges Faced
- GPU memory limitations
- Training time constraints

### Learnings
- Hands-on experience with Hugging Face ecosystem
- Understanding of SLM fine-tuning workflow
