### Fine-Tuning a Language Model on Multiple Datasets Using Unsloth

In this notebook, I performed **single fine-tuning using multiple datasets** to enhance a language model's generalization across tasks. The workflow demonstrates how to combine datasets, format them into instruction-based prompts, and fine-tune a model using the **Unsloth** framework. The steps include:

- Installing all necessary libraries for dataset loading, evaluation, and model training.
- Loading and merging multiple NLP datasets into one.
- Formatting the combined data into instruction-response pairs.
- Initializing the model with LoRA configuration for parameter-efficient tuning.
- Training the model using `SFTTrainer`.
- Saving and optionally evaluating the model post training.

This process showcases the flexibility of training a single model on diverse tasks with unified chat-style templates.


In [None]:
!pip install datasets rouge evaluate transformers wandb nltk rouge_score

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [None]:
import os
import numpy as np
import pandas as pd
import torch
import evaluate
import nltk
import rouge
import wandb

from datasets import load_dataset, concatenate_datasets
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
from sklearn.metrics import accuracy_score, f1_score
from nltk.tokenize import sent_tokenize
from torch.utils.data import DataLoader

In [None]:
# Download necessary NLTK data
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Set up Weights & Biases for tracking experiments (optional but recommended)
wandb.init(project="multiple-dataset-fine-tuning")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mrishikeshavlal-patel[0m ([33mrishikeshavlal-patel-student[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# Set seeds for reproducibility
def set_seed(seed=42):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed()

### Load Base Model with Unsloth
- Load a base model using Unsloth's `FastLanguageModel` with LoRA support.
- Enables efficient fine-tuning with quantization and low-rank adapters.


In [None]:
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Load datasets
# 1. CNN/DailyMail for summarization
cnn_dataset = load_dataset("cnn_dailymail", "3.0.0")

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
# 2. GLUE SST-2 for sentiment classification
sst2_dataset = load_dataset("glue", "sst2")

print("Datasets loaded successfully!")
print(f"CNN/DailyMail - Train: {len(cnn_dataset['train'])}, Validation: {len(cnn_dataset['validation'])}, Test: {len(cnn_dataset['test'])}")
print(f"SST-2 - Train: {len(sst2_dataset['train'])}, Validation: {len(sst2_dataset['validation'])}, Test: {len(sst2_dataset['test'])}")

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Datasets loaded successfully!
CNN/DailyMail - Train: 287113, Validation: 13368, Test: 11490
SST-2 - Train: 67349, Validation: 872, Test: 1821


In [None]:
# Preprocess the datasets
cnn_processed = cnn_dataset.map(preprocess_cnn_dailymail, batched=True)  # Preprocess CNN/DailyMail
sst2_processed = sst2_dataset.map(preprocess_sst2, batched=True)      # Preprocess SST-2

cnn_sample_size = min(len(cnn_processed["train"]), 2000)
sst2_sample_size = min(len(sst2_processed["train"]), 2000)

cnn_train_subset = cnn_processed["train"].shuffle(seed=42).select(range(cnn_sample_size))
sst2_train_subset = sst2_processed["train"].shuffle(seed=42).select(range(sst2_sample_size))

# Combine datasets for training
combined_train = concatenate_datasets([cnn_train_subset, sst2_train_subset])
combined_val = concatenate_datasets([
    cnn_processed["validation"].shuffle(seed=42).select(range(min(len(cnn_processed["validation"]), 500))),
    sst2_processed["validation"].shuffle(seed=42).select(range(min(len(sst2_processed["validation"]), 500)))
])

print(f"Combined training set size: {len(combined_train)}")
print(f"Combined validation set size: {len(combined_val)}")

# Create a custom data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding="max_length",
    max_length=512
)

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Combined training set size: 4000
Combined validation set size: 1000


In [None]:
# Function to preprocess CNN/DailyMail for summarization
def preprocess_cnn_dailymail(examples):
    # Add task prefix to distinguish this as a summarization task
    inputs = ["summarize: " + doc for doc in examples["article"]]

    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=256, truncation=True)

    # Tokenize targets (summaries)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=64, truncation=True)

    model_inputs["labels"] = labels["input_ids"]

    # Add task type identifier
    model_inputs["task_type"] = ["summarization"] * len(inputs)

    return model_inputs

In [None]:
# Function to preprocess SST-2 for sentiment classification - optimized version
def preprocess_sst2(examples):
    batch_size = len(examples["sentence"])

    # Add task prefix in a more efficient way
    inputs = [f"classify sentiment: {sentence}" for sentence in examples["sentence"]]

    # Tokenize inputs - use padding=False to avoid unnecessary padding during preprocessing
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding=False  # Change from "max_length" to False
    )

    # Simplify label conversion
    text_labels = ["negative" if label == 0 else "positive" for label in examples["label"]]

    # Tokenize targets with padding=False
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            text_labels,
            max_length=8,
            truncation=True,
            padding=False  # Change from "max_length" to False
        )

    model_inputs["labels"] = labels["input_ids"]

    # Add task type identifier efficiently
    model_inputs["task_type"] = ["classification"] * batch_size

    return model_inputs

In [None]:
# Set up metrics
rouge = evaluate.load("rouge")
accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Decode predictions and labels with better error handling
    decoded_preds = []
    try:
        # Convert to int32 and clip values to valid token range
        max_id = tokenizer.vocab_size - 1
        clipped_preds = np.clip(predictions, 0, max_id).astype(np.int32)
        decoded_preds = tokenizer.batch_decode(clipped_preds, skip_special_tokens=True)
    except Exception as e:
        # If batch decoding fails, fall back to individual decoding with safeguards
        for pred in predictions:
            try:
                # Clip values to valid token range
                clipped_pred = np.clip(pred, 0, tokenizer.vocab_size - 1).astype(np.int32)
                decoded_pred = tokenizer.decode(clipped_pred, skip_special_tokens=True)
                decoded_preds.append(decoded_pred)
            except Exception as inner_e:
                # If a prediction can't be decoded, use an empty string
                print(f"Warning: Failed to decode prediction: {inner_e}")
                decoded_preds.append("")

    # Process labels (which are usually more stable)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Clean up predictions and labels
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Rest of the function remains the same...
    classification_preds = []
    classification_labels = []
    summarization_preds = []
    summarization_labels = []

    for pred, label in zip(decoded_preds, decoded_labels):
        if "positive" in label[0] or "negative" in label[0]:
            # This is a classification task
            classification_preds.append(pred)
            classification_labels.append(label[0])
        else:
            # This is a summarization task
            summarization_preds.append(pred)
            summarization_labels.append(label[0])

    # Results dictionary
    results = {}

    # Compute ROUGE for summarization if we have any summarization examples
    if summarization_preds:
        rouge_output = rouge.compute(
            predictions=summarization_preds,
            references=[[label] for label in summarization_labels],
            use_stemmer=True
        )
        results.update({k: v for k, v in rouge_output.items()})

    # Compute classification metrics if we have any classification examples
    if classification_preds:
        # Convert text predictions to binary labels for accuracy
        binary_preds = ["positive" in pred for pred in classification_preds]
        binary_labels = ["positive" in label for label in classification_labels]

        results["classification_accuracy"] = accuracy_score(binary_labels, binary_preds)
        results["classification_f1"] = f1_score(binary_labels, binary_preds, average='binary')

    return results

In [None]:
# Define training arguments with corrected steps
training_args = Seq2SeqTrainingArguments(
    fp16=True,
    output_dir="./results",
    eval_strategy="steps",
    eval_steps=200,
    logging_dir="./logs",
    logging_steps=50,
    save_steps=200,
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=2,
    predict_with_generate=True,
    generation_max_length=64,
    report_to="wandb",
    load_best_model_at_end=True,
    metric_for_best_model="rouge1" if len(cnn_processed["validation"]) > 0 else "classification_accuracy",
    push_to_hub=False,
    dataloader_num_workers=4,
    optim="adamw_torch",
    gradient_checkpointing=True,
)

In [None]:
# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=combined_train,
    eval_dataset=combined_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(


In [None]:
# Save the fine-tuned model
model_path = "./fine_tuned_multi_task_model"
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)
print(f"Model saved to {model_path}")

Model saved to ./fine_tuned_multi_task_model


In [None]:
# Test the model on both tasks
def test_model_on_both_tasks(model, tokenizer):
    model.eval()

    # Test summarization
    article = """
    The COVID-19 pandemic has dramatically changed the way we live and work.
    Many companies have shifted to remote work, and schools have adopted
    online learning models. Public health measures including social distancing
    and mask-wearing have become commonplace in many regions. Vaccines were
    developed in record time, but distribution challenges and vaccine hesitancy
    remain obstacles to achieving herd immunity.
    """

    summarization_input = tokenizer("summarize: " + article, return_tensors="pt").to(device)
    summary_ids = model.generate(
        summarization_input["input_ids"],
        max_length=75,
        min_length=30,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Test classification
    review = "The movie was absolutely fantastic with great performances and an engaging storyline."
    classification_input = tokenizer("classify sentiment: " + review, return_tensors="pt").to(device)
    sentiment_ids = model.generate(
        classification_input["input_ids"],
        max_length=10,
        early_stopping=True
    )
    sentiment = tokenizer.decode(sentiment_ids[0], skip_special_tokens=True)

    return {
        "summarization_example": article,
        "generated_summary": summary,
        "classification_example": review,
        "predicted_sentiment": sentiment
    }

# Test the model
test_results = test_model_on_both_tasks(model, tokenizer)
print("\nTest Results:")
print(f"Summarization Example: \n{test_results['summarization_example'][:100]}...")
print(f"Generated Summary: \n{test_results['generated_summary']}")
print(f"\nClassification Example: \n{test_results['classification_example']}")
print(f"Predicted Sentiment: {test_results['predicted_sentiment']}")




Test Results:
Summarization Example: 

    The COVID-19 pandemic has dramatically changed the way we live and work.
    Many companies hav...
Generated Summary: 
the COVID-19 pandemic has dramatically changed the way we live and work. many companies have shifted to remote work - and schools have adopted online learning models.

Classification Example: 
The movie was absolutely fantastic with great performances and an engaging storyline.
Predicted Sentiment: Der Film war absolut fantastig mit tollen
