# Fine-Tuning SmolLM: Adapting Compact Language Models for Custom Tasks

## Introduction

SmolLM and SmolLM2, developed by Hugging Face, are compact transformer-based language models available in sizes of 135M, 360M, and 1.7B parameters. Designed for efficiency and on-device deployment, these models excel in tasks like text generation, coding, and instruction following. Fine-tuning allows you to adapt SmolLM models to specific tasks or domains, such as customer support chatbots, educational content generation, or specialized text rewriting, by training them on custom datasets. This guide provides a step-by-step process for fine-tuning SmolLM, including prerequisites, data preparation, implementation, and best practices, with code examples and tables for clarity.


## Prerequisites

Before fine-tuning SmolLM, ensure you have the necessary tools and resources. Below is a table summarizing the requirements.

| **Requirement**   | **Details**                                                                                 |
|-------------------|---------------------------------------------------------------------------------------------|
| **Hardware**       | GPU (e.g., NVIDIA A100, V100, or consumer-grade RTX 3060) for faster training; CPU possible but slower. |
| **Memory**         | 8-16 GB GPU VRAM for 135M/360M models; 24-40 GB for 1.7B model.                            |
| **Software**       | Python 3.8+, PyTorch, Hugging Face `transformers`, `datasets`, `peft`, `trl`.             |
| **Model Checkpoint** | SmolLM or SmolLM2 from Hugging Face (e.g., `HuggingFaceTB/SmolLM2-1.7B`).                 |
| **Dataset**        | Task-specific dataset in text or JSON format (e.g., instruction-response pairs).           |
| **Storage**        | 1-10 GB for model weights and dataset, depending on model size and data volume.            |


# Installation

In [None]:
!pip install torch transformers datasets peft trl accelerate --quiet

^C


: 

## Selecting a Model

Choose a SmolLM model based on your task and hardware constraints:

| **Model**         | **Parameters** | **Memory (8-bit)** | **Use Case**                                              |
|-------------------|----------------|---------------------|------------------------------------------------------------|
| **SmolLM-135M**    | 135M           | 162.87 MB           | Lightweight tasks, low-resource devices                    |
| **SmolLM-360M**    | 360M           | 723.56 MB           | General-purpose tasks, moderate resources                  |
| **SmolLM-1.7B**    | 1.7B           | 1812.14 MB          | Complex tasks, instruction following                       |
| **SmolLM2-1.7B**   | 1.7B           | 1812.14 MB          | Enhanced instruction following, function calling           |

> For this guide, we’ll use **SmolLM2-1.7B-Instruct** as an example, but the process applies to all variants.


## Step-by-Step Fine-Tuning Process

Fine-tuning involves adapting a pre-trained SmolLM model to a specific task using a custom dataset. We’ll use Supervised Fine-Tuning (SFT) with LoRA (Low-Rank Adaptation) to make the process memory-efficient, followed by optional Direct Preference Optimization (DPO) for alignment.

### Step 1: Prepare the Dataset

A high-quality dataset is critical for effective fine-tuning. The dataset should align with your target task, such as instruction-response pairs for chatbots or text-completion pairs for domain-specific generation.

**Dataset Format**

Instruction-Response Pairs: Common for instruction tuning.

```json
[
  {"instruction": "Write a Python function to calculate factorial.", "response": "def factorial(n): return 1 if n == 0 else n * factorial(n-1)"},
  {"instruction": "Summarize this article.", "response": "The article discusses..."}
]
```

**Text Completion**: For generative tasks.
```json
[
  {"text": "The capital of France is ### Paris."},
  {"text": "Python is a ### programming language."}
]
```

### Data Sources

- Public Datasets: Use datasets like open-instruct or alpaca from Hugging Face Datasets.

- Custom Data: Curate domain-specific data (e.g., customer support logs, educational content).

- Synthetic Data: Generate data using larger models like LLaMA or GPT-4.

In [None]:
"""
Example: Loading a Dataset

Load a dataset using Hugging Face datasets:
"""
from datasets import load_dataset

# Load a public dataset (e.g., Alpaca)
dataset = load_dataset("tatsu-lab/alpaca", split="train")

# Or load a custom JSON dataset
# dataset = load_dataset("json", data_files="custom_dataset.json", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

data/train-00000-of-00001-a09b74b3ef9c3b(…):   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

In [None]:
"""
Preprocessing Script for Instruction-Response Datasets

This script tokenizes a dataset using the specified HuggingFace tokenizer.
It formats each example with clear prompt/response separators and ensures
consistent truncation and input formatting for training/fine-tuning.
"""

from transformers import AutoTokenizer, PreTrainedTokenizer
from datasets import Dataset
from typing import Dict, Any
import logging

# ------------------ Config ------------------ #
CHECKPOINT = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
MAX_LENGTH = 512
TRUNCATION = True

# ------------------ Logging ------------------ #
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ------------------ Load Tokenizer ------------------ #
try:
    tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
    logger.info(f"Tokenizer loaded from checkpoint: {CHECKPOINT}")
except Exception as e:
    logger.error(f"Failed to load tokenizer: {e}")
    raise

# ------------------ Tokenize Function ------------------ #
def tokenize_function(example: Dict[str, str]) -> Dict[str, Any]:
    """
    Tokenizes a single 'text' field from the dataset.

    Args:
        example: Dictionary with a 'text' field.

    Returns:
        Dictionary with tokenized inputs.
    """
    try:
        return tokenizer(
            example["text"],
            truncation=TRUNCATION,
            max_length=MAX_LENGTH
        )
    except KeyError as e:
        logger.warning(f"Missing expected key in example: {e}")
        return {}

# ------------------ Tokenize Dataset ------------------ #
def tokenize_dataset(dataset: Dataset) -> Dataset:
    """
    Applies tokenization to a dataset with a 'text' field.

    Args:
        dataset: HuggingFace Dataset with 'text' (already formatted).

    Returns:
        Tokenized dataset.
    """
    logger.info("Starting dataset tokenization...")
    tokenized = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["instruction", "input", "output", "text"]
    )
    logger.info("Tokenization complete.")
    return tokenized

tokenized_dataset = tokenize_dataset(dataset)


Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

## Step 2: Configure LoRA for Efficient Fine-Tuning

LoRA reduces memory usage by training low-rank adapters instead of the full model. This is ideal for fine-tuning SmolLM on consumer-grade hardware.

In [None]:
"""
LoRA Configuration for Causal Language Modeling (e.g., SmolLM, LLaMA, GPT)

This script sets up a Low-Rank Adaptation (LoRA) configuration to apply
efficient fine-tuning on selected attention modules (e.g., q_proj, v_proj).
"""

from peft import LoraConfig, get_peft_model

# ------------------ LoRA Configuration ------------------ #
lora_config = LoraConfig(
    r=16,                      # Low-rank dimension. Higher values = more capacity.
    lora_alpha=32,             # Scaling factor to adjust the impact of LoRA layers.
    target_modules=["q_proj", "v_proj"],  # Apply LoRA to attention projection layers.
    lora_dropout=0.05,         # Dropout for regularization; 0.0 = no dropout.
    bias="none",               # Bias adaptation strategy. Options: "none", "all", "lora_only"
    task_type="CAUSAL_LM"      # Task type. Options: "CAUSAL_LM", "SEQ_2_SEQ_LM", "TOKEN_CLS", etc.
)


## Step 3: Load the Model

Load the pre-trained SmolLM model and apply LoRA:

In [None]:
# ------------------ Load Model and Apply LoRA ------------------ #
import torch
from transformers import AutoModelForCausalLM
try:
    logger.info(f"Loading base model from checkpoint: {CHECKPOINT}")
    model = AutoModelForCausalLM.from_pretrained(
        CHECKPOINT,
        torch_dtype=torch.bfloat16,  # Efficient training/inference on newer GPUs
        device_map="auto"            # Auto-distribution across available GPUs
    )

    logger.info("Applying LoRA configuration...")
    model = get_peft_model(model, lora_config)

    logger.info("LoRA applied. Showing trainable parameters:")
    model.print_trainable_parameters()

except Exception as e:
    logger.exception("Failed to load model or apply LoRA.")
    raise e

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

trainable params: 3,145,728 || all params: 1,714,522,112 || trainable%: 0.1835


## Step 4: Set Up Training Arguments

Configure training hyperparameters using Hugging Face’s `transformers` library.

In [None]:
"""
TrainingArguments Configuration for Fine-Tuning SmolLM with LoRA

This configuration is optimized for:
- Mixed precision (fp16)
- Gradient accumulation to simulate larger batch sizes
- Regular checkpointing and logging
- AdamW optimizer (Torch native)
"""

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./smollm2_finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    warmup_steps=100,
    logging_steps=100,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    bf16=False,
    optim="adamw_torch",
    lr_scheduler_type="linear",
    report_to="wandb",  # Enable logging to Weights & Biases
    run_name="smollm2_lora_finetune",  # Will be used as the W&B run name
    logging_dir="./logs",
    save_strategy="steps",
    disable_tqdm=False,
    ddp_find_unused_parameters=False
)


Step 5: Train the Model

Use the SFTTrainer from the `trl` library to perform `supervised fine-tuning`.

In [None]:
"""
SFTTrainer Setup for Fine-Tuning a LoRA-Enabled Causal Language Model

This configuration uses HuggingFace's TRL (Transformers Reinforcement Learning) library
for supervised fine-tuning (SFT) with LoRA applied to a base model.

Make sure:
- Your dataset is tokenized
- Model and tokenizer are aligned
"""

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,                          # PEFT model with LoRA applied
    args=training_args,                   # HuggingFace TrainingArguments
    train_dataset=tokenized_dataset,      # Tokenized dataset
    processing_class=tokenizer,          # Same tokenizer used for preprocessing
)

# ------------------ Start Training ------------------ #
trainer.train()


Truncating train dataset:   0%|          | 0/52002 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msourav-bt-kt[0m ([33msourav-bt-kt-evilcore[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Save the Fine-Tuned Model

Save the LoRA adapters and merge them with the base model for inference:

In [None]:
from peft import AutoPeftModelForCausalLM

# ------------------ Step 1: Save LoRA adapters ------------------ #
logger.info("Saving LoRA adapters...")
model.save_pretrained("./smollm2_finetuned_lora")
tokenizer.save_pretrained("./smollm2_finetuned_lora")  # Optional, for consistency

# ------------------ Step 2: Load & Merge with base ------------------ #
logger.info("Merging LoRA weights into the base model...")
merged_model = AutoPeftModelForCausalLM.from_pretrained(
    "./smollm2_finetuned_lora",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Explicitly merge the adapters (in-place)
merged_model = merged_model.merge_and_unload()

# ------------------ Step 3: Save merged full model ------------------ #
logger.info("Saving merged model (base + LoRA weights)...")
merged_model.save_pretrained("./smollm2_finetuned_merged")
tokenizer.save_pretrained("./smollm2_finetuned_merged")

logger.info("✅ Model merging and saving complete.")

## Evaluate and Test

Test the fine-tuned model to ensure it performs well on your task.

In [None]:
"""
Inference using the fine-tuned and merged SmolLM model
"""

from transformers import pipeline

# ------------------ Load Model & Tokenizer ------------------ #
MODEL_PATH = "./smollm2_finetuned_merged"

try:
    logger.info("Loading model and tokenizer from: %s", MODEL_PATH)

    generator = pipeline(
        "text-generation",
        model=AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16, device_map="auto"),
        tokenizer=AutoTokenizer.from_pretrained(MODEL_PATH),
        device_map="auto"
    )

except Exception as e:
    logger.exception("Failed to load model or tokenizer.")
    raise e

# ------------------ Run Inference ------------------ #
prompt = "Write a Python function to calculate factorial."

logger.info("Generating output for prompt: %s", prompt)

output = generator(
    prompt,
    max_new_tokens=100,
    do_sample=True,         # Enable sampling for creativity
    temperature=0.7,        # Lower = more deterministic
    top_k=50,               # Consider top 50 options per step
    top_p=0.95              # Nucleus sampling
)

print("\n--- Generated Output ---\n")
print(output[0]["generated_text"])


## Best Practices

To maximize fine-tuning success, follow these guidelines:

| **Practice**              | **Description**                                                                 |
|---------------------------|---------------------------------------------------------------------------------|
| **High-Quality Data**      | Use clean, relevant, and diverse data to avoid overfitting or bias.             |
| **Small Learning Rate**    | Use 1e-5 to 2e-4 to prevent catastrophic forgetting of pre-trained knowledge.   |
| **Gradient Accumulation**  | Increase effective batch size on low-memory GPUs.                               |
| **Regular Checkpoints**    | Save frequently to recover from training interruptions.                         |
| **LoRA for Efficiency**    | Prefer LoRA over full fine-tuning to save memory and speed up training.         |
| **Quantization**           | Use 8-bit quantization (e.g., bitsandbytes) for further memory savings.         |
| **Evaluation Metrics**     | Use task-specific metrics (e.g., BLEU for text generation, accuracy for classification). |
| **Monitor Overfitting**    | Split data into train/validation sets and monitor validation loss.              |

---

## Common Challenges and Solutions

| **Challenge**            | **Solution**                                                                  |
|--------------------------|--------------------------------------------------------------------------------|
| **Out-of-Memory Errors**  | Reduce batch size, use gradient accumulation, or enable 8-bit quantization.    |
| **Poor Performance**      | Increase dataset size/quality, adjust learning rate, or extend training epochs.|
| **Slow Training**         | Use faster GPUs, mixed-precision training, or reduce model/dataset size.       |


# Load Quantized Model with bitsandbytes

In [None]:
pip install transformers accelerate bitsandbytes

## Step 2: Load Quantized Model (4-bit or 8-bit)
Use `AutoModelForCausalLM.from_pretrained(...)` with `load_in_4bit=True` or `load_in_8bit=True`.

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Path to your merged model
MODEL_PATH = "./smollm2_finetuned_merged"

# Optional: Configuration for quantization (4-bit example)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # or use `load_in_8bit=True`
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Could also use torch.float16
    bnb_4bit_quant_type="nf4",              # Recommended: 'nf4' or 'fp4'
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="auto"
)

# Create text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")

# Prompt
prompt = "Write a Python function to calculate factorial."

# Run inference
output = generator(prompt, max_new_tokens=100, temperature=0.7, top_k=50, top_p=0.95)

# Output
print("\n--- Generated Output ---\n")
print(output[0]["generated_text"])


# Export Hugging Face Model to ONNX

In [None]:
pip install transformers onnx onnxruntime optimum

## Step 2: Use optimum.exporters.onnx to Export

Here’s how to export your **merged LoRA model** (`AutoModelForCausalLM`) to ONNX:


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.exporters.onnx import main_export
from pathlib import Path

# Paths
MODEL_DIR = "./smollm2_finetuned_merged"
ONNX_EXPORT_DIR = Path("./onnx_export")

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

# Export using `main_export` utility
main_export(
    model=model,
    tokenizer=tokenizer,
    output=ONNX_EXPORT_DIR,
    task="text-generation",  # Or "causal-lm"
    opset=17,  # Choose opset based on compatibility (>=13 usually OK)
    optimize=True  # Applies fusion & graph optimization
)

print(f"✅ Model exported to {ONNX_EXPORT_DIR.resolve()}")


## Run ONNX Inference

In [None]:
import onnxruntime
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./onnx_export")

# Prepare input
inputs = tokenizer("Write a Python function to calculate factorial.", return_tensors="np")

# Run ONNX inference
session = onnxruntime.InferenceSession("./onnx_export/model.onnx")
onnx_inputs = {k: v for k, v in inputs.items()}

outputs = session.run(None, onnx_inputs)
print("✅ ONNX model ran inference successfully.")
