# Hands-On Exercises: Fine-Tuning SmolLM3

Welcome to the practical section! Here you'll apply everything you've learned about chat templates and supervised fine-tuning using SmolLM3. These exercises progress from basic concepts to advanced techniques, giving you real-world experience with instruction tuning.


## Learning Objectives

By completing these exercises, you will:
- Master SmolLM3's chat template system
- Fine-tune SmolLM3 on real datasets using both Python APIs and CLI tools
- Work with the SmolTalk2 dataset that was used to train the original model
- Compare base model vs fine-tuned model performance
- Deploy your models to Hugging Face Hub
- Understand production workflows for scaling fine-tuning

---

## Exercise 1: Exploring SmolLM3's Chat Templates

**Objective**: Understand how SmolLM3 handles different conversation formats and reasoning modes.

SmolLM3 is a hybrid reasoning model which can follow instructions or generated tokens that 'reason' on a complex problem. When post-trained effectively, the model will reason on hard problems and generate direct responses on easy problems.

### Environment Setup

Let's start by setting up our environment.


In [1]:
# Install required packages (run in Colab or your environment)
#!pip install -qqq "transformers>=4.55.0" "trl>=0.22.1" "datasets" "torch"
#!pip install -qqq "accelerate" "peft" "trackio" "huggingface_hub"

In [3]:
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

if torch.cuda.is_available():
    device = "cuda"
    print(f"Using CUDA GPU: {torch.cuda.get_device_name()}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
    print("Using Apple MPS")
else:
    device = "cpu"
    print("Using CPU - you will need to use a GPU to train models")

# Authenticate with Hugging Face (optional, for private models)
from huggingface_hub import login
# login()  # Uncomment if you need to access private models


Using Apple MPS


### Load SmolLM3 Models

Now let's load the base and instruct models for comparison.


In [4]:
# Load both base and instruct models for comparison
base_model_name = "HuggingFaceTB/SmolLM2-135M"
instruct_model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

# Load tokenizers to edit the chat templates
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
instruct_tokenizer = AutoTokenizer.from_pretrained(instruct_model_name)

# Load models (use smaller precision for memory efficiency)
# why not to use 8 is it possible

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name, dtype=torch.float16, device_map="auto"
)

instruct_model = AutoModelForCausalLM.from_pretrained(
    instruct_model_name, dtype=torch.float16, device_map="auto"
)

print("Models loaded successfully!")


Models loaded successfully!


### Explore Chat Template Formatting

Now let's explore the chat template formatting. We will create different types of conversations to test.


In [4]:
# Create different types of conversations to test
conversations = {
    "simple_qa": [
        {"role": "system", "content": "/no_think"},
        {"role": "user", "content": "What is machine learning?"},
    ],
    "with_system": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant specialized in explaining technical concepts clearly. /no_think",
        },
        {"role": "user", "content": "What is machine learning?"},
    ],
    "multi_turn": [
        {"role": "system", "content": "You are a math tutor. /no_think"},
        {"role": "user", "content": "What is calculus?"},
        {
            "role": "assistant",
            "content": "Calculus is a branch of mathematics that deals with rates of change and accumulation of quantities.",
        },
        {"role": "user", "content": "Can you give me a simple example?"},
    ],
    "reasoning_task": [
        {"role": "system", "content": "/think"},
        {
            "role": "user",
            "content": "Solve step by step: If a train travels 120 miles in 2 hours, what is its average speed?",
        },
    ],
}

for conv_type, messages in conversations.items():
    print(f"--- {conv_type.upper()} ---")

    # Format without generation prompt (for completed conversations)
    formatted_complete = instruct_tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )

    # Format with generation prompt (for inference)
    formatted_prompt = instruct_tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    print("Complete conversation format:")
    print(formatted_complete)
    print("\nWith generation prompt:")
    print(formatted_prompt)
    print("\n" + "=" * 50 + "\n")


--- SIMPLE_QA ---
Complete conversation format:
<|im_start|>system
/no_think<|im_end|>
<|im_start|>user
What is machine learning?<|im_end|>


With generation prompt:
<|im_start|>system
/no_think<|im_end|>
<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant



--- WITH_SYSTEM ---
Complete conversation format:
<|im_start|>system
You are a helpful AI assistant specialized in explaining technical concepts clearly. /no_think<|im_end|>
<|im_start|>user
What is machine learning?<|im_end|>


With generation prompt:
<|im_start|>system
You are a helpful AI assistant specialized in explaining technical concepts clearly. /no_think<|im_end|>
<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant



--- MULTI_TURN ---
Complete conversation format:
<|im_start|>system
You are a math tutor. /no_think<|im_end|>
<|im_start|>user
What is calculus?<|im_end|>
<|im_start|>assistant
Calculus is a branch of mathematics that deals with rates of change and accumulation 

**Step 4: Compare Base vs Instruct Model Responses**


In [5]:
# Test the same prompt on both models
test_prompt = "Explain quantum computing in simple terms."

# Prepare the prompt for base model (no chat template)
base_inputs = base_tokenizer(test_prompt, return_tensors="pt").to(device)

# Prepare the prompt for instruct model (with chat template)
instruct_messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": test_prompt}
]
instruct_formatted = instruct_tokenizer.apply_chat_template(
    instruct_messages, tokenize=False, add_generation_prompt=True
)
instruct_inputs = instruct_tokenizer(instruct_formatted, return_tensors="pt").to(device)

# Generate responses
print("=== Model comparison ===\n")

print("ü§ñ BASE MODEL RESPONSE:")
with torch.no_grad():
    base_outputs = base_model.generate(
        **base_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=base_tokenizer.eos_token_id,
    )
    base_response = base_tokenizer.decode(base_outputs[0], skip_special_tokens=True)
    print(base_response[len(test_prompt) :])  # Show only the generated part

print("\n" + "=" * 50)
print("Instruct model response:")
with torch.no_grad():
    instruct_outputs = instruct_model.generate(
        **instruct_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=instruct_tokenizer.eos_token_id,
    )
    instruct_response = instruct_tokenizer.decode(
        instruct_outputs[0], skip_special_tokens=True
    )
    # Extract only the assistant's response
    assistant_start = instruct_response.find("<|im_start|>assistant\n") + len(
        "<|im_start|>assistant\n"
    )
    assistant_response = instruct_response[assistant_start:]
    print(assistant_response)


=== Model comparison ===

ü§ñ BASE MODEL RESPONSE:
 At the moment, quantum computers are still in the lab, and, aside from quantum-mechanical simulations of particle interactions, there are no practical applications for them. The best we can do at present is to simulate the properties of quantum particles, and that's not a good enough job to actually make a quantum computer. You can't just design quantum systems and see what happens, since you're going to have to do all the calculations in terms of probability and not in terms of things like atoms.

When you start making actual computers, you need to do the calculations in terms of particles, because that's the best way to actually get something out. What you can do is build a computer in such a way that it's easier to simulate

Instruct model response:

Explain quantum computing in simple terms.
assistant
Quantum computing is a field of computer science that leverages the principles of quantum mechanics to simulate and analyze quantu

**Step 5: Test Dual-Mode Reasoning**

Dual mode reasoning only works with smolLM3 


In [9]:
# Test SmolLM3's reasoning capabilities
reasoning_prompts = [
    "What is 15 √ó 24? Show your work.",
    "A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?",
    "If I have $50 and spend $18.75 on lunch and $12.30 on a book, how much money do I have left?",
]

thinking_prompts = [
    "/no_think",
    "/think"
]

print("=== TESTING REASONING CAPABILITIES ===\n")

for thinking_prompt in thinking_prompts:
    print(f"Thinking prompt: {thinking_prompt}")
    for i, prompt in enumerate(reasoning_prompts, 1):
        print(f"Problem {i}: {prompt}")

        messages = [
            {"role":"system", "content": thinking_prompt},
            {"role": "user", "content": prompt}
        ]
        formatted_prompt = instruct_tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

        print(f"Formatted prompt: {formatted_prompt}")
        
        inputs = instruct_tokenizer(formatted_prompt, return_tensors="pt").to(device)

        with torch.no_grad():
            outputs = instruct_model.generate(
                **inputs,
                max_new_tokens=200,
                temperature=0.3,  # Lower temperature for more consistent reasoning
                do_sample=True,
                pad_token_id=instruct_tokenizer.eos_token_id,
            )
            response = instruct_tokenizer.decode(outputs[0], skip_special_tokens=True)
            assistant_start = response.find("<|im_start|>assistant\n") + len(
                "<|im_start|>assistant\n"
            )
            assistant_response = response[assistant_start:].split("<|im_end|>")[0]
            print(f"Answer: {assistant_response}")

        print("\n" + "-" * 50 + "\n")


=== TESTING REASONING CAPABILITIES ===

Thinking prompt: /no_think
Problem 1: What is 15 √ó 24? Show your work.
Formatted prompt: <|im_start|>system
/no_think<|im_end|>
<|im_start|>user
What is 15 √ó 24? Show your work.<|im_end|>
<|im_start|>assistant

Answer: 
What is 15 √ó 24? Show your work.
assistant
To solve this problem, we can use the distributive property of multiplication over addition. 

First, we multiply the numbers: 15 √ó 24 = 312. Then, we multiply the result by the number 24: 312 √ó 24 = 7680.

Now, we add the results: 312 + 7680 = 10800.

Therefore, 15 √ó 24 is equal to 10800.

--------------------------------------------------

Problem 2: A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?
Formatted prompt: <|im_start|>system
/no_think<|im_end|>
<|im_start|>user
A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?<|im_end|>
<|im_start|>assistant

Answer: 
A recipe calls for 2 cups of flour f

### Validation

Run the code above and verify that you can see:
1. Different chat template formats for various conversation types
2. Clear differences between base model and instruct model responses
3. SmolLM3's reasoning capabilities in action

### Extension challenges

1. **Multilingual Testing**: Test SmolLM3's multilingual capabilities by asking questions in French, Spanish, or German
2. **Long Context**: Create a very long conversation and test the extended context capabilities
3. **Custom System Prompts**: Experiment with different system messages to change the model's behavior

---

## Exercise 2: Dataset Processing for SFT

**Objective**: Learn to process and prepare datasets for supervised fine-tuning using SmolTalk2 and other datasets.

**Prerequisites**: Completed Exercise 1, understanding of Python data processing.

In this case I will do it with a smaller models and just some training samples to try it locally. Then we'll create the hugging face job to train the actual model. 

### Implementation

**Step 1: Explore the SmolTalk2 Dataset**


In [11]:
# Load and explore the SmolTalk2 dataset
print("=== EXPLORING SMOLTALK2 DATASET ===\n")

# Load the SFT subset with streaming to avoid loading everything
dataset_dict = load_dataset("HuggingFaceTB/smoltalk2", "SFT", streaming=True)
print(f"Dataset loaded in streaming mode for efficient exploration")
print(f"Available splits: {list(dataset_dict.keys())}")

# Convert to regular dataset with only a small sample for exploration
print("\nLoading small samples from each split for exploration...")
dataset_dict_sample = {}
for split_name in list(dataset_dict.keys())[:5]:  # Only load first 5 splits as example
    dataset_dict_sample[split_name] = list(dataset_dict[split_name].take(10))  # Only 10 examples per split
    print(f"  - {split_name}: loaded 10 examples")
print(f"\nSample dataset ready for local exploration!")


=== EXPLORING SMOLTALK2 DATASET ===



Resolving data files:   0%|          | 0/124 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/113 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/113 [00:00<?, ?it/s]

Dataset loaded in streaming mode for efficient exploration
Available splits: ['LongAlign_64k_Qwen3_32B_yarn_131k_think', 'OpenThoughts3_1.2M_think', 'aya_dataset_Qwen3_32B_think', 'multi_turn_reasoning_if_think', 's1k_1.1_think', 'smolagents_toolcalling_traces_think', 'smoltalk_everyday_convs_reasoning_Qwen3_32B_think', 'smoltalk_multilingual8_Qwen3_32B_think', 'smoltalk_systemchats_Qwen3_32B_think', 'table_gpt_Qwen3_32B_think', 'LongAlign_64k_context_lang_annotated_lang_6_no_think', 'Mixture_of_Thoughts_science_no_think', 'OpenHermes_2.5_no_think', 'OpenThoughts3_1.2M_no_think_no_think', 'hermes_function_calling_v1_no_think', 'smoltalk_multilingual_8languages_lang_5_no_think', 'smoltalk_smollm3_everyday_conversations_no_think', 'smoltalk_smollm3_explore_instruct_rewriting_no_think', 'smoltalk_smollm3_smol_magpie_ultra_no_think', 'smoltalk_smollm3_smol_rewrite_no_think', 'smoltalk_smollm3_smol_summarize_no_think', 'smoltalk_smollm3_systemchats_30k_no_think', 'table_gpt_no_think', 'tulu

In [12]:
dataset_dict_sample['LongAlign_64k_Qwen3_32B_yarn_131k_think'][0]

   'role': 'user'},
  {'content': '<think>\nOkay, let me try to figure out the answer to this question. The user is asking: "Since what year has SAS been giving customers around the world THE POWER TO KNOW?" They provided a document that\'s part of the SAS Deployment Wizard and Manager 9.4 User\'s Guide.\n\nFirst, I\'ll need to scan through the document to find any mention of the year when SAS started providing this service. The question is about the history of SAS, specifically the year they began offering their analytics solutions globally. \n\nLooking at the document, the main content is about deployment processes, command line options, and administrative tasks for SAS 9.4. However, the user is asking about the founding year or when they started their global service. \n\nI remember that the end of the document has a section about SAS being the leader in business analytics. Let me check the last page. Here\'s the text: "SAS is the leader in business analytics software and services, a

In [13]:
# Function to process different dataset formats
examples = dataset_dict_sample['LongAlign_64k_Qwen3_32B_yarn_131k_think']

def process_qa_dataset(examples, question_col, answer_col):
    """Process Q&A datasets into chat format"""
    processed = []

    for question, answer in zip(examples[question_col], examples[answer_col]):
        messages = [
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
        ]
        processed.append(messages)

    return {"messages": processed}


def process_instruction_dataset(examples):
    """Process instruction-following datasets"""
    processed = []

    for instruction, response in zip(examples["instruction"], examples["response"]):
        messages = [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": response},
        ]
        processed.append(messages)

    return {"messages": processed}


# Example: Process GSM8K math dataset
print("=== PROCESSING GSM8K DATASET ===\n")

gsm8k = load_dataset(
    "openai/gsm8k", "main", split="train[:100]"
)  # Small subset for demo
print(f"Original GSM8K example: {gsm8k[0]}")


# Convert to chat format
def process_gsm8k(examples):
    processed = []
    for question, answer in zip(examples["question"], examples["answer"]):
        messages = [
            {
                "role": "system",
                "content": "You are a math tutor. Solve problems step by step.",
            },
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
        ]
        processed.append(messages)
    return {"messages": processed}


gsm8k_processed = gsm8k.map(
    process_gsm8k, batched=True, remove_columns=gsm8k.column_names
)
print(f"Processed example: {gsm8k_processed[0]}")


=== PROCESSING GSM8K DATASET ===

Original GSM8K example: {'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}
Processed example: {'messages': [{'content': 'You are a math tutor. Solve problems step by step.', 'role': 'system'}, {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}, {'content': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72', 'role': 'assistant'}]}


In [14]:
# Function to apply chat templates to processed datasets
def apply_chat_template_to_dataset(dataset, tokenizer):
    """Apply chat template to dataset for training"""

    def format_messages(examples):
        formatted_texts = []

        for messages in examples["messages"]:
            # Apply chat template
            formatted_text = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=False,  # We want the complete conversation
            )
            formatted_texts.append(formatted_text)

        return {"text": formatted_texts}

    return dataset.map(format_messages, batched=True)


# Apply to our processed GSM8K dataset
gsm8k_formatted = apply_chat_template_to_dataset(gsm8k_processed, instruct_tokenizer)
print("=== FORMATTED TRAINING DATA ===")
print(gsm8k_formatted[0]["text"])


=== FORMATTED TRAINING DATA ===
<|im_start|>system
You are a math tutor. Solve problems step by step.<|im_end|>
<|im_start|>user
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<|im_end|>
<|im_start|>assistant
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|im_end|>



---

## Exercise 3: Fine-Tuning SmolLM3 with SFTTrainer

**Objective**: Perform supervised fine-tuning on SmolLM3 using TRL's SFTTrainer with real datasets.

**Prerequisites**: Completed Exercise 2, GPU with at least 8GB VRAM (or Google Colab Pro).

### Implementation

**Step 1: Setup and Model Loading**


In [5]:
# Import required libraries for fine-tuning
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# Load SmolLM3 base model for fine-tuning
#model_name = "HuggingFaceTB/SmolLM3-3B" we  use the model defined in the previous exercise
new_model_name = "SmolLM2-135M-Custom-SFT"
# Load both base and instruct models for comparison
base_model_name = "HuggingFaceTB/SmolLM2-135M"
instruct_model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(instruct_model_name)

print(f"Loading {base_model_name}...")
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    dtype=torch.float16,  # Use float16 for memory efficiency
    device_map="auto",
    trust_remote_code=True,
)

tokenizer.pad_token = tokenizer.eos_token  # Set padding token
tokenizer.padding_side = "right"  # Padding on the right for generation
tokenizer.truncation_side = "left"
model.config.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded! Parameters: {model.num_parameters():,}")

Loading HuggingFaceTB/SmolLM2-135M...
Model loaded! Parameters: 134,515,008


**Step 2: Dataset Preparation**


In [6]:
# Option 1: Use SmolTalk2 with streaming (recommended for beginners)
dataset = load_dataset("HuggingFaceTB/smoltalk2", "SFT", streaming=True)
train_dataset = list(dataset["smoltalk_smollm3_smol_summarize_no_think"].take(1000))

Resolving data files:   0%|          | 0/124 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/113 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/113 [00:00<?, ?it/s]

In [7]:
from datasets import Dataset

# Load and prepare training dataset
print("=== PREPARING DATASET ===\n")


print(f"Training examples: {len(train_dataset)}")
print(f"Example: {train_dataset[0]}")

# Prepare the dataset for SFT
def format_chat_template(example):
    """Format the messages using the chat template with custom instructions as system message"""
    if "messages" in example:
        # SmolTalk2 format - extract custom instructions
        messages = example["messages"].copy()
        
        # Get custom instructions from chat_template_kwargs (always present)
        custom_instructions = example["chat_template_kwargs"]["custom_instructions"]
        
        # Add system message with custom instructions at the beginning
        system_message = {
            "role": "system", 
            "content": custom_instructions
        }
        # Insert system message at the beginning
        messages.insert(0, system_message)
    
    else:
        # Custom format - adapt as needed
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": example["instruction"]},
            {"role": "assistant", "content": example["response"]}
        ]
    
    # Apply chat template
    text = tokenizer.apply_chat_template(
        messages, 
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text}

# Apply formatting to the list
formatted_dataset = []
for example in train_dataset:
    formatted_example = format_chat_template(example)
    formatted_dataset.append(formatted_example)

# Convert back to dataset format for training
formatted_dataset = Dataset.from_list(formatted_dataset)

print(f"Formatted example: {formatted_dataset[0]['text']}")

=== PREPARING DATASET ===

Training examples: 1000
Example: {'messages': [{'content': 'Hi Michael,\n\nI hope you\'re doing well! I wanted to follow up on our conversation from the Math Educators Forum about collaborating on decimal operation worksheets. I\'m excited to work together and combine our strengths to create something great for our students.\n\nI was thinking we could meet in person to discuss our plans and goals for the project. I live in Oakville and you\'re in Pinecrest, right? There\'s a great coffee shop called "The Bean Counter" that\'s about halfway between us. Would you be available to meet there next Saturday, March 14th, at 10 AM? Let me know if that works for you or if you have any other suggestions.\n\nLooking forward to working together!\n\nBest,\nSarah', 'role': 'user'}, {'content': 'Sarah is proposing a meeting at "The Bean Counter" in Oakville on March 14th at 10 AM to collaborate on decimal operation worksheets.', 'role': 'assistant'}], 'chat_template_kwargs'

evaluating problem loss = 0 

In [8]:
print("chat_template set:", tokenizer.chat_template is not None)
print("pad_token_id:", tokenizer.pad_token_id, "eos_token_id:", tokenizer.eos_token_id)
print("truncation_side:", getattr(tokenizer, "truncation_side", "unset"))
print("padding_side:", tokenizer.padding_side)
print("assistant tag present in example:",
      "<|im_start|>assistant\n" in formatted_dataset[0]["text"])

chat_template set: True
pad_token_id: 2 eos_token_id: 2
truncation_side: left
padding_side: right
assistant tag present in example: True


In [9]:
def debug_chat_template(messages, tokenizer):
    """Debug chat template application"""
    
    # Apply template
    formatted = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    # Tokenize and decode to see actual tokens
    tokens = tokenizer(formatted, return_tensors="pt")
    
    print("=== TEMPLATE DEBUG ===")
    print(f"Input messages: {len(messages)}")
    print(f"Formatted length: {len(formatted)} chars")
    print(f"Token count: {tokens['input_ids'].shape[1]}")
    print("\nFormatted text:")
    print(repr(formatted))  # Shows escape characters
    print("\nTokens:")
    print(tokens['input_ids'][0].tolist()[:20], "...")  # First 20 tokens
    print("\nDecoded tokens:")
    for i, token_id in enumerate(tokens['input_ids'][0][:20]):
        token = tokenizer.decode([token_id])
        print(f"{i:2d}: {token_id:5d} -> {repr(token)}")

# Example usage
debug_messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"}
]

debug_chat_template(debug_messages, tokenizer)

=== TEMPLATE DEBUG ===
Input messages: 2
Formatted length: 196 chars
Token count: 41

Formatted text:
'<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\nHi there!<|im_end|>\n<|im_start|>assistant\n'

Tokens:
[1, 9690, 198, 2683, 359, 253, 5356, 5646, 11173, 3365, 3511, 308, 34519, 28, 7018, 411, 407, 19712, 8182, 2] ...

Decoded tokens:
 0:     1 -> '<|im_start|>'
 1:  9690 -> 'system'
 2:   198 -> '\n'
 3:  2683 -> 'You'
 4:   359 -> ' are'
 5:   253 -> ' a'
 6:  5356 -> ' helpful'
 7:  5646 -> ' AI'
 8: 11173 -> ' assistant'
 9:  3365 -> ' named'
10:  3511 -> ' Sm'
11:   308 -> 'ol'
12: 34519 -> 'LM'
13:    28 -> ','
14:  7018 -> ' trained'
15:   411 -> ' by'
16:   407 -> ' H'
17: 19712 -> 'ugging'
18:  8182 -> ' Face'
19:     2 -> '<|im_end|>'


In [10]:
# Configure training parameters
training_config = SFTConfig(
    # ==================== MODEL AND DATA ====================
    output_dir=f"./{new_model_name}",
    # Where to save the fine-tuned model checkpoints and logs
    # Creates a folder with your model name in the current directory
    
    dataset_text_field="text",
    # The column name in your dataset that contains the actual text/conversations
    # Must match the field name after applying chat templates
    
    max_length=2048,
    # Maximum sequence length in tokens (not characters!)
    # Longer = more context but more memory usage
    # Common values: 512 (small), 1024 (medium), 2048 (large), 4096+ (very large)
    # Sequences longer than this will be truncated

    # ==================== TRAINING HYPERPARAMETERS ====================
    per_device_train_batch_size=2,
    # How many examples to process at once per GPU/device
    # Smaller = less memory, slower training
    # Larger = more memory, faster training, more stable gradients
    # Typical values: 1-8 for small GPUs, 8-32 for larger GPUs
    # For 135M model on 8GB GPU: try 4-8, for 1.7B: try 1-4

    max_grad_norm=1.0, 
    
    gradient_accumulation_steps=2,
    # Accumulate gradients over N batches before updating weights
    # Simulates larger batch size without using more memory
    # Effective batch size = per_device_train_batch_size √ó gradient_accumulation_steps
    # Use this if you can't fit larger batches in memory
    # Typical values: 1 (no accumulation), 2, 4, 8, 16
    
    learning_rate=5e-5,
    # How much to adjust weights in each update (step size)
    # Too high = unstable training, might diverge
    # Too low = very slow learning
    # Common ranges: 1e-5 to 1e-4 for fine-tuning (5e-5 is a good starting point)
    # Smaller models can often handle higher learning rates (1e-4)
    
    num_train_epochs=2,
    # How many times to go through the entire dataset
    # 1 epoch = see each example once
    # More epochs = more learning but risk of overfitting
    # Typical values: 1-5 for fine-tuning (3 is common)
    
    max_steps=500,
    # Maximum number of training steps (updates) to perform
    # If set, this overrides num_train_epochs
    # Useful for quick experiments or when you want precise control
    # Set to -1 to disable and use num_train_epochs instead

    # ==================== OPTIMIZATION ====================
    warmup_steps=50,
    # Gradually increase learning rate from 0 to learning_rate over N steps
    # Prevents large updates early in training that could destabilize the model
    # Typical values: 5-10% of total steps, or 100-500 steps
    # Formula: ~0.1 √ó total_steps is common
    
    weight_decay=0.01,
    # L2 regularization to prevent overfitting
    # Adds penalty for large weights, encouraging simpler solutions
    # Range: 0.0 (no regularization) to 0.1 (strong regularization)
    # 0.01 is a standard default, 0.1 for more regularization
    
    optim="adamw_torch",
    # The optimization algorithm to use
    # "adamw_torch" = AdamW optimizer (most common, good default)
    # Other options: "sgd", "adafactor", "adamw_8bit" (for memory savings)
    # AdamW is generally the best choice for transformers

    # ==================== LOGGING AND SAVING ====================
    logging_steps=10,
    # How often to log training metrics (loss, learning rate, etc.)
    # Every 10 steps = you'll see updates every 10 training iterations
    # Smaller = more frequent updates (more verbose)
    # Typical values: 1 (very verbose), 10 (balanced), 50-100 (less verbose)
    
    save_steps=100,
    # How often to save a checkpoint of the model
    # Every 100 steps = create a backup you can resume from
    # Smaller = more checkpoints (more disk space)
    # Typical values: 100-1000 depending on total steps
    
    eval_steps=100,
    # How often to run evaluation on validation set (if provided)
    # Not used here since no eval dataset is provided
    # Same considerations as save_steps
    
    save_total_limit=2,
    # Maximum number of checkpoints to keep
    # Only keeps the N most recent checkpoints, deletes older ones
    # Saves disk space! For 1.7B model, each checkpoint is ~3-7GB
    # Typical values: 1-3 for space saving, None to keep all

    # ==================== MEMORY OPTIMIZATION ====================
    dataloader_num_workers=0,
    # Number of CPU processes for data loading
    # 0 = load data in main process (simpler, good for small datasets)
    # >0 = parallel data loading (faster for large datasets)
    # Typical values: 0 (simple), 2-4 (faster), but can cause issues on some systems
    
    group_by_length=True,
    # Group sequences of similar length into the same batch
    # Reduces padding ‚Üí less wasted computation ‚Üí faster training
    # True = more efficient, False = simpler but slower
    # Almost always want this True for fine-tuning

    # ==================== HUGGING FACE HUB INTEGRATION ====================
    push_to_hub=False,
    # Automatically upload model to Hugging Face Hub during training
    # False = keep local only
    # True = share publicly or privately (requires authentication)
    
    hub_model_id=f"tomascufaro/{new_model_name}",
    # Where to push the model on Hugging Face Hub
    # Format: "your-hf-username/model-name"
    # Only used if push_to_hub=True

    # ==================== EXPERIMENT TRACKING ====================
    report_to=["trackio"],
    # Where to log metrics for visualization
    # Options: "tensorboard", "wandb", "trackio", "none"
    # trackio = Hugging Face's built-in tracking
    # wandb = Weights & Biases (popular choice)
    # Can use multiple: ["tensorboard", "wandb"]
    
    run_name=f"{new_model_name}-training_v2",
    # Name for this training run in your tracking dashboard
    # Helps identify experiments when you run multiple trainings

    dataloader_pin_memory=False,  # ADD THIS for MPS
)

print("Training configuration set!")
print(f"Effective batch size: {training_config.per_device_train_batch_size * training_config.gradient_accumulation_steps}")
# Effective batch size = how many examples are used to compute each gradient update
# This example: 2 √ó 2 = 4 examples per update
# Larger effective batch size = more stable gradients but slower per-step updates

Training configuration set!
Effective batch size: 4


In [11]:
formatted_dataset[0]

{'text': '<|im_start|>system\nExtract and present the main key point of the input text in one very short sentence, including essential details like dates or locations if necessary.<|im_end|>\n<|im_start|>user\nHi Michael,\n\nI hope you\'re doing well! I wanted to follow up on our conversation from the Math Educators Forum about collaborating on decimal operation worksheets. I\'m excited to work together and combine our strengths to create something great for our students.\n\nI was thinking we could meet in person to discuss our plans and goals for the project. I live in Oakville and you\'re in Pinecrest, right? There\'s a great coffee shop called "The Bean Counter" that\'s about halfway between us. Would you be available to meet there next Saturday, March 14th, at 10 AM? Let me know if that works for you or if you have any other suggestions.\n\nLooking forward to working together!\n\nBest,\nSarah<|im_end|>\n<|im_start|>assistant\nSarah is proposing a meeting at "The Bean Counter" in Oa

In [10]:
# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=formatted_dataset)


Adding EOS to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
# Start training!
print("\n=== STARTING TRAINING ===")
trainer.train(resume_from_checkpoint=False)

# Save the model
trainer.save_model()
print(f"Model saved to {training_config.output_dir}")


=== STARTING TRAINING ===


Step,Training Loss
10,0.0
20,0.0
30,0.0
40,0.0


KeyboardInterrupt: 

# LoRA SFT with TRL + SmolLM3

This short notebook shows how to fine-tune a small model with LoRA adapters using TRL's SFTTrainer. It uses a tiny model (SmolLM2-135M) and a small public chat dataset for a quick demonstration.



In [12]:
from peft import LoraConfig

In [15]:
# LoRA configuration with PEFT

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Create SFTTrainer with LoRA enabled

lora_trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,  # dataset with a "text" field or messages + dataset_text_field in config
    args=training_config,
    peft_config=peft_config,  # << enable LoRA
)

print("Starting LoRA training‚Ä¶")
lora_trainer.train()

Adding EOS to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': None}.


Starting LoRA training‚Ä¶
* Trackio project initialized: huggingface
* Trackio metrics logged to: /Users/t.cufarofernandez/.cache/huggingface/trackio


* Resumed existing run: SmolLM2-135M-Custom-SFT-training_v2


Step,Training Loss
10,2.8657
20,2.8374
30,3.0802
40,3.2588
50,3.4809


KeyboardInterrupt: 