# Supervised Fine-Tuning with SFTTrainer

This notebook demonstrates how to fine-tune the `HuggingFaceTB/SmolLM2-135M` model using the `SFTTrainer` from the `trl` library. The notebook cells run and will finetune the model. You can select your difficulty by trying out different datasets.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Fine-Tuning SmolLM2 with SFTTrainer</h2>
    <p>Take a dataset from the Hugging Face hub and finetune a model on it. </p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Use the `HuggingFaceTB/smoltalk` dataset</p>
    <p>🐕 Try out the `bigcode/the-stack-smol` dataset and finetune a code generation model on a specific subset `data/python`.</p>
    <p>🦁 Select a dataset that relates to a real world use case your interested in</p>
</div>

In [15]:
# Install the requirements in Google Colab
!pip install -r https://raw.githubusercontent.com/huggingface/smol-course/refs/heads/main/requirements.txt

Ignoring colorama: markers 'platform_system == "Windows"' don't match your environment
Ignoring setuptools: markers 'python_full_version >= "3.12"' don't match your environment
Collecting numpy==2.1.3 (from -r https://raw.githubusercontent.com/huggingface/smol-course/refs/heads/main/requirements.txt (line 26))
  Using cached numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting trl==0.12.1 (from -r https://raw.githubusercontent.com/huggingface/smol-course/refs/heads/main/requirements.txt (line 60))
  Using cached trl-0.12.1-py3-none-any.whl.metadata (10 kB)
Using cached numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
Using cached trl-0.12.1-py3-none-any.whl (310 kB)
Installing collected packages: numpy, trl
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
  Attempting uninstall: trl
    Found existing installati

In [1]:
!pip install --upgrade tensorflow transformers trl # only trl is affected from 0.12.1 to 0.12.2

Collecting transformers
  Using cached transformers-4.47.0-py3-none-any.whl.metadata (43 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Using cached tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)


In [2]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-wikihow_es_v2"
finetune_tags = ["smol-course", "module_1"]

In [None]:
!pip freeze > requirements.txt

# Generate with the base model

Here we will try out the base model which does not have a chat template.

In [3]:
# Let's test the base model before training
prompt = "Write a haiku about programming"

# Format with ChatML template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Before training:
user
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a


## Dataset Preparation

We will load a sample dataset and format it for training. The dataset should be structured with input-output pairs, where each input is a prompt and the output is the expected response from the model.

In [4]:
# Load a sample dataset
from datasets import load_dataset, Dataset

# Load the dataset
ds = load_dataset("daqc/wikihow_es_v2", split="train[:1000]")  # Limit to 1000 records

In [5]:
# Split the dataset into 800 for training and 200 for testing
split_ds = ds.train_test_split(test_size=0.2, seed=42)  # 80% for training, 20% for testing
train_ds = split_ds["train"]  # 800 records
test_ds = split_ds["test"]  # 200 records

# Define a function to process the dataset for training
def prepare_dataset(sample):
    """
    Prepare the dataset for training by converting it into ChatML format.
    - title: Represents the user input or question.
    - summary: Represents the assistant's response.
    """
    messages = [
        {"role": "user", "content": sample["title"]},
        {"role": "assistant", "content": sample["summary"]}
    ]

    # Apply chat template with tokenization (ready for training)
    input_text_tokenized = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)

    return {"input_ids": input_text_tokenized}

# Process the training and test datasets
processed_train_ds = train_ds.map(prepare_dataset)
processed_test_ds = test_ds.map(prepare_dataset)

# Verify that the "input_ids" column is present and valid
print("First row of the training dataset:")
print(processed_train_ds[0])  # Show the first example of the processed dataset

print("\nFirst row of the test dataset:")
print(processed_test_ds[0])  # Show the first example of the processed dataset

# Verify the types of the datasets
print(type(processed_train_ds))  # Should be <class 'datasets.Dataset'>
print(type(processed_test_ds))  # Should be <class 'datasets.Dataset'>



First row of the training dataset:
{'title': '¿Cómo realizar Una distribución aleatoria?', 'section_name': 'Emplear un generador aleatorio en línea', 'summary': 'Asígnale un número a tus participantes. Elige el generador aleatorio en línea que emplearás. Coloca tus números en el generador aleatorio en línea que elijas. Haz clic en el botón para generar tu número aleatorio.', 'document': 'Por lo general, los generadores aleatorios en línea emplean un sistema de numeración, por lo que será de utilidad si se les asigna un número a tus participantes. Tan solo escribe un número (empezando desde el 1) junto a cada nombre de tu lista.  Este instrumento es muy útil en las aulas de clase, ya que los profesores podrán asignarle un número a cada estudiante y luego emplearán este instrumento para solicitarles que respondan preguntas o lean en voz alta. Asimismo, será de utilidad para escoger ganadores de obsequios, ya que eliminará las preferencias o errores humanos que se cometen al escoger a un 

In [6]:
# Save the processed datasets for future use
#processed_train_ds.save_to_disk("wikihow_train_800")
#processed_test_ds.save_to_disk("wikihow_test_200")

# Confirm that everything is saved correctly
# Display the first record for verification
print("\nTrain dataset:")
print(processed_train_ds)
print("\nFirst record in training dataset:")
print(processed_train_ds["input_ids"][0])

print("\nTest dataset:")
print(processed_test_ds)
print("\nFirst record in test dataset:")
print(processed_test_ds["input_ids"][0])


Train dataset:
Dataset({
    features: ['title', 'section_name', 'summary', 'document', 'english_section_name', 'english_url', 'url', 'input_ids'],
    num_rows: 800
})

First record in training dataset:
[1, 4093, 198, 47416, 51, 7342, 15771, 1345, 463, 280, 810, 81, 1006, 762, 19086, 12814, 34789, 1508, 542, 47, 2, 198, 1, 520, 9531, 198, 1653, 6160, 3662, 1121, 551, 304, 16148, 763, 95, 253, 252, 376, 15634, 266, 30, 3906, 14921, 1102, 991, 11015, 34789, 1508, 1119, 430, 303, 6160, 12504, 10168, 649, 695, 280, 31798, 30, 1712, 10213, 252, 376, 304, 16148, 763, 395, 430, 1102, 991, 11015, 34789, 1508, 1119, 430, 303, 6160, 12504, 10168, 1102, 5271, 292, 30, 16646, 539, 286, 430, 1102, 6543, 12814, 20819, 991, 280, 7269, 304, 16148, 763, 95, 34789, 1508, 1119, 30, 2, 198, 1, 520, 9531, 198]

Test dataset:
Dataset({
    features: ['title', 'section_name', 'summary', 'document', 'english_section_name', 'english_url', 'url', 'input_ids'],
    num_rows: 200
})

First record in test datase

In [7]:
# Map the dataset to ensure only the 'input_ids' column is passed to the model during training
def prepare_for_training(sample):
    return {"input_ids": sample["input_ids"]}

# Apply this function to both train and test datasets to only keep 'input_ids'
train_dataset = processed_train_ds.map(prepare_for_training, remove_columns=processed_train_ds.column_names)
eval_dataset = processed_test_ds.map(prepare_for_training, remove_columns=processed_test_ds.column_names)

# Check that only 'input_ids' is present in the processed datasets
print("Train dataset columns:", train_dataset)
print("Eval dataset columns:", eval_dataset)


Train dataset columns: Dataset({
    features: ['input_ids'],
    num_rows: 800
})
Eval dataset columns: Dataset({
    features: ['input_ids'],
    num_rows: 200
})


# Wandb

In [8]:
! pip install -U wandb



In [9]:
import wandb
import os

wandb.login()

wandb_project = "SmolLM2-FT-wikihow_es_v2"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## Configuring the SFTTrainer

The `SFTTrainer` is configured with various parameters that control the training process. These include the number of training steps, batch size, learning rate, and evaluation strategy. Adjust these parameters based on your specific requirements and computational resources.

In [10]:
# Configure the SFTTrainer
sft_config = SFTConfig(
    output_dir="./sft_output",
    max_steps=300,  # Number of steps adjusted for a quick training (shorter training time)
    per_device_train_batch_size=4,  # Keep batch size at 4 due to memory limitations
    learning_rate=3e-5,  # Moderate learning rate to avoid too fast parameter changes
    logging_steps=10,  # Log training metrics frequently
    save_steps=50,  # Save checkpoints more frequently
    eval_strategy="steps",  # Evaluate the model at regular steps intervals
    eval_steps=25,  # Evaluate every 25 steps to get frequent metrics
    use_mps_device=True if device == "mps" else False,  # Use MPS if available for mixed precision training
    hub_model_id=finetune_name,  # Set a unique model name for the Hugging Face Hub
    report_to="wandb",  # Enable Wandb tracking for monitoring training
)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,  # Training dataset
    tokenizer=tokenizer,  # Tokenizer to process the data
    eval_dataset=eval_dataset,  # Evaluation dataset
    max_seq_length=128,  # Limit sequence length to optimize memory usage
)



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


## Training the Model

With the trainer configured, we can now proceed to train the model. The training process will involve iterating over the dataset, computing the loss, and updating the model's parameters to minimize this loss.

In [None]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")


In [14]:
import os
HF_TOKEN = os.getenv("HF_TOKEN")

# Set to true if you want to save to the huggingface hub
if True:
    trainer.push_to_hub(HF_TOKEN, tags=finetune_tags)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Bonus Exercise: Generate with fine-tuned model</h2>
    <p>🐕 Use the fine-tuned to model generate a response, just like with the base example..</p>
</div>

In [15]:
# Test the fine-tuned model on the same prompt
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load my fine-tuned model and tokenizer from Hugging Face Hub
mymodel_name = "daqc/SmolLM2-FT-wikihow_es_v2"
mytokenizer = AutoTokenizer.from_pretrained(mymodel_name)
mymodel = AutoModelForCausalLM.from_pretrained(mymodel_name).to(device)

tokenizer_config.json:   0%|          | 0.00/3.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/812 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [30]:
# Let's test the base model before training
prompt = "¿Cómo jugar minecraft?"

# Format with ChatML template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = mytokenizer.apply_chat_template(messages, tokenize=False)

# TODO: use the fine-tuned to model generate a response, just like with the base example.
# Generate response
inputs = mytokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = mymodel.generate(**inputs, max_new_tokens=50)

# Print the fine-tuned model's response
print("After fine-tuning:")
print(mytokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning:
user
¿Cómo jugar minecraft?
assistant
Abre Minecraft. Usa la tecla "Minecraft" en la pantalla de la computadora. Usa la tecla "Minecraft" en la pantalla de la computadora.


## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `SFTTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.