<a href="https://colab.research.google.com/github/thibaud-perrin/instruction-tuning/blob/main/notebooks/supervised_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fine-Tuning SmolLM2 with SFTTrainer

In the following section, we will walk through the process of fine-tuning a model using the `SFTTrainer` from the `trl` library. The examples will demonstrate how to select and use datasets from the Hugging Face Hub for this purpose, offering varying levels of complexity:

- **üê¢ Basic Example:** Fine-tune the model using the `HuggingFaceTB/smoltalk` dataset.  
- **üêï Intermediate Example:** Fine-tune a code generation model with the `bigcode/the-stack-smol` dataset, focusing on the `data/python` subset.  

These examples will illustrate different approaches and levels of customization for supervised fine-tuning.


## Libraries

In [1]:
# Install the requirements in Google Colab
!pip install transformers datasets trl huggingface_hub

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m480.6/480.6 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.13.0-py3-none-any.whl (293 kB)
[2K   [90m‚îÅ‚îÅ

In [2]:
# Authenticate to Hugging Face

from huggingface_hub import login
login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [3]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)


config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

## Generate with the base model
Here we will try out the base model which does not have a chat template.

In [4]:
# Let's test the base model before training
prompt = "Write a haiku about programming"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Before training:
user
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a


## HuggingFaceTB/smoltalk

In [None]:
# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"
finetune_tags = ["smol-course", "module_1"]

### Dataset Preparation
We will load a sample dataset and format it for training. The dataset should be structured with input-output pairs, where each input is a prompt and the output is the expected response from the model.

**TRL will format input messages based on the model's chat templates.** They need to be represented as a list of dictionaries with the keys: `role` and `content`.

In [5]:
# Load a sample dataset
from datasets import load_dataset

ds = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")

README.md:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/946k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/52.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/119 [00:00<?, ? examples/s]

In [6]:
def process_dataset(sample):
    sample['chat_format_messages'] = tokenizer.apply_chat_template(sample['messages'], tokenize=False, add_generation_prompt=True)
    return sample

In [7]:
ds = ds.map(process_dataset)

Map:   0%|          | 0/2260 [00:00<?, ? examples/s]

Map:   0%|          | 0/119 [00:00<?, ? examples/s]

### Configuring the SFTTrainer
The `SFTTrainer` is configured with various parameters that control the training process. These include the number of training steps, batch size, learning rate, and evaluation strategy. Adjust these parameters based on your specific requirements and computational resources.

In [8]:
# Inspect the dataset structure and metadata
print(ds)

# Display dataset features
print(ds["train"].features)

# Check the number of examples in the train and test splits
print(f"Train split size: {len(ds['train'])}")
print(f"Test split size: {len(ds['test'])}")

# Peek at a few examples to understand the data format
print(ds["train"][0])

DatasetDict({
    train: Dataset({
        features: ['full_topic', 'messages', 'chat_format_messages'],
        num_rows: 2260
    })
    test: Dataset({
        features: ['full_topic', 'messages', 'chat_format_messages'],
        num_rows: 119
    })
})
{'full_topic': Value(dtype='string', id=None), 'messages': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'chat_format_messages': Value(dtype='string', id=None)}
Train split size: 2260
Test split size: 119
{'full_topic': 'Travel/Vacation destinations/Beach resorts', 'messages': [{'content': 'Hi there', 'role': 'user'}, {'content': 'Hello! How can I help you today?', 'role': 'assistant'}, {'content': "I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?", 'role': 'user'}, {'content': "Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.", 'role': 'assistant'}, {'c

In [9]:
num_epochs = 3
max_steps = len(ds["train"]) // 4 * num_epochs
print(f"Calculated max_steps: {max_steps}")

Calculated max_steps: 1695


In [10]:
# Configure the SFTTrainer
sft_config = SFTConfig(
    output_dir="./sft_output",
    max_steps=max_steps,  # Adjust based on dataset size and desired training duration
    per_device_train_batch_size=4,  # Set according to your GPU memory capacity
    learning_rate=5e-5,  # Common starting point for fine-tuning
    logging_steps=10,  # Frequency of logging training metrics
    save_steps=100,  # Frequency of saving model checkpoints
    evaluation_strategy="steps",  # Evaluate the model at regular intervals
    eval_steps=50,  # Frequency of evaluation
    use_mps_device=(
        True if device == "mps" else False
    ),  # Use MPS for mixed precision training
    hub_model_id=finetune_name,  # Set a unique name for your model

)

# Pre-process: Extract the `chat_format_messages` column
train_dataset = ds["train"].map(lambda x: {"text": x["chat_format_messages"]}, remove_columns=ds["train"].column_names)
eval_dataset = ds["test"].map(lambda x: {"text": x["chat_format_messages"]}, remove_columns=ds["test"].column_names)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    eval_dataset=eval_dataset,
)



Map:   0%|          | 0/2260 [00:00<?, ? examples/s]

Map:   0%|          | 0/119 [00:00<?, ? examples/s]

  trainer = SFTTrainer(


Map:   0%|          | 0/2260 [00:00<?, ? examples/s]

Map:   0%|          | 0/119 [00:00<?, ? examples/s]

### Training the Model
With the trainer configured, we can now proceed to train the model. The training process will involve iterating over the dataset, computing the loss, and updating the model's parameters to minimize this loss.

In [11]:
import os

os.environ["WANDB_MODE"] = "disabled"

In [12]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")



Step,Training Loss,Validation Loss
50,1.0462,1.139049
100,1.0905,1.103094
150,1.0428,1.075026
200,1.0288,1.06033
250,1.0216,1.051065
300,1.0117,1.041757
350,0.9835,1.035299
400,0.9877,1.032088
450,1.0016,1.023203
500,1.0562,1.013396


### Generating with the Fine-Tuned Model

In this section, we will demonstrate how to generate responses using the fine-tuned model. This process mirrors the approach used with the base model, showcasing how fine-tuning impacts the model's output.  


In [13]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the fine-tuned model
fine_tuned_model_path = f"./{finetune_name}"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

In [18]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt = "Write a haiku about programming"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=100)

In [24]:
# Decode and print the response
print("After fine-tuning:")
tokenizer.decode(outputs[0], skip_special_tokens=True)

"user\nWrite a haiku about programming\nassistant\nHello! How can I help you today? I'm a language model and I'm looking for help with programming. What programming language are you comfortable with? I'd be happy to help you learn. What programming language are you comfortable with? Python, Java, or something else? I can help you learn it. What programming language are you comfortable with? Python, Java, or something else? I can help you learn it. What programming language are you comfortable with? Python,"

## bigcode/the-stack-smol

In [4]:
# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-stack-smol"
finetune_tags = ["stack-smol", "module_1"]

### Dataset Preparation

In [15]:
# Load a sample dataset
from datasets import load_dataset

ds = load_dataset(path="bigcode/the-stack-smol")

# Filter for Python files
ds = ds.filter(lambda example: example['lang'] == 'Python')

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

In [16]:
len(ds['train'])

10000

Thanks to this process_dataset function we will transform the dataset in a completion task dataset

In [17]:
import random

def process_dataset(sample):

    # Get the content and create a random starting portion for the user
    content = sample['content']
    content_lines = content.splitlines()

    # Randomly select a number of lines for the user message
    num_lines = random.randint(1, max(1, len(content_lines) - 1))  # At least 1 line, but not the full content
    user_message = "\n".join(content_lines[:num_lines])  # User gets the first `num_lines` lines

    assistant_message = f"```{sample['lang']}\n{content}\n```"

    # Build chat format
    messages = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": assistant_message}
    ]

    # Apply template
    sample['chat_format_messages'] = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False  # Avoid adding duplicate prompts
    )
    return sample
ds["train"] = ds["train"].shuffle(seed=42).select(range(5000))  # Randomly select first 5,000 after shuffle
ds = ds.map(process_dataset)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

### Configuring the SFTTrainer

In [18]:
# Split the train dataset into train and test sets (e.g., 80% train, 20% test)
if "test" not in ds:
  ds = ds["train"].train_test_split(test_size=0.2, seed=42)

# Inspect the dataset structure and metadata
print(ds)

# Display dataset features
print(ds["train"].features)

# Check the number of examples in the train and test splits
print(f"Train split size: {len(ds['train'])}")
print(f"Test split size: {len(ds['test'])}")

# Peek at a few examples to understand the data format
print(ds["train"][0])
print(ds["test"][0])

DatasetDict({
    train: Dataset({
        features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang', 'chat_format_messages'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang', 'chat_format_messages'],
        num_rows: 1000
    })
})
{'content': Value(dtype='string', id=None), 'avg_line_length': Value(dtype='float64', id=None), 'max_line_length': Value(dtype='int64', id=None), 'alphanum_fraction': Value(dtype='float64', id=None), 'licenses': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'repository_name': Value(dtype='string', id=None), 'path': Value(dtype='string', id=None), 'size': Value(dtype='int64', id=None), 'lang': Value(dtype='string', id=None), 'chat_format_messages': Value(dtype='string', id=None)}
Train split size: 4000
Test split si

In [19]:
num_epochs = 3
train_size = len(ds["train"])

# Calculate max_steps as before
max_steps = train_size // 4 * num_epochs

# Determine eval_steps as a fraction of max_steps (e.g., every 10% of max_steps)
eval_steps = max_steps // 10  # Adjust the divisor for more/less frequent evaluations

# Determine save_steps as a fraction of max_steps (e.g., every 5% of max_steps)
save_steps = max_steps // 20  # Adjust the divisor for more/less frequent saves

# Determine logging_steps as a fraction of max_steps (e.g., every 2% of max_steps)
logging_steps = max_steps // 50  # Adjust the divisor for more/less frequent logs

print(f"Calculated max_steps: {max_steps}")
print(f"Calculated eval_steps: {eval_steps}")
print(f"Calculated save_steps: {save_steps}")
print(f"Calculated logging_steps: {logging_steps}")

Calculated max_steps: 3000
Calculated eval_steps: 300
Calculated save_steps: 150
Calculated logging_steps: 60


In [20]:
# Configure the SFTTrainer
sft_config = SFTConfig(
    output_dir="./sft_output",
    max_steps=max_steps,  # Adjust based on dataset size and desired training duration
    per_device_train_batch_size=4,  # Set according to your GPU memory capacity
    learning_rate=5e-5,  # Common starting point for fine-tuning
    logging_steps=logging_steps,  # Frequency of logging training metrics
    save_steps=save_steps,  # Frequency of saving model checkpoints
    evaluation_strategy="steps",  # Evaluate the model at regular intervals
    eval_steps=eval_steps,  # Frequency of evaluation
    use_mps_device=(
        True if device == "mps" else False
    ),  # Use MPS for mixed precision training
    hub_model_id=finetune_name,  # Set a unique name for your model

)

# Pre-process: Extract the `chat_format_messages` column
train_dataset = ds["train"].map(lambda x: {"text": x["chat_format_messages"]}, remove_columns=ds["train"].column_names)
eval_dataset = ds["test"].map(lambda x: {"text": x["chat_format_messages"]}, remove_columns=ds["test"].column_names)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    eval_dataset=eval_dataset,
)



Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

  trainer = SFTTrainer(


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

### Training the Model

In [21]:
import os

os.environ["WANDB_MODE"] = "disabled"

In [22]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

Step,Training Loss,Validation Loss
300,1.1528,1.178477
600,1.1979,1.166381
900,1.0831,1.156178
1200,1.0779,1.153828
1500,1.0893,1.149063
1800,1.0032,1.146037
2100,1.0114,1.152456
2400,0.9731,1.152455
2700,0.9899,1.150881
3000,0.9583,1.150217


### Generating with the Fine-Tuned Model

In [23]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the fine-tuned model
fine_tuned_model_path = f"./{finetune_name}"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

In [24]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt = "# write a snake movement function in python"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=100)

In [25]:
# Decode and print the response
print("After fine-tuning:")
tokenizer.decode(outputs[0], skip_special_tokens=True)

After fine-tuning:


'user\n# write a snake movement function in python\nassistant\n```Python\n# write a snake movement function in python\n\nimport random\nimport time\n\ndef snake_move(snake_x, snake_y, snake_length):\n    global speed\n    global direction\n    global direction_list\n    global direction_list_list\n    global snake_x, snake_y, snake_length\n    global direction_list, direction_list_list\n    global direction_list_list_list\n    global snake_x, snake_'