# L3: Supervised Fine-Tuning (SFT)

<p style= padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>

In [3]:
# Warning control
import warnings

warnings.filterwarnings("ignore")

## Import libraries

In [4]:
import torch
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig

<!-- <div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> -->
<p> 💻 &nbsp; <b>Access <code>requirements.txt</code> file:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.</p>

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>
</div>

## Setting up helper functions

In [None]:
def generate_responses(
    model, tokenizer, user_message, system_message=None, max_new_tokens=100
):
    # Format chat using tokenizer's chat template
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})

    # We assume the data are all single-turn conversation
    messages.append({"role": "user", "content": user_message})

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    # Recommended to use vllm, sglang or TensorRT
    # todo use vllm to replace model.generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    input_len = inputs["input_ids"].shape[1]
    generated_ids = outputs[0][input_len:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response

In [None]:
def test_model_with_questions(
    model, tokenizer, questions, system_message=None, title="Model Output"
):
    print(f"\n=== {title} ===")
    for i, question in enumerate(questions, 1):
        response = generate_responses(model, tokenizer, question, system_message)
        print(f"\nModel Input {i}:\n{question}\nModel Output {i}:\n{response}\n")

In [None]:
def load_model_and_tokenizer(model_name, use_gpu=False):

    # Load base model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    if use_gpu:
        model.to("cuda")

    if not tokenizer.chat_template:
        # using jinja model template
        tokenizer.chat_template = """{% for message in messages %}
                {% if message['role'] == 'system' %}System: {{ message['content'] }}\n
                {% elif message['role'] == 'user' %}User: {{ message['content'] }}\n
                {% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }} <|endoftext|>
                {% endif %}
                {% endfor %}"""

    # Tokenizer config
    if not tokenizer.pad_token:
        # eos_token: end of sequence tokens
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [11]:
def display_dataset(dataset):
    # Visualize the dataset
    rows = []
    print(dataset)
    for i in range(10):
        example = dataset[i]
        user_msg = next(
            m["content"] for m in example["messages"] if m["role"] == "user"
        )
        assistant_msg = next(
            m["content"] for m in example["messages"] if m["role"] == "assistant"
        )
        rows.append({"User Prompt": user_msg, "Assistant Response": assistant_msg})

    # Display as table
    df = pd.DataFrame(rows)
    pd.set_option("display.max_colwidth", None)  # Avoid truncating long strings
    display(df)

## Load base model & test on simple questions

In [None]:
USE_GPU = True

questions = [
    "Give me an 1-sentence introduction of LLM.",
    "Calculate 1+1-1",
    "What's the difference between thread and process?",
]

In [None]:
model, tokenizer = load_model_and_tokenizer("../../../models/Qwen3-0.6B-Base", USE_GPU)

test_model_with_questions(
    model, tokenizer, questions, title="Base Model (Before SFT) Output"
)

del model, tokenizer

## SFT results on Qwen3-0.6B model

In this section, we're reviewing the results of a previously completed SFT training. Due to limited resources, we won’t be running the full training on a relatively large model like Qwen3-0.6B. However, in the next section of this notebook, you’ll walk through the full training process using a smaller model and a lightweight dataset.

In [None]:
model, tokenizer = load_model_and_tokenizer("../../../models/Qwen3-0.6B-SFT", USE_GPU)

test_model_with_questions(
    model, tokenizer, questions, title="Base Model (After SFT) Output"
)

del model, tokenizer

## Doing SFT on a small model

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>Note:</b> We're performing SFT on a small model <code>HuggingFaceTB/SmolLM2-135M</code> and a smaller training dataset to to ensure the full training process can run on limited computational resources. If you're running the notebooks on your own machine and have access to a GPU, feel free to switch to a larger model—such as <code>Qwen/Qwen3-0.6B-Base</code>—to perform full SFT and reproduce the results shown above.</p>
</div>

In [None]:
# model_name = "../../../models/SmolLM2-135M"
# switch to a bigger model for qwen base
model_name = "../../../models/Qwen3-0.6B-Base/"
model, tokenizer = load_model_and_tokenizer(model_name, USE_GPU)

In [12]:
train_dataset = load_dataset("../../../data/banghua/DL-SFT-Dataset")["train"]
print(len(train_dataset))
display_dataset(train_dataset)
print(train_dataset)

2961
Dataset({
    features: ['messages'],
    num_rows: 2961
})


Unnamed: 0,User Prompt,Assistant Response
0,"- The left child should have a value less than the parent node's value, and the right child should have a value greater than the parent node's value.","This statement is correct. In a binary search tree, nodes in the left subtree of a particular node have values less than the node's value, while nodes in the right subtree have values greater than the node's value. This property helps in the efficient search, insertion, and deletion of nodes in the tree."
1,"To pass three levels must be the plan.\nThen tackle Two, when that is done.\nOf 100 that start, at the end will be 20.\nFinQuiz is a website that helps you prepare.\nUse it to be stress-free, and not lose your hair.\nThen, take the exam with a smile on your face.\nBe confident that you will gain your place.\nSo make this the goal to which you aspire. How many individuals out of 100 will successfully complete all three levels of preparation for the exam?","Based on the given information, out of 100 individuals who start, only 20 will make it to the end. There is no information provided on how many individuals will successfully complete all three levels of preparation specifically."
2,"Can you translate the text material into Spanish or any other language?: He really is, you know.\nThings a hero should show.\nHe loves me more than a zillion things.\nHe loves me when I sing my jolly folktale rhymes.\nHe's good, not just good, in fact he's great!\nBut because he's my best mate!\nWOW !!! I love it!!!!","¿Puede traducir el texto a español o a cualquier otro idioma?: \nRealmente lo es, ya sabes.\nCosas que un héroe debería demostrar.\nMe quiere más que un millón de cosas.\nMe quiere cuando canto mis alegres rimas de cuentos populares.\nEs bueno, no solo bueno, ¡de hecho es genial!\n¡Pero porque es mi mejor amigo!\n¡WOW! ¡Me encanta!"
3,"Complete feed for exotic fishes and turtles, granules of 3mm.\nIt is administered to fishes and turtles only the quantity that they consume in 5 to 10 minutes.\nStore at room temperature (15 – 25 C˚), away from moisture. Keep out of children’s reach. Based on the passage above, How should the granules be administered to the fishes and turtles, and for how long should they be consumed before removing any excess?",The granules should be administered to the fishes and turtles in the quantity that they can consume in 5 to 10 minutes. Any excess should be removed.
4,Write a funny limerick about a person with a strange obsession. Make sure to include a rhyming scheme of AABBA and use humorous language to describe their odd fixation. Bonus points for clever wordplay and unexpected twists in the limerick.,"There once was a man from Quebec\nWhose obsession was quite a wreck\nHe collected old plates\nAt garage sales and estates\nBut only the ones with a speck\n\nHe'd clean them obsessively well\nEach one looked like heaven, no hell\nHis friends thought him crazy\nFor being so hazy\nBut his obsession he could not quell\n\nOne day he found a plate so rare\nIt glowed with a magical flair\nHe held it with glee\nBut then suddenly\nIt vanished right into thin air\n\nHe looked all around, oh so dazed\nHis obsession now left in a craze\nHe searched without rest\nTill he realized, crest\nHe was missing a meal he once praised!"
5,"Midland Swimmer (London, Ontario: Brick Books, 1996): 31.\n9farewells swelled with hopes of come again.\n14odd reports of happy returns petered out.\n19that didn't exist, or passing off maps as places.\n24paperbacks stalled on the revolving racks.\n35abandoned at the dock -- someone to stand by.\n38the open-armed return of all your relations. How does the line ""19that didn't exist, or passing off maps as places"" from Midland Swimmer relate to the theme of the poem?","The line speaks to the idea of deception and the creation of false reality. This aligns with the overall theme of the poem, which explores the concept of home and the longing for connection and belonging. The notion of passing off maps as places suggests a lack of authenticity and a longing for something that does not truly exist. This could be interpreted as a parallel to the speaker's desire for a home or a sense of belonging that may be equally illusory."
6,"Given the text: Name my Arts Marketing Blog! - Name Contest. Pick the perfect name.\nName my Arts Marketing Blog!\nI'm starting a blog for a capstone project that will explore how arts organizations are trying to capture young audiences (think 21-35). Ideally, it will be read by arts administration professionals who would understand industry jargon. I want something catchy, maybe 2-4 words. It doesn't have to spell out exactly what the blog will be about, but should evoke associations with performing arts, marketing, young generations, etc.\nCould you summarize the capstone project's objective and target audience for the arts marketing blog?","The capstone project aims to explore how arts organizations can attract younger audiences (21-35) through marketing, and the target audience for the blog is arts administration professionals who understand industry jargon. The blog's name should be catchy and evoke associations with performing arts, marketing, and young generations."
7,- The queue size should be dynamically adjusted as elements are added or removed.,"This is a good practice as it helps in efficient utilization of resources and prevents overflow or underflow situations. When elements are added to the queue, the size of the queue should be increased dynamically to accommodate the new elements. Similarly, when elements are removed from the queue, the size of the queue should be reduced dynamically to free up the memory resources. This ensures that the queue always has sufficient space to hold the elements being added, without wasting any unnecessary memory. Additionally, dynamically adjusting the queue size also helps in improving the performance of the queue operations, such as enqueue and dequeue, as the queue size is always optimized based on the current number of elements in the queue."
8,"What kind of research did the screenwriters undertake to get the background of the lead character right in ""The Queen's Gambit?""","According to interviews with the show's creator and writer Scott Frank, and the author of the novel upon which the series is based Walter Tevis, extensive research was conducted to ensure that the background of the lead character was accurate. This research included studying books on chess, interviewing chess experts, consulting with former female players to understand the challenges faced by women in the male-dominated chess world, and reading about the history of orphanages and drug addiction treatment in the 1960s. Additionally, the creative team visited Eastern Europe, where much of the show is set, to get a sense of the culture and atmosphere of the time. Finally, they consulted psychologists and addiction experts to accurately depict the effects of drugs on the human brain."
9,"Given the text: The original Alice was equivalent to the TRS-80 MC-10.\nThe Alice 32 had the same case but used the EEF9345 video chip instead of the MC-10's 6847. The RAM was doubled from 4KB to 8KB.\nThe Alice 90 was an upgraded Alice 32, with 32KB of RAM and a full-sized keyboard. It could also accept EF9345 video input to allow overlaying graphics on an external video feed.\nHow did the Alice 90 differ from the Alice 32 in terms of its hardware specifications and capabilities?","The Alice 90 had 32KB of RAM and a full-sized keyboard, while the Alice 32 had only 8KB of RAM and a smaller keyboard. The Alice 90 also had the capability to accept EF9345 video input, allowing for graphics to be overlaid on an external video feed, while the Alice 32 used the EEF9345 video chip instead of the MC-10's 6847."


Dataset({
    features: ['messages'],
    num_rows: 2961
})


In [None]:
# SFTTrainer config
sft_config = SFTConfig(
    learning_rate=8e-5,  # Learning rate for training.
    num_train_epochs=1,  #  Set the number of epochs to train the model.
    per_device_train_batch_size=1,  # Batch size for each device (e.g., GPU) during training.
    gradient_accumulation_steps=8,  # Number of steps before performing a backward/update pass to accumulate gradients.
    gradient_checkpointing=False,  # Enable gradient checkpointing to reduce memory usage during training at the cost of slower training speed.
    logging_steps=2,  # Frequency of logging training progress (log every 2 steps).
)

In [None]:
sft_trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    processing_class=tokenizer,
)
sft_trainer.train()

## Testing training results on small model and small dataset

**Note:** The following results are for the small model and dataset we used for SFT training, due to limited computational resources. To view the results of full-scale training on a larger model, see the **"SFT Results on Qwen3-0.6B Model"** section above.

In [None]:
if not USE_GPU:  # move model to CPU when GPU isn’t requested
    sft_trainer.model.to("cpu")
test_model_with_questions(
    sft_trainer.model, tokenizer, questions, title="Base Model (After SFT) Output"
)