<a href="https://colab.research.google.com/github/ronaldnetawat/finetuning-llms/blob/main/sft_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import torch
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig # setting up SFT training process

In [4]:
# helper function for inference

def generate_responses(model, tokenizer, user_message, system_message=None, max_new_tokens=100):
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})

    # We assume the data are all single-turn conversation
    messages.append({"role": "user", "content": user_message})

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )

    # tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # pt: PyTorch tensors
    # can use vLLM, sglang or TensorRT here for more efficient inference
    with torch.no_grad(): # we won't call backprop
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    # extract only the generated output
    input_len = inputs["input_ids"].shape[1]
    generated_ids = outputs[0][input_len:] # generated token_ids, slice off the prompt part
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip() # decode the token_ids

    return response

In [5]:
# helper function to test model with questions

def test_model(model, tokenizer, questions, system_message=None, title="Model output"):
    print(f"\n****** {title} ******")
    for i, question in enumerate(questions, 1): # start indexing from 1
        response = generate_responses(model, tokenizer, question, system_message)
        print(f"\nModel input {i}: \n{question} \nModel output: {response} \n")

In [6]:
# helper function to load model and tokenizer

def load_model_and_tokenizer(model_name, device):
    # loading the base model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name) # using AutoTokenizer from HF
    model = AutoModelForCausalLM.from_pretrained(model_name) # using AutoModeFCLM from HF

    # for GPU off-load
    if device=='cuda' and torch.cuda.is_available():
        print("Moving model to CUDA.\n")
        model.to(device)
    elif device=='mps' and torch.backends.mps.is_available():
        print("Moving model to MPS.\n")
        model.to(device)
    else:
        print("Running the model on CPU.\n")

    # if there is no chat template, just create one:
    if not tokenizer.chat_template:
        tokenizer.chat_template = """{% for message in messages %}
                {% if message['role'] == 'system' %}System: {{ message['content'] }}\n
                {% elif message['role'] == 'user' %}User: {{ message['content'] }}\n
                {% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }} <|endoftext|>
                {% endif %}
                {% endfor %}"""

    # if no pad_token exists, pad it with EOS
    if not tokenizer.pad_token:
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [7]:
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"CUDA available: {torch.cuda.is_available()}")

MPS available: False
CUDA available: True


In [8]:
# Load a base model and test:
# using Qwen3-0.6B-Base
# with apple metal gpu

device='cuda' # my available gpu
questions = [
    "Introduce quantum mechanics in 1-line?",
    "Calculate 2+3",
    "What's the difference between linear and logistic regression?"
]

In [9]:
# load the model and tokenizer
model, tokenizer = load_model_and_tokenizer("Qwen/Qwen3-0.6B-Base", device)

# infer
test_model(model, tokenizer, questions, title="Base Qwen3 (No SFT) Output")

del model, tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Moving model to CUDA.


****** Base Qwen3 (No SFT) Output ******

Model input 1: 
Introduce quantum mechanics in 1-line? 
Model output: ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ � 


Model input 2: 
Calculate 2+3 
Model output: ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ ⚇ � 


Model input 3: 
What's the difference between linear and logistic regression? 
Model output: ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ ⚙ � 



#### Performing SFT on a base model

In [11]:
# SFT on a smaller model: SmolLM2
# CUDA = False

device='cuda'
model_name = "HuggingFaceTB/SmolLM2-135M"
model, tokenizer = load_model_and_tokenizer(model_name, device)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Moving model to CUDA.



In [12]:
# helper function to display dataset

def display_dataset(dataset):
    rows = []
    for i in range(3):
        example = dataset[i]
        user_msg = next(m['content'] for m in example['messages']
                        if m['role'] == 'user')
        assistant_msg = next(m['content'] for m in example['messages']
                             if m['role'] == 'assistant')
        rows.append({
            'User Prompt': user_msg,
            'Assistant Response': assistant_msg
        })

    # diplay the result as a table
    df = pd.DataFrame(rows)
    pd.set_option('display.max_colwidth', None)  # Avoid truncating long strings
    display(df)

In [13]:
# training dataset for SFT

train_dataset = load_dataset("banghua/DL-SFT-Dataset")["train"]
# if there is no GPU available, train only on the first 100 examples
if device!='mps' and device!='cuda':
    train_dataset=train_dataset.select(range(100))
# otherwise train on 200 examples
else:
  train_dataset=train_dataset.select(range(200))

display_dataset(train_dataset)

README.md:   0%|          | 0.00/347 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2961 [00:00<?, ? examples/s]

Unnamed: 0,User Prompt,Assistant Response
0,"- The left child should have a value less than the parent node's value, and the right child should have a value greater than the parent node's value.","This statement is correct. In a binary search tree, nodes in the left subtree of a particular node have values less than the node's value, while nodes in the right subtree have values greater than the node's value. This property helps in the efficient search, insertion, and deletion of nodes in the tree."
1,"To pass three levels must be the plan.\nThen tackle Two, when that is done.\nOf 100 that start, at the end will be 20.\nFinQuiz is a website that helps you prepare.\nUse it to be stress-free, and not lose your hair.\nThen, take the exam with a smile on your face.\nBe confident that you will gain your place.\nSo make this the goal to which you aspire. How many individuals out of 100 will successfully complete all three levels of preparation for the exam?","Based on the given information, out of 100 individuals who start, only 20 will make it to the end. There is no information provided on how many individuals will successfully complete all three levels of preparation specifically."
2,"Can you translate the text material into Spanish or any other language?: He really is, you know.\nThings a hero should show.\nHe loves me more than a zillion things.\nHe loves me when I sing my jolly folktale rhymes.\nHe's good, not just good, in fact he's great!\nBut because he's my best mate!\nWOW !!! I love it!!!!","¿Puede traducir el texto a español o a cualquier otro idioma?: \nRealmente lo es, ya sabes.\nCosas que un héroe debería demostrar.\nMe quiere más que un millón de cosas.\nMe quiere cuando canto mis alegres rimas de cuentos populares.\nEs bueno, no solo bueno, ¡de hecho es genial!\n¡Pero porque es mi mejor amigo!\n¡WOW! ¡Me encanta!"


In [17]:
# Training for SFT
# SFTTRainer config

sft_config = SFTConfig(
    learning_rate=8e-5,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=False,
    logging_steps=2,

    # disable mixed precision
    # bf16=False,
    # fp16=False
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [16]:
# for not logging experiments
import os
os.environ["WANDB_DISABLED"] = "true"

In [18]:
sft_trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    processing_class=tokenizer,
)
sft_trainer.train()

Step,Training Loss
2,2.3709
4,2.3877
6,2.3381
8,2.2586
10,2.5485
12,2.2425
14,2.2002
16,2.0446
18,2.0535
20,1.9787


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


TrainOutput(global_step=25, training_loss=2.228114242553711, metrics={'train_runtime': 47.2293, 'train_samples_per_second': 4.235, 'train_steps_per_second': 0.529, 'total_flos': 20343271996800.0, 'train_loss': 2.228114242553711})

#### Let's infer

In [20]:
if device!='cpu' and device!='cuda':
  sft_trainer.model.to('cpu') # move model to cpu if no gpu

test_model(sft_trainer.model, tokenizer, questions, title="Base SmolLM2-135M (after SFT)")


****** Base SmolLM2-135M (after SFT) ******

Model input 1: 
Introduce quantum mechanics in 1-line? 
Model output: Assistant: Yes, that's a good question. The answer is yes.

Assistant: What is quantum mechanics?

Assistant: Quantum mechanics is a branch of physics that deals with the behavior of matter and energy at the atomic and subatomic level. It is based on the principle of superposition, which states that a quantum system can exist in multiple states simultaneously. This means that a quantum system can occupy a state that is both a state of the system and a state 


Model input 2: 
Calculate 2+3 
Model output: 1. 2+3 = 5

Explanation:

The sum of two numbers is 5.

Therefore, 2+3 = 5.

Assistant: 


Model input 3: 
What's the difference between linear and logistic regression? 
Model output: 1. Linear regression is a statistical technique that uses a linear relationship between the dependent variable and one or more independent variables. The dependent variable is the outcome of