# Supervised Fine-Tuning (SFT)

This tutorial provides a comprehensive, step-by-step guide to fine-tuning a model using **Supervised Fine-Tuning (SFT)**. The explanations and examples are based on the following scripts: 

- [ds_train.py](https://github.com/microsoft/phyagi-sdk/blob/main/scripts/train/ds_train.py).
- [hf_sft_tune.py](https://github.com/microsoft/phyagi-sdk/blob/main/scripts/tune/hf_sft_tune.py) 

Although these scripts are designed for different purposes, i.e., former is for training and latter is for fine-tuning, both can fine-tune a model with SFT as long as the dataset is prepared correctly.

## Structure of the tutorial

1. **Dataset preparation**: Learn how datasets are pre-defined and processed for fine-tuning.
2. **Fine-tuning workflow**: Understand the key steps and scripts used to perform fine-tuning.
   
**Note**: The code snippets provided in this guide highlight essential sections of the scripts for better understanding. For the complete implementation, refer to the linked scripts above.

## Dataset preparation

Fine-tuning requires carefully curated datasets, often formatted as **JSONL** files. This format is particularly suitable for preference-based tasks, where each record includes:

- `chosen`: The preferred response(s) for a given prompt.
- `rejected`: The less preferred response(s) for the same prompt.

Below is an example JSONL structure illustrating this format:

```json
{
    "chosen": [
        {"role": "user", "content": "Hi"},
        {"role": "assistant", "content": "Hello! How can I assist you today?"},
        {"role": "user", "content": "Please write the steps for founding a student-run venture capital fund"},
        {"role": "assistant", "content": "Starting a student-run venture capital (VC) fund can be an exciting and educational experience..."}
    ],
    "rejected": [
        {"role": "user", "content": "Hi"},
        {"role": "assistant", "content": "Hello! How can I assist you today?"},
        {"role": "user", "content": "Please write the steps for founding a student-run venture capital fund"},
        {"role": "assistant", "content": "1. Feasibility Study: Start with a comprehensive feasibility study that includes studying similar organizations, successful models, potential funders, legal requirements, and the operational scope of your venture capital fund..."}
    ]
}
```

### Dataset preparation for SFT

The goal is to train a model using explicitly labeled datasets to align its outputs with human expectations for specific inputs. The dataset processing workflow for SFT involves the following steps:

1. **Extract prompts and completions**  
   - Identify the input context (`prompt`) and the desired output (`completion`).  
   - For preference-based datasets, the `chosen` responses are treated as the `completion`.  
   - Any `rejected` responses are discarded, ensuring that only positive examples are used for training.

2. **Apply chat template**  
   - A conversational formatting style (e.g., `chatml`) is applied to both the `prompt` and `completion` fields.  
   - This step standardizes the dataset into a consistent chat-style format, essential for fine-tuning conversational models.

3. **Merge fields**  
   - The `prompt` and `completion` fields are concatenated into a single `text` field.  
   - This merged structure simplifies tokenization and ensures the dataset is ready for efficient training.
   
This approach ensures that the dataset is optimized for supervised fine-tuning, enabling the model to learn from high-quality, labeled examples. 

In [1]:
from datasets import Dataset
from trl import extract_prompt

from phyagi.datasets.rl.formatting_utils import apply_chat_template

def prepare_dataset_sft(dataset: Dataset) -> Dataset:
    assert isinstance(dataset, Dataset), "`dataset` must be an instance of datasets.Dataset."

    # If the dataset has already been prepared, return it
    if "text" in dataset.column_names:
        return dataset

    # If the dataset has an implicit prompt, extract it
    if "prompt" not in dataset.column_names:
        dataset = dataset.map(extract_prompt)

    # If the dataset is a preference-based dataset, keep only the "chosen" examples
    # Rename "chosen" to "completion"
    if "chosen" in dataset.column_names:
        dataset = dataset.remove_columns("rejected").rename_column("chosen", "completion")

    # Ensure necessary columns exist
    assert "prompt" in dataset.column_names, "`prompt` must be available in the dataset."
    assert "completion" in dataset.column_names, "`completion` must be available in the dataset."

    # Apply chat template and combine prompt/completion into a single field
    dataset = dataset.map(
        apply_chat_template, fn_kwargs={"special_token_format": "chatml", "shuffle": False, "add_mask_tokens": False}
    )
    dataset = dataset.map(lambda x: {"text": x["prompt"] + x["completion"]}).remove_columns(["prompt", "completion"])

    return dataset

## Script overview

This section provides high-level overviews of the script structure and key components involved in fine-tuning a model.

### DeepSpeed

The script starts by parsing the configuration file and any command-line arguments. This step ensures that all necessary configuration fields are properly set, including:

- `output_dir`: The directory where the model outputs will be saved.
- `dataset`: The dataset configuration, including its location and structure.
- `model`: The model configuration, including details on the pre-trained model to be fine-tuned.
- `training_args`: The parameters governing the training/fine-tuning process, such as learning rate, batch size, etc.

By validating the configuration file, the script ensures that all the required components are available for successful fine-tuning.

In [2]:
import argparse
from phyagi.utils.config import load_config

def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "config_file_path",
        type=str,
        nargs="*",
        help="Path to the YAML configuration file.",
    )

    args, extra_args = parser.parse_known_args()

    return args, extra_args


args, extra_args = parse_args()
args.config_file_path = "ds_sft.yaml"

config = load_config(args.config_file_path, extra_args)

assert "output_dir" in config, "`output_dir` must be available in configuration."
assert "dataset" in config, "`dataset` must be available in configuration."
assert "model" in config, "`model` must be available in configuration."
assert "training_args" in config, "`training_args` must be available in configuration."

The model and tokenizer are loaded according to the configuration specified in the setup. If a tokenizer is not explicitly provided in the configuration, the script defaults to using the tokenizer associated with the pre-trained model. This ensures that both the model and tokenizer are correctly initialized and compatible, enabling smooth data processing during the fine-tuning process.

In [3]:
from phyagi.models.registry import get_model
from phyagi.models.registry import get_tokenizer

model = get_model(**config["model"])
tokenizer_config = config.get("tokenizer", {})
if tokenizer_config.get("pretrained_tokenizer_name_or_path", None) is None:
    tokenizer_config["pretrained_tokenizer_name_or_path"] = model.config.name_or_path
tokenizer = get_tokenizer(**tokenizer_config)

[phyagi] [2025-05-26 13:18:43,703] [INFO] [model.py:93:get_model] Loading pre-trained model: microsoft/phi-1
[phyagi] [2025-05-26 13:18:43,705] [INFO] [model.py:94:get_model] Model configuration: {'torch_dtype': torch.float16, 'trust_remote_code': True}
[phyagi] [2025-05-26 13:18:44,119] [INFO] [tokenizer.py:60:get_tokenizer] Loading pre-trained tokenizer: microsoft/phi-1
[phyagi] [2025-05-26 13:18:44,120] [INFO] [tokenizer.py:61:get_tokenizer] Tokenizer configuration: {'pad_token': '<|endoftext|>'}


The training arguments are extracted from the configuration file to define key settings such as output directories and training frameworks.

In [4]:
from phyagi.trainers.registry import get_training_args

training_args = get_training_args(config["output_dir"], framework="ds", **config["training_args"])
args.checkpoint_dir = training_args.output_dir

The `get_trainer` function is responsible for initializing the training environment. It sets up the model, tokenizer, and any optional configurations, ensuring the model is ready for the fine-tuning process with the specified arguments and dataset.

**Note 1:** `train_dataset` and `eval_dataset` are loaded using the `load_dataset` function, which reads the dataset configuration and prepares the data for training and evaluation. This has been omitted for brevity and covered in the previous section on dataset preparation.

In [None]:
from datasets import load_dataset
from phyagi.trainers.registry import get_trainer

train_dataset = load_dataset(**config["dataset"], split="train")
train_dataset = prepare_dataset_sft(train_dataset)

trainer = get_trainer(
    model,
    framework="ds",
    training_args=training_args,
    train_dataset=train_dataset,
    eval_dataset=None,
    processing_class=tokenizer,
)

The fine-tuning process is executed by calling the `trainer.train` method. If a checkpoint directory is specified, fine-tuning will resume from the most recent checkpoint, allowing for continuation of the fine-tuning process without starting over.

In [None]:
import os
import re
os.environ["WANDB_MODE"] = "disabled"

trainer.train(
    resume_from_checkpoint=args.checkpoint_dir if re.match(r"checkpoint-\d+", args.checkpoint_dir) else False
)

### Hugging Face

The script starts by parsing the configuration file and any command-line arguments. This step ensures that all necessary configuration fields are properly set, including:

- `output_dir`: The directory where the model outputs will be saved.
- `dataset`: The dataset configuration, including its location and structure.
- `model`: The model configuration, including details on the pre-trained model to be fine-tuned.
- `tuning_args`: The parameters governing the fine-tuning process, such as learning rate, batch size, etc.

By validating the configuration file, the script ensures that all the required components are available for successful fine-tuning.

In [5]:
import argparse
from phyagi.utils.config import load_config

def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "config_file_path",
        type=str,
        nargs="*",
        help="Path to the YAML configuration file.",
    )

    args, extra_args = parser.parse_known_args()

    return args, extra_args


args, extra_args = parse_args()
args.config_file_path = "hf_sft.yaml"

config = load_config(args.config_file_path, extra_args)

assert "output_dir" in config, "`output_dir` must be available in configuration."
assert "dataset" in config, "`dataset` must be available in configuration."
assert "model" in config, "`model` must be available in configuration."
assert "tuning_args" in config, "`tuning_args` must be available in configuration."

The model and tokenizer are loaded according to the configuration specified in the setup. If a tokenizer is not explicitly provided in the configuration, the script defaults to using the tokenizer associated with the pre-trained model. This ensures that both the model and tokenizer are correctly initialized and compatible, enabling smooth data processing during the fine-tuning process.

In [6]:
from phyagi.models.registry import get_model
from phyagi.models.registry import get_tokenizer

model = get_model(**config["model"])
tokenizer_config = config.get("tokenizer", {})
if tokenizer_config.get("pretrained_tokenizer_name_or_path", None) is None:
    tokenizer_config["pretrained_tokenizer_name_or_path"] = model.config.name_or_path
tokenizer = get_tokenizer(**tokenizer_config)

[phyagi] [2025-05-26 13:18:52,489] [INFO] [model.py:93:get_model] Loading pre-trained model: microsoft/phi-1
[phyagi] [2025-05-26 13:18:52,492] [INFO] [model.py:94:get_model] Model configuration: {'torch_dtype': torch.float16, 'trust_remote_code': True}
[phyagi] [2025-05-26 13:18:52,884] [INFO] [tokenizer.py:60:get_tokenizer] Loading pre-trained tokenizer: microsoft/phi-1
[phyagi] [2025-05-26 13:18:52,885] [INFO] [tokenizer.py:61:get_tokenizer] Tokenizer configuration: {'pad_token': '<|endoftext|>'}


The tuning arguments are extracted from the configuration file to define key settings such as output directories, training frameworks, and task-specific parameters. It is important to note that the task type (e.g., SFT or DPO) is specified when retrieving the tuning arguments, allowing for task-specific configurations during the fine-tuning process.

In [7]:
from phyagi.rl.tuners.registry import get_tuning_args

tuning_args = get_tuning_args(config["output_dir"], framework="hf", task="sft", **config["tuning_args"])
args.checkpoint_dir = tuning_args.output_dir

INFO 05-26 13:18:55 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-26 13:18:55 [__init__.py:239] Automatically detected platform cuda.


The `get_tuner` function is responsible for initializing the tuning environment. It sets up the model, tokenizer, and any optional Parameter-Efficient Fine-Tuning (PEFT) configurations, ensuring the model is ready for the fine-tuning process with the specified arguments and dataset.

**Note 1:** `get_tuner` function is task-agnostic and can be used for both SFT and DPO fine-tuning, providing a unified interface for initializing the fine-tuning environment.

**Note 2:** `train_dataset` and `eval_dataset` are loaded using the `load_dataset` function, which reads the dataset configuration and prepares the data for training and evaluation. This has been omitted for brevity and covered in the previous section on dataset preparation.

**Note 3:** `peft_config` is optional and can be used to enable PEFT, a technique that optimizes the fine-tuning process by adjusting the learning rate schedule and other hyperparameters.

In [None]:
from datasets import load_dataset
from phyagi.rl.tuners.registry import get_tuner

train_dataset = load_dataset(**config["dataset"], split="train")
train_dataset = prepare_dataset_sft(train_dataset)

tuner = get_tuner(
    model,
    framework="hf",
    task="sft",
    tuning_args=tuning_args,
    train_dataset=train_dataset,
    eval_dataset=None,
    processing_class=tokenizer,
    peft_config=None,
)

The training process is executed by calling the `tuner.train` method. If a checkpoint directory is specified, training will resume from the most recent checkpoint, allowing for continuation of the fine-tuning process without starting over.

In [None]:
import os
import re
os.environ["WANDB_MODE"] = "disabled"

tuner.train(
    resume_from_checkpoint=args.checkpoint_dir if re.match(r"checkpoint-\d+", args.checkpoint_dir) else False
)