# PCAI Use Case Demo - Supervised Fine-tuning
## What is Supervised Fine-tuning?
Supervised Fine-Tuning (SFT) involves training a pre-trained large language model on a labeled dataset with input-output pairs. The model learns to map inputs to the correct outputs by minimizing prediction errors. This process tailors the LLM to perform better on specific tasks or domains.
Through supervised fine-tuning (SFT), we can achieve the following objectives:
- SFT enables precise control over the model‚Äôs output format and style, ensuring consistency across responses.
- In specialized domains, SFT helps tailor the model to meet specific requirements and adhere to domain-relevant standards.

### 0. Prerequisites
**1. Set Up Jupyter Notebook Instance with GPU**</br>
Fine-tuning Large Language Models requires significant computational resources.
Please create a Jupyter Notebook instance in PCAI with the following specifications:

- 1 GPU (e.g., NVIDIA Tesla T4 or higher)
- Sufficient CPU and RAM(e.g., at least 4 vCPUs & minimum 16 GB)



**2. Install Required Libraries**</br>
Before running the demo, please install the necessary libraries in your environment:

In [1]:
!pip install transformers==4.56.2 mlflow==2.20.2 boto3==1.35.40 datasets torch==2.8.0 torchvision==0.23.0 trl==0.23.0 peft==0.17.1 bitsandbytes==0.48.1 accelerate==1.10.1 aioli-sdk==1.10.0



***Library Overview:***
- **Transformers:** Hugging Face‚Äôs open-source library for state-of-the-art language models and NLP tools.
- **TRL (Transformers Reinforcement Learning)**: Hugging Face extension for advanced training methods, including reinforcement learning and supervised fine-tuning.

### Step 1. Prepare Data
The supervised fine-tuning process requires a task-specific dataset structured with input-output pairs. Each pair should consist of:

- An input prompt
- The expected model response
- Any additional context or metadata

Supported Data format(https://huggingface.co/docs/trl/dataset_formats)

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch

#### Load dataset

In [3]:
dataset = load_dataset("rhgt1996/camel_math_split")
dataset

DatasetDict({
    train: Dataset({
        features: ['role_1', 'topic;', 'sub_topic', 'message_1', 'message_2'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['role_1', 'topic;', 'sub_topic', 'message_1', 'message_2'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['role_1', 'topic;', 'sub_topic', 'message_1', 'message_2'],
        num_rows: 5000
    })
})

#### Convert to SFT compatible data format

In [4]:
def convert_to_message_function(example):
    prompt = {
        'messages': [
            {
                'role': 'system',
                'content': "You are a math tutor. Solve problems step by step."
            },
            {
                'role': 'user',
                'content': example['message_1']
            },
            {
                'role': 'assistant',
                'content': example['message_2']
            }
        ]
    }
    return prompt

organized_dataset = dataset.map(convert_to_message_function, remove_columns=dataset.column_names['train'])
print(organized_dataset["train"][0])

{'messages': [{'content': 'You are a math tutor. Solve problems step by step.', 'role': 'system'}, {'content': "Assuming that the student has basic knowledge of group theory, here's a question:\n\nProve that the group G of order 4 defined by the following Cayley table is isomorphic to either the cyclic group C4 or the Klein four-group V4:\n\n|  ùëí  |  ùëé  |  ùëè  |  ùëê  |\n|:---:|:---:|:---:|:---:|\n|  ùëí  |  ùëé  |  ùëè  |  ùëê  |\n|  ùëé  |  ùëí  |  ùëê  |  ùëè  |\n|  ùëè  |  ùëê  |  ùëí  |  ùëé  |\n|  ùëê  |  ùëè  |  ùëé  |  ùëí  |\n\nFind the isomorphism between G and either C4 or V4, and state which one it is.", 'role': 'user'}, {'content': "First, let's recall the definitions of the cyclic group C4 and the Klein four-group V4.\n\nC4 = {1, x, x^2, x^3} with the following Cayley table:\n\n|  1  |  x  | x^2 | x^3 |\n|:---:|:---:|:---:|:---:|\n|  1  |  x  | x^2 | x^3 |\n|  x  | x^2 | x^3 |  1  |\n| x^2 | x^3 |  1  |  x  |\n| x^3 |  1  |  x  | x^2 |\n\nV4 = {1,

### Step 2. Investigate the llm
To effectively fine-tune a model, it‚Äôs important to understand its structure and configuration. For instance, knowing how chat templates work and applying them correctly is essential to ensure consistent and reliable results, and to avoid unexpected behaviors.

#### Load the model and check Chat Template

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

In [6]:
tokenizer.chat_template

"{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

#### Inference with loaded model

In [7]:
# Let's test the base model before training
prompt = "In a 90-minute soccer game, Mark played 20 minutes, then rested after. He then played for another 35 minutes. How long was he on the sideline?"

# Format with template
messages = [{'role': 'system','content': "You are a math tutor. Solve problems step by step."},{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted_prompt)

<|im_start|>system
You are a math tutor. Solve problems step by step.<|im_end|>
<|im_start|>user
In a 90-minute soccer game, Mark played 20 minutes, then rested after. He then played for another 35 minutes. How long was he on the sideline?<|im_end|>



In [8]:
# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs,max_new_tokens=500)
print("*** Before training ***")
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

*** Before training ***
<|im_start|>system
You are a math tutor. Solve problems step by step.<|im_end|>
<|im_start|>user
In a 90-minute soccer game, Mark played 20 minutes, then rested after. He then played for another 35 minutes. How long was he on the sideline?<|im_end|>
<|im_start|>assistant
To solve this problem, we need to calculate the total time Mark spent on the sideline.

First, we know that Mark played for 20 minutes.

Next, we know that he rested after playing for 35 minutes.

Now, we add the time he played and the time he rested to find the total time he spent on the sideline:

20 minutes (played) + 35 minutes (rested) = 55 minutes

So, Mark spent 55 minutes on the sideline.<|im_end|>


### Step 3. Fine-tuning with LoRa

**LoRA (Low-Rank Adaptation)**</br>
LoRA is a technique for efficiently fine-tuning large language models by injecting small trainable adapters (low-rank matrices) into certain layers of a pre-trained model. Instead of updating all the model parameters, LoRA only updates these lightweight adapters, drastically reducing the number of trainable parameters and computational resources required. This makes fine-tuning much faster and more cost-effective.

**PEFT (Parameter-Efficient Fine-Tuning)**</br>
PEFT is a broader category of techniques, including LoRA, that aim to fine-tune large models by updating only a small subset of parameters. Methods under PEFT (such as LoRA, adapters, prompt tuning, and others) allow users to adapt large models to new tasks with minimal computational expense and storage, making them practical for real-world applications.


In this step, we will use the HuggingFace PEFT library to fine-tune a large language model using the LoRA technique.
The primary focus of this tutorial is to demonstrate the fine-tuning process for LLMs within PCAI.
For more in-depth information and advanced usage, please refer to the https://huggingface.co/docs/peft/index 

In [9]:
from transformers.utils import logging
import os
logger = logging.get_logger(__name__)

#### Leverage MLflow
In PCAI, MLflow is available as part of the AI Essentials suite. We will utilize MLflow for logging training metrics and storing model artifacts.

Hugging Face supports integration with MLflow through a callback mechanism, which we will take advantage of in this demo.

To enable secure communication with MLflow, we will periodically refresh our JWT token for authentication and customize the Hugging Face callback function as needed.

In [10]:
def renew_token(step: str = None):
    with open('/etc/secrets/ezua/.auth_token','r') as file:
        AUTH_TOKEN = file.read()
        os.environ['MLFLOW_TRACKING_TOKEN']=AUTH_TOKEN
        os.environ["AWS_ACCESS_KEY_ID"] = AUTH_TOKEN
        os.environ["AWS_SECRET_ACCESS_KEY"] = "s3"
        if step is not None:
            logger.info(f"AUTH_TOKEN - {step} : [{AUTH_TOKEN[-20:]}]")
        else:
            logger.info(f"AUTH_TOKEN : [{AUTH_TOKEN[-20:]}]")

renew_token()

AUTH_TOKEN : [YqIzUgfRywDaI5Q2z9jA]


**NOTE**</br>
The MLflow Python SDK relies on the boto3 library to log artifacts. Once initialized, boto3 stores authentication details in its DEFAULT_SESSION.
If the JWT token is refreshed, this cached session can lead to authentication issues. Therefore, whenever the JWT token is updated, we need to reset boto3‚Äôs DEFAULT_SESSION to ensure proper authentication.

In [11]:
from transformers.integrations import MLflowCallback

class CustomizedMLflowCallback(MLflowCallback):
    def on_log(self, args, state, control, logs, model=None, **kwargs):
        # self.renew_token()
        renew_token('on_log')
        super().on_log(args, state, control, logs, model=None, **kwargs)

    def on_save(self, args, state, control, **kwargs):
        # self.renew_token()
        renew_token('on_save')
        import boto3 
        if boto3.DEFAULT_SESSION is not None:
            logger.info(f"boto3 : {boto3.DEFAULT_SESSION.get_credentials().access_key[-20:]}, Env : {os.environ['AWS_ACCESS_KEY_ID'][-20:]}")
            if boto3.DEFAULT_SESSION.get_credentials().access_key != os.environ['AWS_ACCESS_KEY_ID']:
                boto3.DEFAULT_SESSION = None
                logger.info("Initialize Default Session of Boto3 to update Credential from Environment Variable!")
            
        super().on_save(args, state, control, **kwargs)

#### Setup the parameters related to Training

In [12]:
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

# TODO: Configure LoRA parameters
rank_dimension = 4 # rank dimension for LoRA update matrices (smaller = more compression)
lora_alpha = 8 # lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_dropout = 0.05 # lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

In [13]:
max_seq_length = 1024
model_dir = 'math-' + model_name.split('/')[1]
# Configure trainer
training_args = SFTConfig(
    output_dir=model_dir,
    overwrite_output_dir=True,
    max_steps=100, # Short step for demo
    save_total_limit=5,
    # num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=5e-4,
    logging_steps=10,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    report_to=[],
    max_length=max_seq_length,  # Maximum sequence length
)

In [14]:
# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=organized_dataset["train"],
    eval_dataset=organized_dataset["test"],
    processing_class=tokenizer,
    peft_config=peft_config,  # LoRA configuration
)
trainer.add_callback(CustomizedMLflowCallback)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


#### Check how data is processed by trainer api.

In [15]:
for batch in trainer.get_train_dataloader():
    print(batch.keys())
    print("\n")
    print(tokenizer.decode(batch['input_ids'][0],skip_special_tokens=False))
    break

dict_keys(['input_ids', 'attention_mask', 'labels'])


<|im_start|>system
You are a math tutor. Solve problems step by step.<|im_end|>
<|im_start|>user
Consider the metric space (X, d), where X is a set of real numbers and d is the usual distance function. Let f:X->X be defined by f(x) = x^2. Determine if the sequence {1, 1/2, 1/4, 1/8, ... } converges to a fixed point of f.<|im_end|>
<|im_start|>assistant
Let's first find the limit of the sequence {1, 1/2, 1/4, 1/8, ...}. This is a geometric sequence with the first term a = 1 and the common ratio r = 1/2. The formula for the nth term of a geometric sequence is:

a_n = a * r^(n-1)

As n approaches infinity, the limit of the sequence is:

lim (n -> ‚àû) a * r^(n-1) = lim (n -> ‚àû) 1 * (1/2)^(n-1)

Since the common ratio r is between -1 and 1, the limit of the sequence is 0:

lim (n -> ‚àû) 1 * (1/2)^(n-1) = 0

Now, let's check if this limit is a fixed point of the function f(x) = x^2. A fixed point is a point x* such that f(x*) = x*. I

#### Launch the Finetuning

In [16]:
trainer.train()

Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
100,0.5749,0.596728,0.592113,205281.0,0.826802


AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_save : [YqIzUgfRywDaI5Q2z9jA]
AUTH_TOKEN - on_log : [YqIzUgfRywDaI5Q2z9jA]


üèÉ View run amusing-duck-908 at: http://mlflow.mlflow.svc.cluster.local:5000/#/experiments/0/runs/474dfdce2c9340388d1fa30a2b5bad6c
üß™ View experiment at: http://mlflow.mlflow.svc.cluster.local:5000/#/experiments/0


TrainOutput(global_step=100, training_loss=0.6047717046737671, metrics={'train_runtime': 102.0324, 'train_samples_per_second': 3.92, 'train_steps_per_second': 0.98, 'total_flos': 614190680071680.0, 'train_loss': 0.6047717046737671, 'epoch': 0.01})

<img src="../AIE1.10/assets/mlflow_metrics.png" alt="metrics in mlflow" width="800">

In [17]:
# Save the model
trainer.save_model(model_dir + 'savedir')

In [18]:
## Log artifacts to MLFLOW
import mlflow

## Get the ID of the MLflow Run that was automatically created above
last_run_id = mlflow.last_active_run().info.run_id
renew_token('Final Model Logging')
    
with mlflow.start_run(run_id=last_run_id):
    mlflow.log_params(peft_config.to_dict())
    mlflow.transformers.log_model(
        transformers_model={"model": trainer.model, "tokenizer": tokenizer},
        artifact_path=model_dir + '-savedir',  # This is a relative path to save model files within MLflow run
    )

AUTH_TOKEN - Final Model Logging : [YqIzUgfRywDaI5Q2z9jA]
Device set to use cuda:0
2025/11/14 14:54:13 INFO mlflow.transformers: Overriding save_pretrained to False for PEFT models, following the Transformers behavior. The PEFT adaptor and config will be saved, but the base model weights will not and reference to the HuggingFace Hub repository will be logged instead.
2025/11/14 14:54:14 INFO mlflow.transformers: Skipping saving pretrained model weights to disk as the save_pretrained argumentis set to False. The reference to the HuggingFace Hub repository HuggingFaceTB/SmolLM2-360M-Instruct will be logged instead.


README.md: 0.00B [00:00, ?B/s]

2025/11/14 14:54:14 INFO mlflow.transformers: A local checkpoint path or PEFT model is given as the `transformers_model`. To avoid loading the full model into memory, we don't infer the pip requirement for the model. Instead, we will use the default requirements, but it may not capture all required pip libraries for the model. Consider providing the pip requirements explicitly.
Found credentials in environment variables.


üèÉ View run amusing-duck-908 at: http://mlflow.mlflow.svc.cluster.local:5000/#/experiments/0/runs/474dfdce2c9340388d1fa30a2b5bad6c
üß™ View experiment at: http://mlflow.mlflow.svc.cluster.local:5000/#/experiments/0


<img src="../AIE1.10//assets/mlflow_artifacts_1.png" alt="metrics in mlflow" width="800">

#### Inference with LoRa Adapter

In [20]:
from peft import AutoPeftModelForCausalLM

instruct_finetuned_360 = AutoPeftModelForCausalLM.from_pretrained('./math-SmolLM2-360M-Instruct' + 'savedir', device_map="auto", dtype=torch.float16)

We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


In [21]:
outputs = instruct_finetuned_360.generate(**inputs,max_new_tokens=500)
print("*** With LoRa Adapter ***")
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

*** With LoRa Adapter ***
<|im_start|>system
You are a math tutor. Solve problems step by step.<|im_end|>
<|im_start|>user
In a 90-minute soccer game, Mark played 20 minutes, then rested after. He then played for another 35 minutes. How long was he on the sideline?<|im_end|>
<|im_start|>assistant
Mark played for 20 minutes + 35 minutes = 55 minutes on the sideline.<|im_end|>
