# Fine-Tuning Qwen3 Models on Amazon SageMaker

This notebook demonstrates the complete fine-tuning process for Qwen3 models using Amazon SageMaker's distributed training capabilities. We'll configure training parameters, execute the fine-tuning job with LoRA (Low-Rank Adaptation), and prepare the resulting model for inference.

## What This Notebook Covers

- **Training Configuration**: Set up optimized hyperparameters for memory-efficient fine-tuning
- **LoRA Implementation**: Configure Parameter-Efficient Fine-Tuning with LoRA
- **SageMaker Training Jobs**: Execute distributed training with PyTorch framework
- **Model Inference**: Test the fine-tuned model and validate Chain-of-Thought performance
- **Model Compression**: Prepare models for deployment with optimized storage

## Prerequisites

- Completed environment preparation notebook
- Model and training data uploaded to S3
- SageMaker execution role with appropriate permissions

In [None]:
%store -r

In [None]:
print(f"test_model_id : {test_model_id}")
print(f"bucket : {bucket}")
print(f"model_weight_path : {model_weight_path}")
print(f"training_input_path : {training_input_path}")
# print(f"test_input_path : {test_input_path}")
print(f"local_training_input_path : {local_training_input_path}")
# print(f"local_test_input_path : {local_test_input_path}")
print(f"registered_model : {registered_model}")

In [None]:
import sagemaker
from pathlib import Path
from time import strftime

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

In [None]:
sagemaker.__version__

## Model Fine-Tuning Parameter Configuration

Now we're ready to configure model fine-tuning parameters. We'll use TRL's [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) to fine-tune our model. SFTTrainer simplifies supervised fine-tuning of open LLMs and is a subclass of the `transformers` `Trainer` class. 

We have prepared a script [sm_qlora_trainer.py](./src/sm_qlora_trainer.py) that loads datasets from disk, prepares the model and tokenizer, and starts training. This script uses TRL's [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) to fine-tune the model and supports the following features:

The `yaml` file is uploaded to Amazon SageMaker similar to the dataset. We'll save this configuration file as `qwen3-4b.yaml` and upload it to S3.

In [None]:
!mkdir -p src/configs

In [None]:
%%writefile src/configs/qwen3-4b.yaml
# Script basic parameters
model_name_or_path: "/opt/ml/input/data/model_weight"
train_dataset_path: "/opt/ml/input/data/training"
output_dir: "/opt/ml/checkpoints"
tokenizers_parallelism: "false"

# Model configuration - Memory optimization
model:
  load_in_4bit: true
  bnb_4bit_use_double_quant: true  # Enable double quantization
  bnb_4bit_quant_type: "nf4"
  use_bf16: false  # Use fp16 (memory efficient)
  trust_remote_code: true
  low_cpu_mem_usage: true
  use_cache: false  # Disable cache to save memory
  offload_folder: "offload"  # Disk offloading configuration
  offload_state_dict: true  # State dictionary offloading

# Tokenizer configuration
tokenizer:
  trust_remote_code: true
  use_fast: true
  padding_side: "right"

# LoRA configuration - Memory optimization
lora:
  lora_alpha: 16
  lora_dropout: 0.05
  lora_r: 64  # Reduced r value to decrease memory usage
  bias: "none"
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "gate_proj"
    - "up_proj"
    - "down_proj"

# Data configuration - Memory optimization
data:
  train_path: "train_dataset.json"
  text_column: "text"
  max_seq_length: 2048
  padding: false  # Use dynamic padding
  truncation: true

# Dataset processing configuration - Memory optimization
dataset:
  preprocessing_batch_size: 50  # Small batch size for processing
  num_proc: 1
  streaming: false  # Set to true if needed to enable streaming

# Data collator configuration
data_collator:
  mlm: false
  pad_to_multiple_of: 8

# Training configuration - Memory optimization
training:
  per_device_train_batch_size: 1  # Reduced batch size
  gradient_accumulation_steps: 8  # Increased to maintain effective batch size
  learning_rate: 2.0e-3
  num_train_epochs: 5
  logging_steps: 10
  warmup_steps: 10
  optim: "adamw_torch_fused"  # Optimized optimizer
  group_by_length: true  # Group by length to minimize padding
  save_strategy: "steps"
  save_steps: 500
  save_total_limit: 1  # Reduced number of saved models
  seed: 42
  dataloader_num_workers: 0  # Reduced number of workers
  report_to: "none"  # Disable reporting
  ddp_find_unused_parameters: false
  gradient_checkpointing: true  # Enable gradient checkpointing
  max_grad_norm: 1.0

In [None]:
# from sagemaker.huggingface import HuggingFace
# import torch

training_hyperparameters={}

## Create SageMaker Training Job

To create a SageMaker training job, we need the `HuggingFace` Estimator. The Estimator handles end-to-end training and deployment workflows on Amazon SageMaker. The Estimator manages infrastructure usage. Amazon SageMaker launches and manages all necessary EC2 instances, provides appropriate Hugging Face containers, uploads the provided scripts, and downloads data from S3 buckets to `/opt/ml/input/data` in the container. Then it starts the training job.

> Note: When using custom training scripts, you must include `requirements.txt` in the `source_dir`. It's recommended to clone the entire repository.

To use `torchrun` for script execution, simply define the `distribution` parameter in the Estimator and set it to `{"torch_distributed": {"enabled": True}}`. This will make SageMaker execute the training job as follows:

```python
torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-1 --master_port 7777 --node_rank 1 sm_qlora_trainer.py --config /opt/ml/input/data/config/config.yaml
```

The HuggingFace configuration below starts a training job on 1x ml.g5.2xlarge with 1x A10 GPU. The amazing thing about SageMaker is that you can easily scale to ml.p4d.24xlarge or 2x ml.p4d.24xlarge by modifying the instance_count. SageMaker handles the rest.

In [None]:
instance_type = 'ml.g5.2xlarge'
# instance_type = 'ml.p4d.24xlarge'
# instance_type = 'ml.p5.48xlarge'
# instance_type = 'local_gpu'
instance_count = 1
max_run = 72*60*60

In [None]:
local_model_weight_path = f"{Path.cwd()}/{registered_model}"
local_model_weight_path

In [None]:
if instance_type =='local_gpu':
    import os
    from sagemaker.local import LocalSession

    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    training = f"file://{local_training_input_path}"
    # test = f"file://{local_test_input_path}"
    model_weight = f"file://{local_model_weight_path}"
else:
    sagemaker_session = sagemaker.Session()
    training = training_input_path
    # test = test_input_path
    model_weight = model_weight_path

training, model_weight

In [None]:
from sagemaker.pytorch import PyTorch
import time
# define Training Job Name 
job_name = f'huggingface-{registered_model}-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
# distribution={"mpi": {"enabled": True}}
distribution={
    "torch_distributed": {
        "enabled": True,
        # "NCCL_DEBUG":"INFO"
        # "mpi": "-verbose -x NCCL_DEBUG=INFO"
    }
}  # torchrun, activates SMDDP AllGather
# distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather

environment={
    "NCCL_DEBUG" : "INFO", 
    "SM_LOG_LEVEL": "10",
}

training_hyperparameters["config"] = "/opt/ml/code/configs/qwen3-4b.yaml"
    
estimator = PyTorch(
                    entry_point='sm_lora_trainer.py',
                    source_dir=f'{Path.cwd()}/src',
                    role=role,
                    # image_uri=image_uri,
                    framework_version='2.3.0',
                    py_version='py311',
                    instance_count=instance_count,
                    instance_type=instance_type,
                    distribution=distribution,
                    disable_profiler=True,
                    debugger_hook_config=False,
                    max_run=max_run,
                    hyperparameters=training_hyperparameters,
                    sagemaker_session=sagemaker_session,
                    # enable_remote_debug=True,
                    # keep_alive_period_in_seconds=1200,
                    # input_mode='FastFile'
                    # max_wait=max_run,
                    # use_spot_instances=True,
                    # subnets=['subnet-090e278f3622051c4'],
                    # security_group_ids=['sg-05baa06337a188842'],
                    max_retry_attempts=30,
                    environment=environment,
                   )

In [None]:
!sudo rm -rf src/core.*

In [None]:
current_time = strftime("%m%d-%H%M%s")
i_type = instance_type.replace('.','-')
job_name = f'{registered_model}-{i_type}-{instance_count}-{current_time}'


if instance_type =='local_gpu':
    estimator.checkpoint_s3_uri = None
else:
    estimator.checkpoint_s3_uri = f's3://{bucket}/checkpoint/{test_model_id}/{job_name}'
    
    
estimator.fit(
    inputs={
        'training': training,
        'model_weight' : model_weight
    }, 
    job_name=job_name,
    wait=False
)

In [None]:
sagemaker_session = sagemaker.Session()
sagemaker_session.logs_for_job(job_name=job_name, wait=True)

## PEFT Model Inference

After training completion, we'll test our fine-tuned model to evaluate its Chain-of-Thought reasoning performance. This section covers loading the trained LoRA weights, merging them with the base model, and conducting inference tests.

In [None]:
import sagemaker
sagemaker_session = sagemaker.Session()
train_result = sagemaker_session.describe_training_job(job_name=job_name)

In [None]:
checkpoint_s3uri = train_result['CheckpointConfig']['S3Uri']
checkpoint_s3uri

In [None]:
!aws s3 ls $checkpoint_s3uri/

In [None]:
output_dir = './checkpoints'

In [None]:
!rm -rf $output_dir
!aws s3 sync $checkpoint_s3uri $output_dir

In [None]:
!rm -rf $output_dir/checkpoint-*
!rm -rf $output_dir/compressed_model
!rm -rf $output_dir/runs

In [None]:
local_model_weight_path=f'{Path.cwd()}/{registered_model}'

In [None]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
peft_model_id = output_dir

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    local_model_weight_path,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
peft_model = PeftModel.from_pretrained(base_model, peft_model_id)
peft_model = peft_model.merge_and_unload()

In [None]:
merged_save_dir = "merged_model"
peft_model.save_pretrained(merged_save_dir, safe_serialization=True, max_shard_size="2GB")

In [None]:
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(local_model_weight_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(merged_save_dir)

In [None]:
import torch
torch.cuda.empty_cache()
device = torch.cuda.current_device()

In [None]:
# Chain-of-Thought inference prompt template
inference_prompt_style = """You are an AI Assistant with advanced knowledge in reasoning, analysis, and problem-solving.
Provide the most appropriate answer to the <question>. Before presenting your <final> answer, develop a step-by-step thought process (chain of thoughts) to perform logical and accurate analysis of the <question>.

<question>
{}
</question>
### Guidelines:
- Skip unnecessary greetings or preambles, and start directly with <response>
- Do not repeat the question and answer
- Write the step-by-step thought process in sufficient detail, but keep the final answer concise

### Response Format:
<think>
    ### THINKING
    [Provide detailed step-by-step reasoning process here. Analyze the problem, consider possible approaches, and use logical reasoning to reach a conclusion.]
</think>
<final>
    ### FINAL-ANSWER
    [Present the conclusion derived from THINKING as a concise and clear final answer.]
</final>

Answer below:
<think>
"""

In [None]:
%%time
max_new_tokens = 1024

input_ids = tokenizer(
    [inference_prompt_style.format("Can you create a famous tourist route in Seoul?") + tokenizer.eos_token], return_tensors="pt"
).input_ids

input_ids = input_ids.to(device)

outputs = peft_model.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
import os
os.makedirs('shell', exist_ok=True)
compressed_model_path='/'.join(checkpoint_s3uri.split("/")[:-1]) + "/compressed_model"
compressed_model_path

## Fine-Tuned Model Compression (model.tar.gz)

We'll compress our fine-tuned model into a tar.gz format for efficient storage and deployment. This compressed model can be easily deployed to SageMaker endpoints or used for batch inference.

In [None]:
%%writefile shell/finetuned_model_compression_upload.sh

cd merged_model
cp -r ../src/requirements.txt ./
sudo rm -rf code
tar cvf - * | pigz > model.tar.gz

cd ..
mv merged_model/model.tar.gz ./model.tar.gz

In [None]:
%%time
!sh ./shell/finetuned_model_compression_upload.sh

In [None]:
!aws s3 cp ./model.tar.gz $compressed_model_path/finetuned/model.tar.gz

## Pre-trained Model Compression

For comparison and backup purposes, we'll also compress the original pre-trained model. This allows easy comparison between the base model and fine-tuned model performance.

In [None]:
!rm -rf $registered_model/original

In [None]:
%%writefile shell/pretrained_model_compression_upload.sh

cd qwen3-4b
tar cvf - * | pigz > pretrained_model.tar.gz

cd ..
mv qwen3-4b/pretrained_model.tar.gz ./pretrained_model.tar.gz

In [None]:
%%time
!sh ./shell/pretrained_model_compression_upload.sh
!aws s3 cp ./pretrained_model.tar.gz $compressed_model_path/pretrained/model.tar.gz

In [None]:
%store merged_save_dir
%store checkpoint_s3uri
%store compressed_model_path