# Supervised Fine-Tuning (SFT) at Scale with DeepSpeed

This guide provides a step-by-step workflow for supervised fine-tuning the `Qwen/Qwen2.5-32B-Instruct` model on a multi-GPU Anyscale cluster. We will use LLaMA-Factory for the training framework and `DeepSpeed` to efficiently manage memory and scale the training process.

**What is Supervised Fine-Tuning (SFT)?** SFT is a technique to adapt a pre-trained model to specific tasks. By showing the model high-quality examples of instructions and their desired outputs, we teach it to follow new instructions more accurately.


## Step 1: Set Up Your Environment
### Dependencies
First, we need to ensure our environment has the right libraries. We'll start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended Container Image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster downloads.

In [1]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory@git+https://github.com/hiyouga/LLaMA-Factory.git@v0.9.3

# Install DeepSpeed for large-scale training
pip install -q deepspeed==0.16.9

# (Optional) For experiment tracking with Weights & Biases
pip install -q wandb==0.21.3

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

[92mSuccessfully registered `llamafactory` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_v9rjackrrlnihvmvv7fqkpc3mz?workspace-tab=dependencies[0m
[92mSuccessfully registered `deepspeed` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_v9rjackrrlnihvmvv7fqkpc3mz?workspace-tab=dependencies[0m
[92mSuccessfully registered `wandb` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_v9rjackrrlnihvmvv7fqkpc3mz?workspace-tab=dependencies[0m
[92mSuccessfully registered `hf_transfer` package to be installed on all cluster nodes.[0m
[92mView and update

### Model and Compute Resources

DeepSpeed ZeRO-3 partitions parameters, gradients, and optimizer states across multiple GPUs, enabling supervised fine-tuning (SFT) of 30B+ LLMs on just 4 GPUs. This setup works with GPUs such as L40S (48 GB) or A100-40G. Before proceeding, check the availability of GPUs on your cloud.

| Item | Value |
|------|-------|
| **Base model** | [`Qwen/Qwen2.5-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) |
| **Worker Nodes** | 4 × L40S / 4 x A100-40G |





## Step 2: Prepare the Dataset

### Understand the Dataset
For this tutorial, we will use [`glaive_toolcall_en_demo`](https://huggingface.co/datasets/zuol/glaive_toolcall_en_demo/tree/main), a dataset designed to teach models how to use tools (also known as function calling).

This dataset contains conversational examples where the model needs to interact with external tools. Each entry includes:
* `conversations`: A turn-by-turn log between a human and the gpt assistant.
* `tools`: A JSON schema describing the functions the model can call.

Note: The `conversations` may include special turns like function_call (the model deciding to call a tool) and observation (the result returned from the tool). This structure is ideal for teaching the model sophisticated tool-use behavior. To maintain role alignment in ShareGPT format, a strict turn order must be followed: `human` and `observation` (tool output) must appear in odd-numbered positions (1, 3, 5, ...), while `gpt` and `function_call` must appear in even-numbered positions (2, 4, 6, ...). The model learns to generate the content in the `gpt` and `function_call` turns.

<details>
  <summary>Dataset Example</summary>

  ```json
  {
    "conversations": [
      {
        "from": "human",
        "value": "Hi, I have some ingredients and I want to cook something. Can you help me find a recipe?"
      },
      {
        "from": "gpt",
        "value": "Of course! I can help you with that. Please tell me what ingredients you have."
      },
      {
        "from": "human",
        "value": "I have chicken, bell peppers, and rice."
      },
      {
        "from": "function_call",
        "value": "{\"name\": \"search_recipes\", \"arguments\": {\"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"]}}"
      },
      {
        "from": "observation",
        "value": "{\"recipes\": [{\"name\": \"Chicken and Bell Pepper Stir Fry\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice.\"}, {\"name\": \"Chicken and Rice Casserole\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown.\"}]}"
      },
      {
        "from": "gpt",
        "value": "I found two recipes for you. The first one is \"Chicken and Bell Pepper Stir Fry\". The instructions are: Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice. The second one is \"Chicken and Rice Casserole\". The instructions are: Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown. Which one would you like to try?"
      },
      {
        "from": "human",
        "value": "They both sound delicious, but I think I'll try the stir fry. Can you order the ingredients for me?"
      },
      {
        "from": "gpt",
        "value": "I'm sorry, but as an AI, I don't have the capability to perform external tasks such as ordering ingredients. However, I can help you find more recipes or provide cooking tips if you need."
      }
    ],
    "tools": "[{\"name\": \"search_recipes\", \"description\": \"Search for recipes based on ingredients\", \"parameters\": {\"type\": \"object\", \"properties\": {\"ingredients\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}, \"description\": \"The ingredients to search for\"}}, \"required\": [\"ingredients\"]}}]"
  }
  ```

</details>

### Register the local dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add all dataset files and a `dataset_info.json` to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`. 

For example, if you wanted to run SFT fine-tuning on the `glaive_toolcall_en_demo` dataset locally, first go through the following setup steps:

`dataset_info.json`
```json
{
  "my_glaive_toolcall_en_demo": {
      "file_name": "glaive_toolcall_en_demo.json",
      "formatting": "sharegpt",
      "columns": {
          "messages": "conversations",
          "tools": "tools"
      }
  }
}
```

For a more detailed dataset preparation and formatting guide, follow **TODO: link**:[_](3.1.3-data-prep-fine-tune.md)



In [3]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/glaive_toolcall_en_demo.json -O /mnt/cluster_storage/glaive_toolcall_en_demo.json
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/

--2025-09-15 08:52:01--  https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/glaive_toolcall_en_demo.json
Resolving anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)... 3.5.77.39, 3.5.82.180, 3.5.78.195, ...
Connecting to anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)|3.5.77.39|:443... connected.


HTTP request sent, awaiting response... 200 OK
Length: 738925 (722K) [application/json]
Saving to: ‘/mnt/cluster_storage/glaive_toolcall_en_demo.json’

     0K .......... .......... .......... .......... ..........  6%  266M 0s
    50K .......... .......... .......... .......... .......... 13%  107M 0s
   100K .......... .......... .......... .......... .......... 20%  319M 0s
   150K .......... .......... .......... .......... .......... 27%  364M 0s
   200K .......... .......... .......... .......... .......... 34%  318M 0s
   250K .......... .......... .......... .......... .......... 41%  314M 0s
   300K .......... .......... .......... .......... .......... 48%  291M 0s
   350K .......... .......... .......... .......... .......... 55%  317M 0s
   400K .......... .......... .......... .......... .......... 62%  306M 0s
   450K .......... .......... .......... .......... .......... 69%  362M 0s
   500K .......... .......... .......... .......... .......... 76%  376M 0s
   550K ....

## Step 3: Create the Fine-Tuning Config (SFT + DeepSpeed)

Next, create the main YAML configuration file—the master recipe for our fine-tuning job. It specifies the base model, the fine-tuning method (LoRA), the dataset, training hyperparameters, cluster resources, and more.

Here is the `qwen2.5_deepspeed_lora_sft.yaml` included in the workspace:

**Important notes:**
- Include the `WANDB_API_KEY` in the config to enable Weights & Biases (W&B) tracking; otherwise, you’ll encounter an error. If you don’t want to use W&B, disable it by setting `report_to: none` in the config.
- This YAML file only needs to be accessible to the head node; there’s no need to place it on storage accessible to worker nodes (e.g., `/mnt/cluster_storage/`).
- However, any paths referenced by the config (e.g., `dataset_dir`, DeepSpeed JSON, `output_dir`) must be on storage accessible to all workers (e.g., `/mnt/cluster_storage/`).
- Qwen models are generally not gated, but if you need to fine-tune a gated model such as Llama, include your `HF_TOKEN` in the config.
- We set `accelerator_type` to `L40S` in the config; you can switch to other GPUs—such as `A100-40GB`—or any other GPU with comparable or more VRAM, depending on your cloud availability.

### LLaMA-Factory + Ray Configuration

```yaml
# qwen2.5_deepspeed_lora_sft.yaml

### model
model_name_or_path: Qwen/Qwen2.5-32B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### deepspeed
deepspeed: /mnt/cluster_storage/ds_z3_config.json # path to the DeepSpeed config

### dataset
# local dataset
dataset: my_glaive_toolcall_en_demo
dataset_dir: /mnt/cluster_storage

template: qwen
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: qwen2.5_32b_lora_sft
logging_steps: 5
save_steps: 50
plot_loss: true
report_to: wandb # or none

### train
per_device_train_batch_size: 1 # adjust this depending on your GPU memory and sequence length
gradient_accumulation_steps: 4
num_train_epochs: 3.0
learning_rate: 1.0e-4
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: qwen2.5_32b_lora_sft
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4  # Number of GPUs to use.
resources_per_worker:
  GPU: 1
  accelerator_type:L40S: 0.001            # Use this to simply specify a GPU type (not guaranteed on the same node). You can use A100-40G if L40S is not available. 
  # anyscale/accelerator_shape:4xL40S: 0.001  # Use this to specify a specific node shape.
  # See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.
ray_init_kwargs:
  runtime_env:
    env_vars:
      # if using wandb for experiments tracking
      WANDB_API_KEY: <your_wandb_token>
      # if using gated models like meta-llama/Llama-3.1-8B-Instruct
      # HF_TOKEN: <your_huggingface_token>
      # if hf_transfer is installed
      HF_HUB_ENABLE_HF_TRANSFER: '1'
```

**Note:**
This configuration assumes `4xL40S` GPUs are available in your cloud environment. If not, you can substitute with `4xA100-40G` (or another supported accelerator with similar VRAM).

### DeepSpeed Configuration
DeepSpeed is an open-source deep-learning optimization library developed by Microsoft, aimed at enabling large-model training. Higher ZeRO stages (1→3) and enabling CPU offload reduce GPU VRAM usage, but might cause slower training. 

To enable DeepSpeed, create a separate JSON config and reference it from your main training yaml config with: `deepspeed: /mnt/cluster_storage/ds_z3_config.json`.

**Note**: you need to put the Deepspeed config in the **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)**。

Below is a sample ZeRO-3 config:

<details>
  <summary>ds_z3_config.json</summary>

  ```json
  {
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "bf16": {
      "enabled": "auto"
    },
    "zero_optimization": {
      "stage": 3,
      "overlap_comm": false,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
    }
  }
  ```

</details>


For a more detailed guide on acceleration and optimization methods including DeepSpeed on Ray, check [todo: add doc link](3.9-speed-and-memory-optimizations.md)

In [4]:
%%bash
# Create a copy of the DeepSpeed configuration file in /mnt/cluster_storage
cp ../deepspeed-configs/ds_z3_config.json /mnt/cluster_storage/

## Step 4: Train and Monitor

With all configuration in place, you can launch fine-tuning/post-training in one of two ways.

### Option A — Run from a Workspace (quick start)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.

In [3]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/qwen2.5_deepspeed_lora_sft.yaml

[2025-09-09 15:20:26,962] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)
INFO 09-09 15:20:29 [__init__.py:248] No platform detected, vLLM is running on UnspecifiedPlatform


2025-09-09 15:20:32,996	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.0.51.64:6379...
2025-09-09 15:20:33,007	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-llcdv67sqfh5atv2kbdl8prt5n.i.anyscaleuserdata.com [39m[22m
2025-09-09 15:20:33,009	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_1e8e0d3419b5e62f9bbdd40542f176752fa75f01.zip' (0.41MiB) to Ray cluster...
2025-09-09 15:20:33,010	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_1e8e0d3419b5e62f9bbdd40542f176752fa75f01.zip'.



View detailed results here: /mnt/cluster_storage/qwen2.5_32b_lora_sft
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-09-09_10-09-51_355612_2399/artifacts/2025-09-09_15-20-33/qwen2.5_32b_lora_sft/driver_artifacts`
[36m(autoscaler +9s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[36m(autoscaler +11s)[0m [autoscaler] [1xL4:4CPU-16GB] Attempting to add 1 node to the cluster (increasing from 0 to 1).
[36m(autoscaler +11s)[0m [autoscaler] [1xL4:4CPU-16GB|g6.xlarge] [us-west-2a] [on-demand] Launched 1 instance.




[36m(TrainTrainable pid=3073, ip=10.0.33.92)[0m [2025-09-09 15:23:37,642] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cpu (auto detect)

Training started with configuration:
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config                                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/args/bf16                                                                             True │
│ train_loop_config/args/cutoff_len                                                                       1024 │
│ train_loop_config/args/dataset                                                          ..._toolcall_en_demo │
│ train_loop_config/args/dataset_dir                                                      /mnt/cluster_storage │
│ 

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=3073, ip=10.0.33.92)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=3073, ip=10.0.33.92)[0m - (node_id=42cfab8d697667e67bd8e5415464841428c1d9eae0a9dcafa4f85fc9, ip=10.0.33.92, pid=3150) world_rank=0, local_rank=0, node_rank=0


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [2025-09-09 15:23:46,015] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:23:48] llamafactory.hparams.parser:143 >> Set `ddp_find_unused_parameters` to False in DDP training since LoRA is enabled.
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:23:48] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-09 15:23:49,827 >> loading file vocab.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/vocab.json
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-09 15:23:49,827 >> loading file merges.txt from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/merges.txt
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-09 15:23:49,827 >> loading file tokenizer.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer.json
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-09 15:23:49,827 >> loading file added_to

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:23:51] llamafactory.data.loader:143 >> Loading dataset glaive_toolcall_en_demo.json...


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|tokenization_utils_base.py:2299] 2025-09-09 15:23:51,159 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 300 examples [00:00, 17097.28 examples/s]
Converting format of dataset (num_proc=16):   0%|          | 0/100 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16):  21%|██        | 21/100 [00:00<00:00, 189.68 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 100/100 [00:00<00:00, 405.61 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|          | 0/100 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16):   7%|▋         | 7/100 [00:00<00:11,  8.18 examples/s]
Running tokenizer on dataset (num_proc=16):  21%|██  

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m training example:
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m input_ids:
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 382, 2, 13852, 271, 2610, 1231, 1618, 825, 476, 803, 5746, 311, 7789, 448, 279, 1196, 3239, 382, 2610, 525, 3897, 448, 729, 32628, 2878, 366, 15918, 1472, 15918, 29, 11874, 9492, 510, 27, 15918, 397, 4913, 1313, 788, 330, 1688, 497, 330, 1688, 788, 5212, 606, 788, 330, 1836, 7080, 8923, 497, 330, 4684, 788, 330, 5890, 369, 18627, 3118, 389, 13966, 497, 330, 13786, 788, 5212, 1313, 788, 330, 1700, 497, 330, 13193, 788, 5212, 38120, 788, 5212, 1313, 788, 330, 1653, 497, 330, 3615, 788, 5212, 1313, 788, 330, 917, 14345, 330, 4684, 788, 330, 785, 13966, 311, 2711, 369, 9207, 2137, 330, 6279, 788, 4383, 38120, 1341, 3417, 532, 522, 15918, 1339, 2461, 1817, 729, 1618, 11, 470, 264, 2951, 1633, 448, 729, 829, 32

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|configuration_utils.py:698] 2025-09-09 15:23:56,533 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/config.json
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|configuration_utils.py:770] 2025-09-09 15:23:56,534 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "architectures": [
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   ],
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "hidden_siz

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:24:01] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:24:01] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:24:01] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:24:01] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:24:01] llamafactory.model.model_utils.misc:143 >> Found linear modules: up_proj,v_proj,gate_proj,o_proj,q_proj,down_proj,k_proj
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|2025-09-09 15:24:01] llamafactory.model.loader:143 >> trainable params: 4,399,104 || all params: 498,431,872 || trainable%

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:756] 2025-09-09 15:24:01,550 >> Using auto half precision backend
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:2409] 2025-09-09 15:24:02,150 >> ***** Running training *****
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:2410] 2025-09-09 15:24:02,150 >>   Num examples = 100
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:2411] 2025-09-09 15:24:02,150 >>   Num Epochs = 1
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:2412] 2025-09-09 15:24:02,150 >>   Instantaneous batch size per device = 1
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:2415] 2025-09-09 15:24:02,150 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:2416] 2025-09-09 15:24:02,150 >>   Gradient Accumulation steps = 4
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [IN

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m {'loss': 0.7165, 'grad_norm': 0.753259003162384, 'learning_rate': 9.949107209404665e-05, 'epoch': 0.2}


 20%|██        | 5/25 [00:05<00:18,  1.06it/s][0m 
 24%|██▍       | 6/25 [00:06<00:17,  1.11it/s][0m 
 28%|██▊       | 7/25 [00:06<00:15,  1.13it/s][0m 
 32%|███▏      | 8/25 [00:07<00:14,  1.16it/s][0m 
 36%|███▌      | 9/25 [00:08<00:13,  1.16it/s][0m 


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m {'loss': 0.8346, 'grad_norm': 0.7239006161689758, 'learning_rate': 8.274303669726426e-05, 'epoch': 0.4}


 40%|████      | 10/25 [00:09<00:12,  1.17it/s]0m 
 44%|████▍     | 11/25 [00:10<00:11,  1.20it/s]0m 
 48%|████▊     | 12/25 [00:11<00:10,  1.19it/s]0m 
 52%|█████▏    | 13/25 [00:11<00:10,  1.19it/s]0m 
 56%|█████▌    | 14/25 [00:12<00:09,  1.19it/s]0m 


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m {'loss': 0.6278, 'grad_norm': 0.8122984766960144, 'learning_rate': 5e-05, 'epoch': 0.6}


 60%|██████    | 15/25 [00:13<00:08,  1.20it/s]0m 
 64%|██████▍   | 16/25 [00:14<00:07,  1.21it/s]0m 
 68%|██████▊   | 17/25 [00:15<00:06,  1.22it/s]0m 
 72%|███████▏  | 18/25 [00:16<00:05,  1.21it/s]0m 
 76%|███████▌  | 19/25 [00:16<00:04,  1.21it/s]0m 


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m {'loss': 0.3742, 'grad_norm': 0.7035364508628845, 'learning_rate': 1.725696330273575e-05, 'epoch': 0.8}


 80%|████████  | 20/25 [00:17<00:04,  1.21it/s]0m 
 84%|████████▍ | 21/25 [00:18<00:03,  1.21it/s]0m 
 88%|████████▊ | 22/25 [00:19<00:02,  1.20it/s]0m 
 92%|█████████▏| 23/25 [00:20<00:01,  1.20it/s]0m 
 96%|█████████▌| 24/25 [00:21<00:00,  1.22it/s]0m 


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m {'loss': 0.5312, 'grad_norm': 0.5648745894432068, 'learning_rate': 5.089279059533658e-07, 'epoch': 1.0}


100%|██████████| 25/25 [00:21<00:00,  1.21it/s][INFO|trainer.py:3993] 2025-09-09 15:24:24,016 >> Saving model checkpoint to qwen2.5_32b_lora_sft/checkpoint-25
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|configuration_utils.py:698] 2025-09-09 15:24:24,262 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/config.json
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|configuration_utils.py:770] 2025-09-09 15:24:24,262 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "architectures": [
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   ],
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   "e


Training finished iteration 1 at 2025-09-09 15:24:24. Total running time: 3min 51s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000000 │
│ time_this_iter_s               44.46439 │
│ time_total_s                   44.46439 │
│ training_iteration                    1 │
│ epoch                                1. │
│ grad_norm                       0.56487 │
│ learning_rate                        0. │
│ loss                             0.5312 │
│ step                                 25 │
╰─────────────────────────────────────────╯
Training saved a checkpoint for iteration 1 at: (local)/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_338d5_00000_0_2025-09-09_15-20-33/checkpoint_000000
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m {'train_runtime': 22.7555, 'train_samples_per_second': 4.395, 'train_steps_per_second': 1.099, 'train_loss': 0.616857385635376, 'epo

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_32b_lora_sft/TorchTrainer_338d5_00000_0_2025-09-09_15-20-33/checkpoint_000000)
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:2676] 2025-09-09 15:24:24,908 >> 
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m 
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m 
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m 
100%|██████████| 25/25 [00:22<00:00,  1.10it/s]0m 
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|trainer.py:3993] 2025-09-09 15:24:24,910 >> Saving model checkpoint to qwen2.5_32b_lora_sft
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|configuration_utils.py:698] 2025-09-09 15:24:25,137 >> loading configuration file config.json from cache at /home/ray/.cache/huggin

[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m ***** train metrics *****
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   epoch                    =        1.0
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   total_flos               =   113387GF
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   train_loss               =     0.6169
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   train_runtime            = 0:00:22.75
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   train_samples_per_second =      4.395
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m   train_steps_per_second   =      1.099
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m Figure saved at: qwen2.5_32b_lora_sft/training_loss.png


[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m [INFO|modelcard.py:450] 2025-09-09 15:24:25,399 >> Dropping the following result as it does not have all the necessary fields:
[36m(RayTrainWorker pid=3150, ip=10.0.33.92)[0m {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}



Training completed after 1 iterations at 2025-09-09 15:24:26. Total running time: 3min 53s


2025-09-09 15:24:26,598	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/qwen2.5_32b_lora_sft' in 0.0326s.





### Option B — Run as an Anyscale Job (production)

For longer or production runs, submit the training as an **Anyscale Job**. Jobs run outside your interactive session for better stability, retries, and durable logs. You’ll package LLaMA-Factory and other libraries in a container image and launch with a short job config. See **[WIP Launching Fine-Tuning with Anyscale Jobs](3.10-launch-fine-tuning-with-anyscale-jobs.md)** for the step-by-step guide.

### Monitoring with Weights & Biases (WandB)
If you enabled Weights & Biases ( with `report_to: wandb` in the training config YAML file), you can monitor your training job in real-time. Look for the training loss to decrease steadily, which indicates the model is learning. For a more detailed guide on tracking experiments with other tools such as TensorBoard and MLFlow, see [todo: add doc link](3.5-observability-and-tracking.md).

**WandB**
![WandB](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.1/3.2.1-wandb.png)

## Step 5: Locate Checkpoints

Checkpoints are written under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/qwen2.5_32b_lora_sft`. 

Inside, you’ll see a **trainer session** directory named like:
`TorchTrainer_8c6a5_00000_0_2025-09-09_09-53-45/`.

- `TorchTrainer_*` is created **when the trainer starts**; the suffix encodes a short run id and the **start timestamp**.
- Within that directory, checkpoints are named `checkpoint_000xxx/`, where the number is the saved ordered checkpoints. 

The save cadence is controlled by `save_strategy` and `save_steps`. For instructions on how to resume interrupted training via `resume_from_checkpoint` and more, see [todo: add link for checkpointing](3.4-checkpointing.md#understanding-your-training-output-directory).

## Step 6: Export the Model

If you use LoRA, you can keep the base model and adapter separate ([for multi-LoRA adapter use](https://docs.anyscale.com/llm/serving/multi-lora)) or merge the adapter into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply post-training quantization on merged or full models before serving. See [todo: add doc link]() for the exact export commands and options.