# Direct Preference Optimization (DPO) at Scale with QLoRA

This guide provides a step-by-step workflow for preference fine-tuning the `Qwen/Qwen2.5-7B-Instruct` model on a multi-GPU Anyscale cluster. We will use LLaMA-Factory as the training framework and `QLoRA` to reduce memory requirements and enable efficient multi-GPU training.

**What is Direct Preference Optimization (DPO)?** DPO aligns a model with human preferences using pairs of “chosen” and “rejected” responses. Rather than training a separate reward model, DPO directly optimizes the policy to increase the likelihood of preferred outputs and decrease the likelihood of rejected ones.

## Step 1: Set Up Your Environment
### Dependencies
First, we need to ensure our environment has the right libraries. We'll start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended Container Image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster downloads.

In [1]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory@git+https://github.com/hiyouga/LLaMA-Factory.git@v0.9.3

# (Optional) For visualizing training metrics and logs
pip install -q tensorboard==2.20.0

# (Optional) For lightweight 8-bit and 4-bit optimizers and inference
pip install -q bitsandbytes==0.47.0

# (Optional) For AWQ quantization support
pip install -q autoawq==0.2.9

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-server 1.24.0 requires anyio<4,>=3.1.0, but you have anyio 4.10.0 which is incompatible.[0m[31m


[92mSuccessfully registered `llamafactory` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_j3li62ul9bwvaathjuulbzf7wc?workspace-tab=dependencies[0m
[92mSuccessfully registered `tensorboard` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_j3li62ul9bwvaathjuulbzf7wc?workspace-tab=dependencies[0m
[92mSuccessfully registered `bitsandbytes` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_j3li62ul9bwvaathjuulbzf7wc?workspace-tab=dependencies[0m
[92mSuccessfully registered `autoawq` package to be installed on all cluster nodes.[0m
[92mView and u

[0m

## Model and Resources

| Item | Value |
|------|-------|
| **Base model** | [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| **Workers** | 4 × L4 / A10G |

> Compared to SFT, DPO holds two copies of the model (policy + reference), and alignment datasets often use long contexts, so it's the ideal workflow for memory optimization techniques like **QLoRA**. On 24 GB NVIDIA L4 GPUs, running DPO at FP16 for 7B models generally OOMs without QLoRA.

### Understand the Dataset
For this tutorial, we will use [`ultrafeedback.jsonl`](kaitchup/UltraFeedback-prompt-chosen-rejected), a JSONL preference dataset tailored for Direct Preference Optimization (DPO). Each sample contains one instruction **prompt** and two candidate completions: a **preferred** (`chosen`) response and a **less preferred** (`rejected`) response.

This dataset includes:
- `prompt`: An instruction/question to answer (often multi-sentence, with constraints).
- `chosen`: The response that best follows the instruction.
- `rejected`: A weaker alternative for the same prompt (may span multiple lines).

**Note:** Files are in JSON Lines format—one JSON object per line. Each record is independent (no multi-turn conversation thread).

<details>
  <summary>Dataset Example</summary>

  ```json
  {
    "prompt": "Paraphrase the given questions to have different wording. Your paraphrased questions should have the same answer as the original question. Try to change the sentence as much as possible using synonyms and/or rearranging the structure of the sentence. The questions are in three domains: presidents, national parks, and dogs. Each question has a keyword indicating its domain. Keywords are \"this national park\", \"this dog breed\", and \"this president\", which will be replaced with the name of an actual president, a national park, or a breed of dog. Hence, in paraphrasing, this keyword should also be used the same way. Do not write questions that compare or involve multiple domains. Do not write open-ended or subjective questions (e.g., questions that can be answered differently by different people.) Make your questions specific and concrete. Your question should have the same type of answer as the original question(e.g., if the question is extractive, the paraphrased question should be extractive as well.)\n\nWhat lakes are in this national park?",

    "rejected": "What bodies of water are located in this national park? \n\nWhich president is commonly known for his efforts to protect natural resources?\n\nWhich president is recognized for their dedication to preserving the environment? \n\nWhat type of dog breed is known for its loyalty and affectionate nature?\n\nWhat breed of dog is renowned for its faithfulness and loving personality?",
    
    "chosen": "Which bodies of water can be found within the borders of this particular national park?"
  }
  ```
</details>

### Register the local dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add all dataset files and a `dataset_info.json` to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`. 

For example, if you wanted to run DPO post-training on the `ultrafeedback` dataset locally, first go through the following setup steps:

`dataset_info.json`
```json
{
  "my_ultrafeedback": {
    "file_name": "ultrafeedback.jsonl",
    "ranking": true,
    "columns": {
      "prompt": "prompt",
      "chosen": "chosen",
      "rejected": "rejected"
    }
  }
}
```

For a more detailed dataset preparation and formatting guide, follow **TODO: link**:[_](3.1.3-data-prep-fine-tune.md)

In [2]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/ultrafeedback.jsonl -O /mnt/cluster_storage/ultrafeedback.jsonl
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/

--2025-09-19 15:54:37--  https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/ultrafeedback.jsonl
Resolving anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)... 3.5.77.216, 52.218.230.41, 52.92.229.66, ...
Connecting to anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)|3.5.77.216|:443... connected.


HTTP request sent, awaiting response... 200 OK
Length: 291881 (285K) [application/x-www-form-urlencoded]
Saving to: ‘/mnt/cluster_storage/ultrafeedback.jsonl’

     0K .......... .......... .......... .......... .......... 17%  233M 0s
    50K .......... .......... .......... .......... .......... 35% 35.1M 0s
   100K .......... .......... .......... .......... .......... 52%  240M 0s
   150K .......... .......... .......... .......... .......... 70% 80.9M 0s
   200K .......... .......... .......... .......... .......... 87%  248M 0s
   250K .......... .......... .......... .....                100%  264M=0.003s

2025-09-19 15:54:37 (102 MB/s) - ‘/mnt/cluster_storage/ultrafeedback.jsonl’ saved [291881/291881]



## Step 3: Create the Preference-Tuning Config (DPO + QLoRA)

Next, create the YAML configuration file that defines your DPO (Direct Preference Optimization) run. It specifies the base model, quantization (QLoRA), dataset, DPO hyperparameters, logging, and Ray cluster resources.

Here is the `qwen2.5_7b_qlora_dpo_ray.yaml` included in the workspace:

**Important notes:**
- **QLoRA quantization:** `quantization_bit: 4` with `quantization_method: bnb` reduces memory while preserving quality. If you use a *pre-quantized* model like AWQ, **omit** these keys.
- **LoRA setup**: If you prefer standard LoRA, **disable quantization** by removing both `quantization_bit` and `quantization_method` from the config.
- **Access & paths:** The YAML only needs to be on the **head node**, but any referenced paths (`dataset_dir`, `output_dir`) must live on storage **reachable by all workers** (e.g., `/mnt/cluster_storage/`).
- **Gated models:** Qwen is generally ungated. For gated bases (e.g., Llama), add your `HF_TOKEN`.

### LLaMA-Factory + Ray Configuration

```yaml
# qwen2.5_7b_qlora_dpo_ray.yaml

### model
trust_remote_code: true
model_name_or_path: Qwen/Qwen2.5-7B-Instruct

### method
# If you instead want to use just LoRA, or use a pre-quantized model like Qwen/Qwen2.5-7B-Instruct-AWQ, then omit the quantization_bit/method keys below
quantization_bit: 4 # 4-bit base weights (QLoRA). Use 8 for 8-bit; omit for FP16/BF16
quantization_method: bnb  # QLoRA via BitsAndBytes or hqq / eetq

stage: dpo
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid  # choices: [sigmoid (dpo), orpo, simpo]

# local dataset
dataset: my_ultrafeedback
dataset_dir: /mnt/cluster_storage

template: qwen
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: qwen2.5_7b_qlora_dpo
logging_steps: 5
save_steps: 5              # for tensorboard logging purpose too, can increase if not using tensorboard
plot_loss: true
report_to: tensorboard  # or none

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
num_train_epochs: 3.0  # low for demo purpose; adjust as needed
learning_rate: 5.0e-6
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: qwen2.5_7b_qlora_dpo
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4  # Number of GPUs to use.
resources_per_worker:
  GPU: 1
  anyscale/accelerator_shape:4xL4: 0.001  # Use this to specify a specific node shape.
  # accelerator_type:L4: 0.001            # Or use this to simply specify a GPU type.
  # See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.

ray_init_kwargs:
  runtime_env:
    env_vars:
      # if using gated models like meta-llama/Llama-3.1-8B-Instruct
      # HF_TOKEN: <your_huggingface_token>
      # Enable faster downloads if hf_transfer is installed:
      HF_HUB_ENABLE_HF_TRANSFER: '1'
```

## Step 4: Train and Monitor

With all configuration in place, you can launch fine-tuning/post-training in one of two ways.

### Option A — Run from a Workspace (quick start)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.

In [5]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/qwen2.5_7b_qlora_dpo_ray.yaml

INFO 09-19 16:12:06 [__init__.py:248] No platform detected, vLLM is running on UnspecifiedPlatform


2025-09-19 16:12:10,738	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.0.168.141:6379...
2025-09-19 16:12:10,748	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-c3rc1dvuypysehcmb91gu17t54.i.anyscaleuserdata.com [39m[22m
2025-09-19 16:12:10,750	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_4743ba917c5b0b6789a61fbd792b5972f2c8ed63.zip' (0.10MiB) to Ray cluster...
2025-09-19 16:12:10,751	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_4743ba917c5b0b6789a61fbd792b5972f2c8ed63.zip'.



View detailed results here: /mnt/cluster_storage/qwen2.5_7b_qlora_dpo
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-09-19_15-49-42_124608_2430/artifacts/2025-09-19_16-12-10/qwen2.5_7b_qlora_dpo/driver_artifacts`

Training started with configuration:
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config                                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/args/bf16                                                                             True │
│ train_loop_config/args/cutoff_len                                                                       1024 │
│ train_loop_config/args/dataset                                                              my_ultrafeedback │
│ train_loop_config/args

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Setting up process group for: env:// [rank=0, world_size=4]
[36m(TorchTrainer pid=6752, ip=10.0.183.178)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=6752, ip=10.0.183.178)[0m - (node_id=c87e4ad0b35e4f468ebff6d2970822063a67672f55ac6702a530230f, ip=10.0.183.178, pid=6857) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=6752, ip=10.0.183.178)[0m - (node_id=c87e4ad0b35e4f468ebff6d2970822063a67672f55ac6702a530230f, ip=10.0.183.178, pid=6858) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=6752, ip=10.0.183.178)[0m - (node_id=c87e4ad0b35e4f468ebff6d2970822063a67672f55ac6702a530230f, ip=10.0.183.178, pid=6859) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=6752, ip=10.0.183.178)[0m - (node_id=c87e4ad0b35e4f468ebff6d2970822063a67672f55ac6702a530230f, ip=10.0.183.178, pid=6856) world_rank=3, local_rank=3, node_rank=0


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:24] llamafactory.hparams.parser:143 >> Set `ddp_find_unused_parameters` to False in DDP training since LoRA is enabled.
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:24] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 4, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-19 16:12:24,292 >> loading file vocab.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/vocab.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-19 16:12:24,292 >> loading file merges.txt from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/merges.txt
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-19 16:12:24,292 >> loading file tokenizer.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/tokenizer.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-19 16:12:24,292 >> loading file added_

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:25] llamafactory.data.loader:143 >> Loading dataset ultrafeedback.jsonl...


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|tokenization_utils_base.py:2299] 2025-09-19 16:12:25,610 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[36m(RayTrainWorker pid=6856, ip=10.0.183.178)[0m [rank3]:[W919 16:12:25.082181490 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Converting format of dataset (num_proc=16):   0%|          | 0/100 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 100/100 [00:00<00:00, 575.25 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|          | 0/100 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16):   7%|▋         | 7/100 [00:00<00:06, 14.29 examples/s]
Running tokeni

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m training example:
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m chosen_input_ids:
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 10398, 25, 16246, 264, 11652, 304, 8585, 11, 3410, 458, 13578, 62230, 81, 1475, 2319, 504, 279, 4024, 429, 51844, 279, 1852, 7290, 624, 2505, 25, 794, 3757, 264, 49410, 782, 963, 20731, 82008, 320, 69, 4517, 294, 51274, 3096, 24847, 82008, 8, 409, 85838, 512, 220, 21, 47349, 220, 16, 22, 17, 20, 13, 1967, 1723, 59304, 96858, 510, 5097, 25, 151645, 198, 151644, 77091, 198, 16, 13, 4270, 1342, 10632, 279, 2661, 11652, 304, 8585, 624, 623, 3757, 12224, 20731, 82008, 320, 59778, 315, 19833, 24847, 82008, 8, 504, 85838, 389, 5470, 220, 21, 11, 220, 16, 22, 17, 20, 13, 10964, 2841, 1033, 510, 41462, 20108, 312, 759, 12784, 424, 25, 512, 220, 21, 47349, 220, 16, 22

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:12:28,998 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:12:28,999 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   ],
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "eos_token_id": 151645,
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "hidden_act": "silu",
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:41] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:41] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:41] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:41] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:41] llamafactory.model.model_utils.misc:143 >> Found linear modules: q_proj,o_proj,up_proj,down_proj,v_proj,k_proj,gate_proj
[36m(RayTrainWorker pid=6859, ip=10.0.183.178)[0m [INFO|2025-09-19 16:12:24] llamafactory.hparams.parser:406 >> Process rank: 2, world size: 4, device: cuda:2, distri

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:756] 2025-09-19 16:12:42,300 >> Using auto half precision backend
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:2409] 2025-09-19 16:12:43,045 >> ***** Running training *****
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:2410] 2025-09-19 16:12:43,045 >>   Num examples = 100
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:2411] 2025-09-19 16:12:43,045 >>   Num Epochs = 3
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:2412] 2025-09-19 16:12:43,045 >>   Instantaneous batch size per device = 1
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:2415] 2025-09-19 16:12:43,045 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:2416] 2025-09-19 16:12:43,045 >>   Gradient Accumulation steps = 2
[36m(RayTrainWorker pid=6857, ip=10.0.

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'loss': 0.6977, 'grad_norm': 7.635711669921875, 'learning_rate': 5e-06, 'rewards/chosen': 0.009051240980625153, 'rewards/rejected': 0.014680067077279091, 'rewards/accuracies': 0.25, 'rewards/margins': -0.005628826562315226, 'logps/chosen': -280.187255859375, 'logps/rejected': -292.1568908691406, 'logits/chosen': -0.8685919046401978, 'logits/rejected': -0.8516700267791748, 'epoch': 0.4}


 13%|█▎        | 5/39 [00:29<03:30,  6.20s/it][INFO|trainer.py:3993] 2025-09-19 16:13:13,581 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-5
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:13:13,823 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:13:13,824 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   ],
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=6857, ip=10.0.183.1


Training finished iteration 1 at 2025-09-19 16:13:16. Total running time: 1min 5s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000000 │
│ time_this_iter_s               58.76934 │
│ time_total_s                   58.76934 │
│ training_iteration                    1 │
│ epoch                               0.4 │
│ grad_norm                       7.63571 │
│ learning_rate                   0.00001 │
│ logits/chosen                  -0.86859 │
│ logits/rejected                -0.85167 │
│ logps/chosen                 -280.18726 │
│ logps/rejected               -292.15689 │
│ loss                             0.6977 │
│ rewards/accuracies                 0.25 │
│ rewards/chosen                  0.00905 │
│ rewards/margins                -0.00563 │
│ rewards/rejected                0.01468 │
│ step                                  5 │
╰────────────────────────────────────

 15%|█▌        | 6/39 [00:36<03:36,  6.55s/it])[0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000000)[32m [repeated 3x across cluster][0m
 18%|█▊        | 7/39 [00:42<03:19,  6.24s/it])[0m 
 21%|██        | 8/39 [00:47<03:04,  5.94s/it])[0m 
 23%|██▎       | 9/39 [00:54<03:12,  6.40s/it])[0m 


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'loss': 0.7005, 'grad_norm': 10.753315925598145, 'learning_rate': 4.752422169756048e-06, 'rewards/chosen': -0.011030399240553379, 'rewards/rejected': -0.0002653626725077629, 'rewards/accuracies': 0.4749999940395355, 'rewards/margins': -0.010765035636723042, 'logps/chosen': -278.4250183105469, 'logps/rejected': -295.75921630859375, 'logits/chosen': -0.8270877599716187, 'logits/rejected': -0.9758028984069824, 'epoch': 0.8}


 26%|██▌       | 10/39 [01:01<03:06,  6.44s/it][INFO|trainer.py:3993] 2025-09-19 16:13:45,664 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-10
[36m(RayTrainWorker pid=6858, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000001)
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:13:45,894 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:13:45,895 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=68


Training finished iteration 2 at 2025-09-19 16:13:48. Total running time: 1min 37s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000001 │
│ time_this_iter_s                32.0465 │
│ time_total_s                   90.81584 │
│ training_iteration                    2 │
│ epoch                               0.8 │
│ grad_norm                      10.75332 │
│ learning_rate                        0. │
│ logits/chosen                  -0.82709 │
│ logits/rejected                 -0.9758 │
│ logps/chosen                 -278.42502 │
│ logps/rejected               -295.75922 │
│ loss                             0.7005 │
│ rewards/accuracies                0.475 │
│ rewards/chosen                 -0.01103 │
│ rewards/margins                -0.01077 │
│ rewards/rejected               -0.00027 │
│ step                                 10 │
╰───────────────────────────────────

 28%|██▊       | 11/39 [01:10<03:19,  7.12s/it][0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000001)[32m [repeated 3x across cluster][0m
 31%|███       | 12/39 [01:15<02:56,  6.53s/it][0m 
 33%|███▎      | 13/39 [01:16<02:09,  4.97s/it][0m 
 36%|███▌      | 14/39 [01:24<02:23,  5.74s/it][0m 
[36m(RayTrainWorker pid=6858, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000002)


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'loss': 0.6083, 'grad_norm': 6.456727027893066, 'learning_rate': 4.058724504646834e-06, 'rewards/chosen': -0.014000813476741314, 'rewards/rejected': -0.05412696301937103, 'rewards/accuracies': 0.5555555820465088, 'rewards/margins': 0.04012615233659744, 'logps/chosen': -221.62258911132812, 'logps/rejected': -242.07455444335938, 'logits/chosen': -0.801766037940979, 'logits/rejected': -0.932135820388794, 'epoch': 1.16}


 38%|███▊      | 15/39 [01:30<02:23,  5.98s/it][INFO|trainer.py:3993] 2025-09-19 16:14:14,942 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-15
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:14:15,184 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:14:15,184 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   ],
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=6857, ip=10.0.183


Training finished iteration 3 at 2025-09-19 16:14:17. Total running time: 2min 6s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000002 │
│ time_this_iter_s               29.18197 │
│ time_total_s                   119.9978 │
│ training_iteration                    3 │
│ epoch                              1.16 │
│ grad_norm                       6.45673 │
│ learning_rate                        0. │
│ logits/chosen                  -0.80177 │
│ logits/rejected                -0.93214 │
│ logps/chosen                 -221.62259 │
│ logps/rejected               -242.07455 │
│ loss                             0.6083 │
│ rewards/accuracies              0.55556 │
│ rewards/chosen                   -0.014 │
│ rewards/margins                 0.04013 │
│ rewards/rejected               -0.05413 │
│ step                                 15 │
╰────────────────────────────────────

 41%|████      | 16/39 [01:37<02:24,  6.27s/it][0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000002)[32m [repeated 3x across cluster][0m
 44%|████▎     | 17/39 [01:44<02:24,  6.59s/it][0m 
 46%|████▌     | 18/39 [01:52<02:23,  6.85s/it][0m 
 49%|████▊     | 19/39 [01:57<02:08,  6.44s/it][0m 
[36m(RayTrainWorker pid=6858, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000003)


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'loss': 0.6945, 'grad_norm': 6.419449806213379, 'learning_rate': 3.056302334890786e-06, 'rewards/chosen': 0.0024192254059016705, 'rewards/rejected': 0.0013211145997047424, 'rewards/accuracies': 0.5750000476837158, 'rewards/margins': 0.0010981112718582153, 'logps/chosen': -276.6562194824219, 'logps/rejected': -287.2279357910156, 'logits/chosen': -0.8158325552940369, 'logits/rejected': -0.8227788209915161, 'epoch': 1.56}


 51%|█████▏    | 20/39 [02:02<01:52,  5.93s/it][INFO|trainer.py:3993] 2025-09-19 16:14:46,893 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-20
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:14:47,131 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:14:47,131 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   ],
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=6857, ip=10.0.183


Training finished iteration 4 at 2025-09-19 16:14:49. Total running time: 2min 38s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000003 │
│ time_this_iter_s                31.9378 │
│ time_total_s                   151.9356 │
│ training_iteration                    4 │
│ epoch                              1.56 │
│ grad_norm                       6.41945 │
│ learning_rate                        0. │
│ logits/chosen                  -0.81583 │
│ logits/rejected                -0.82278 │
│ logps/chosen                 -276.65622 │
│ logps/rejected               -287.22794 │
│ loss                             0.6945 │
│ rewards/accuracies                0.575 │
│ rewards/chosen                  0.00242 │
│ rewards/margins                  0.0011 │
│ rewards/rejected                0.00132 │
│ step                                 20 │
╰───────────────────────────────────

 54%|█████▍    | 21/39 [02:08<01:48,  6.05s/it][0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000003)[32m [repeated 3x across cluster][0m
 56%|█████▋    | 22/39 [02:12<01:32,  5.44s/it][0m 
 59%|█████▉    | 23/39 [02:17<01:23,  5.23s/it][0m 
 62%|██████▏   | 24/39 [02:23<01:19,  5.29s/it][0m 
[36m(RayTrainWorker pid=6856, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000004)


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'loss': 0.6834, 'grad_norm': 8.747222900390625, 'learning_rate': 1.9436976651092143e-06, 'rewards/chosen': -0.008300685323774815, 'rewards/rejected': -0.032277125865221024, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': 0.023976439610123634, 'logps/chosen': -220.33480834960938, 'logps/rejected': -298.58892822265625, 'logits/chosen': -0.8311011791229248, 'logits/rejected': -0.881970226764679, 'epoch': 1.96}


 64%|██████▍   | 25/39 [02:27<01:11,  5.09s/it][INFO|trainer.py:3993] 2025-09-19 16:15:12,018 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-25
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:15:12,253 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:15:12,254 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   ],
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "bos_token_id": 151643,
[36m(RayTrainWorker pid=6857, ip=10.0.183


Training finished iteration 5 at 2025-09-19 16:15:14. Total running time: 3min 3s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000004 │
│ time_this_iter_s               24.95763 │
│ time_total_s                  176.89323 │
│ training_iteration                    5 │
│ epoch                              1.96 │
│ grad_norm                       8.74722 │
│ learning_rate                        0. │
│ logits/chosen                   -0.8311 │
│ logits/rejected                -0.88197 │
│ logps/chosen                 -220.33481 │
│ logps/rejected               -298.58893 │
│ loss                             0.6834 │
│ rewards/accuracies                 0.45 │
│ rewards/chosen                  -0.0083 │
│ rewards/margins                 0.02398 │
│ rewards/rejected               -0.03228 │
│ step                                 25 │
╰────────────────────────────────────

 67%|██████▋   | 26/39 [02:33<01:09,  5.37s/it][0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000004)[32m [repeated 3x across cluster][0m
 69%|██████▉   | 27/39 [02:40<01:10,  5.86s/it][0m 
 72%|███████▏  | 28/39 [02:46<01:04,  5.88s/it][0m 
 74%|███████▍  | 29/39 [02:51<00:54,  5.45s/it][0m 


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'loss': 0.6069, 'grad_norm': 9.205820083618164, 'learning_rate': 9.412754953531664e-07, 'rewards/chosen': 0.010408895090222359, 'rewards/rejected': -0.0308152474462986, 'rewards/accuracies': 0.6111111640930176, 'rewards/margins': 0.04122414067387581, 'logps/chosen': -319.7648010253906, 'logps/rejected': -264.7530822753906, 'logits/chosen': -0.8459330797195435, 'logits/rejected': -0.958310604095459, 'epoch': 2.32}


 77%|███████▋  | 30/39 [02:58<00:54,  6.05s/it][INFO|trainer.py:3993] 2025-09-19 16:15:42,895 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-30
[36m(RayTrainWorker pid=6858, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000005)
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:15:43,141 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:15:43,141 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=68


Training finished iteration 6 at 2025-09-19 16:15:45. Total running time: 3min 34s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000005 │
│ time_this_iter_s               30.96431 │
│ time_total_s                  207.85754 │
│ training_iteration                    6 │
│ epoch                              2.32 │
│ grad_norm                       9.20582 │
│ learning_rate                        0. │
│ logits/chosen                  -0.84593 │
│ logits/rejected                -0.95831 │
│ logps/chosen                  -319.7648 │
│ logps/rejected               -264.75308 │
│ loss                             0.6069 │
│ rewards/accuracies              0.61111 │
│ rewards/chosen                  0.01041 │
│ rewards/margins                 0.04122 │
│ rewards/rejected               -0.03082 │
│ step                                 30 │
╰───────────────────────────────────

 79%|███████▉  | 31/39 [03:05<00:49,  6.24s/it][0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000005)[32m [repeated 3x across cluster][0m
 82%|████████▏ | 32/39 [03:10<00:41,  5.97s/it][0m 
 85%|████████▍ | 33/39 [03:14<00:32,  5.42s/it][0m 
 87%|████████▋ | 34/39 [03:20<00:27,  5.53s/it][0m 


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'loss': 0.683, 'grad_norm': 9.392861366271973, 'learning_rate': 2.4757783024395244e-07, 'rewards/chosen': -0.004318982362747192, 'rewards/rejected': -0.02901686355471611, 'rewards/accuracies': 0.5499999523162842, 'rewards/margins': 0.024697883054614067, 'logps/chosen': -224.56350708007812, 'logps/rejected': -296.472412109375, 'logits/chosen': -0.7405776977539062, 'logits/rejected': -0.8611633777618408, 'epoch': 2.72}


 90%|████████▉ | 35/39 [03:27<00:23,  5.99s/it][INFO|trainer.py:3993] 2025-09-19 16:16:11,899 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-35
[36m(RayTrainWorker pid=6858, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000006)
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:16:12,136 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:16:12,137 >> Model config Qwen2Config {
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   "architectures": [
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m     "Qwen2ForCausalLM"
[36m(RayTrainWorker pid=68


Training finished iteration 7 at 2025-09-19 16:16:14. Total running time: 4min 3s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000006 │
│ time_this_iter_s               29.29587 │
│ time_total_s                  237.15341 │
│ training_iteration                    7 │
│ epoch                              2.72 │
│ grad_norm                       9.39286 │
│ learning_rate                        0. │
│ logits/chosen                  -0.74058 │
│ logits/rejected                -0.86116 │
│ logps/chosen                 -224.56351 │
│ logps/rejected               -296.47241 │
│ loss                              0.683 │
│ rewards/accuracies                 0.55 │
│ rewards/chosen                 -0.00432 │
│ rewards/margins                  0.0247 │
│ rewards/rejected               -0.02902 │
│ step                                 35 │
╰────────────────────────────────────

 92%|█████████▏| 36/39 [03:35<00:19,  6.52s/it][0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000006)[32m [repeated 3x across cluster][0m
 95%|█████████▍| 37/39 [03:40<00:12,  6.02s/it][0m 
 97%|█████████▋| 38/39 [03:46<00:06,  6.02s/it][0m 
100%|██████████| 39/39 [03:47<00:00,  4.66s/it][INFO|trainer.py:3993] 2025-09-19 16:16:32,021 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo/checkpoint-39
[36m(RayTrainWorker pid=6858, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000007)
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:16:32,260 >> loading configuration file config.json from cache at /home/ray/.cache/h


Training finished iteration 8 at 2025-09-19 16:16:34. Total running time: 4min 23s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000007 │
│ time_this_iter_s               19.72356 │
│ time_total_s                  256.87697 │
│ training_iteration                    8 │
│ epoch                              2.72 │
│ grad_norm                       9.39286 │
│ learning_rate                        0. │
│ logits/chosen                  -0.74058 │
│ logits/rejected                -0.86116 │
│ logps/chosen                 -224.56351 │
│ logps/rejected               -296.47241 │
│ loss                              0.683 │
│ rewards/accuracies                 0.55 │
│ rewards/chosen                 -0.00432 │
│ rewards/margins                  0.0247 │
│ rewards/rejected               -0.02902 │
│ step                                 35 │
╰───────────────────────────────────

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:2676] 2025-09-19 16:16:34,433 >> 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m 
100%|██████████| 39/39 [03:50<00:00,  5.90s/it][0m 
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|trainer.py:3993] 2025-09-19 16:16:34,436 >> Saving model checkpoint to qwen2.5_7b_qlora_dpo
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:698] 2025-09-19 16:16:34,684 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/snapshots/a09a35458c702b33eeacc393d103063234e8bc28/config.json
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|configuration_utils.py:770] 2025-09-19 16:16:34

[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m ***** train metrics *****
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   epoch                    =        3.0
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   total_flos               = 12512612GF
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   train_loss               =      0.661
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   train_runtime            = 0:03:51.38
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   train_samples_per_second =      1.297
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m   train_steps_per_second   =      0.169
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Figure saved at: qwen2.5_7b_qlora_dpo/training_loss.png
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Figure saved at: qwen2.5_7b_qlora_dpo/training_rewards_accuracies.png


[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m [INFO|modelcard.py:450] 2025-09-19 16:16:35,095 >> Dropping the following result as it does not have all the necessary fields:
[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}



Training completed after 8 iterations at 2025-09-19 16:16:36. Total running time: 4min 25s


2025-09-19 16:16:36,549	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/qwen2.5_7b_qlora_dpo' in 0.0220s.





[36m(RayTrainWorker pid=6857, ip=10.0.183.178)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/qwen2.5_7b_qlora_dpo/TorchTrainer_12134_00000_0_2025-09-19_16-12-10/checkpoint_000007)[32m [repeated 3x across cluster][0m


### Option B — Run as an Anyscale Job (production)

For longer or production runs, submit the training as an **Anyscale Job**. Jobs run outside your interactive session for better stability, retries, and durable logs. You’ll package LLaMA-Factory and other libraries in a container image and launch with a short job config. See **[WIP Launching Fine-Tuning with Anyscale Jobs](3.10-launch-fine-tuning-with-anyscale-jobs.md)** for the step-by-step guide.

### Monitoring with TensorBoard
If you enabled TensorBoard logging (`report_to: tensorboard` in your YAML), you can watch metrics (e.g., training loss) update live and compare multiple runs with the same run name side-by-side.

- **While the job is running:** LLaMA-Factory prints a ready-to-run command that starts with `tensorboard --logdir`. Open a new terminal and run it. Example:
  ```bash
  tensorboard --logdir /tmp/ray/session_*/artifacts/*/qwen2.5_7b_qlora_dpo/driver_artifacts
  ```

- **After the job (shared storage):** Point TensorBoard at `{ray_storage_path}/{ray_run_name}/`. Each `TorchTrainer_*` subfolder holds event files for a single run. Using the parent folder aggregates all runs for easy comparison.
  ```bash
  tensorboard --logdir /mnt/cluster_storage/qwen2.5_7b_qlora_dpo
  ```

In your Anyscale workspace, look for the open **port 6006** labeled **TensorBoard** to view the dashboards.
![Anyscale workspace showing open ports with TensorBoard on port 6006](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/open-ports.png)

**TensorBoard**
![TensorBoard](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.2/3.2.2-tensorboard.png)

For a more detailed guide on tracking experiments with other tools such as WandB or MLFlow, see [todo: add doc link](3.5-observability-and-tracking.md).

## Step 5: Locate Checkpoints

Checkpoints are written under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/qwen2.5_32b_lora_sft`. 

Inside, you’ll see a **trainer session** directory named like:
`TorchTrainer_ff224_00000_0_2025-09-19_15-57-20/`.

- `TorchTrainer_*` is created **when the trainer starts**; the suffix encodes a short run id and the **start timestamp**.
- Within that directory, checkpoints are named `checkpoint_000xxx/`, where the number is the saved ordered checkpoints. 

The save cadence is controlled by `save_strategy` and `save_steps`. For instructions on how to resume interrupted training via `resume_from_checkpoint` and more, see [todo: add link for checkpointing](3.4-checkpointing.md#understanding-your-training-output-directory).

## Step 6: Export the Model

If you use LoRA, you can keep the base model and adapter separate ([for multi-LoRA adapter use](https://docs.anyscale.com/llm/serving/multi-lora)) or merge the adapter into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply post-training quantization on merged or full models before serving. See [todo: add doc link]() for the exact export commands and options.