```{orphan}
```

# Kahneman–Tversky Optimization (KTO) at Scale with LoRA

This guide provides a step-by-step workflow for preference fine-tuning the `meta-llama/Meta-Llama-3-8B-Instruct` model on a multi-GPU Anyscale cluster. We’ll use **LLaMA-Factory** as the training framework and **LoRA** to reduce memory footprint and enable efficient multi-GPU training.

**What is KTO?** *Kahneman–Tversky Optimization* aligns a model to human preferences using **single binary labels (accept/reject)** instead of pairwise “chosen vs. rejected” comparisons. KTO directly optimizes the policy on these unary signals, simplifying data preparation while still encouraging preferred behavior and discouraging undesired outputs.

## Step 1: Set Up Your Environment
### Dependencies
First, we need to ensure our environment has the right libraries. We'll start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended Container Image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster downloads.

In [1]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory@git+https://github.com/hiyouga/LLaMA-Factory.git@v0.9.3

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

# (Optional) Acceleration methods (ensure CUDA/Torch compatibility)
pip install -q flash-attn==2.8.3 liger-kernel==0.6.2

# (Optional) Experiment tracking library
pip install -q mlflow==3.4.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-server 1.24.0 requires anyio<4,>=3.1.0, but you have anyio 4.10.0 which is incompatible.[0m[31m


[92mSuccessfully registered `llamafactory` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_28b5ivx1sj8wei1sfv965t9fqr?workspace-tab=dependencies[0m
[92mSuccessfully registered `hf_transfer` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_28b5ivx1sj8wei1sfv965t9fqr?workspace-tab=dependencies[0m
[92mSuccessfully registered `flash-attn, liger-kernel` packages to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_28b5ivx1sj8wei1sfv965t9fqr?workspace-tab=dependencies[0m


[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llamafactory 0.9.3 requires pydantic<=2.10.6, but you have pydantic 2.11.9 which is incompatible.[0m[31m


[92mSuccessfully registered `mlflow` package to be installed on all cluster nodes.[0m
[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_28b5ivx1sj8wei1sfv965t9fqr?workspace-tab=dependencies[0m


[0m

## Model and Resources

| Item | Value |
|------|-------|
| **Base model** | [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| **Workers** | 4 × L40S / A100 (1 GPU each) |

> Compared to SFT, KTO typically holds two copies of the model (policy + reference), and alignment datasets often use long contexts, so GPUs with larger VRAM are recommended. Techniques like **LoRA** and memory-efficient attention can further reduce memory pressure.

## Step 2: Prepare the Dataset

### Understand the Dataset
For this tutorial, we will use `kto_en_demo`, a unary-preference dataset for **KTO (Kahneman–Tversky Optimization)**.  
Each record contains a multi-turn ShareGPT-style dialogue plus a **binary label** indicating whether the modeled behavior is preferred.

This dataset contains:
- `messages`: Turn-by-turn chat between a user and the assistant.
- `label`: A boolean (`true`/`false`) indicating whether the example is preferred.

**Note:** To maintain role alignment in ShareGPT format, a strict turn order must be followed: `human` and `observation` (tool output) must appear in odd-numbered positions (1, 3, 5, ...), while `gpt` and `function_call` must appear in even-numbered positions (2, 4, 6, ...). The model learns to generate the content in the `gpt` and `function_call` turns.

<details>
  <summary>Dataset Example</summary>

  ```json
  {
    "messages": [
      { "role": "user", "content": "Compare and contrast the roles of the hippocampus and the prefrontal cortex..." },
      { "role": "assistant", "content": "The human brain is a highly complex organ, responsible for a myriad of cognitive functions..." },
      { "role": "user", "content": "Discuss the mechanisms through which the prefrontal cortex ..." },
      { "role": "assistant", "content": "The prefrontal cortex (PFC)..." },
      { "role": "user", "content": "Can you elaborate on the role of the amygdala..." },
      { "role": "assistant", "content": "The amygdala plays a crucial role in the emotional processing of stored memories..." }
    ],
    "label": true
  }
  ```

</details>

### Register the local dataset

To specify new datasets that are accessible across Ray worker nodes, add all dataset files and a `dataset_info.json` to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`.

For example, to run KTO fine-tuning on `kto_en_demo` locally:

`dataset_info.json`

- `kto_tag` maps the unary preference label used by KTO.
- `tags` helps the loader interpret role/content fields in ShareGPT-style records.

```json
{
  "my_kto_en_demo": {
    "file_name": "/mnt/cluster_storage/kto_en_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "kto_tag": "label"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  }
}
```

> For a more detailed dataset preparation and formatting guide, follow [_](https://docs.anyscale.com/llm/fine-tuning/data-preparation).


In [2]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/kto_en_demo.json -O /mnt/cluster_storage/kto_en_demo.json
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/

--2025-09-22 17:31:23--  https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/sharegpt/kto_en_demo.json
Resolving anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)... 3.5.81.39, 52.218.152.49, 52.92.137.194, ...
Connecting to anyscale-public-materials.s3.us-west-2.amazonaws.com (anyscale-public-materials.s3.us-west-2.amazonaws.com)|3.5.81.39|:443... connected.


HTTP request sent, awaiting response... 200 OK
Length: 913519 (892K) [application/json]
Saving to: ‘/mnt/cluster_storage/kto_en_demo.json’

     0K .......... .......... .......... .......... ..........  5%  222M 0s
    50K .......... .......... .......... .......... .......... 11% 38.3M 0s
   100K .......... .......... .......... .......... .......... 16% 72.7M 0s
   150K .......... .......... .......... .......... .......... 22%  261M 0s
   200K .......... .......... .......... .......... .......... 28% 56.6M 0s
   250K .......... .......... .......... .......... .......... 33%  259M 0s
   300K .......... .......... .......... .......... .......... 39%  248M 0s
   350K .......... .......... .......... .......... .......... 44%  269M 0s
   400K .......... .......... .......... .......... .......... 50%  257M 0s
   450K .......... .......... .......... .......... .......... 56%  249M 0s
   500K .......... .......... .......... .......... .......... 61%  266M 0s
   550K .......... .....

## Step 3: Create the Preference-Tuning Config (KTO + LoRA)

Create a YAML file that defines your **KTO** run. It specifies the base model, dataset, **LoRA** settings, KTO hyperparameters, optional acceleration methods, logging, and Ray cluster resources.

**Important notes:**
- **Acceleration libs:** `flash_attn` and `liger-kernel` can be used together, but actual speed and memory gains vary with GPU architecture, sequence length, batch size, precision, kernel availability. Benchmark your training workloads to confirm improvements. Note that `fa2` is not supported on Turing GPUs (e.g., T4).
- **Access & paths:** The YAML only needs to be on the **head node**, but any referenced paths (e.g., `dataset_dir`, `ray_storage_path`, `output_dir`) must be on **shared storage** (such as `/mnt/cluster_storage/`) visible to all workers.
- **Gated models:** If your base is gated, set `HF_TOKEN` in the runtime env.
- **Memory tips:** If VRAM is tight, consider enabling gradient checkpointing or switching to QLoRA (4/8-bit), then add the corresponding quantization keys.

### LLaMA-Factory + Ray Configuration

```yaml
# llama3_lora_kto_ray.yaml

### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
trust_remote_code: true

### method
stage: kto
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
pref_beta: 0.1

### acceleration methods
# both methods can be enabled at the same time
flash_attn: fa2            # speed up attention and cut activation memory at long context, use auto on Turing GPUs (e.g., T4)
enable_liger_kernel: true  # reduce VRAM and improve throughput across multiple transformer ops

### dataset
dataset: my_kto_en_demo
dataset_dir: /mnt/cluster_storage

template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: llama3_8b_lora_kto
logging_steps: 5
save_steps: 50
plot_loss: true
overwrite_output_dir: true
report_to: mlflow   # or none

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
num_train_epochs: 3.0  # low for demo purpose; adjust as needed
learning_rate: 5.0e-6
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: llama3_8b_kto_lora
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4
resources_per_worker:
  GPU: 1
  anyscale/accelerator_shape:4xL40S: 0.001  # pin a specific node shape
  # accelerator_type:L40S: 0.001            # or just request a GPU type

ray_init_kwargs:
  runtime_env:
    env_vars:
      # if using gated models like meta-llama/Llama-3-8B-Instruct
      HF_TOKEN: <your_huggingface_token>
      # Enable faster downloads if hf_transfer is installed:
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      # if using mlflow for experiments tracking
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

## Step 4: Train and Monitor

With all configuration in place, you can launch fine-tuning/post-training in one of two ways.

### Option A — Run from a Workspace (quick start)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.

In [None]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/kto_lora.yaml

INFO 09-22 17:58:43 [__init__.py:248] No platform detected, vLLM is running on UnspecifiedPlatform


2025-09-22 17:58:47,161	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.0.84.195:6379...
2025-09-22 17:58:47,173	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-zwu6m5dfjub67c3wfgsrmyfc7f.i.anyscaleuserdata.com [39m[22m
2025-09-22 17:58:47,175	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_bfae09dd86bdebe0b1c521da35d6b3a9fcf596f8.zip' (0.22MiB) to Ray cluster...
2025-09-22 17:58:47,176	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_bfae09dd86bdebe0b1c521da35d6b3a9fcf596f8.zip'.



View detailed results here: /mnt/cluster_storage/llama3_8b_kto_lora
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-09-22_17-27-35_971622_2432/artifacts/2025-09-22_17-58-47/llama3_8b_kto_lora/driver_artifacts`

Training started with configuration:
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Training config                                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_loop_config/args/bf16                                                                             True │
│ train_loop_config/args/cutoff_len                                                                       1024 │
│ train_loop_config/args/dataset                                                                my_kto_en_demo │
│ train_loop_config/args/dat

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m Setting up process group for: env:// [rank=0, world_size=4]
[36m(TorchTrainer pid=9165, ip=10.0.121.215)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=9165, ip=10.0.121.215)[0m - (node_id=a82307d5edc33319835b52f5f6a1b9248a325214f61311ca3dee4da5, ip=10.0.121.215, pid=9270) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=9165, ip=10.0.121.215)[0m - (node_id=a82307d5edc33319835b52f5f6a1b9248a325214f61311ca3dee4da5, ip=10.0.121.215, pid=9272) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=9165, ip=10.0.121.215)[0m - (node_id=a82307d5edc33319835b52f5f6a1b9248a325214f61311ca3dee4da5, ip=10.0.121.215, pid=9271) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=9165, ip=10.0.121.215)[0m - (node_id=a82307d5edc33319835b52f5f6a1b9248a325214f61311ca3dee4da5, ip=10.0.121.215, pid=9269) world_rank=3, local_rank=3, node_rank=0


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:58:59] llamafactory.hparams.parser:143 >> Set `ddp_find_unused_parameters` to False in DDP training since LoRA is enabled.
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:58:59] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 4, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-22 17:59:00,054 >> loading file tokenizer.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/tokenizer.json
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-22 17:59:00,054 >> loading file tokenizer.model from cache at None
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-22 17:59:00,054 >> loading file added_tokens.json from cache at None
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|tokenization_utils_base.py:2023] 2025-09-22 17:59:00,055 >> loading file special_tokens_map.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/special_tokens_map.json
[36m(RayTrainWorker pid=9270, ip=10.0.1

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:01] llamafactory.data.template:143 >> Add pad token: <|eot_id|>
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:01] llamafactory.data.template:143 >> Add <|eom_id|> to stop words.
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:01] llamafactory.data.loader:143 >> Loading dataset /mnt/cluster_storage/kto_en_demo.json...


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|tokenization_utils_base.py:2299] 2025-09-22 17:59:01,529 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Converting format of dataset (num_proc=16):   0%|          | 0/300 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 300/300 [00:00<00:00, 1906.76 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|          | 0/300 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16):   6%|▋         | 19/300 [00:00<00:12, 23.07 examples/s]
Running tokenizer on dataset (num_proc=16):  13%|█▎        | 38/300 [00:00<00:05, 45.14 examples/s]
Running tokenizer on dataset (num_proc=16):  19%|█▉        | 57/300 [00:01<00:03, 63.07 examples/s]
Running tokenizer on dataset (num_proc=16):  25%|██▌       | 76/300 [00:01<00:02, 78.45 examples/s]
Running tokenizer on dataset (num_proc=16):  32%|███▏      | 95/300 [00:01<00:0

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m training example:
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m input_ids:
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [128000, 128006, 882, 128007, 271, 791, 12411, 17657, 9849, 374, 2133, 1306, 21930, 1698, 11, 9660, 315, 6500, 2082, 430, 4685, 5694, 389, 279, 19002, 315, 8191, 3932, 311, 3839, 477, 24927, 872, 2930, 7640, 627, 48, 25, 16299, 374, 279, 1888, 12399, 315, 420, 4652, 5380, 38053, 701, 4320, 505, 512, 4444, 570, 4435, 198, 5462, 570, 13482, 198, 3100, 570, 8184, 198, 5549, 570, 10170, 17146, 4842, 198, 40, 1781, 279, 4320, 374, 128009, 128006, 78191, 128007, 271, 46, 2319, 297, 2319, 23128, 23128, 0, 353, 70, 343, 3491, 9, 6914, 757, 1781, 1131, 507, 2319, 297, 2319, 23128, 23128, 0, 353, 70, 343, 3491, 9, 578, 1888, 12399, 315, 420, 4652, 374, 1131, 353, 3696, 372, 1119, 9, 1131, 423, 0, 10170, 17146, 4842, 0, 816, 352, 0, 353, 6263, 29037, 9, 578, 12411, 17657, 9849, 374, 7556, 922, 21930, 1698, 11, 902,

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:698] 2025-09-22 17:59:06,081 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:770] 2025-09-22 17:59:06,082 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "architectures": [
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m     "LlamaForCausalLM"
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   ],
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "attention_bias": false,
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "bos_token_id": 128000,
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "eos_token_id": 128009,
[36m(RayTrainWorker pid=9270, ip

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:06] llamafactory.model.model_utils.liger_kernel:143 >> Current training stage does not support chunked cross entropy.
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:06] llamafactory.model.model_utils.liger_kernel:143 >> Liger kernel has been applied to the model.


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|modeling_utils.py:1151] 2025-09-22 17:59:06,380 >> loading weights file model.safetensors from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/model.safetensors.index.json
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|modeling_utils.py:2241] 2025-09-22 17:59:06,381 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:1135] 2025-09-22 17:59:06,383 >> Generate config GenerationConfig {
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "bos_token_id": 128000,
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "eos_token_id": 128009,
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "use_cache": false
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m }
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m 
Loading check

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:09] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:09] llamafactory.model.model_utils.attention:143 >> Using FlashAttention-2 for faster training and inference.
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:09] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:09] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:09] llamafactory.model.model_utils.misc:143 >> Found linear modules: o_proj,v_proj,up_proj,q_proj,gate_proj,k_proj,down_proj
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|2025-09-22 17:59:09] llamafactory.model.loader:143 >> trainable params: 20,971,520 || all params: 8,051,

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:756] 2025-09-22 17:59:10,484 >> Using auto half precision backend
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:2409] 2025-09-22 17:59:11,110 >> ***** Running training *****
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:2410] 2025-09-22 17:59:11,110 >>   Num examples = 300
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:2411] 2025-09-22 17:59:11,110 >>   Num Epochs = 3
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:2412] 2025-09-22 17:59:11,110 >>   Instantaneous batch size per device = 1
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:2415] 2025-09-22 17:59:11,110 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:2416] 2025-09-22 17:59:11,110 >>   Gradient Accumulation steps = 2
[36m(RayTrainWorker pid=9270, ip=10.0.

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.5056, 'grad_norm': 3.1052091121673584, 'learning_rate': 1.6666666666666667e-06, 'rewards/chosen': -0.009211322026593345, 'logps/chosen': -413.89833286830356, 'logits/chosen': -24049542.85714286, 'rewards/rejected': 0.019812012712160747, 'logps/rejected': -970.287109375, 'logits/rejected': -30141077.333333332, 'rewards/margins': -0.02902333473875409, 'kl': 0.7326087951660156, 'epoch': 0.13}


  4%|▍         | 5/114 [00:10<03:12,  1.77s/it][0m 
  5%|▌         | 6/114 [00:11<02:53,  1.61s/it][0m 
  6%|▌         | 7/114 [00:13<02:45,  1.54s/it][0m 
  7%|▋         | 8/114 [00:14<02:36,  1.47s/it][0m 
  8%|▊         | 9/114 [00:15<02:31,  1.44s/it][0m 
  9%|▉         | 10/114 [00:17<02:25,  1.40s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.5037, 'grad_norm': 3.206493854522705, 'learning_rate': 3.7500000000000005e-06, 'rewards/chosen': 0.000185394287109375, 'logps/chosen': -376.9236328125, 'logits/chosen': -38986489.6, 'rewards/rejected': 0.034631043672561646, 'logps/rejected': -347.0869140625, 'logits/rejected': -28356636.8, 'rewards/margins': -0.03444564938545227, 'kl': 1.127706527709961, 'epoch': 0.27}


  9%|▉         | 10/114 [00:17<02:25,  1.40s/it][0m 
 10%|▉         | 11/114 [00:18<02:24,  1.41s/it][0m 
 11%|█         | 12/114 [00:19<02:19,  1.36s/it][0m 
 11%|█▏        | 13/114 [00:21<02:20,  1.39s/it][0m 
 12%|█▏        | 14/114 [00:22<02:11,  1.32s/it][0m 
 13%|█▎        | 15/114 [00:23<02:08,  1.30s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4956, 'grad_norm': 2.4285411834716797, 'learning_rate': 4.995258321842611e-06, 'rewards/chosen': -0.0027526840567588806, 'logps/chosen': -185.4678751627604, 'logits/chosen': -10551043.333333334, 'rewards/rejected': 0.0250568687915802, 'logps/rejected': -137.75800432477678, 'logits/rejected': -8254264.0, 'rewards/margins': -0.02780955284833908, 'kl': 1.2991180419921875, 'epoch': 0.4}


 13%|█▎        | 15/114 [00:23<02:08,  1.30s/it][0m 
 14%|█▍        | 16/114 [00:24<02:05,  1.28s/it][0m 
 15%|█▍        | 17/114 [00:26<02:00,  1.24s/it][0m 
 16%|█▌        | 18/114 [00:27<01:59,  1.24s/it][0m 
 17%|█▋        | 19/114 [00:28<02:02,  1.29s/it][0m 
 18%|█▊        | 20/114 [00:30<02:08,  1.36s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.5003, 'grad_norm': 2.7762460708618164, 'learning_rate': 4.942120794399002e-06, 'rewards/chosen': 0.014214071134726206, 'logps/chosen': -361.7993570963542, 'logits/chosen': -33656781.333333336, 'rewards/rejected': -0.0008510590996593237, 'logps/rejected': -414.75225830078125, 'logits/rejected': -11235316.0, 'rewards/margins': 0.01506513023438553, 'kl': 1.6234521865844727, 'epoch': 0.53}


 18%|█▊        | 20/114 [00:30<02:08,  1.36s/it][0m 
 18%|█▊        | 21/114 [00:31<02:07,  1.37s/it][0m 
 19%|█▉        | 22/114 [00:33<02:07,  1.38s/it][0m 
 20%|██        | 23/114 [00:34<02:03,  1.35s/it][0m 
 21%|██        | 24/114 [00:35<02:02,  1.36s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4938, 'grad_norm': 3.4795355796813965, 'learning_rate': 4.83118057351089e-06, 'rewards/chosen': 0.033310700207948685, 'logps/chosen': -415.112060546875, 'logits/chosen': -16302114.0, 'rewards/rejected': 0.004298783838748932, 'logps/rejected': -133.60540771484375, 'logits/rejected': -7499702.0, 'rewards/margins': 0.029011916369199753, 'kl': 0.2708768844604492, 'epoch': 0.67}


 22%|██▏       | 25/114 [00:37<02:02,  1.38s/it][0m 
 23%|██▎       | 26/114 [00:38<01:58,  1.35s/it][0m 
 24%|██▎       | 27/114 [00:39<01:56,  1.34s/it][0m 
 25%|██▍       | 28/114 [00:41<01:56,  1.36s/it][0m 
 25%|██▌       | 29/114 [00:42<01:54,  1.35s/it][0m 
 26%|██▋       | 30/114 [00:43<01:53,  1.35s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4989, 'grad_norm': 3.157608985900879, 'learning_rate': 4.665063509461098e-06, 'rewards/chosen': -0.02777557571729024, 'logps/chosen': -320.25791422526044, 'logits/chosen': -44957109.333333336, 'rewards/rejected': 0.0035932548344135284, 'logps/rejected': -399.44537353515625, 'logits/rejected': -22130400.0, 'rewards/margins': -0.031368830551703766, 'kl': 0.9080495834350586, 'epoch': 0.8}


 26%|██▋       | 30/114 [00:43<01:53,  1.35s/it][0m 
 27%|██▋       | 31/114 [00:45<01:51,  1.34s/it][0m 
 28%|██▊       | 32/114 [00:46<01:48,  1.32s/it][0m 
 29%|██▉       | 33/114 [00:47<01:49,  1.35s/it][0m 
 30%|██▉       | 34/114 [00:49<01:49,  1.37s/it][0m 
 31%|███       | 35/114 [00:50<01:49,  1.39s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4975, 'grad_norm': 3.352367639541626, 'learning_rate': 4.447701436314176e-06, 'rewards/chosen': 0.028008118271827698, 'logps/chosen': -319.4502685546875, 'logits/chosen': -4734688.4, 'rewards/rejected': -0.023479002714157104, 'logps/rejected': -321.9469970703125, 'logits/rejected': 1054939.2, 'rewards/margins': 0.051487120985984805, 'kl': 0.6697483062744141, 'epoch': 0.93}


 31%|███       | 35/114 [00:50<01:49,  1.39s/it][0m 
 32%|███▏      | 36/114 [00:52<01:49,  1.40s/it][0m 
 32%|███▏      | 37/114 [00:53<01:47,  1.40s/it][0m 
 33%|███▎      | 38/114 [00:54<01:30,  1.19s/it][0m 
 34%|███▍      | 39/114 [00:55<01:33,  1.24s/it][0m 
 35%|███▌      | 40/114 [00:56<01:32,  1.26s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.446, 'grad_norm': 2.855987310409546, 'learning_rate': 4.184239109116393e-06, 'rewards/chosen': 0.00960337370634079, 'logps/chosen': -239.4619344075521, 'logits/chosen': -24379098.666666668, 'rewards/rejected': 0.10370179017384847, 'logps/rejected': -927.6328125, 'logits/rejected': -30101224.0, 'rewards/margins': -0.09409841646750768, 'kl': 0.8544096946716309, 'epoch': 1.05}


 35%|███▌      | 40/114 [00:56<01:32,  1.26s/it][0m 
 36%|███▌      | 41/114 [00:58<01:34,  1.29s/it][0m 
 37%|███▋      | 42/114 [00:59<01:32,  1.28s/it][0m 
 38%|███▊      | 43/114 [01:00<01:34,  1.33s/it][0m 
 39%|███▊      | 44/114 [01:02<01:33,  1.34s/it][0m 
 39%|███▉      | 45/114 [01:03<01:32,  1.35s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4926, 'grad_norm': 3.918887138366699, 'learning_rate': 3.880912432401265e-06, 'rewards/chosen': -0.004432678843537967, 'logps/chosen': -117.6781005859375, 'logits/chosen': -6835382.666666667, 'rewards/rejected': -0.02580215036869049, 'logps/rejected': -311.5234375, 'logits/rejected': -5398134.857142857, 'rewards/margins': 0.021369471525152523, 'kl': 0.4147300720214844, 'epoch': 1.19}


 39%|███▉      | 45/114 [01:03<01:32,  1.35s/it][0m 
 40%|████      | 46/114 [01:05<01:33,  1.38s/it][0m 
 41%|████      | 47/114 [01:06<01:28,  1.32s/it][0m 
 42%|████▏     | 48/114 [01:07<01:28,  1.35s/it][0m 
 43%|████▎     | 49/114 [01:09<01:28,  1.36s/it][0m 
 44%|████▍     | 50/114 [01:10<01:26,  1.35s/it][0m 
[36m(RayTrainWorker pid=9269, ip=10.0.121.215)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_75e12_00000_0_2025-09-22_17-58-47/checkpoint_000000)


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4941, 'grad_norm': 2.7803127765655518, 'learning_rate': 3.544900862216959e-06, 'rewards/chosen': -0.02284517458506993, 'logps/chosen': -233.09730747767858, 'logits/chosen': -6707277.142857143, 'rewards/rejected': -0.056767781575520836, 'logps/rejected': -227.58184814453125, 'logits/rejected': -18187356.0, 'rewards/margins': 0.03392260699045091, 'kl': 0.66485595703125, 'epoch': 1.32}


 44%|████▍     | 50/114 [01:10<01:26,  1.35s/it][INFO|trainer.py:3993] 2025-09-22 18:00:26,474 >> Saving model checkpoint to llama3_8b_lora_kto/checkpoint-50
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:698] 2025-09-22 18:00:26,672 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:770] 2025-09-22 18:00:26,673 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "architectures": [
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m     "LlamaForCausalLM"
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   ],
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "attention_bias": false,
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "attention_dropout": 0.0,
[36m(RayTrainWorker pid=9270, 


Training finished iteration 1 at 2025-09-22 18:00:28. Total running time: 1min 40s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000000 │
│ time_this_iter_s                 94.279 │
│ time_total_s                     94.279 │
│ training_iteration                    1 │
│ epoch                              1.32 │
│ grad_norm                       2.78031 │
│ kl                              0.66486 │
│ learning_rate                        0. │
│ logits/chosen            -6707277.14286 │
│ logits/rejected              -18187356. │
│ logps/chosen                 -233.09731 │
│ logps/rejected               -227.58185 │
│ loss                             0.4941 │
│ rewards/chosen                 -0.02285 │
│ rewards/margins                 0.03392 │
│ rewards/rejected               -0.05677 │
│ step                                 50 │
╰───────────────────────────────────

 45%|████▍     | 51/114 [01:13<02:03,  1.95s/it][0m 
 46%|████▌     | 52/114 [01:15<01:50,  1.79s/it][0m 
 46%|████▋     | 53/114 [01:16<01:41,  1.66s/it][0m 
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_75e12_00000_0_2025-09-22_17-58-47/checkpoint_000000)[32m [repeated 3x across cluster][0m
 47%|████▋     | 54/114 [01:17<01:35,  1.59s/it][0m 
 48%|████▊     | 55/114 [01:19<01:29,  1.52s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4901, 'grad_norm': 3.4102723598480225, 'learning_rate': 3.184157475180208e-06, 'rewards/chosen': 0.003834752632038934, 'logps/chosen': -424.96132114955356, 'logits/chosen': -27426998.85714286, 'rewards/rejected': -0.013437906901041666, 'logps/rejected': -427.0186360677083, 'logits/rejected': 7196722.666666667, 'rewards/margins': 0.0172726595330806, 'kl': 0.15345001220703125, 'epoch': 1.45}


 48%|████▊     | 55/114 [01:19<01:29,  1.52s/it][0m 
 49%|████▉     | 56/114 [01:20<01:25,  1.48s/it][0m 
 50%|█████     | 57/114 [01:22<01:23,  1.47s/it][0m 
 51%|█████     | 58/114 [01:23<01:21,  1.46s/it][0m 
 52%|█████▏    | 59/114 [01:24<01:14,  1.35s/it][0m 
 53%|█████▎    | 60/114 [01:26<01:13,  1.35s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4861, 'grad_norm': 3.9761669635772705, 'learning_rate': 2.8072207266617856e-06, 'rewards/chosen': 0.01904270201921463, 'logps/chosen': -160.99930419921876, 'logits/chosen': -7771552.0, 'rewards/rejected': -0.0751983642578125, 'logps/rejected': -299.243310546875, 'logits/rejected': -17014041.6, 'rewards/margins': 0.09424106627702714, 'kl': 0.07212066650390625, 'epoch': 1.59}


 53%|█████▎    | 60/114 [01:26<01:13,  1.35s/it][0m 
 54%|█████▎    | 61/114 [01:27<01:13,  1.39s/it][0m 
 54%|█████▍    | 62/114 [01:28<01:12,  1.40s/it][0m 
 55%|█████▌    | 63/114 [01:30<01:11,  1.40s/it][0m 
 56%|█████▌    | 64/114 [01:31<01:07,  1.36s/it][0m 
 57%|█████▋    | 65/114 [01:33<01:07,  1.37s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4819, 'grad_norm': 3.127728223800659, 'learning_rate': 2.4230123536095746e-06, 'rewards/chosen': 0.04307342767715454, 'logps/chosen': -263.8954833984375, 'logits/chosen': -40611148.8, 'rewards/rejected': -0.0642248511314392, 'logps/rejected': -439.3173828125, 'logits/rejected': -39223113.6, 'rewards/margins': 0.10729827880859374, 'kl': 0.37912988662719727, 'epoch': 1.72}


 57%|█████▋    | 65/114 [01:33<01:07,  1.37s/it][0m 
 58%|█████▊    | 66/114 [01:34<01:04,  1.34s/it][0m 
 59%|█████▉    | 67/114 [01:35<01:03,  1.36s/it][0m 
 60%|█████▉    | 68/114 [01:37<01:01,  1.34s/it][0m 
 61%|██████    | 69/114 [01:38<01:02,  1.38s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4883, 'grad_norm': 3.268752336502075, 'learning_rate': 2.040626205458574e-06, 'rewards/chosen': 0.06611897051334381, 'logps/chosen': -339.74298095703125, 'logits/chosen': -48837516.0, 'rewards/rejected': -0.02267284318804741, 'logps/rejected': -67.47608184814453, 'logits/rejected': 2432824.0, 'rewards/margins': 0.08879181370139122, 'kl': 0.36167335510253906, 'epoch': 1.85}


 61%|██████▏   | 70/114 [01:39<01:00,  1.37s/it][0m 
 62%|██████▏   | 71/114 [01:41<00:58,  1.36s/it][0m 
 63%|██████▎   | 72/114 [01:42<00:57,  1.36s/it][0m 
 64%|██████▍   | 73/114 [01:43<00:55,  1.35s/it][0m 
 65%|██████▍   | 74/114 [01:45<00:53,  1.34s/it][0m 
 66%|██████▌   | 75/114 [01:46<00:50,  1.30s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4746, 'grad_norm': 3.637986898422241, 'learning_rate': 1.6691130013008514e-06, 'rewards/chosen': -0.009941291809082032, 'logps/chosen': -153.185595703125, 'logits/chosen': -28408435.2, 'rewards/rejected': -0.14257004261016845, 'logps/rejected': -497.9251953125, 'logits/rejected': -25942432.0, 'rewards/margins': 0.1326287508010864, 'kl': 0.01880645751953125, 'epoch': 1.99}


 66%|██████▌   | 75/114 [01:46<00:50,  1.30s/it][0m 
 67%|██████▋   | 76/114 [01:47<00:43,  1.15s/it][0m 
 68%|██████▊   | 77/114 [01:48<00:44,  1.21s/it][0m 
 68%|██████▊   | 78/114 [01:49<00:43,  1.22s/it][0m 
 69%|██████▉   | 79/114 [01:51<00:44,  1.26s/it][0m 
 70%|███████   | 80/114 [01:52<00:43,  1.27s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4264, 'grad_norm': 4.671276092529297, 'learning_rate': 1.3172661079099752e-06, 'rewards/chosen': 0.11861257553100586, 'logps/chosen': -552.185986328125, 'logits/chosen': -41346288.0, 'rewards/rejected': -0.08438415825366974, 'logps/rejected': -545.228271484375, 'logits/rejected': -31541286.0, 'rewards/margins': 0.2029967337846756, 'kl': 0.0881195068359375, 'epoch': 2.11}


 70%|███████   | 80/114 [01:52<00:43,  1.27s/it][0m 
 71%|███████   | 81/114 [01:53<00:42,  1.30s/it][0m 
 72%|███████▏  | 82/114 [01:55<00:40,  1.28s/it][0m 
 73%|███████▎  | 83/114 [01:56<00:40,  1.31s/it][0m 
 74%|███████▎  | 84/114 [01:57<00:39,  1.32s/it][0m 
 75%|███████▍  | 85/114 [01:59<00:38,  1.34s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4847, 'grad_norm': 3.5751049518585205, 'learning_rate': 9.934134090518593e-07, 'rewards/chosen': 0.042299906412760414, 'logps/chosen': -342.3179524739583, 'logits/chosen': -35587850.666666664, 'rewards/rejected': -0.08786888420581818, 'logps/rejected': -302.6426086425781, 'logits/rejected': 2848422.5, 'rewards/margins': 0.13016879061857858, 'kl': 0.0, 'epoch': 2.24}


 75%|███████▍  | 85/114 [01:59<00:38,  1.34s/it][0m 
 75%|███████▌  | 86/114 [02:00<00:38,  1.39s/it][0m 
 76%|███████▋  | 87/114 [02:01<00:36,  1.37s/it][0m 
 77%|███████▋  | 88/114 [02:03<00:35,  1.35s/it][0m 
 78%|███████▊  | 89/114 [02:04<00:32,  1.30s/it][0m 
 79%|███████▉  | 90/114 [02:05<00:30,  1.26s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4827, 'grad_norm': 2.979959011077881, 'learning_rate': 7.052201923388955e-07, 'rewards/chosen': -0.030370076497395832, 'logps/chosen': -193.817138671875, 'logits/chosen': -6196848.0, 'rewards/rejected': -0.08789678982325963, 'logps/rejected': -304.4921177455357, 'logits/rejected': -8776000.57142857, 'rewards/margins': 0.0575267133258638, 'kl': 0.3118095397949219, 'epoch': 2.37}


 79%|███████▉  | 90/114 [02:05<00:30,  1.26s/it][0m 
 80%|███████▉  | 91/114 [02:07<00:30,  1.32s/it][0m 
 81%|████████  | 92/114 [02:08<00:29,  1.36s/it][0m 
 82%|████████▏ | 93/114 [02:09<00:28,  1.37s/it][0m 
 82%|████████▏ | 94/114 [02:11<00:27,  1.35s/it][0m 
 83%|████████▎ | 95/114 [02:12<00:24,  1.29s/it][0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4787, 'grad_norm': 4.546638488769531, 'learning_rate': 4.5950771910944603e-07, 'rewards/chosen': 0.0801504800717036, 'logps/chosen': -452.5955810546875, 'logits/chosen': -22209477.333333332, 'rewards/rejected': -0.13689537346363068, 'logps/rejected': -526.0299072265625, 'logits/rejected': -39445840.0, 'rewards/margins': 0.2170458535353343, 'kl': 0.29901599884033203, 'epoch': 2.51}


 83%|████████▎ | 95/114 [02:12<00:24,  1.29s/it][0m 
 84%|████████▍ | 96/114 [02:13<00:24,  1.35s/it][0m 
 85%|████████▌ | 97/114 [02:15<00:22,  1.32s/it][0m 
 86%|████████▌ | 98/114 [02:16<00:21,  1.36s/it][0m 
 87%|████████▋ | 99/114 [02:17<00:20,  1.34s/it][0m 
 88%|████████▊ | 100/114 [02:19<00:18,  1.30s/it]0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4765, 'grad_norm': 2.9704947471618652, 'learning_rate': 2.620917716123444e-07, 'rewards/chosen': 0.04602128267288208, 'logps/chosen': -195.94288853236608, 'logits/chosen': -27954144.0, 'rewards/rejected': -0.1297607421875, 'logps/rejected': -254.4271240234375, 'logits/rejected': -6951386.666666667, 'rewards/margins': 0.17578202486038208, 'kl': 0.0, 'epoch': 2.64}


 88%|████████▊ | 100/114 [02:19<00:18,  1.30s/it][INFO|trainer.py:3993] 2025-09-22 18:01:34,872 >> Saving model checkpoint to llama3_8b_lora_kto/checkpoint-100
[36m(RayTrainWorker pid=9269, ip=10.0.121.215)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_75e12_00000_0_2025-09-22_17-58-47/checkpoint_000001)
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:698] 2025-09-22 18:01:35,070 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:770] 2025-09-22 18:01:35,070 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "architectures": [
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m     "LlamaForCausalLM"
[36m(RayTrainWor


Training finished iteration 2 at 2025-09-22 18:01:37. Total running time: 2min 49s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000001 │
│ time_this_iter_s               69.12024 │
│ time_total_s                  163.39924 │
│ training_iteration                    2 │
│ epoch                              2.64 │
│ grad_norm                       2.97049 │
│ kl                                   0. │
│ learning_rate                        0. │
│ logits/chosen                -27954144. │
│ logits/rejected          -6951386.66667 │
│ logps/chosen                 -195.94289 │
│ logps/rejected               -254.42712 │
│ loss                             0.4765 │
│ rewards/chosen                  0.04602 │
│ rewards/margins                 0.17578 │
│ rewards/rejected               -0.12976 │
│ step                                100 │
╰───────────────────────────────────

 89%|████████▊ | 101/114 [02:22<00:26,  2.03s/it]0m 
 89%|████████▉ | 102/114 [02:24<00:21,  1.80s/it]0m 
 90%|█████████ | 103/114 [02:25<00:18,  1.69s/it]0m 
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_75e12_00000_0_2025-09-22_17-58-47/checkpoint_000001)[32m [repeated 3x across cluster][0m
 91%|█████████ | 104/114 [02:26<00:15,  1.60s/it]0m 
 92%|█████████▏| 105/114 [02:28<00:13,  1.55s/it]0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4828, 'grad_norm': 3.197455644607544, 'learning_rate': 1.1764499893210879e-07, 'rewards/chosen': 0.07992073893547058, 'logps/chosen': -335.86749267578125, 'logits/chosen': -28101516.0, 'rewards/rejected': -0.18496094644069672, 'logps/rejected': -350.64691162109375, 'logits/rejected': -8913080.0, 'rewards/margins': 0.2648816853761673, 'kl': 0.0, 'epoch': 2.77}


 92%|█████████▏| 105/114 [02:28<00:13,  1.55s/it]0m 
 93%|█████████▎| 106/114 [02:29<00:12,  1.56s/it]0m 
 94%|█████████▍| 107/114 [02:31<00:10,  1.52s/it]0m 
 95%|█████████▍| 108/114 [02:32<00:08,  1.48s/it]0m 
 96%|█████████▌| 109/114 [02:34<00:07,  1.45s/it]0m 
 96%|█████████▋| 110/114 [02:35<00:05,  1.44s/it]0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'loss': 0.4867, 'grad_norm': 3.448063850402832, 'learning_rate': 2.958631979685156e-08, 'rewards/chosen': -0.046194459001223244, 'logps/chosen': -351.6595865885417, 'logits/chosen': -25176178.666666668, 'rewards/rejected': -0.09574539320809501, 'logps/rejected': -479.48256138392856, 'logits/rejected': -39495369.14285714, 'rewards/margins': 0.04955093420687177, 'kl': 0.3386707305908203, 'epoch': 2.91}


 96%|█████████▋| 110/114 [02:35<00:05,  1.44s/it]0m 
 97%|█████████▋| 111/114 [02:36<00:04,  1.44s/it]0m 
 98%|█████████▊| 112/114 [02:38<00:02,  1.45s/it]0m 
 99%|█████████▉| 113/114 [02:39<00:01,  1.43s/it]0m 
100%|██████████| 114/114 [02:40<00:00,  1.24s/it][INFO|trainer.py:3993] 2025-09-22 18:01:56,261 >> Saving model checkpoint to llama3_8b_lora_kto/checkpoint-114
[36m(RayTrainWorker pid=9269, ip=10.0.121.215)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_75e12_00000_0_2025-09-22_17-58-47/checkpoint_000002)
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:698] 2025-09-22 18:01:56,450 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:770] 2025-09


Training finished iteration 3 at 2025-09-22 18:01:58. Total running time: 3min 11s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000002 │
│ time_this_iter_s               21.37755 │
│ time_total_s                  184.77679 │
│ training_iteration                    3 │
│ epoch                           2.90667 │
│ grad_norm                       3.44806 │
│ kl                              0.33867 │
│ learning_rate                        0. │
│ logits/chosen           -25176178.66667 │
│ logits/rejected         -39495369.14286 │
│ logps/chosen                 -351.65959 │
│ logps/rejected               -479.48256 │
│ loss                             0.4867 │
│ rewards/chosen                 -0.04619 │
│ rewards/margins                 0.04955 │
│ rewards/rejected               -0.09575 │
│ step                                110 │
╰───────────────────────────────────

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:2676] 2025-09-22 18:01:58,707 >> 
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m 
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m 
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'train_runtime': 167.5942, 'train_samples_per_second': 5.37, 'train_steps_per_second': 0.68, 'train_loss': 0.4829057821056299, 'epoch': 3.0}


100%|██████████| 114/114 [02:43<00:00,  1.24s/it]0m 


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m 🏃 View run llama3_8b_lora_kto at: https://dbc-20ea386a-27d3.cloud.databricks.com/#/experiments/3019228085008994/runs/39ef543e880046f7b60c935ca24de172
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m 🧪 View experiment at: https://dbc-20ea386a-27d3.cloud.databricks.com/#/experiments/3019228085008994


100%|██████████| 114/114 [02:44<00:00,  1.44s/it]0m 
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|trainer.py:3993] 2025-09-22 18:01:59,904 >> Saving model checkpoint to llama3_8b_lora_kto
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:698] 2025-09-22 18:02:00,094 >> loading configuration file config.json from cache at /home/ray/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/config.json
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|configuration_utils.py:770] 2025-09-22 18:02:00,095 >> Model config LlamaConfig {
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "architectures": [
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m     "LlamaForCausalLM"
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   ],
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "attention_bias": false,
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   "attention_dro

[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m ***** train metrics *****
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   epoch                    =        3.0
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   total_flos               = 19610724GF
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   train_loss               =     0.4829
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   train_runtime            = 0:02:47.59
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   train_samples_per_second =       5.37
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m   train_steps_per_second   =       0.68
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m Figure saved at: llama3_8b_lora_kto/training_loss.png
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m Figure saved at: llama3_8b_lora_kto/training_rewards_chosen.png


[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m [INFO|modelcard.py:450] 2025-09-22 18:02:00,536 >> Dropping the following result as it does not have all the necessary fields:
[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}



Training completed after 3 iterations at 2025-09-22 18:02:02. Total running time: 3min 14s


2025-09-22 18:02:02,105	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/llama3_8b_kto_lora' in 0.0222s.





[36m(RayTrainWorker pid=9270, ip=10.0.121.215)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/llama3_8b_kto_lora/TorchTrainer_75e12_00000_0_2025-09-22_17-58-47/checkpoint_000002)[32m [repeated 3x across cluster][0m


### Option B — Run as an Anyscale Job (production)

For longer or production runs, submit the training as an **Anyscale Job**. Jobs run outside your interactive session for better stability, retries, and durable logs. You’ll package LLaMA-Factory and other libraries in a container image and launch with a short job config. See **[WIP Launching Fine-Tuning with Anyscale Jobs](3.10-launch-fine-tuning-with-anyscale-jobs.md)** for the step-by-step guide.

### Tracking with MLflow

If you set `report_to: mlflow` in your YAML, LLaMA-Factory will log metrics (loss, learning rate, etc.), parameters, and artifacts to your configured MLflow tracking server.

* **Install MLflow:**

  ```bash
  pip install mlflow
  ```

**Example YAML snippet:**

```yaml
report_to: mlflow

ray_init_kwargs:
  runtime_env:
    env_vars:
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

**MLFlow**
![MLFlow](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.3/3.2.3-mlflow.png)

For a more detailed guide on tracking experiments with other tools such as WandB or TensorBoard, see [todo: add doc link](3.5-observability-and-tracking.md).


## Step 5: Locate Checkpoints

Checkpoints are written under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/llama3_8b_kto_lora`. 

Inside, you’ll see a **trainer session** directory named like:
`TorchTrainer_75e12_00000_0_2025-09-22_17-58-47`.

- `TorchTrainer_*` is created **when the trainer starts**; the suffix encodes a short run id and the **start timestamp**.
- Within that directory, checkpoints are named `checkpoint_000xxx/`, where the number is the saved ordered checkpoints. 

The save cadence is controlled by `save_strategy` and `save_steps`. For instructions on how to resume interrupted training via `resume_from_checkpoint` and more, see [todo: add link for checkpointing](3.4-checkpointing.md#understanding-your-training-output-directory).

## Step 6: Export the Model

If you use LoRA, you can keep the base model and adapter separate ([for multi-LoRA adapter use](https://docs.anyscale.com/llm/serving/multi-lora)) or merge the adapter into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply post-training quantization on merged or full models before serving. See [todo: add doc link]() for the exact export commands and options.