# Preference Fine-Tuning Using Direct Preference Optimization (DPO)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Finetuning/DPO_Finetuning.ipynb)


**This notebook demonstrates how to perform preference fine-tuning using the Together AI platform. We'll work with the HelpSteer2 dataset to train a model that produces more helpful responses according to human preferences.**

## What is Preference-Tuning?

Preference fine-tuning improves models by training them on pairs of responses where one is preferred over the other. We use Direct Preference Optimization (DPO), which allows models to learn from human preferences without requiring a separate reward model.

We use [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290) for this type of fine-tuning instead of reinforcement learning from human feedback (RLHF).

<img src="../images/RLHF_DPO.png" width="950">

#### Setup and Installation
---
First, install the necessary Python libraries. We need:
- `together`: The official Together AI Python client for interacting with the API (finetuning, inference, files, etc.).
- `datasets`: A library from Hugging Face for easily downloading and manipulating datasets.

In [1]:
!pip install -qU together datasets

In [2]:
## Setup Together AI client
import os
import sys
from together import Together

TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
WANDB_API_KEY = os.getenv("WANDB_API_KEY") # needed for logging fine-tuning to wandb

client = Together(api_key=TOGETHER_API_KEY)
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference"

## Dataset: HelpSteer2

We will use the HelpSteer2 dataset from NVIDIA, which contains human feedback for AI models across multiple dimensions:

- Helpfulness: How well the response addresses the user's needs
- Correctness: Factual accuracy of information
- Coherence: Logical flow and consistency
- Complexity: Appropriate level of detail
- Verbosity: Concise vs. overly detailed

The dataset is available at [huggingface.co/datasets/nvidia/HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2).

For our preference-tuning experiment, we use [HelpSteer2-DPO](https://huggingface.co/datasets/Atsunori/HelpSteer2-DPO), specifically adapted for preference optimization to make AI models more helpful.

## Data Preparation

Let's prepare the dataset to be compatible with Together AI [preference-tuning dataset format](https://docs.together.ai/docs/fine-tuning-data-preparation#preference-data).

In [3]:
from datasets import load_dataset

# Download the HelpSteer2-DPO dataset from Hugging Face
dataset = load_dataset("Atsunori/HelpSteer2-DPO")

# Display basic information about the dataset
print(f"Dataset structure: {dataset}")
print(f"Available splits: {dataset.keys()}")


Dataset structure: DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen_response', 'rejected_response'],
        num_rows: 7221
    })
    validation: Dataset({
        features: ['prompt', 'chosen_response', 'rejected_response'],
        num_rows: 373
    })
})
Available splits: dict_keys(['train', 'validation'])


### Let's see an example from the dataset:

In [4]:
_id=210
print("\nSample from training split:")
print(f"Prompt:\n{dataset['train'][_id]['prompt']}\n")
print("Chosen Response vs Rejected Response:")
print("-" * 80)
chosen = dataset["train"][_id]['chosen_response']
rejected = dataset["train"][_id]['rejected_response']

# Print chosen response in full
print("CHOSEN RESPONSE:")
print(chosen)
print("\n" + "-" * 40 + "\n")

# Print rejected response in full
print("REJECTED RESPONSE:")
print(rejected)
print("-" * 80)



Sample from training split:
Prompt:
I want you to act like Crush the sea turtle from Finding Nemo. I want you to respond and answer like Crush using the tone, manner and vocabulary Crusy would use. Do not write any explanations. Only answer like Crush. You must know all of the knowledge of Crush. My first sentence is “Hi Crush."

Chosen Response vs Rejected Response:
--------------------------------------------------------------------------------
CHOSEN RESPONSE:
G'day mate! It's me, Crush the sea turtle.

So, you want to know all about me, huh? Well, let me tell ya, I'm a bit of an enigma. I'm a loggerhead turtle, which means I'm a bit of a rare breed. I was born in the Great Barrier Reef, which is one of the most beautiful places in the world.

I've been around for a while, mate. I've seen a lot of things and done a lot of things. I've traveled all around the world, from the warm waters of the Caribbean to the chilly waters of the North Atlantic. I've seen some amazing things, and I

## Understanding Preference Pairs

Each example in our dataset contains:
- A user prompt/query
- A preferred (chosen) response
- A non-preferred (rejected) response

During the process of preference tuning the model learns to generate outputs more similar to the preferred examples while avoiding characteristics of the rejected ones.

## Convert Data to Preference Format

Our data needs to be in preference format:

```json
    {
        "input": {
            "messages": [
                {"role": "user", "content": "..."},
            ],
        },
        "preferred_output": [
            {"role": "assistant", "content": "..."}
        ],
        "non_preferred_output": [
            {"role": "assistant", "content": "..."}
        ]
    }
```

In [5]:
def convert_to_preference_dataset(dataset):
    """
    Converts the HelpSteer2-DPO dataset to a format suitable for together.ai preference fine-tuning.
    
    Returns a dataset with the preference format.
    """
    converted_dataset = {
        "train": [],
        "validation": []
    }
    
    for split in ["train", "validation"]:
        for example in dataset[split]:
            # Create the input messages
            messages = [
                {"role": "user", "content": example["prompt"]}
            ]
            
            # Create the preferred and non-preferred outputs
            preferred_output = [
                {"role": "assistant", "content": example["chosen_response"]}
            ]
            
            non_preferred_output = [
                {"role": "assistant", "content": example["rejected_response"]}
            ]
            
            # Add the converted example to the dataset
            converted_dataset[split].append({
                "input": {
                    "messages": messages
                },
                "preferred_output": preferred_output,
                "non_preferred_output": non_preferred_output
            })
    
    return converted_dataset

In [6]:
import json
import os

# Convert the dataset to the required format
converted_dataset = convert_to_preference_dataset(dataset)

# Create output directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Write training data
dpo_train_file_path = "data/helpsteer2_preference_train.jsonl"
with open(dpo_train_file_path, "w") as f:
    for example in converted_dataset["train"]:
        f.write(json.dumps(example) + "\n")

# Write validation data
dpo_validation_file_path = "data/helpsteer2_preference_validation.jsonl"
with open(dpo_validation_file_path, "w") as f:
    for example in converted_dataset["validation"]:
        f.write(json.dumps(example) + "\n")

print(f"Saved {len(converted_dataset['train'])} training examples to data/helpsteer_preference.jsonl")
print(f"Validation set contains {len(converted_dataset['validation'])} examples")

# Display a sample example
print("\nSample example:")
print(json.dumps(converted_dataset["train"][0], indent=2))


Saved 7221 training examples to data/helpsteer_preference.jsonl
Validation set contains 373 examples

Sample example:
{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "c#"
      }
    ]
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "C# (pronounced \"C sharp\") is a modern, object-oriented programming language developed by Microsoft. It is widely used for building various types of applications, including web applications, desktop applications, mobile applications, and games. C# is similar to other programming languages such as Java and C++, and it is known for its simplicity and ease of use. C# is a powerful language that provides a rich set of libraries and frameworks that make it easy to build robust and scalable applications.\n\nHere is a brief overview of some key features of C#:\n\n1. Object-oriented: C# is an object-oriented language, which means it uses the concept of objects to represent real-world entities and 

**Now we'll upload the datasets to the Together AI cloud to use them for fine-tuning. Notice that we set `check = True` this will trigger a format check to ensure that our data is in the correct format for DPO**

In [7]:
dpo_train_file = client.files.upload(dpo_train_file_path, check=True)
dpo_validation_file = client.files.upload(dpo_validation_file_path, check=True)

print(f"Uploaded DPO training files: {dpo_train_file}")
print(f"Uploaded DPO validation files: {dpo_validation_file}")

Uploading file helpsteer2_preference_train.jsonl: 100%|██████████| 28.3M/28.3M [00:01<00:00, 21.4MB/s]
Uploading file helpsteer2_preference_validation.jsonl: 100%|██████████| 1.45M/1.45M [00:00<00:00, 3.39MB/s]


Uploaded DPO training files: id='file-ad54386b-81a5-4f7f-ae32-3174efa66dc1' object=<ObjectType.File: 'file'> created_at=1744664620 type=None purpose=<FilePurpose.FineTune: 'fine-tune'> filename='helpsteer2_preference_train.jsonl' bytes=28329513 line_count=0 processed=True FileType='jsonl'
Uploaded DPO validation files: id='file-f781bc64-f36b-43bd-8f91-f14875d32440' object=<ObjectType.File: 'file'> created_at=1744664622 type=None purpose=<FilePurpose.FineTune: 'fine-tune'> filename='helpsteer2_preference_validation.jsonl' bytes=1454367 line_count=0 processed=True FileType='jsonl'


## Starting a Preference Fine-Tuning Job

We use [Direct-Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290) as a method for Preference Fine-Tuning

- In order to start the preference fine-tuning job we need to set `training_method="dpo"`

### Key Parameters: DPO Beta (β)

The `dpo_beta` parameter is crucial - it controls how much the model can deviate from its reference behavior:
- Lower values (e.g., 0.1): More aggressive optimization toward preferred responses, potentially more creativity but higher risk of instability
- Higher values (e.g., 0.7): More conservative updates, staying closer to reference model behavior, more stability but potentially less improvement

Default value is 0.1, but you should experiment with different values depending on your specific use case.

In [None]:
# Directly perform DPO training

dpo_training = client.fine_tuning.create(
    training_method='dpo',
    dpo_beta=0.1,
    training_file=dpo_train_file.id,
    validation_file=dpo_validation_file.id,
    n_evals=10,
    model=MODEL_NAME,
    wandb_api_key=WANDB_API_KEY,
    wandb_project_name="helpsteer2",
    suffix="helpsteer2_dpo_training",
    n_epochs=1,
    n_checkpoints=1,
    learning_rate=1e-5,
    lora=True,
)
print(dpo_training.id)

You can also try another setting of the `dpo_beta` parameter to see the effect it has on the preference tuning

```python
dpo_training = client.fine_tuning.create(
    training_file=dpo_train_file.id,
    validation_file=dpo_validation_file.id,
    n_evals=5,
    model=MODEL_NAME,
    wandb_api_key=WANDB_API_KEY,
    wandb_project_name="helpsteer2",
    suffix="helpsteer2_dpo_training",
    n_epochs=1,
    n_checkpoints=1,
    learning_rate=1e-5,
    lora=True,
    training_method='dpo',
    dpo_beta=0.7,  # HIGHER DPO_BETA
)
print(dpo_training.id)
```

## Monitoring Training Progress

During training, pay attention to these key preference-optimization specific metrics:
- **Reward Accuracy** (`eval`/`reward`/`accuracy`): Percentage of times your model correctly assigns higher reward to preferred responses. Higher is better.

<img src="../images/dpo_loss.png" width="700">

- **KL Divergence** (`eval/approx. KL/rejected`, `eval/approx. KL/chosen`): Measures how much your model diverges from the reference model. Indicates the magnitude of behavioral change.

<img src="../images/dpo_train.png" width="700">
<img src="../images/dpo_eval.png" width="700">

As training progresses, you typically want to see accuracy increase while KL divergence rises gradually, showing your model is learning preferences without deviating too far from its original behavior.

## Inference with Preference Tuned Model

Now, let's use our finetuned model! We can call it just like any other model on the Together AI platform, by providing the unique fine-tuned model `output_name` we retrieved from our fine-tuning job earlier.

In [8]:
finetuned_model = "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-helpsteer2_dpo_training_continuing_sft-cf1147c8"#dpo_training.output_name #this is the name of the finetuned model

user_prompt = """I want you to act like Crush the sea turtle from Finding Nemo. 
I want you to respond and answer like Crush using the tone, manner and vocabulary Crush would use. Do not write any explanations. 
Only answer like Crush. 
You must know all of the knowledge of Crush.
"""

response = client.chat.completions.create(
    model = finetuned_model,
    messages=[
        {
            "role": "user",
            "content": user_prompt,
        }
    ]
)

print(response.choices[0].message.content)

DUDE! What's up?


## Two-Stage Approach: SFT + DPO - (Optional)
---
Everything below this point is optional and the DPO only fine-tuning only uses the above code

### Data Prep for SFT + DPO

Alternatively for better results, we recommend a two-stage approach:
1. Standard Fine-Tuning (SFT): First train the model to generate responses similar to the preferred examples
2. Preference Optimization (DPO): Further refine the model to better distinguish between preferred and non-preferred outputs

This combined approach typically yields better results than applying DPO directly to a base model.

Below we'll prepare the data for a round of supervised fine-tuning before we perform the preference tuning. This section below is optional, and if you ONLY want to perform DPO you can skip to the next section.

In [None]:
# Convert the preference dataset to SFT format
def convert_preference_to_sft_format(data):
    """
    Convert Preference data format to SFT format.
    
    Takes input messages and preferred output and formats them into a chat format
    with appropriate role assignments.
    """
    messages = []
    for msg in data["input"]["messages"]:
        messages.append(msg)
    messages.extend(data["preferred_output"])
    
    return {"messages": messages}

def process_preference_to_sft(input_data, output_path, split="train"):
    """
    Process the preference dataset and convert it to SFT format.
    
    Args:
        input_data: Dictionary containing train and validation preference data
        output_path: Path to save the output SFT jsonl file
        split: Dataset split to process ("train" or "validation")
    """
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    
    line_count = 0
    with open(output_path, 'w') as outfile:
        for example in input_data[split]:
            try:
                sft_format = convert_preference_to_sft_format(example)
                outfile.write(json.dumps(sft_format) + '\n')
                line_count += 1
                if line_count % 2000 == 0 and split == "train":
                    print(f"Processed {line_count} examples")
            except Exception as e:
                print(f"Error processing example {line_count + 1}: {str(e)}")
    
    print(f"Processed {line_count} examples for {split}. Output saved to {output_path}")
    return line_count

# Process the training dataset
sft_train_output_path = "data/helpsteer2_sft_training.jsonl"
sft_train_count = process_preference_to_sft(converted_dataset, sft_train_output_path, "train")

# Process the validation dataset
sft_validation_output_path = "data/helpsteer2_sft_validation.jsonl"
sft_validation_count = process_preference_to_sft(converted_dataset, sft_validation_output_path, "validation")

# Display a sample SFT example from training set
with open(sft_train_output_path, 'r') as f:
    sample_sft = json.loads(f.readline().strip())
    
print("\nSample SFT example:")
print(json.dumps(sample_sft, indent=2))

# Compare dataset sizes
print(f"\nPreference dataset sizes:")
print(f"  Training: {len(converted_dataset['train'])} examples")
print(f"  Validation: {len(converted_dataset['validation'])} examples")

print(f"SFT dataset sizes:")
print(f"  Training: {sft_train_count} examples")
print(f"  Validation: {sft_validation_count} examples")

Processed 2000 examples
Processed 4000 examples
Processed 6000 examples
Processed 7221 examples for train. Output saved to data/helpsteer2_sft_training.jsonl
Processed 373 examples for validation. Output saved to data/helpsteer2_sft_validation.jsonl

Sample SFT example:
{
  "messages": [
    {
      "role": "user",
      "content": "c#"
    },
    {
      "role": "assistant",
      "content": "C# (pronounced \"C sharp\") is a modern, object-oriented programming language developed by Microsoft. It is widely used for building various types of applications, including web applications, desktop applications, mobile applications, and games. C# is similar to other programming languages such as Java and C++, and it is known for its simplicity and ease of use. C# is a powerful language that provides a rich set of libraries and frameworks that make it easy to build robust and scalable applications.\n\nHere is a brief overview of some key features of C#:\n\n1. Object-oriented: C# is an object-ori

In [None]:
!ls data

helpsteer2_preference_train.jsonl      helpsteer2_sft_training.jsonl
helpsteer2_preference_validation.jsonl helpsteer2_sft_validation.jsonl


**Now we'll upload the SFT datasets to the Together AI cloud to use them for fine-tuning. Notice that we set `check = True` this will trigger a format check to ensure that our data is in the correct format for SFT**

In [None]:
sft_train_file = client.files.upload(sft_train_output_path, check=True)
sft_validation_file = client.files.upload(sft_validation_output_path, check=True)

print(f"Uploaded SFT training files: {sft_train_file}")
print(f"Uploaded SFT validation files: {sft_validation_file}")



Uploading file helpsteer2_sft_training.jsonl: 100%|██████████| 17.4M/17.4M [00:01<00:00, 17.4MB/s]
Uploading file helpsteer2_sft_validation.jsonl: 100%|██████████| 900k/900k [00:00<00:00, 1.13MB/s]


Uploaded SFT training files: id='file-95a0cfa5-499c-42bf-ae58-817a2efb8fbe' object=<ObjectType.File: 'file'> created_at=1744643856 type=None purpose=<FilePurpose.FineTune: 'fine-tune'> filename='helpsteer2_sft_training.jsonl' bytes=17393749 line_count=0 processed=True FileType='jsonl'
Uploaded SFT validation files: id='file-ef7c69a0-0ba5-44c2-9495-206646683d8a' object=<ObjectType.File: 'file'> created_at=1744643857 type=None purpose=<FilePurpose.FineTune: 'fine-tune'> filename='helpsteer2_sft_validation.jsonl' bytes=899789 line_count=0 processed=True FileType='jsonl'


### Two-Stage Approach: SFT + DPO (Optional) - Fine-tuning

**Optionally - We can also firstly create a SFT (usual fine-tuning) job, and use preference tuning with DPO to continue the training of the resulting SFT checkpoint.**

In [None]:
# If you want to combine the SFT + DPO training, you can do so by creating a SFT job first
# and then using the DPO training to continue the training of the resulting SFT checkpoint.

sft_training = client.fine_tuning.create(
    training_file=sft_train_file.id,
    validation_file=sft_validation_file.id,
    n_evals=3,
    model=MODEL_NAME,
    wandb_api_key=WANDB_API_KEY,
    wandb_project_name="helpsteer2",
    suffix="helpsteer2_sft_training",
    n_epochs=1,
    n_checkpoints=1,
    learning_rate=1e-5,
    lora=True,
)
print(sft_training.id)

ft-0d95918c-8d57


This gives us a SFT checkpoint:

<img src="../images/SFT_job.png" width="1000">

Training log:

<img src="../images/SFT_loss.png" width="700">

Use continual fine-tuning(CFT) to further preference tune the checkpoint we get from the SFT run above:

In [None]:
dpo_training_from_sft = client.fine_tuning.create(
    training_file=dpo_train_file.id,
    validation_file=dpo_validation_file.id,
    n_evals=10,
    #model=MODEL_NAME, We do not use model name here, it is derived from the checkpoint!
    wandb_api_key=WANDB_API_KEY,
    wandb_project_name="helpsteer2",
    suffix="helpsteer2_dpo_training_continuing_sft",
    n_epochs=1,
    n_checkpoints=1,
    learning_rate=1e-5,
    lora=True,
    training_method='dpo', # Now we use DPO training
    from_checkpoint=sft_training.id # Continuing from SFT checkpoint!
)
print(dpo_training_from_sft.id)

ft-d054a797-6e52


We can now see the resulting DPO tuning graphs:

<img src="../images/DPO_SFT_KL.png" width="700">

<img src="../images/DPO_SFT_acc.png" width="700">

## Further Reading
- [Together AI Full Docs](https://docs.together.ai/docs/preference-fine-tuning)
- [A dataset to make more human-like responses](https://arxiv.org/abs/2501.05032)
- [A Comprehensive Survey of Direct Preference Optimization:
  Datasets, Theories, Variants, and Applications](https://arxiv.org/abs/2410.15595)
- [Iterative Reasoning Preference Optimization](https://arxiv.org/abs/2404.19733v1)