# 💡What is DPO ?
- Direct Preference Optimization (DPO) is a method for aligning large language models (LLMs) with human preferences, offering a simpler and more stable alternative to traditional Reinforcement Learning from Human Feedback (RLHF).
- Pre-trained LLMs are excellent at predicting the next token based on vast amounts of text. However, they don't inherently know what humans prefer in terms of helpfulness, harmlessness, style, or specific content. This is where "alignment" comes in – teaching the model to generate outputs that are more desirable to humans.
- DPO's key innovation is that it eliminates the need for a separate reward model and complex reinforcement learning algorithms. Instead, it directly optimizes the language model's policy based on human preferences, treating it as a binary classification problem.

## Training

In [None]:
# Import required libraries 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_config, prepare_model_for_kbit_training, get_peft_model
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import os 

# Set your hf token if we need to access gated models 
from huggingface_hub import login
login(token= "")


Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



--- 
**💡Phi-3 comes in different sizes and context window variants. Make sure you select the correct one from Hugging Face:**

- microsoft/Phi-3-mini-4k-instruct (3.8B parameters, 4K context)

- microsoft/Phi-3-mini-128k-instruct (3.8B parameters, 128K context).

- microsoft/Phi-3-medium-4k-instruct (14B parameters, 4K context)

- microsoft/Phi-3-medium-128k-instruct (14B parameters, 128K context)

**For this demo, we are using "microsoft/Phi-3-mini-4k-instruct"**

---

In [2]:
# Enable HF transfer
# Accelerate model, dataset, and tokenizer downloads from the Hugging Face Hub
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

# Model to be used
model_name = "microsoft/Phi-3-mini-4k-instruct"

# Data to be used 
data = "Intel/orca_dpo_pairs"

# Save directory, adjust if needed
save_directory = "./cache"

In [3]:
# Quantizatio Configuration 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model 
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    quantization_config=bnb_config,
    device_map="balanced",
    trust_remote_code=True,
    cache_dir=save_directory,
    torch_dtype=torch.bfloat16,
)

# Load tokenizer 
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=save_directory,
    trust_remote_code=True,
)

# LoRA Congifguration 
peft_config = LoraConfig(
    lora_alpha=16, 
    lora_dropout=0.1, 
    r = 64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules= [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

# Load Dataset from HF 
dataset = load_dataset(
    path="Intel/orca_dpo_pairs",
    cache_dir=save_directory
)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
# Set tokenizer configurations

# 1. Set pad token to EOS token 
tokenizer.pad_token = tokenizer.eos_token

# Set padding side 
tokenizer.padding_side = "right"  #Important for DPO, typically "right" for causal models

# Wrap up the mode
peft_model = get_peft_model(model=model, peft_config= peft_config)

In [5]:
# Check if tokenizer have chat_template mentioned
if tokenizer.chat_template:
    print(tokenizer.chat_template)
else:
    print(f"No chat template present for {tokenizer.name_or_path}")

{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}


---
### ℹ️ Data Preparation for DPO Training 
- The "general thumb rule" for preparing data for Hugging Face's DPOTrainer is to ensure your dataset has three essential columns: 
    - prompt
    - chosen
    - rejected

- The DPOTrainer expects your dataset to be a datasets object (from the Hugging Face datasets library), typically loaded from a JSONL file, CSV, or a dataset from the Hugging Face Hub.

- If tokenizer.chat_template is not None,  for prompt, you should provide only the user's input. For chosen and rejected, you should provide only the model's response part. The DPOTrainer will then internally construct the full conversational sequence for DPO.

- The DPOTrainer is explicitly designed to work with tokenizer.chat_template. It will internally apply the template correctly to construct the full sequences (prompt + chosen and prompt + rejected) before tokenization and forward passes. This ensures consistency and correctness with the model's pre-training.

- So, while applying apply_chat_template manually in the mapping function, for both chosen and rejected, might seem like the right thing to do, it often complicates things or leads to errors because the DPOTrainer expects raw text in those fields and will apply the template itself.
---

In [6]:
# Data formatting strategy if tokenizer.chat_template is None

# def format_dpo_dataset(sample):
#     prompt_message = {"role":"user", "content":sample["question"]}
#     prompt_text = tokenizer.apply_chat_template(prompt_message, tokenize=False, add_generation_prompt=True)

#     chosen_message = [{"role":"user", "content":sample["question"]}, {"role":"assistant", "content":sample["chosen_answer"]}]
#     chosen_text = tokenizer.apply_chat_template(chosen_message, tokenize=False, add_generation_prompt=True)

#     rejected_message = [{"role":"user", "content":sample["question"]}, {"role":"assistant", "content":sample["rejected_answer"]}]
#     rejected_text = tokenizer.apply_chat_template(rejected_message, tokenize=False, add_generation_prompt=True)
      
#     return {
#         "prompt" : prompt_text,
#         "chosen" : chosen_text,
#         "rejected" : rejected_text
#     }

# Data formatting strategy if tokenizer.chat_template is not None
def format_dpo_dataset(sample):
    return {
        "prompt": sample["question"],
        "chosen": sample["chosen"],
        "rejected": sample["rejected"]
    }


In [7]:
# Final dataset.Dataset should only consist of 3 fields "prompt", "chosen", "rejected"
# Others should be removed 
columns_to_remove = ["system", "question"]

# Map the dataset 
processed_dataset = dataset.map(format_dpo_dataset,                  # formatting function
                                remove_columns= columns_to_remove,   # Columns to remove 
                                batched= False,                      # Process example by example; set to True if your map function handles batches
                                num_proc= os.cpu_count())            # Use multiple processes for faster mapping

In [8]:
# Train Dataset 
train_dataset  = processed_dataset['train']

In [None]:
# DPOConfig for DPO-specific hyperparameters
dpo_config = DPOConfig(
    output_dir="./phi3_dpo_results", # Directory to save results in
    num_train_epochs=5,              # Epochs 
    beta=0.1,
    loss_type="sigmoid",             # 
    optim="paged_adamw_8bit",        # 
    max_prompt_length=512,           
    max_completion_length=1024,
    max_length=2048,
    per_device_train_batch_size=2
    gradient_accumulation_steps=4,   # 
    gradient_checkpointing=True,     # Some intermediate activations are not stored, instead, they are recomputed during backpropagation
    learning_rate=2e-5,              # Initial learning rate
    lr_scheduler_type="cosine",      # Controls how the learning rate changes during training
    max_steps=-1,                    # 1 step is 1 optimizer update of model, using 1 batch of training data
    logging_steps=1,    
    save_steps=500,
    warmup_ratio=0.1,
    fp16=True,
    report_to= ["tensorboard"],
    logging_dir="./logs",
    remove_unused_columns=False,
    push_to_hub=False
)

# --- 6. Initialize and Train DPOTrainer ---
print("Initializing DPOTrainer...")
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=dpo_config, # Pass the DPOConfig directly as arguments
    train_dataset=train_dataset,
    peft_config=peft_config, # Crucial for LoRA/QLoRA fine-tuning
)

# Start training
print("Starting DPO training...")
dpo_trainer.train()
print("DPO training complete!")

Initializing DPOTrainer...
[2025-07-27 14:46:01,124] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


W0727 14:46:02.174000 21608 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting DPO training...


You are not running the flash-attention implementation, expect numerical differences.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
1,0.6931
2,0.6931
