# 💡What is DPO ?
- Direct Preference Optimization (DPO) is a method for aligning large language models (LLMs) with human preferences, offering a simpler and more stable alternative to traditional Reinforcement Learning from Human Feedback (RLHF).
- Pre-trained LLMs are excellent at predicting the next token based on vast amounts of text. However, they don't inherently know what humans prefer in terms of helpfulness, harmlessness, style, or specific content. This is where "alignment" comes in – teaching the model to generate outputs that are more desirable to humans.
- DPO's key innovation is that it eliminates the need for a separate reward model and complex reinforcement learning algorithms. Instead, it directly optimizes the language model's policy based on human preferences, treating it as a binary classification problem.

## Training

In [None]:
# Import required libraries 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_config, prepare_model_for_kbit_training, get_peft_model
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import os 

# Set your hf token if we need to access gated models 
from huggingface_hub import login
login(token= "")


--- 
**💡Phi-3 comes in different sizes and context window variants. Make sure you select the correct one from Hugging Face:**

- microsoft/Phi-3-mini-4k-instruct (3.8B parameters, 4K context)

- microsoft/Phi-3-mini-128k-instruct (3.8B parameters, 128K context).

- microsoft/Phi-3-medium-4k-instruct (14B parameters, 4K context)

- microsoft/Phi-3-medium-128k-instruct (14B parameters, 128K context)

**For this demo, we are using "microsoft/Phi-3-mini-4k-instruct"**

---

In [2]:
# Enable HF transfer
# Accelerate model, dataset, and tokenizer downloads from the Hugging Face Hub
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

# Model to be used
model_name = "microsoft/Phi-3-mini-4k-instruct"

# Data to be used 
data = "Intel/orca_dpo_pairs"

# Save directory, adjust if needed
save_directory = "./cache"

In [3]:
# Quantizatio Configuration 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model 
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    quantization_config=bnb_config,
    device_map="balanced",
    trust_remote_code=True,
    cache_dir=save_directory,
    torch_dtype=torch.bfloat16,
)

# Load tokenizer 
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=save_directory,
    trust_remote_code=True,
)

# LoRA Congifguration 
peft_config = LoraConfig(
    lora_alpha=16, 
    lora_dropout=0.1, 
    r = 64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules= [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

# Load Dataset from HF 
dataset = load_dataset(
    path="Intel/orca_dpo_pairs",
    cache_dir=save_directory
)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
# Set tokenizer configurations

# 1. Set pad token to EOS token 
tokenizer.pad_token = tokenizer.eos_token

# Set padding side 
tokenizer.padding_side = "right"  #Important for DPO, typically "right" for causal models

# Wrap up the mode
peft_model = get_peft_model(model=model, peft_config= peft_config)

In [21]:
# Check if tokenizer have chat_template mentioned
if tokenizer.chat_template:
    print(tokenizer.chat_template)
else:
    print(f"No chat template present for {tokenizer.name_or_path}")

{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}


---
### ℹ️ Data Preparation for DPO Training 
- The "general thumb rule" for preparing data for Hugging Face's DPOTrainer is to ensure your dataset has three essential columns: 
    - prompt
    - chosen
    - rejected

- The DPOTrainer expects your dataset to be a datasets object (from the Hugging Face datasets library), typically loaded from a JSONL file, CSV, or a dataset from the Hugging Face Hub.

- If tokenizer.chat_template is not None,  for prompt, you should provide only the user's input. For chosen and rejected, you should provide only the model's response part. The DPOTrainer will then internally construct the full conversational sequence for DPO.

- The DPOTrainer is explicitly designed to work with tokenizer.chat_template. It will internally apply the template correctly to construct the full sequences (prompt + chosen and prompt + rejected) before tokenization and forward passes. This ensures consistency and correctness with the model's pre-training.
---

In [14]:
# Data formatting strategy if tokenizer.chat_template is None

# def format_dpo_dataset(sample):
#     prompt_message = {"role":"user", "content":sample["question"]}
#     prompt_text = tokenizer.apply_chat_template(prompt_message, tokenize=False, add_generation_prompt=True)

#     chosen_message = [{"role":"user", "content":sample["question"]}, {"role":"assistant", "content":sample["chosen_answer"]}]
#     chosen_text = tokenizer.apply_chat_template(chosen_message, tokenize=False, add_generation_prompt=True)

#     rejected_message = [{"role":"user", "content":sample["question"]}, {"role":"assistant", "content":sample["rejected_answer"]}]
#     rejected_text = tokenizer.apply_chat_template(rejected_message, tokenize=False, add_generation_prompt=True)
      
#     return {
#         "prompt" : prompt_text,
#         "chosen" : chosen_text,
#         "rejected" : rejected_text
#     }

# Data formatting strategy if tokenizer.chat_template is not None
def format_dpo_dataset(sample):
    return {
        "prompt": sample["question"],
        "chosen": sample["chosen"],
        "rejected": sample["rejected"]
    }


In [15]:
# Final dataset.Dataset should only consist of 3 fields "prompt", "chosen", "rejected"
# Others should be removed 
columns_to_remove = ["system", "question"]

# Map the dataset 
processed_dataset = dataset.map(format_dpo_dataset,                  # formatting function
                                remove_columns= columns_to_remove,   # Columns to remove 
                                batched= False,                      # Process example by example; set to True if your map function handles batches
                                num_proc= os.cpu_count())            # Use multiple processes for faster mapping

In [16]:
datase

NameError: name 'datase' is not defined

- Start with a Strong SFT Base: DPO works best when the model already understands basic instruction following. If your base model isn't instruction-tuned, consider an initial SFT phase.

- High-Quality Preference Data: The quality of your chosen and rejected pairs is paramount.