# What is DPO ?
- Direct Preference Optimization (DPO) is a method for aligning large language models (LLMs) with human preferences, offering a simpler and more stable alternative to traditional Reinforcement Learning from Human Feedback (RLHF).
- Pre-trained LLMs are excellent at predicting the next token based on vast amounts of text. However, they don't inherently know what humans prefer in terms of helpfulness, harmlessness, style, or specific content. This is where "alignment" comes in – teaching the model to generate outputs that are more desirable to humans.
- DPO's key innovation is that it eliminates the need for a separate reward model and complex reinforcement learning algorithms. Instead, it directly optimizes the language model's policy based on human preferences, treating it as a binary classification problem.

## Training

In [None]:
# Import required libraries 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_config, prepare_model_for_kbit_training, get_peft_model
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import os 

# Set your hf token if we need to access gated models 
from huggingface_hub import login
login(token= "")


Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



In [2]:
# Enable HF transfer
# Accelerate model, dataset, and tokenizer downloads from the Hugging Face Hub
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

# Model to be used
model_name = "microsoft/Phi-3-mini-4k-instruct"

# Data to be used 
data = "Intel/orca_dpo_pairs"

# Save directory, adjust if needed
save_directory = "./cache"

In [3]:
# Quantizatio Configuration 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model 
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    quantization_config=bnb_config,
    device_map="balanced",
    trust_remote_code=True,
    cache_dir=save_directory,
    torch_dtype=torch.bfloat16,
)

# Load tokenizer 
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=save_directory,
    trust_remote_code=True,
)

# LoRA Congifguration 
peft_config = LoraConfig(
    lora_alpha=16, 
    lora_dropout=0.1, 
    r = 64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules= [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

# Load Dataset from HF 
dataset = load_dataset(
    path="Intel/orca_dpo_pairs",
    cache_dir=save_directory
)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
# Set tokenizer configurations

# 1. Set pad token to EOS token 
tokenizer.pad_token = tokenizer.eos_token

# Set padding side 
tokenizer.padding_side = "right"  #Important for DPO, typically "right" for causal models

# Wrap up the mode
peft_model = get_peft_model(model=model, peft_config= peft_config)

In [4]:
print(tokenizer.chat_template)

{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'user' %}{{'<|user|>
' + message['content'] + '<|end|>
'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>
' + message['content'] + '<|end|>
'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}


In [6]:
# A common issue: datasets might not be perfectly formatted.
# You might need a formatting function if your data columns are different.
# For example, if your data is structured as {"instruction": "...", "response_good": "...", "response_bad": "..."}
# def format_dpo_dataset(sample):
#     prompt_message = {"role":"user", "content":sample["question"]}
#     prompt_text = tokenizer.apply_chat_template(prompt_message, tokenize=False, add_generation_prompt=True)

#     chosen_message = [{"role":"user", "content":sample["question"]}, {"role":"assistant", "content":sample["chosen_answer"]}]
#     chosen_text = tokenizer.apply_chat_template(chosen_message, tokenize=False, add_generation_prompt=True)

#     rejected_message = [{"role":"user", "content":sample["question"]}, {"role":"assistant", "content":sample["rejected_answer"]}]
#     rejected_text = tokenizer.apply_chat_template(rejected_message, tokenize=False, add_generation_prompt=True)
      
#     return {
#         "prompt" : prompt_text,
#         "chosen" : chosen_text,
#         "rejected" : rejected_text
#     }

def format_dpo_dataset(sample):
    return {
        "prompt": sample["question"],
        "chosen": sample["chosen"],
        "rejected": sample["rejected"]
    }


In [7]:
dataset.map(format_dpo_dataset)

Map:   0%|          | 0/12859 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['system', 'question', 'chosen', 'rejected', 'prompt'],
        num_rows: 12859
    })
})

- Start with a Strong SFT Base: DPO works best when the model already understands basic instruction following. If your base model isn't instruction-tuned, consider an initial SFT phase.