*More details in this article: [DPO Full Training vs. DPO with LoRA: How Good is LoRA for DPO Training?](https://newsletter.kaitchup.com/p/dpo-full-training-vs-dpo-with-lora)*


This notebook shows how to run DPO training with only one base model and two LoRA adapters: one for the reference and one for the policy. It uses Qwen2.5 1.5B for example.

The notebook requires a 24 GB GPU (Ampere or more recent).

In [1]:
!pip install --upgrade transformers bitsandbytes peft accelerate datasets trl flash_attn



In [2]:
import torch, os, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed
)
from trl import DPOTrainer, DPOConfig

set_seed(1234)

model_name = "Qwen/Qwen2.5-1.5B"

#SFT Adapter
sft_adapter = "kaitchup/Qwen2.5-1.5B-SFT-UltraChat" #location of your SFT adapter

compute_dtype = torch.bfloat16

#If you have troubles with FlashAttention, use 'sdpa' instead
# attn_implementation = 'flash_attention_2'
attn_implementation = 'sdpa'



bs = 4 #Batch size per device (training and validation)
gas = 8 #Gradient accumulation steps
mseqlen = 1024 #Maximum sequence length
lr = 1e-6 #Learning rate

lora_alpha = 16
lora_dropout = 0.0
lora_r = 16

output_dir = "/workspace/DPO_LoRA_RUN/"

In [3]:
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = "<|image_pad|>"
tokenizer.pad_token_id = 151655
tokenizer.padding_side = 'left'

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
import torch
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())
print("Current device:", torch.cuda.current_device())
print("Device name:", torch.cuda.get_device_name(0))


CUDA available: True
Device count: 1
Current device: 0
Device name: NVIDIA A100-SXM4-40GB


DPO expects a dataset with:
* a prompt
* a chosen answer to this prompt
* a rejected answer to this prompt

In [5]:
ds = load_dataset("mlabonne/orpo-dpo-mix-40k", split="train").train_test_split(test_size=0.01)
ds_train = ds['train']
ds_test = ds['test']

#Add the EOS token
def process(row):
    prompt_messages = tokenizer.apply_chat_template([row["chosen"][0]], tokenize=False)
    # Now we extract the final turn to define chosen/rejected responses
    chosen_messages = tokenizer.apply_chat_template(row["chosen"][1:], tokenize=False)+tokenizer.eos_token
    rejected_messages = tokenizer.apply_chat_template(row["rejected"][1:], tokenize=False)+tokenizer.eos_token
    row["prompt"] = prompt_messages
    row["chosen"] = chosen_messages
    row["rejected"] = rejected_messages
    return row



In [6]:
ds['train'][0]

{'source': 'prm_dpo_pairs',
 'chosen': [{'content': "If $10^{51} - 9$ is written as an integer in standard form, what is the sum of the integer's digits?",
   'role': 'user'},
  {'content': 'I need to find the sum of the digits of a very large number, so I want to see if there is a pattern or a shortcut to do that.\nI notice that $10^{51} - 9$ is one less than a power of 10, so it has a lot of 9s in its decimal representation.\nIn fact, it has 50 9s, followed by a 1 at the end.\nFor example, $10^5 - 9 = 99991$.\nSo, the sum of the digits of $10^{51} - 9$ is just the sum of 50 9s and a 1.\nThat is, $50 \\times 9 + 1 = 451$.\n# Answer\n\n451',
   'role': 'assistant'}],
 'rejected': [{'content': "If $10^{51} - 9$ is written as an integer in standard form, what is the sum of the integer's digits?",
   'role': 'user'},
  {'content': 'Since $10^{51}$ is the least integer with $52$ digits, $10^{51}-9$ has 51 digits.  The ones digit is 1 and all the other digits are 9.  The sum of the digits i

In [7]:
ds_train = ds_train.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)

ds_test = ds_test.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)

Map (num_proc=12):   0%|          | 0/43802 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/443 [00:00<?, ? examples/s]

In [8]:
ds_train[0]

{'source': 'prm_dpo_pairs',
 'chosen': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>assistant\nI need to find the sum of the digits of a very large number, so I want to see if there is a pattern or a shortcut to do that.\nI notice that $10^{51} - 9$ is one less than a power of 10, so it has a lot of 9s in its decimal representation.\nIn fact, it has 50 9s, followed by a 1 at the end.\nFor example, $10^5 - 9 = 99991$.\nSo, the sum of the digits of $10^{51} - 9$ is just the sum of 50 9s and a 1.\nThat is, $50 \\times 9 + 1 = 451$.\n# Answer\n\n451<|im_end|>\n<|endoftext|>',
 'rejected': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>assistant\nSince $10^{51}$ is the least integer with $52$ digits, $10^{51}-9$ has 51 digits.  The ones digit is 1 and all the other digits are 9.  The sum of the digits is $9\\cdot 50 + 1=\\boxed{451}$.<|im_end|>\n<|endoftext|>',
 'prompt': "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_

In [9]:
len(ds_train)

43802

In [None]:
# load base model
model = AutoModelForCausalLM.from_pretrained(
      model_name, device_map={"": 0}, torch_dtype=compute_dtype, attn_implementation=attn_implementation)

model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})

The following cell displays some warnings that you can safely ignore.

In [None]:
# Load the policy adapter (this one is trainable)
model = PeftModel.from_pretrained(model, sft_adapter, is_trainable=True, adapter_name="DPO")
# Load the reference adapter (must be named "reference")
model.load_adapter(sft_adapter, adapter_name="reference")

<All keys matched successfully>

In [12]:
training_arguments = DPOConfig(
        output_dir=output_dir,
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=bs,
        gradient_accumulation_steps=gas,
        per_device_eval_batch_size=bs,
        log_level="debug",
        save_strategy="steps",
        save_steps=200,
        logging_steps=25,
        learning_rate=lr,
        bf16 = True,
        beta = 0.1,
        eval_steps=25,
        num_train_epochs=0.1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        max_length=mseqlen,
        max_prompt_length=mseqlen,
        model_adapter_name="DPO",
        ref_adapter_name="reference",
        dataset_num_proc=multiprocessing.cpu_count(),
)

In [13]:
trainer = DPOTrainer(
    model,
    args=training_arguments,
    train_dataset=ds_train,
    eval_dataset=ds_test,
    processing_class=tokenizer,
)

Extracting prompt in train dataset (num_proc=12):   0%|          | 0/43802 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=12):   0%|          | 0/43802 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=12):   0%|          | 0/43802 [00:00<?, ? examples/s]

Extracting prompt in eval dataset (num_proc=12):   0%|          | 0/443 [00:00<?, ? examples/s]

Applying chat template to eval dataset (num_proc=12):   0%|          | 0/443 [00:00<?, ? examples/s]

Tokenizing eval dataset (num_proc=12):   0%|          | 0/443 [00:00<?, ? examples/s]

Using auto half precision backend
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


The most important metrics in the training logs are Rewards/accuracies and Rewards/margins which should be increasing. Note that DPO is very sensitive to the learning rate. I recommending trying different values from 5e-7 to 5e-5.


In [None]:
trainer_ = trainer.train()

Currently training with a batch size of: 4
The following columns in the Training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: source, question. If source, question are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
skipped Embedding(151936, 1536): 222.5625M params
bitsandbytes: will optimize Embedding(151936, 1536) in fp32
skipped Embedding(151936, 1536): 445.125M params
bitsandbytes: will optimize Embedding(151936, 1536) in fp32
skipped Embedding(151936, 1536): 667.6875M params
bitsandbytes: will optimize Embedding(151936, 1536) in fp32
skipped: 667.6875M params
***** Running training *****
  Num examples = 43,802
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 137
  Number of trainable parameters = 485,212,160
Automatic Weights & Biases logging enabled

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
25,0.6841,0.666503,0.079556,0.020827,0.612613,0.058729,-460.918457,-371.939484,-0.634721,-0.631942


The following columns in the Evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: source, question. If source, question are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 443
  Batch size = 4
