# Direct Preference Optimization (DPO)

Original notebook: https://github.com/ShawhinT/YouTube-Blog/blob/main/LLMs/dpo/3-finetune_with_dpo.ipynb


**Keywords:** Full fine tuning, RL, DPO

**LLM:** Qwen/Qwen2.5-0.5B-Instruct

**Dataset:** https://huggingface.co/datasets/shawhin/youtube-titles-dpo

# Install dependencies

In [1]:
!pip install --no-deps git+https://github.com/lvwerra/trl -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for trl (pyproject.toml) ... [?25l[?25hdone


In [2]:
!pip install -U datasets -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25h

# Imports

In [3]:
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# 1. Load Data

In [4]:
dataset = load_dataset("shawhin/youtube-titles-dpo")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



data/train-00000-of-00001.parquet:   0%|          | 0.00/39.2k [00:00<?, ?B/s]

data/valid-00000-of-00001.parquet:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1026 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/114 [00:00<?, ? examples/s]

Dataset is in the **format** that is needed for DPO: ['prompt', 'chosen', 'rejected']

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 1026
    })
    valid: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 114
    })
})

# 2. Load Model

In [6]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_name) # , device_map="auto"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # set pad token (as end of sequence token)

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [7]:
tokenizer.pad_token # token for end of sequence in this LLM

'<|im_end|>'

# 3. Generate Title with the Base Model

In [8]:
def format_chat_prompt(user_input, system_message="You are a helpful assistant."):
    """
    Formats user input into the chat template format with <|im_start|> and <|im_end|> tags.

    Args:
        user_input (str): The input text from the user.

    Returns:
        str: Formatted prompt for the model.
    """

    # Format user message
    user_prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n"

    # Start assistant's turn
    assistant_prompt = "<|im_start|>assistant\n" # the model will later generate <|im_end|> or exceed the max number of tokens

    # Combine prompts
    formatted_prompt = user_prompt + assistant_prompt

    return formatted_prompt

## 3.1. Test the Function & Pipeline

In [20]:
# Set up text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device='cuda')

# Example prompt
prompt = format_chat_prompt(dataset['valid']['prompt'][0][0]['content'])

In [21]:
print(prompt)

<|im_start|>user
Given the YouTube video idea write an engaging title.

**Video Idea**: intro independent component analysis

**Additional Guidance**:
- Title should be between 30 and 75 characters long
- Only return the title idea, nothing else!<|im_end|>
<|im_start|>assistant



In [18]:
# Generate output
outputs = generator(prompt, max_length=100, truncation=True, num_return_sequences=1, temperature=0.7)

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In [19]:
print(outputs[0]['generated_text'])

<|im_start|>user
Given the YouTube video idea write an engaging title.

**Video Idea**: intro independent component analysis

**Additional Guidance**:
- Title should be between 30 and 75 characters long
- Only return the title idea, nothing else!<|im_end|>
<|im_start|>assistant
"Exploring Independent Component Analysis: From Basics to Modern Applications"


# 4. Train the Model

In [24]:
ft_model_name = model_name.split('/')[1].replace("Instruct", "DPO")

"""
 NOTE: training will be full (not peft)
 the original setting was:
    - per_device_train_batch_size=8
    - per_device_eval_batch_size=8
 To avoid OOM, we change to:
    - per_device_train_batch_size=2
    - per_device_eval_batch_size=2
    - gradient_accumulation_steps=4
"""
training_args = DPOConfig(
    output_dir=ft_model_name,
    logging_steps=25,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,  # update after 4 steps (*2 samples in each batch) => after 8 samples, average gradients & update!
    bf16=True, # less accuracy in comparison with bf32
    num_train_epochs=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    save_strategy="epoch",
    eval_strategy="epoch",
    eval_steps=1,
    report_to="none") # report_to="none" to ignore monitoring via Weight& Biases

device = torch.device('cuda')

In [25]:
trainer = DPOTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=dataset['train'],
    eval_dataset=dataset['valid'],
)

Extracting prompt in train dataset:   0%|          | 0/1026 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/1026 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1026 [00:00<?, ? examples/s]

Extracting prompt in eval dataset:   0%|          | 0/114 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/114 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/114 [00:00<?, ? examples/s]

In [26]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151645}.


Epoch,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/chosen,Logps/rejected,Logits/chosen,Logits/rejected
1,0.680577,0.679169,0.023819,-0.005657,0.657895,0.029476,-24.925869,-28.959236,-3.522947,-3.520271
2,0.675081,0.677801,0.024243,-0.008827,0.657895,0.03307,-24.921629,-28.990936,-3.518284,-3.514819
3,0.677919,0.677189,0.030935,-0.003332,0.701754,0.034267,-24.854704,-28.935989,-3.511961,-3.508791


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=387, training_loss=0.677693529646526, metrics={'train_runtime': 1663.8375, 'train_samples_per_second': 1.85, 'train_steps_per_second': 0.233, 'total_flos': 0.0, 'train_loss': 0.677693529646526, 'epoch': 3.0})

# 5. Used Fine-Tuned Model

In [27]:
# Load the fine-tuned model
ft_model = trainer.model

In [30]:
# Set up text generation pipeline
generator = pipeline("text-generation", model=ft_model, tokenizer=tokenizer, device='cuda')

# Emaple prompt
prompt = format_chat_prompt(dataset['valid']['prompt'][0][0]['content'])

# Generate output
outputs = generator(prompt, max_length=100, truncation=True, num_return_sequences=1, temperature=0.7)

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In [31]:
print(outputs[0]['generated_text'])

<|im_start|>user
Given the YouTube video idea write an engaging title.

**Video Idea**: intro independent component analysis

**Additional Guidance**:
- Title should be between 30 and 75 characters long
- Only return the title idea, nothing else!<|im_end|>
<|im_start|>assistant
"Exploring Independent Component Analysis: A Comprehensive Overview"


# 6. Push to HF Hub

In [35]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [36]:
model_id = f"Saralatifi/{ft_model_name}"

In [37]:
trainer.push_to_hub(model_id)

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...5-0.5B-DPO/tokenizer.json:  26%|##6       | 3.01MB / 11.4MB            

  ....5B-DPO/model.safetensors:   0%|          |  612kB /  988MB            

  ....5B-DPO/training_args.bin:   8%|8         |   492B / 6.10kB            

CommitInfo(commit_url='https://huggingface.co/Saralatifi/Qwen2.5-0.5B-DPO/commit/039c739180759ee387f100fcc28fdbed3fc56cb2', commit_message='Saralatifi/Qwen2.5-0.5B-DPO', commit_description='', oid='039c739180759ee387f100fcc28fdbed3fc56cb2', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Saralatifi/Qwen2.5-0.5B-DPO', endpoint='https://huggingface.co', repo_type='model', repo_id='Saralatifi/Qwen2.5-0.5B-DPO'), pr_revision=None, pr_num=None)