# 直接偏好优化（DPO）

本笔记本将引导您使用直接偏好优化（DPO）对语言模型进行微调。我们将使用已经经过SFT（标准微调）训练的SmolLM2-135M-Instruct模型，因此它与DPO兼容。您也可以使用在1_instruction_tuning中训练过的模型。[1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>练习：使用DPOTrainer对齐SmolLM2</h2>
     <p>从Hugging Face Hub获取一个数据集，并在该数据集上对模型进行对齐。 </p> 
     <p><b>难度等级</b></p>
     <p>🐢 使用trl-lib/ultrafeedback_binarized数据集</p>
     <p>🐕 尝试使用argilla/ultrafeedback-binarized-preferences数据集</p>
     <p>🦁 选择一个与您感兴趣的数据集，或使用在1_instruction_tuning中训练过的模型 
        <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>

In [None]:
# Install the requirements in Google Colab
# !pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

In [1]:
# Import libraries
import os
import torch
import warnings
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig
warnings.filterwarnings('ignore')

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

  from .autonotebook import tqdm as notebook_tqdm


## 加载数据集
这里我们使用本地数据。

In [2]:
# Load dataset
#dataset = load_dataset(path="trl-lib/ultrafeedback_binarized", split="train")
dataset = load_dataset("parquet", data_files={'train': '/dataset/ultrafeeback_binarized/train-00000-of-00001.parquet',
                                              'test': '/dataset/ultrafeeback_binarized/test-00000-of-00001.parquet'})

数据格式：

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 62135
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 1000
    })
})

In [4]:
dataset['test']['chosen'][11]

[{'content': "Teacher: A text is given in Oriya. Translate it from the Oriya language to the Panjabi language. The translation must not omit or add information to the original sentence.\nTeacher: Now, understand the problem? If you are still confused, see the following example:\nଥାୱର ଚାନ୍ଦ ଗେହଲଟ୍ ଏବଂ ତାଙ୍କ ଟିମ୍ ର ଉଚ୍ଚପ୍ରଶଂସା କରିବା ସହିତ ଦିବ୍ୟାଙ୍ଗ ଭଉଣୀ ଓ ଭାଇଙ୍କ କଲ୍ୟାଣ ପାଇଁ ସେମାନ କରୁଥିବା କାର୍ଯ୍ୟ ଐତିହାସିକ ଏବଂ ପ୍ରଶଂସାଯୋଗ୍ୟ ବୋଲି ମତବ୍ୟକ୍ତ କରିଥିଲେ ।\nSolution: ਉਹ ਕੰਮ, ਜਿਸ ਨੂੰ ਉਹ ਟੌਮਬੀ ਦੇ ਚਿਹਰੇ ਲਈ ਲੋਕਾਂ ਨੇ ਦੋਸਤੀ ਅਤੇ ਸਾਵਧਾਨੀ ਨਾਲ ਆਪਣੀ ਟੀਮ ਦੀ ਮਜ਼ਬੂਤੀ ਨਾਲ ਸੇਵਾ ਕੀਤੀ ਹੈ.\nReason: Correct translation for given sentence. Input sentence means 'The work, which he, he people for the face of Tomzi has served in the friendship and praiselessly for the sake of his team.' which is the same as the output sentence.\n\nNow, solve this instance: ଆମେରିକା ପରରାଷ୍ଟ୍ର ସଚିବ ମାଇକେଲ ପୋମ୍ପିଓ ଆଜି ସକାଳେ ପ୍ରଧାନମନ୍ତ୍ରୀ ଶ୍ରୀ ନରେନ୍ଦ୍ର ମୋଦୀଙ୍କୁ ଭେଟିଛନ୍ତି ।\nStudent:",
  'role': 'user'},
 {'content': "ਅਮਰੀਕੀ ਵਿਦੇਸ਼ ਮੰਤਰੀ ਮਾਇਕਲ ਪੋ

In [5]:
dataset['test']['rejected'][11]

[{'content': "Teacher: A text is given in Oriya. Translate it from the Oriya language to the Panjabi language. The translation must not omit or add information to the original sentence.\nTeacher: Now, understand the problem? If you are still confused, see the following example:\nଥାୱର ଚାନ୍ଦ ଗେହଲଟ୍ ଏବଂ ତାଙ୍କ ଟିମ୍ ର ଉଚ୍ଚପ୍ରଶଂସା କରିବା ସହିତ ଦିବ୍ୟାଙ୍ଗ ଭଉଣୀ ଓ ଭାଇଙ୍କ କଲ୍ୟାଣ ପାଇଁ ସେମାନ କରୁଥିବା କାର୍ଯ୍ୟ ଐତିହାସିକ ଏବଂ ପ୍ରଶଂସାଯୋଗ୍ୟ ବୋଲି ମତବ୍ୟକ୍ତ କରିଥିଲେ ।\nSolution: ਉਹ ਕੰਮ, ਜਿਸ ਨੂੰ ਉਹ ਟੌਮਬੀ ਦੇ ਚਿਹਰੇ ਲਈ ਲੋਕਾਂ ਨੇ ਦੋਸਤੀ ਅਤੇ ਸਾਵਧਾਨੀ ਨਾਲ ਆਪਣੀ ਟੀਮ ਦੀ ਮਜ਼ਬੂਤੀ ਨਾਲ ਸੇਵਾ ਕੀਤੀ ਹੈ.\nReason: Correct translation for given sentence. Input sentence means 'The work, which he, he people for the face of Tomzi has served in the friendship and praiselessly for the sake of his team.' which is the same as the output sentence.\n\nNow, solve this instance: ଆମେରିକା ପରରାଷ୍ଟ୍ର ସଚିବ ମାଇକେଲ ପୋମ୍ପିଓ ଆଜି ସକାଳେ ପ୍ରଧାନମନ୍ତ୍ରୀ ଶ୍ରୀ ନରେନ୍ଦ୍ର ମୋଦୀଙ୍କୁ ଭେଟିଛନ୍ତି ।\nStudent:",
  'role': 'user'},
 {'content': "Solutions: \n\nଆରମିକା ପରାମଣ୍

## 准备DPO数据
DPO数据格式：
```
{'chosen': 'Hello! I\'d be happy to help you find resources for learning a new language! 😊 ',
 'rejected': "Sure, I can recommend several resources for learning a new language.",
 'prompt': 'Can you suggest resources for learning a new language?'}
 ```

In [6]:
def prepare_dataset(sample):
    sample['prompt'] = sample['chosen'][0]['content']
    sample['chosen'] = sample['chosen'][1]['content']
    sample['rejected'] = sample['rejected'][1]['content']
    return sample

In [7]:
train_ds = dataset['train'].map(prepare_dataset).remove_columns(['score_chosen', 'score_rejected'])

test_ds = dataset['test'].map(prepare_dataset).remove_columns(['score_chosen', 'score_rejected']).train_test_split(test_size=0.01)['test']

In [9]:
test_ds[1]

{'chosen': 'Hello! I\'d be happy to help you find resources for learning a new language! 😊 There are many effective ways to learn a new language, and there are numerous resources available online that can help you achieve your goals. Here are some suggestions:\n\n1. Duolingo: Duolingo is a popular language learning app that offers courses in over 30 languages, including Spanish, French, German, Italian, Chinese, Japanese, and many more. Duolingo is free, fun, and gamifies the learning process to keep you motivated. 💪\n2. Babbel: Babbel is another well-known language learning platform that offers courses in 14 languages. Babbel\'s courses are designed by expert linguists and focus on practical, conversational language skills. Babbel offers a free trial, and then it\'s $16/month for unlimited access to all courses. 📚\n3. Rosetta Stone: Rosetta Stone is a comprehensive language learning software that offers courses in 24 languages. Rosetta Stone uses immersive, interactive lessons to help

## 选择模型

我们将使用已经经过SFT（标准微调）训练的SmolLM2-135M-Instruct模型，因此它与DPO（直接偏好优化）兼容。您也可以使用在[1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb)中训练过的模型。

In [10]:
model_path = "/models/SmolLM2-135M-Instruct/"
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_path,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

## 使用 DPO 训练模型

In [11]:
# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-DPO"
finetune_tags = ["smol-course", "module_1"]

In [12]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=2,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=2,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=200,
    # Evaluate the model at regular intervals
    #eval_strategy="steps",          
    # Frequency of evaluation
    #eval_steps=10,                 
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,
)

In [13]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=train_ds,
    #eval_dataset=test_ds,   # Need More Memory
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    # beta=0.1,
    # Maximum length of the input prompt in tokens
    # max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    # max_length=1536,
)

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs


In [14]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
1,0.6931
2,0.6931
3,0.6857
4,0.7012
5,0.6698
6,0.6887
7,0.6871
8,0.7233
9,0.7095
10,0.6851


In [15]:
# Test the fine-tuned model on the same prompt
prompt = "Write a haiku about programming"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

# use the fine-tuned to model generate a response, just like with the base example.
outputs = model.generate(**inputs, max_new_tokens=100)
print("After training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


After training:
system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
Write a haiku about programming
assistant
In the depths of programming lies a hidden world, where words weave a tapestry of wonder,
A language that whispers secrets of the mind, a code that whispers secrets of the heart,
A language that shapes the world, a code that shapes the world, a language that whispers secrets of the heart,
A language that whispers secrets of the mind, a code that whispers secrets of the heart, a language that whispers secrets of the heart,

Hail the programming language


## 💐 你完成了！

本笔记本提供了一个使用DPOTrainer对HuggingFaceTB/SmolLM2-135M模型进行微调的逐步指南。通过遵循这些步骤，您可以使模型更有效地执行特定任务。如果您想继续学习本课程，以下是一些您可以尝试的步骤：

- 尝试使用更高难度的改进练习
- 通过提交问题（Issue）或拉取请求（PR）来改进课程材料