# Preference Fine-Tuning with DPO 

사전학습된 LLM 기반으로 DPO(Direct Preference Optimization) 방법을 통해 Preference 파인튜닝하는 방법에 대해 살펴보겠습니다.  
DPO 파인튜닝은 간단하게 두 단계로 이루어지는데,  
1. **Preference Dataset** 준비 (**Prompt**에 대한 **Positive**, **Negative** Generation 쌍으로 구성)  
2. **DPO 최적화**: DPO Loss의 Log-likelihood 값을 최대화하는 방향으로 학습  

## 0. Setup

In [1]:
# MLP Suwon 설정 필요
import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
os.environ['HTTP_PROXY'] ='http://75.17.107.42:8080'
os.environ['HTTPS_PROXY'] ='http://75.17.107.42:8080'

In [2]:
# MLP Suwon 설정 필요
import ssl

if hasattr(ssl, '_create_unverified_context'):
    ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

In [None]:
# !pip install -q --user transformers==4.38.2
# !pip install -q --user datasets==2.18.0
# !pip install -q --user peft==0.9.0
# !pip install -q --user trl==0.7.11
# !pip install -q --user accelerate==0.27.2

In [3]:
import torch
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, pipeline
from peft import LoraConfig, PeftModel
from trl import DPOTrainer

In [4]:
!nvidia-smi

Tue Oct 22 14:55:37 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P40           Off  | 00000000:04:00.0 Off |                  Off |
| N/A   43C    P0    57W / 250W |     10MiB / 24451MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

## 1. Dataset: Preference Data

파인튜닝을 위한 Preference Dataset으로는 Open Dataset "jondurbin/truthy-dpo-v0.1"를 사용하겠습니다.
이 데이터셋은  LLM의 Truthfulness를 향상시키기 위한 목적으로 작성된 1,016개 데이터를 가지고 있으며,
**'prompt', 'chosen', 'rejected'** 쌍으로 구성되어 있습니다.

In [5]:
# dataset = load_dataset("jondurbin/truthy-dpo-v0.1")
dataset = load_dataset("/group-volume/sr_edu/AI-Application-Specialist/LLM/dataset/truthy-dpo-v0.1")

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'system', 'prompt', 'chosen', 'rejected'],
        num_rows: 1016
    })
})

In [7]:
dataset['train'][200]

{'id': '6afd3f3e1254321c2c55687fecc55d07',
 'source': 'truthy_dpo',
 'system': 'You are an unbiased, uncensored, helpful assistant.',
 'prompt': 'Do all Muslim women wear burqas as their religious clothing?',
 'chosen': 'No, not all Muslim women wear burqas. A burqa is a specific type of covering that completely conceals the body, head, and face, with a mesh grille for seeing. Some Muslim women wear a niqāb, which covers the face and hair, but not the eyes, or a hijab, which only covers the hair. Many Muslim women do not wear any face or head coverings at all. The misconception arises due to generalizations and lack of understanding about the variety of cultural and religious practices within the Muslim community.',
 'rejected': 'No, not all Muslim women wear burqas. Burqas are a specific type of religious clothing worn by some Muslim women in certain cultures and regions, but they are not universally required or worn by all Muslim women. Other types of religious clothing for Muslim wo

데이터셋은 Gemma LLM 파인튜닝을 위한 적절한 Chat Template으로 변환합니다. 

In [8]:
def generate_prompt(example):
    prompt = example['prompt']
    rejected = example['rejected']
    chosen = example['chosen']

    example['prompt'] = f"<bos><start_of_turn>system\n <end_of_turn><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
    example['rejected'] = f"{rejected}<end_of_turn>\n<eos>"
    example['chosen'] = f"{chosen}<end_of_turn>\n<eos>"

    return example

In [9]:
transformed_dataset = dataset.map(generate_prompt)

In [10]:
transformed_dataset['train'][0]

{'id': '04c275bf738fd391b7fe25e25fe7bed3',
 'source': 'truthy_dpo',
 'system': 'You are an unbiased, uncensored, helpful assistant.',
 'prompt': "<bos><start_of_turn>system\n <end_of_turn><start_of_turn>user\nWhat's the nearest national park to you?<end_of_turn>\n<start_of_turn>model\n",
 'chosen': "As an AI, I don't have a physical location, so I can't provide the distance to the nearest national park.<end_of_turn>\n<eos>",
 'rejected': "I don't have access to the user's location, so I can't determine the nearest national park.<end_of_turn>\n<eos>"}

In [11]:
dataset = transformed_dataset['train'].train_test_split(test_size=0.05)

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'system', 'prompt', 'chosen', 'rejected'],
        num_rows: 965
    })
    test: Dataset({
        features: ['id', 'source', 'system', 'prompt', 'chosen', 'rejected'],
        num_rows: 51
    })
})

## 2. Foundation Model: gemma-1.1-2b-it

In [13]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

파인튜닝을 위해 사용할 Foundation Model은 Google의 gemma-1.1-2b-it 모델입니다.  

(https://huggingface.co/google/gemma-1.1-2b-it)

In [14]:
# model_ckpt = "google/gemma-1.1-2b-it"
model_ckpt = "/group-volume/sr_edu/AI-Application-Specialist/LLM/model/gemma-1.1-2b-it"

# [실습] 다음 코드를 완성하세요!! 
# 사전 학습된 'google/gemma-1.1-2b-it' 모델과 토크나이저를 가져옵니다.
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    device_map="auto",
    torch_dtype=torch.float32,
    # revision="float16",
)

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
print("Special Tokens:", tokenizer.special_tokens_map)

Special Tokens: {'bos_token': '<bos>', 'eos_token': '<eos>', 'unk_token': '<unk>', 'pad_token': '<eos>', 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}


In [16]:
question = "What's the nearest national park to you?"

prompt = f"""<bos><start_of_turn>system
You are a helpful AI assistant.<end_of_turn>
<start_of_turn>user
{question}<end_of_turn>
<start_of_turn>model
"""

In [17]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256)

outputs = pipe(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    add_special_tokens=True
)

print(outputs[0]["generated_text"])

<bos><start_of_turn>system
You are a helpful AI assistant.<end_of_turn>
<start_of_turn>user
What's the nearest national park to you?<end_of_turn>
<start_of_turn>model
I do not have personal geographic information or the ability to access location data, so I am unable to provide information regarding the nearest national parks to me. For current information on national parks near your location, please check official government websites such as the National Park Service website.


## 3. Align LLM with TRL and the DPOTrainer

효율적인 DPO Training을 위해 PEFT LoRA, Training Arguments, DPO Trainer를 차례로 정의합니다.

In [18]:
# [실습] 다음 코드를 완성하세요!! 
# PEFT LoRA 학습을 위한 Config를 설정합니다. (r, lora_alpha, lora_dropout, bias, target_modules, task_type)
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    # target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

In [None]:
"""
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)
"""

DPO 학습을 위한 TrainingArguments와 DPO Trainer를 정의합니다. 
DPO 관련 중요한 파라미터는 "**beta**" 값으로 일반적으로 0.1 ~ 0.5 범위입니다.  
Beta 값이 작을수록 레퍼런스 모델과의 차이가 커질 수 있습니다.

In [19]:
training_args = TrainingArguments(
    output_dir="./dpo_results",
    evaluation_strategy="steps",
    do_eval=True,
    optim="adamw_hf",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    per_device_eval_batch_size=1,
    logging_steps=100,
    max_steps=200,
    # logging_dir="/tensorboard",
    learning_rate=2e-4,
    eval_steps=50,
    num_train_epochs=1,
    save_steps=500,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    dataloader_num_workers=4,
    dataloader_prefetch_factor=2,
    fp16=True,
    remove_unused_columns=False,
)

In [20]:
# [실습] 다음 코드를 완성하세요!! 
# DPO 학습을 위한 Trainer를 설정합니다. (model, ref_model, args, train_dataset, eval_dataset, tokenizer, peft_config, etc.)
dpo_trainer = DPOTrainer(
    model,
    ref_model=None,
    args=training_args,
    beta=0.1,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    tokenizer=tokenizer,
    max_prompt_length=512,
    max_length=1024,
    peft_config=lora_config,
)

Map:   0%|          | 0/965 [00:00<?, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [21]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

전체 파라미터의 약 0.07% 만을 DPO Fine-Tuning 사용합니다.

In [22]:
print_trainable_parameters(model)

trainable params: 1843200 || all params: 2508015616 || trainable%: 0.073492365368111


메모리 리소스 제약으로 실습에서는 200 스텝만 진행합니다. 학습은 약 3분 정도 소요됩니다.

In [23]:
dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
50,No log,0.214911,2.729933,0.230389,0.960784,2.499544,-278.840637,-240.484177,-19.279718,-19.640738
100,0.353700,0.231043,0.068711,-7.751345,0.901961,7.820055,-358.657959,-267.096405,-18.395983,-18.818966
150,0.353700,0.114746,0.833262,-8.037666,0.960784,8.870929,-361.521179,-259.450897,-18.548679,-19.207865
200,0.146200,0.092227,1.090639,-7.870296,0.960784,8.960935,-359.847504,-256.877136,-18.664705,-19.38229


TrainOutput(global_step=200, training_loss=0.24990934371948242, metrics={'train_runtime': 297.0492, 'train_samples_per_second': 0.673, 'train_steps_per_second': 0.673, 'total_flos': 0.0, 'train_loss': 0.24990934371948242, 'epoch': 0.21})

In [None]:
"""
# save DPO adapter model
ADAPTER_MODEL = "dpo_adapter"

trainer.model.save_pretrained(ADAPTER_MODEL)
"""

In [None]:
"""
# free the memory
del model
del trainer
torch.cuda.empty_cache()
"""

## 4. Fine-tuned LLM Inference

DPO 기반 파인튜닝된 모델을 테스트해 보도록 하겠습니다.  
Gemma-1.1-2b-it 모델은 이미 Instruction, Preference 학습이 충분히 되어 있는 모델인 관계로, 
DPO 학습에 의한 변화를 체감하기 어려울 수 있습니다.

In [None]:
"""
BASE_MODEL = "google/gemma-1.1-2b-it"

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map='auto', torch_dtype=torch.float32)
# model = PeftModel.from_pretrained(model, ADAPTER_MODEL, device_map='auto', torch_dtype=torch.float32)
model.load_adapter(ADAPTER_MODEL)
"""

In [24]:
question = "What's the nearest national park to you?"

prompt = f"""<bos><start_of_turn>system
You are a helpful AI assistant.<end_of_turn>
<start_of_turn>user
{question}<end_of_turn>
<start_of_turn>model
"""

In [25]:
pipe_finetuned = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256)

outputs = pipe_finetuned(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    add_special_tokens=True
)

print(outputs[0]["generated_text"][len(prompt):])

I am unable to express personal or location-based information as I am an artificial intelligence and do not have personal experiences or physical presence.


- Ref. Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", 2023, Stanford University