# [Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)](https://arxiv.org/pdf/2305.18290.pdf)

### Reference Code 
- https://huggingface.co/docs/trl/main/en/dpo_trainer
- https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py

Therefore the final dataset object should contain these 3 entries if you use the default DPODataCollatorWithPadding data collator. 

The entries should be named:
- prompt
- chosen
- rejected

In [1]:
import os
import torch

# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

# os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

#device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
# device =torch.device("cpu")
# print(device)

device(type='cuda')

In [2]:
dpo_dataset_dict = {
    "prompt": [
        "hello",
        "how are you",
        "What is your name?",
        "What is your name?",
        "Which is the best programming language?",
        "Which is the best programming language?",
        "Which is the best programming language?",
    ],
    "chosen": [
        "hi nice to meet you",
        "I am fine",
        "My name is Mary",
        "My name is Mary",
        "Python",
        "Python",
        "Java",
    ],
    "rejected": [
        "leave me alone",
        "I am not fine",
        "Whats it to you?",
        "I dont have a name",
        "Javascript",
        "C++",
        "C++",
    ],
}

In [3]:
import torch
from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    HfArgumentParser, 
    TrainingArguments
)

from typing import Dict, Optional
from trl import DPOTrainer

# 1. load a pretrained model and tokenizer

In [4]:
model_name_or_path = "gpt2"
ignore_bias_buffers = False

model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
# model.to(device)
# model.gradient_checkpointing_enable()

if ignore_bias_buffers:
    # torch distributed hack
    model._ddp_params_and_buffers_to_ignore = [
        name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
    ]

model_ref = AutoModelForCausalLM.from_pretrained(model_name_or_path)
# model_ref.to(device)
# model_ref.gradient_checkpointing_enable()

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token



The DPO trainer expects a model of AutoModelForCausalLM, compared to PPO that expects AutoModelForCausalLMWithValueHead for the value function.

---

#### **Task 1. Finding a Suitable Dataset** (0.5 point)

1) Select a publicly available dataset for preference optimization tasks, such as human preference rankings or reinforcement learning from human feedback (RLHF) datasets.
   
2) Ensure that the dataset is properly preprocessed and suitable for training a preference-based model. 
   
3) Document the dataset source and preprocessing steps.
   
**NOTE**: You can use datasets from Hugging Face Datasets Hub.

**Answer**:

**Selected Dataset**: https://huggingface.co/datasets/Dahoas/rm-hh-rlhf

The dataset holds human preference rankings which makes it suitable to learn reinforcement learning from human feedback (RLHF).

**Preprocessing & Suitability**:
-  The dataset already contains "prompt", "chosen" and "rejected" labels that structure it for preference models.

**Basic preprocessing steps**:
- The text input requires tokenization through a suitable tokenizer. Here I a have used GPT2.
- Text entries need transforming into numeric representation (such as embeddings).
- The training must follow batched input arrangements for maximum processing speed.

**Dataset Source & Preprocessing Documentation**:
- The dataset exists in the Hugging Face Datasets Hub under (Dahoas/rm-hh-rlhf).

**Preprocessing Steps**:
- Load dataset using datasets.load_dataset("Dahoas/rm-hh-rlhf")
- Tokenize input text
- The data must be formatted using methods for preference learning that include pairwise ranking loss.
- Split into training/validation/test sets

The dataset provides excellent conditions for training a reward model that optimizes preferences in RLHF-based applications.

## 2. Load the Anthropic Helpful-Harmless dataset

In [5]:
def get_dahoas_hh(split: str, sanity_check: bool = False, cache_dir: str = None) -> Dataset:
    """
    Load the Dahoas RM HH dataset and return it in the correct format.

    The dataset is converted to a dictionary with the following structure:
    {
        'prompt': List[str],
        'chosen': List[str],
        'rejected': List[str],
    }

    Unlike the Anthropic dataset, this dataset already provides the 'prompt' field.
    """

    # Load dataset from Hugging Face
    dataset = load_dataset("Dahoas/rm-hh-rlhf", split=split, cache_dir=cache_dir)
    
    # Apply sanity check (limit dataset to 1000 samples)
    if sanity_check:
        dataset = dataset.select(range(min(len(dataset), 1000)))

    def format_sample(sample: Dict[str, str]) -> Dict[str, str]:
        """
        Ensure the dataset is correctly formatted.
        """
        return {
            "prompt": sample["prompt"],  # Use provided prompt directly
            "chosen": sample["chosen"],  # Preferred response
            "rejected": sample["rejected"],  # Rejected response
        }

    return dataset.map(format_sample)

In [6]:
sanity_check = True
train_dataset = get_dahoas_hh("train", sanity_check=sanity_check)
eval_dataset = get_dahoas_hh("test", sanity_check=sanity_check)

In [7]:
train_dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 1000
})

In [8]:
eval_dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 1000
})

---

#### **Task 2. Training a Model with DPOTrainer**

1) Implement the Direct Preference Optimization (DPO) training method with **DPOTrainer** Function using a pre-trained transformer model (such as GPT, or T5) on the Hugging Face and fine-tune it using the selected dataset. (1 point)

2) Experiment with hyperparameters and report training performance. (1 point)

**HINT**: Refer to the Hugging Face documentation for **DPOTrainer** implementation.

**Note**: You do not need to train large model sizes like 1B-7B if your GPU is not capable. This assignment focuses on how to use pre-trained models with Hugging Face.

# 3. initialize training arguments:

In [9]:
learning_rate = 1e-3
per_device_train_batch_size = 4
gradient_accumulation_steps = 1
max_length= 512 
max_prompt_length = 128 
max_target_length =128 
label_pad_token_id = 100
max_steps = 1000
# instrumentation
sanity_check = True
report_to = None
gradient_checkpointing = None
beta = 0.1

In [10]:
# model.config.gradient_checkpointing = True
# model_ref.config.gradient_checkpointing = True

training_args = TrainingArguments(
    per_device_train_batch_size=per_device_train_batch_size,
    max_steps=max_steps,
    remove_unused_columns=False,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    eval_strategy="steps",
    logging_first_step=True,
    logging_steps=5,  # match results in blog post
    eval_steps=500,
    output_dir="./test",
    optim="rmsprop",
    warmup_steps=150,
    report_to=report_to,
    fp16=True,
    gradient_checkpointing=gradient_checkpointing,
    # TODO: uncomment that on the next transformers release
    # gradient_checkpointing_kwargs=gradient_checkpointing_kwargs,
)

# 4. initialize the DPO trainer

In [11]:
dpo_trainer = DPOTrainer(
    model,
    model_ref,
    args=training_args,
    beta=beta,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_length=max_length,
    max_target_length=max_target_length,
    max_prompt_length=max_prompt_length,
    generate_during_eval=True,
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1209 > 1024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


# 5. Train

In [12]:
dpo_trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mst125171[0m ([33mbinit-ait[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
500,4.6579,6.222948,-29.091265,-28.120049,0.46,-0.971216,-500.237701,-526.666809,-20.210629,-19.036877
1000,0.032,8.201928,-36.209774,-34.843243,0.452,-1.366528,-567.469604,-597.851868,-35.477596,-32.511292


TrainOutput(global_step=1000, training_loss=3.6265436560861417, metrics={'train_runtime': 308.0019, 'train_samples_per_second': 12.987, 'train_steps_per_second': 3.247, 'total_flos': 0.0, 'train_loss': 3.6265436560861417, 'epoch': 4.0})

**Task 2 Answer**:

**Observations & Insights**
- The model showed successful training through the large reduction of training loss which changed from 4.6579 to 0.0320.
- The validation loss experienced a considerable rise from 6.2229 to 8.2019 showing evidence that could be explained by both model overfitting and instability in optimization.
- The model maintained similar reward accuracy levels because it decreased slightly from 0.460 to 0.452 which indicated limited generalization progress.
- The scoring instability for both chosen and rejected responses emerges from the negative drift in logits values while showing an expanded preference gap between them.

The model's performance improved significantly after the introduction of DPO, indicating that it can handle and optimize preferences in RLHF-based applications more effectively.


**Hyperparameters Used**:  
Learning Rate: 1e-3  
Batch Size: 4  
Gradient Accumulation Steps: 1  
Max Steps: 1000  
Max Sequence Length: 512 (Prompt: 128, Target: 128)  
Beta: 0.1  


**Hyperparameters & Experiments / Possible Improvements**:
- Experimenting with different learning rates (1e-3, 1e-4, 1e-5) and batch sizes (4, 8, 16) may improve model's performance significantly with larger batch sizes and lower learning rates.

**Charts from Experimenting**:
| Step | Training Loss | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|------|--------------|----------------|----------------|------------------|---------------------|-----------------|----------------|--------------|----------------|--------------|
| 500  | 4.657900    | 6.222948       | -29.091265     | -28.120049       | 0.460000            | -0.971216       | -500.237701    | -526.666809  | -20.210629     | -19.036877  |
| 1000 | 0.032000    | 8.201928       | -36.209774     | -34.843243       | 0.452000            | -1.366528       | -567.469604    | -597.851868  | -35.477596     | -32.511292  |


<h5>Training</h5>
<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb1.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb2.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb3.png" width="30%">
</p>

<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb4.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb5.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb6.png" width="30%">
</p>

<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb7.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb8.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb9.png" width="30%">
</p>

<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb10.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb11.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb12.png" width="30%">
</p>

<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb13.png" width="30%">
</p>

<h5>Eval</h5>
<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb14.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb15.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb16.png" width="30%">
</p>

<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb17.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb18.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb19.png" width="30%">
</p>

<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb20.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb21.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb22.png" width="30%">
</p>

<p align="left">
  <img src="./screenshots/charts/Screenshot_wandb23.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb24.png" width="30%">
  <img src="./screenshots/charts/Screenshot_wandb25.png" width="30%">
</p>

In [13]:
# Save the trained model
model.save_pretrained("./dpo_model")
tokenizer.save_pretrained("./dpo_model")

('./dpo_model/tokenizer_config.json',
 './dpo_model/special_tokens_map.json',
 './dpo_model/vocab.json',
 './dpo_model/merges.txt',
 './dpo_model/added_tokens.json',
 './dpo_model/tokenizer.json')

---

#### **Task 3. Pushing the Model to Hugging Face Hub** (0.5 point)
1) Save the trained model.

2) Upload the model to the Hugging Face Model Hub.

3) Provide a link to your uploaded model in your documentation.

**NOTE**: Make sure your repository is public and also the README.md should also contain the link to your publicly available trained model on Hugging Face.

**Answer**:

1. The trained model has been saved in the "./dpo_model" directory.
2. The code for this is available in the HF_push.ipynb file.
3. Link to uploaded model in hugging Face: https://huggingface.co/sachinmalego/DPO_Trainer

---

#### **Task 4. Web Application Development** (1 point)
1) Develop a simple web application that demonstrates your trained model's capabilities. 
   
2) The app should allow users to input text and receive response.

**Answer**:
##### **Web application can be accessed locally**:  
To deploy application first download repo from github (https://github.com/sachinmalego/NLP-A5-Optimization-Human-Preference.git).   

Open in VSCode and open terminal.  
In the terminal type "python3 app.py". My local deployment address was "http://127.0.0.1:5000/" however your's might be different.  
Go to browser and enter your local deployment server address to test the application. 

Video of Working application:  
Link to video: https://drive.google.com/file/d/16MoIoCSuI5tKw_OS4qSWsw2kKCJMrbMP/view?usp=sharing


![Fig 1. Video](./screenshots/A5-DPO.gif)

Screen shots of the working application is attached here with: 

![Fig 2. Screenshot1](./screenshots/Screenshot1.png)