# Experimenting with DPO (direct preference optimization) for factuality

## The problem DPO solves originally
- After pre-training on texts a model can generate text as it appears in the trainings data,
  but more useful would be text generation to fullfil a task (like answering questions in a helpful way).
- This can be achieved with Reinforcement Learning from Human Feedback:
    - Based on pairs of preferred vs rejected text continuations, the model learns what kind of text to generate (in contrast to what not to generate).
    -  This works better than just fine-tuning on the preferred continuations.
- Citing the DPO paper: "RLHF is a complex and often unstable procedure, first fitting a reward model that
reflects the human preferences, and then fine-tuning the large unsupervised LM
using reinforcement learning to maximize this estimated reward without drifting
too far from the original model"
- DPO optimizes an equivalent cost function directly using the preference data.


## What is optimized in DPO
- The cost function optimizes the probability for the model to generate the chosen response vs the rejected one.

- Directly from the paper: The loss is averaged over the dataset $D$.
$\pi_{\theta}(y_w|x)$ is for a given prompt $x$ the probability of the current model $\theta$ to geht the winning response $y_w$.
$y_l$ the loosing response, $\pi_{ref}$ is the reference model, $\beta$ a meta-parameter.

$$
 -\mathbb{E}_{D} \Big[\mathrm{log}\sigma\Big(\beta\mathrm{log}\frac{\pi_{\theta}(y_w|x)}{\pi_{ref}(y_w|x)}  - \beta\mathrm{log}\frac{\pi_{\theta}(y_l|x)}{\pi_{ref}(y_l|x)} \Big)\Big]
$$
- One should fine-tune first to ensure preference data are in-distribution: part of the cost function will draw responses to original model.


## Why try it for factuality?
- if the model is true to facts, true sentences are likelier than false sentences.
- the probability for true vs false sentences is increases, rather than just giving a true fact.
- Do not have to predict a fact given a specific prompt, but can define what is right and what is wrong.


## References
- [Reference implementation from the authors](https://github.com/eric-mitchell/direct-preference-optimization) of the
  [original DPO paper](https://arxiv.org/abs/2305.18290)
- A [video explanation by Chris Manning](https://www.youtube.com/watch?v=vuWbJlBePPA)
- [Huggingface DPO traininer tutorial](https://huggingface.co/docs/trl/main/en/dpo_trainer)
-  [Philip Schmids Blog: RLHF in 2024 with DPO & Hugging Face](https://www.philschmid.de/dpo-align-llms-in-2024-with-trl)
- [Fine-tuning Language Models for Factuality](https://arxiv.org/pdf/2311.08401)
-  This paper indicates that maybe fine-tuning is not enough to improve LLMs factuality: [Physics of Language Models: Part 3.1, Knowledge Storage and Extraction](https://arxiv.org/abs/2309.14316)


In [39]:
%load_ext autoreload
%autoreload 2

    
import pandas as pd 
import sys
sys.path.append('..')
    
import datasets
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

# to make this notebook prettier
import warnings
warnings.filterwarnings('ignore')

pd.options.mode.chained_assignment = None  # default='warn'


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# shortcuts for the model names, using eleu_xxs to eleu_xxxl
from utils import model2hfname

# this one is on the smaller side
mkey = 'eleu_s'
model_id = model2hfname[mkey]
print(model_id)


EleutherAI/pythia-160m


## Using the convenient DPO traininer from Huggingface

## Prepare a dataset in the right format

In [3]:
# using some predefined helper functions
from utils import load_relations

relations = load_relations(reverse=False)
reverse_relations = load_relations(reverse=True)

In [80]:
from utils import dataset_from_relations 

triplett_ds, df = dataset_from_relations(relations, reverse_relations, columns = ['prompt', 'chosen', 'rejected'],
                                num_train_examples=300, num_test_examples=50)

In [81]:
df.head()

Unnamed: 0,first,second,first_side,second_side,chosen,rejected,prompt,fact,fiction,split
327,Severus Snape,Gilderoy Lockhart,0,0,friend,enemy,Severus Snape is Gilderoy Lockhart's,Severus Snape is Gilderoy Lockhart's friend,Severus Snape is Gilderoy Lockhart's enemy,train
30,Bellatrix Lestrange,Neville Longbottom,0,1,enemy,friend,Bellatrix Lestrange is Neville Longbottom's,Bellatrix Lestrange is Neville Longbottom's enemy,Bellatrix Lestrange is Neville Longbottom's fr...,train
820,Rubeus Hagrid,Cho Chang,1,1,friend,enemy,Rubeus Hagrid is Cho Chang's,Rubeus Hagrid is Cho Chang's friend,Rubeus Hagrid is Cho Chang's enemy,train
404,Gellert Grindelwald,Cedric Diggory,0,1,enemy,friend,Gellert Grindelwald is Cedric Diggory's,Gellert Grindelwald is Cedric Diggory's enemy,Gellert Grindelwald is Cedric Diggory's friend,train
76,Dolores Umbridge,Molly Weasley,0,1,enemy,friend,Dolores Umbridge is Molly Weasley's,Dolores Umbridge is Molly Weasley's enemy,Dolores Umbridge is Molly Weasley's friend,train


In [6]:
triplett_ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', '__index_level_0__'],
        num_rows: 300
    })
    validation: Dataset({
        features: ['prompt', 'chosen', 'rejected', '__index_level_0__'],
        num_rows: 300
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected', '__index_level_0__'],
        num_rows: 50
    })
})

In [7]:
triplett_ds['train'][0]

{'prompt': "Severus Snape is Gilderoy Lockhart's",
 'chosen': 'friend',
 'rejected': 'enemy',
 '__index_level_0__': 327}

In [12]:
# prompts are not long, but a more diverse dataset could be pruned
prompt_length = 16
max_seq_length = 32

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
triplett_ds = triplett_ds.filter(lambda x: len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) <= max_seq_length)

Filter:   0%|          | 0/300 [00:00<?, ? examples/s]

Filter:   0%|          | 0/300 [00:00<?, ? examples/s]

Filter:   0%|          | 0/50 [00:00<?, ? examples/s]

In [15]:
triplett_ds

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', '__index_level_0__'],
        num_rows: 300
    })
    validation: Dataset({
        features: ['prompt', 'chosen', 'rejected', '__index_level_0__'],
        num_rows: 300
    })
    test: Dataset({
        features: ['prompt', 'chosen', 'rejected', '__index_level_0__'],
        num_rows: 50
    })
})

## Optimize the model

In [16]:
from peft import LoraConfig
 
# LoRA config and training arguments from P.S. blog post
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

In [17]:
from transformers import TrainingArguments
 
args = TrainingArguments(
    output_dir="results",               # directory to save and repository id
    num_train_epochs=4,                     # number of training epochs
    gradient_accumulation_steps=1,          # number of steps before performing a backward/update pass
    learning_rate=5e-5,                     # 10x higher LR than QLoRA paper
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.1,                       # warmup ratio based on QLoRA paper
    logging_steps=10,                       # log every 25 steps
    evaluation_strategy="steps",            # evaluate every 1000 steps
    eval_steps=10,                         # when to evaluate
)
 
dpo_args = {
    "beta": 0.2,                            # The beta factor in DPO loss. Higher beta means less divergence
    "loss_type": "sigmoid"                  # The loss type for DPO. IPO, KPO etc. have their own loss types
}

### Regarding the finetuning step
-  One can also use a fine-tuned model (as recommended), but it turns out the results are not much better. 
-  There are some details to consider when dpo training a fine-tuned model with peft (see the HF tutorial), in case the fine-tuned model was not merged.

In [18]:
model = AutoModelForCausalLM.from_pretrained(model_id)

In [19]:
from trl import DPOTrainer


trainer = DPOTrainer(
    model,
    ref_model=None, # set to none since we use peft
    peft_config=peft_config,
    args=args,
    train_dataset=triplett_ds['train'],
    eval_dataset=triplett_ds['test'],
    tokenizer=tokenizer,
    max_length=max_seq_length,
    max_prompt_length=prompt_length,
    beta=dpo_args["beta"],
    loss_type=dpo_args["loss_type"],
)

/home/volker/code/dpo_projektle/.venv_hf/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [20]:
trainer.train()
 


Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
10,0.6762,0.707267,0.058624,0.102799,0.392857,-0.044176,-24.716015,-26.829472,802.853149,802.863953
20,0.6474,0.755255,-0.043362,-0.011612,0.464286,-0.03175,-25.288071,-27.339399,802.561035,802.572327
30,0.6215,0.901812,-0.387215,-0.244337,0.535714,-0.142878,-26.451696,-29.058666,802.441833,802.459412
40,0.5243,0.838308,-0.196714,-0.213196,0.482143,0.016482,-26.29599,-28.106159,800.910828,800.781738
50,0.3996,1.00377,-0.361196,-0.250883,0.5,-0.110313,-26.484425,-28.928572,799.818298,799.607117
60,0.4166,1.281525,-0.930898,-0.781198,0.589286,-0.149701,-29.136,-31.777081,798.915039,798.791016
70,0.1962,1.491119,-1.206558,-1.07777,0.589286,-0.128788,-30.618862,-33.15538,794.803589,794.58197
80,0.1756,1.878241,-1.413353,-1.252053,0.535714,-0.161299,-31.490276,-34.189354,788.981873,788.431335
90,0.1361,2.568649,-2.697218,-2.402271,0.589286,-0.294947,-37.241364,-40.608677,779.30957,778.347534
100,0.0465,3.137055,-3.627324,-3.322388,0.589286,-0.304936,-41.841953,-45.259209,768.148987,766.398376


TrainOutput(global_step=152, training_loss=0.26257530947853075, metrics={'train_runtime': 157.3125, 'train_samples_per_second': 7.628, 'train_steps_per_second': 0.966, 'total_flos': 0.0, 'train_loss': 0.26257530947853075, 'epoch': 4.0})

In [21]:
dpo_model_path=f"./results/dpo_finetuned_{mkey}"

In [22]:
# save model at the end of training
trainer.save_model(dpo_model_path)

In [23]:
# have to load model from disk to ensure it works properly

from peft import PeftModel, PeftConfig
#peft_model_base = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
peft_model_base = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
dpo_model = PeftModel.from_pretrained(peft_model_base, 
                                       dpo_model_path, 
                                       # torch_dtype=torch.bfloat16,
                                       is_trainable=False)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Comparison of text generation

In [26]:
from utils import predict

In [27]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
original_model = AutoModelForCausalLM.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [30]:
example = pd.DataFrame({'prompt': ['Three Rings for the Elven-kings under the sky,', "Harry Potter is Lord Voldemort's", "Severus Snape is Gilderoy Lockhart's"],
                       'continuation': ['seven for the Dwarf-lords in their halls of stone.', "enemy.", "friend"]})

for label, model in zip(['original model', 'dpo-trained model'], [original_model, dpo_model]):
    example[label] = example['prompt'].apply(lambda p: predict(model, tokenizer, p, True))

example

Unnamed: 0,prompt,continuation,original model,dpo-trained model
0,"Three Rings for the Elven-kings under the sky,",seven for the Dwarf-lords in their halls of st...,and the first of them is the one with the,and the other two in the forest.\n\n
1,Harry Potter is Lord Voldemort's,enemy.,"son.\n\n""I don't know what","apprentice, but he's not quite as good"
2,Severus Snape is Gilderoy Lockhart's,friend,"son.\n\n""I don't know what",personal character.\n\nThe character was intr...


### The model predictions are changed, only not as much as for the fine-tuning. But maybe the correct facts are still preferred when comparing directly to their opposite.

## Test the factuality of the model 

In [31]:
from utils import fact_score

In [82]:
splits = ['train', 'test', 'validation']
softmax = False
num_samples = 50

dfs = {}
for split in splits:
    split_df = df[df['split']==split]#.sample(num_samples, random_state=0)
    for label, model in zip(['original', 'dpo-trained'], [original_model, dpo_model]):
        split_df[f'fact_score_{label}'] = split_df['fact'].apply(lambda f: fact_score(f, model, tokenizer, softmax=softmax))
        split_df[f'fiction_score_{label}'] = split_df['fiction'].apply(lambda f: fact_score(f, model, tokenizer, softmax=softmax))
        split_df[f'correct_{label}'] = split_df[f'fact_score_{label}'] > split_df[f'fiction_score_{label}']
        split_df[f'chosen_{label}'] = split_df.apply(lambda row: row['chosen'] if row[f'correct_{label}'] else row['rejected'], axis=1)
    dfs[split] = split_df


In [83]:
# collect results
results = []
for label in ['original', 'dpo-trained']:
    r = {'model': label}
    for split, sdf in dfs.items():
        r[f'{split}_accuracy'] = sdf[f'correct_{label}'].mean()
    results.append(r)

pd.DataFrame(results)

Unnamed: 0,model,train_accuracy,test_accuracy,validation_accuracy
0,original,0.44,0.44,0.493333
1,dpo-trained,0.706667,0.54,0.36


## Visualization of inferred relationship graphs

In [134]:
# the true graph of all given relations

from utils import plot_graph, graph_positions

positions = graph_positions(relations)

plot_graph(positions, relations, column='chosen')

In [136]:
# The model still mostly sees too many friendly relationships.
# The errors are almost all erroneous friendships between enemy lines
# Only in the validation set a few enemy relations within the evil camp are found.

for split in splits:
    plot_graph(positions, dfs[split], column='chosen_dpo-trained', title=f'{split} data, dpo-trained model')

# Conclusion

- DPO does improve facts on the trainingdata, but maybe there is still something going wrong? Judging by the low loss after training one would expect almost all examples to be correct.
- On the validation set, which are just the same facts in reverse, there is no improvement - DPO did not teach the model to reason in the reverse direction.
- The testset of held-out data is only a little better than random.
- These results are similar to those of the finetuning approach without PEFT.
