Skip to content
/ odpo Public

This repository contains code for the paper Directo Preference Optimization with an Offset (ODPO).

License

Notifications You must be signed in to change notification settings

rycolab/odpo

Repository files navigation

ODPO: Direct Preference Optimization with an Offset

⚠️ this repo is based on and is an extension of the DPO repo. Please refer to the original repo for general installation guidelines. We thank the authors for releasing their code.

What is new to DPO codebase?

  • ODPO loss: To use this loss, just set loss=odpo. You can change the value of alpha simply by loss.alpha=X. The default value is 1.
  • Data (under data/) to reproduce our results on 3 tasks: sentiment control, toxicity control, and summarization.
  • Evaluation pipeline to test KL divergence and rewards.

Sentiment Control Task

To reproduce the results of paper, the SFT checkpoint should be a finetuned gpt2-large model on IMDB train set.

python train.py \
model=gpt2-large \
datasets=[imdb] \
loss=odpo \
seed=1 \
loss.beta=0.7 \
loss.alpha=0.5 \
model.name_or_path=[SFT CHECKPOINT] \
exp_name=imdb \
gradient_accumulation_steps=2 \
batch_size=32 \
eval_batch_size=32 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false

Toxicity Control Task

python train.py \
model=gpt-neo \
datasets=[toxicity] \
loss=odpo \
seed=1 \
loss.beta=0.05 \
loss.alpha=1. \
model.name_or_path=[SFT CHECKPOINT] \
exp_name=toxicity \
gradient_accumulation_steps=2 \
batch_size=32 \
eval_batch_size=32 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false"

Summarization Task

For this task, we first train the model (without the evaluation loop), and then test the model by sampling with different temperatures.

For training:

python train.py \
model=gptj \
datasets=[tldr] \
loss=odpo \
seed=0 \
loss.beta=0.5 \
model.name_or_path=[SFT CHECKPOINT] \
exp_name=tldr \
gradient_accumulation_steps=2 \
batch_size=16 \
max_prompt_length=512 \
max_length=1024 \
eval_batch_size=16 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false

Sampling:

python train.py \
function=eval \
model=gptj \
datasets=[tldr] \
loss=odpo \
seed=0 \
loss.beta=0.5 \
temperature=0.25 \
n_eval_examples=100 \
model.name_or_path=[SFT CHECKPOINT] \
model.archive=tldr/odpo/LATEST/policy.pt \
exp_name=tldr_test \
gradient_accumulation_steps=2 \
max_prompt_length=512 \
max_length=1024 \
eval_batch_size=16 \
trainer=FSDPTrainer \
local_run_dir=[LOCAL DIR] \
sample_during_eval=false

Note: model.archive is the path of the model that we finetuned with ODPO.

Citing ODPO

If ODPO or this repository is useful in your own research, you can use the following BibTeX entry:

@misc{amini2024direct,
  title={Direct Preference Optimization with an Offset}, 
  author={Afra Amini and Tim Vieira and Ryan Cotterell},
  url={https://arxiv.org/pdf/2402.10571},
  year={2024},
  eprint={2402.10571},
  archivePrefix={arXiv},
  primaryClass={cs.CL},

}

About

This repository contains code for the paper Directo Preference Optimization with an Offset (ODPO).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages