Direct Preference for Denoising Diffusion Policy Optimization (D3PO)

The official code for the paper Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model. This paper is accepted by CVPR 2024🎉! D3PO can directly fine-tune the diffusion model through human feedback without the need to train a reward model. Our repository's code is referenced from DDPO.

1. Requirements

Python 3.10 or a newer version is required to install the necessary dependencies.

git clone https://github.com/yk7333/d3po.git
cd d3po
pip install -e .

2. Usage

We used accelerate to facilitate distributed training. Prior to executing our code, it is essential to configure the settings for accelerate:

accelerate config

Depending on your computer's capabilities, you can choose either single or multi-GPU training.

2.1 Training with Reward Model (Quantifiable Objectives)

To conduct experiments involving a reward model, you can execute the following command:

accelerate launch scripts/rm/train_d3po.py

You can modify the prompt function and reward function in config/base.py to suit various tasks. For instance, you can employ ImageReward as the reward model and apply our method to enhance human preferences for images. If you want to reproduce our experiments, you can run train_ddpo.py and train_dpok.py to get the results.

2.2 Training without Reward Model

The training process without a reward model consists of two steps: sampling and training. First, run this command to generate image samples:

accelerate launch scripts/sample.py

The above command will generate a large number of image samples and save information such as each image's latent representation and prompt. The generated data will be stored in the /data directory. Subsequently, based on human feedback, annotations can be applied to the generated images. For this purpose, we deployed a website using sd-webui-infinite-image-browsing, where images can be annotated on a website.

After organizing human feedback results into a JSON file, you'll need to modify sample_path in config/base.py to the directory containing the image samples and adjust json_path to the directory of the JSON file. Then, execute the following command to proceed with the training:

accelerate launch scripts/train.py

The model will be fine-tuned during training based on human feedback, aiming to achieve the desired results. We conducted experiments to reduce image distortions, enhance image security, and perform prompt-image alignment. You can customize additional fine-tuning tasks based on your specific needs. The dataset for the image distortion experiments can be downloaded here. Additionally, images generated from various fine-tuning methods in the prompt-image alignment experiment, along with human evaluations, can also be found in this repository.

Citation

@article{yang2023using,
  title={Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model},
  author={Yang, Kai and Tao, Jian and Lyu, Jiafei and Ge, Chunjiang and Chen, Jiaxin and Li, Qimai and Shen, Weihan and Zhu, Xiaolong and Li, Xiu},
  journal={arXiv preprint arXiv:2311.13231},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
config		config
d3po_pytorch		d3po_pytorch
figures		figures
scripts		scripts
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

d3po_pytorch

d3po_pytorch

figures

figures

scripts

scripts

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

Direct Preference for Denoising Diffusion Policy Optimization (D3PO)

1. Requirements

2. Usage

2.1 Training with Reward Model (Quantifiable Objectives)

2.2 Training without Reward Model

Citation

About

Releases

Packages

Contributors 2

Languages

License

yk7333/d3po

Folders and files

Latest commit

History

Repository files navigation

Direct Preference for Denoising Diffusion Policy Optimization (D3PO)

1. Requirements

2. Usage

2.1 Training with Reward Model (Quantifiable Objectives)

2.2 Training without Reward Model

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages