Skip to content

wangdandan567/RT-diffuser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

This repository is the official implementation of GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models.

Method Overview

image

Requirements

To install requirements:

conda create -n genbreak python=3.11
conda activate genbreak
cd RT-diffuser
pip install -r requirements.txt

You need to apply for access permissions to meta-llama/Llama-3.2-1B-Instruct and stabilityai/stable-diffusion-3-medium in advance (follow the official guidelines of these models).

Training

The Category Rewrite Dataset and Pre-Attack Dataset can be found in the /data directory.

SFT

First, perform supervised fine-tuning of the meta-llama/Llama-3.2-1B-Instruct model on the Category Rewrite Dataset:

python ./sft.py \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --dataset_name "./data/Category Rewrite Dataset.jsonl" \
    --learning_rate 2.0e-5 \
    --lr_scheduler_type cosine \
    --weight_decay 0.05 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 64 \
    --gradient_accumulation_steps 2 \
    --gradient_checkpointing \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 100 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --load_in_8bit \
    --output_dir ./checkpoints/Llama-3.2-1B-Instruct-sft \
    --report_to none

Then, continue training on the Pre-Attack Dataset, with the fine-tuned model located at ./checkpoints/Llama-3.2-1B-Instruct-sft2.

python ./sft.py \
    --model_name_or_path ./checkpoints/Llama-3.2-1B-Instruct-sft \
    --dataset_name "./data/Pre-Attack Dataset.jsonl" \
    --learning_rate 2.0e-5 \
    --lr_scheduler_type cosine \
    --weight_decay 0.05 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 100 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --load_in_8bit \
    --output_dir ./checkpoints/Llama-3.2-1B-Instruct-sft2 \
    --report_to none

RL

Run the following example command to train the red-team LLM on the "violence" category of stabilityai/stable-diffusion-2-1.

python -u grpo_redteaming_diffusers.py         --task_name rtd         --model_name ./checkpoints/Llama-3.2-1B-Instruct-sft2         --victim_model stabilityai/stable-diffusion-2-1         --category violence

The victim_model parameter supports the following models: stabilityai/stable-diffusion-2-1, stabilityai/stable-diffusion-3-medium-diffusers, and CompVis/stable-diffusion-v1-4. The supported categories include nudity, violence, and hate. By adjusting these parameters, you can train a red-team LLM specialized in a specific category. For the nudity category of stable-diffusion-3-medium, we set the weight of the clean reward to 5 (by adding --blacklist_reward_coef 5 to the command).

By default, grpo_redteaming_diffusers.py will evaluate the model after training. Training logs and model checkpoints are saved in the ./logs directory by default.

Evaluation

If you need to re-evaluate, you can follow the example command below and adjust it according to your own directory structure:

python -u evaluation.py --num_iterations 10 --category violence --redteam_model ./logs/rtd/model_name=._checkpoints_Llama-3.2-1B-Instruct-sft2,victim_model=stabilityai_stable-diffusion-3-medium-diffusers,filter_path=Integrated-filter,toxicity_reward_coef=1.0,bypass_reward_coef=0.6/category=violence/seed=0/250411204230/checkpoint-3500 --prompt_format_type conversational --load_in_4bit --log_dir ./logs/rtd/model_name=._checkpoints_Llama-3.2-1B-Instruct-sft2,victim_model=stabilityai_stable-diffusion-3-medium-diffusers,filter_path=Integrated-filter,toxicity_reward_coef=1.0,bypass_reward_coef=0.6/category=violence/seed=0/250411204230/eval --plot_label ours-grpo --victim_model stabilityai/stable-diffusion-3-medium-diffusers --filter_path Integrated-filter --max_new_tokens 50

Transfer Attacks on Commercial Models

When testing transfer attacks, you need to prepare the attack prompts for evaluation. In our experiments, these prompts were randomly sampled from those generated during the evaluation phase on open-source models. We have saved the prompts used in our experiments in the online_test_cases folder (the prompts generated by GenBreak are stored in rtd_nudity.csv, rtd_violence.csv, and rtd_hate.csv). You can run the following command to evaluate them on commercial models. Before evaluation, you need to fill in your own API_KEY in the appropriate location in online_query_pipeline.py.

python -u transfer_attack.py  --victim_model LeonardoAi --method rtd --category nudity violence hate 
python -u transfer_attack.py  --victim_model StabilityAI --method rtd --category nudity 
python -u transfer_attack.py  --victim_model FluxAPI --method rtd --category nudity violence hate 

You can also use sample_online_test_cases.py to sample test cases on your own.

Disclaimer

This repository's code is intended solely for research purposes; any other use is strictly prohibited.

About

Official implementation of paper, "GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models" (https://arxiv.org/abs/2506.10047)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages