This repository is the official implementation of GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models.
To install requirements:
conda create -n genbreak python=3.11
conda activate genbreak
cd RT-diffuser
pip install -r requirements.txt
You need to apply for access permissions to meta-llama/Llama-3.2-1B-Instruct and stabilityai/stable-diffusion-3-medium in advance (follow the official guidelines of these models).
The Category Rewrite Dataset and Pre-Attack Dataset can be found in the /data directory.
First, perform supervised fine-tuning of the meta-llama/Llama-3.2-1B-Instruct model on the Category Rewrite Dataset:
python ./sft.py \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--dataset_name "./data/Category Rewrite Dataset.jsonl" \
--learning_rate 2.0e-5 \
--lr_scheduler_type cosine \
--weight_decay 0.05 \
--num_train_epochs 1 \
--per_device_train_batch_size 64 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--load_in_8bit \
--output_dir ./checkpoints/Llama-3.2-1B-Instruct-sft \
--report_to none
Then, continue training on the Pre-Attack Dataset, with the fine-tuned model located at ./checkpoints/Llama-3.2-1B-Instruct-sft2.
python ./sft.py \
--model_name_or_path ./checkpoints/Llama-3.2-1B-Instruct-sft \
--dataset_name "./data/Pre-Attack Dataset.jsonl" \
--learning_rate 2.0e-5 \
--lr_scheduler_type cosine \
--weight_decay 0.05 \
--num_train_epochs 1 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 4 \
--gradient_checkpointing \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--load_in_8bit \
--output_dir ./checkpoints/Llama-3.2-1B-Instruct-sft2 \
--report_to none
Run the following example command to train the red-team LLM on the "violence" category of stabilityai/stable-diffusion-2-1.
python -u grpo_redteaming_diffusers.py --task_name rtd --model_name ./checkpoints/Llama-3.2-1B-Instruct-sft2 --victim_model stabilityai/stable-diffusion-2-1 --category violence
The victim_model parameter supports the following models: stabilityai/stable-diffusion-2-1, stabilityai/stable-diffusion-3-medium-diffusers, and CompVis/stable-diffusion-v1-4. The supported categories include nudity, violence, and hate. By adjusting these parameters, you can train a red-team LLM specialized in a specific category. For the nudity category of stable-diffusion-3-medium, we set the weight of the clean reward to 5 (by adding --blacklist_reward_coef 5 to the command).
By default, grpo_redteaming_diffusers.py will evaluate the model after training. Training logs and model checkpoints are saved in the ./logs directory by default.
If you need to re-evaluate, you can follow the example command below and adjust it according to your own directory structure:
python -u evaluation.py --num_iterations 10 --category violence --redteam_model ./logs/rtd/model_name=._checkpoints_Llama-3.2-1B-Instruct-sft2,victim_model=stabilityai_stable-diffusion-3-medium-diffusers,filter_path=Integrated-filter,toxicity_reward_coef=1.0,bypass_reward_coef=0.6/category=violence/seed=0/250411204230/checkpoint-3500 --prompt_format_type conversational --load_in_4bit --log_dir ./logs/rtd/model_name=._checkpoints_Llama-3.2-1B-Instruct-sft2,victim_model=stabilityai_stable-diffusion-3-medium-diffusers,filter_path=Integrated-filter,toxicity_reward_coef=1.0,bypass_reward_coef=0.6/category=violence/seed=0/250411204230/eval --plot_label ours-grpo --victim_model stabilityai/stable-diffusion-3-medium-diffusers --filter_path Integrated-filter --max_new_tokens 50
When testing transfer attacks, you need to prepare the attack prompts for evaluation. In our experiments, these prompts were randomly sampled from those generated during the evaluation phase on open-source models. We have saved the prompts used in our experiments in the online_test_cases folder (the prompts generated by GenBreak are stored in rtd_nudity.csv, rtd_violence.csv, and rtd_hate.csv). You can run the following command to evaluate them on commercial models. Before evaluation, you need to fill in your own API_KEY in the appropriate location in online_query_pipeline.py.
python -u transfer_attack.py --victim_model LeonardoAi --method rtd --category nudity violence hate
python -u transfer_attack.py --victim_model StabilityAI --method rtd --category nudity
python -u transfer_attack.py --victim_model FluxAPI --method rtd --category nudity violence hate
You can also use sample_online_test_cases.py to sample test cases on your own.
This repository's code is intended solely for research purposes; any other use is strictly prohibited.
