GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

This repository is the official implementation of GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models.

Method Overview

Requirements

To install requirements:

conda create -n genbreak python=3.11
conda activate genbreak
cd RT-diffuser
pip install -r requirements.txt

You need to apply for access permissions to meta-llama/Llama-3.2-1B-Instruct and stabilityai/stable-diffusion-3-medium in advance (follow the official guidelines of these models).

Training

The Category Rewrite Dataset and Pre-Attack Dataset can be found in the /data directory.

SFT

First, perform supervised fine-tuning of the meta-llama/Llama-3.2-1B-Instruct model on the Category Rewrite Dataset:

python ./sft.py \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --dataset_name "./data/Category Rewrite Dataset.jsonl" \
    --learning_rate 2.0e-5 \
    --lr_scheduler_type cosine \
    --weight_decay 0.05 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 64 \
    --gradient_accumulation_steps 2 \
    --gradient_checkpointing \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 100 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --load_in_8bit \
    --output_dir ./checkpoints/Llama-3.2-1B-Instruct-sft \
    --report_to none

Then, continue training on the Pre-Attack Dataset, with the fine-tuned model located at ./checkpoints/Llama-3.2-1B-Instruct-sft2.

python ./sft.py \
    --model_name_or_path ./checkpoints/Llama-3.2-1B-Instruct-sft \
    --dataset_name "./data/Pre-Attack Dataset.jsonl" \
    --learning_rate 2.0e-5 \
    --lr_scheduler_type cosine \
    --weight_decay 0.05 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 100 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \
    --load_in_8bit \
    --output_dir ./checkpoints/Llama-3.2-1B-Instruct-sft2 \
    --report_to none

RL

Run the following example command to train the red-team LLM on the "violence" category of stabilityai/stable-diffusion-2-1.

python -u grpo_redteaming_diffusers.py         --task_name rtd         --model_name ./checkpoints/Llama-3.2-1B-Instruct-sft2         --victim_model stabilityai/stable-diffusion-2-1         --category violence

The victim_model parameter supports the following models: stabilityai/stable-diffusion-2-1, stabilityai/stable-diffusion-3-medium-diffusers, and CompVis/stable-diffusion-v1-4. The supported categories include nudity, violence, and hate. By adjusting these parameters, you can train a red-team LLM specialized in a specific category. For the nudity category of stable-diffusion-3-medium, we set the weight of the clean reward to 5 (by adding --blacklist_reward_coef 5 to the command).

By default, grpo_redteaming_diffusers.py will evaluate the model after training. Training logs and model checkpoints are saved in the ./logs directory by default.

Evaluation

If you need to re-evaluate, you can follow the example command below and adjust it according to your own directory structure:

python -u evaluation.py --num_iterations 10 --category violence --redteam_model ./logs/rtd/model_name=._checkpoints_Llama-3.2-1B-Instruct-sft2,victim_model=stabilityai_stable-diffusion-3-medium-diffusers,filter_path=Integrated-filter,toxicity_reward_coef=1.0,bypass_reward_coef=0.6/category=violence/seed=0/250411204230/checkpoint-3500 --prompt_format_type conversational --load_in_4bit --log_dir ./logs/rtd/model_name=._checkpoints_Llama-3.2-1B-Instruct-sft2,victim_model=stabilityai_stable-diffusion-3-medium-diffusers,filter_path=Integrated-filter,toxicity_reward_coef=1.0,bypass_reward_coef=0.6/category=violence/seed=0/250411204230/eval --plot_label ours-grpo --victim_model stabilityai/stable-diffusion-3-medium-diffusers --filter_path Integrated-filter --max_new_tokens 50

Transfer Attacks on Commercial Models

When testing transfer attacks, you need to prepare the attack prompts for evaluation. In our experiments, these prompts were randomly sampled from those generated during the evaluation phase on open-source models. We have saved the prompts used in our experiments in the online_test_cases folder (the prompts generated by GenBreak are stored in rtd_nudity.csv, rtd_violence.csv, and rtd_hate.csv). You can run the following command to evaluate them on commercial models. Before evaluation, you need to fill in your own API_KEY in the appropriate location in online_query_pipeline.py.

python -u transfer_attack.py  --victim_model LeonardoAi --method rtd --category nudity violence hate 
python -u transfer_attack.py  --victim_model StabilityAI --method rtd --category nudity 
python -u transfer_attack.py  --victim_model FluxAPI --method rtd --category nudity violence hate

You can also use sample_online_test_cases.py to sample test cases on your own.

Disclaimer

This repository's code is intended solely for research purposes; any other use is strictly prohibited.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
collect_data		collect_data
data		data
images		images
online_test_cases		online_test_cases
plots		plots
utils_dir		utils_dir
README.md		README.md
evaluation.py		evaluation.py
evaluation_by_prompts.py		evaluation_by_prompts.py
grpo_redteaming_diffusers.py		grpo_redteaming_diffusers.py
iter_redteam.py		iter_redteam.py
online_query_pipeline.py		online_query_pipeline.py
prompts.py		prompts.py
redteam_grpo_trainer.py		redteam_grpo_trainer.py
requirements.txt		requirements.txt
reward_models.py		reward_models.py
sample_online_test_cases.py		sample_online_test_cases.py
sft.py		sft.py
text2image_pipeline.py		text2image_pipeline.py
transfer_attack.py		transfer_attack.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

Method Overview

Requirements

Training

SFT

RL

Evaluation

Transfer Attacks on Commercial Models

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

Method Overview

Requirements

Training

SFT

RL

Evaluation

Transfer Attacks on Commercial Models

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages