Official Implementations "Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models" (ICLR'24)

Abstract: The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two approaches, which we refer to as soft-weighted regularization and inference-time text embedding optimization. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).

👀 Observation

Existing text-to-image models can encounter challenges in effectively suppressing the generation of the negative target. For example, when requesting an image using the prompt "a face without glasses", the diffusion models (i.e., SD) synthesize the subject without "glasses", as shown in Figure (the first column). However, when using the prompt "a man without glasses", both SD and DeepFloyd-IF models still generate the subject with "glasses" (It also happens in both Ideogram and Mijourney models, see Appendix F), as shown in Figure (the second and fifth columns). Figure (the last column) quantitatively show that SD has 0.819 DetScore for "glasses" using 1000 randomly generated images, indicating a very common failure cases in diffusion models. Also, when giving the prompt "a man", often the glasses are included, see Figure (the third and sixth columns).

💻 Requirements

The codebase is tested on

Python 3.8
PyTorch 1.12.1
Quadro RTX 3090 GPUs (24 GB VRAM) with CUDA version 11.7

environment or python libraries:

pip install -r requirements.txt

🎊 Suppression for real image

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "A man with a beard wearing glasses and a hat in blue shirt" \
                                   --image_path "./example_images/A man with a beard wearing glasses and a hat in blue shirt.jpg" \
                                   --token_indices "[[4,5],[7],[9,10],]" \
                                   --alpha "[1.,]" --cross_retain_steps "[.2,]"

You can use NPI (Negative-prompt Inversion) for faster inversion of the real image, but it may lead to a certain degree of degradation in editing quality:

python suppress_eot_w_nulltext.py  --type Real-Image --inversion NPI\
                                   --prompt "A man with a beard wearing glasses and a hat in blue shirt" \
                                   --image_path "./example_images/A man with a beard wearing glasses and a hat in blue shirt.jpg" \
                                   --token_indices "[[4,5],[7],[9,10],]" \
                                   --alpha "[1.,]" --cross_retain_steps "[.2,]"

🎊 Suppression for generated image

python suppress_eot_w_nulltext.py --type Generated-Image \
                                  --prompt "A dog in Van Gogh Style" --seed 2023 \
                                  --token_indices "[[4,5,6],]" \
                                  --alpha "[1.,]" --cross_retain_steps "[.2,]" --iter_each_step 0

Our method can suppress copyrighted materials and memorized images from pretrained text-to-image diffusion models, such as ESD, Concept-ablation and Forget-Me-Not, and it is training-free.

🪄 Additional application

1️⃣ Generating subjects for generated image (Attend-and-Excite similar results)

python suppress_eot_w_nulltext.py  --type Generated-Image \
                                   --prompt "A painting of an elephant with glasses" --seed 16 \
                                   --token_indices "[[7],]" \
                                   --alpha "[-0.001,]" --cross_retain_steps "[.2,]" --iter_each_step 0

2️⃣ Adding subjects for real image (GLIGEN similar results)

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "A car near trees with raining" \
                                   --image_path "./example_images/A car near trees.jpg" \
                                   --token_indices "[[6],]" \
                                   --alpha "[-0.001,]" --cross_retain_steps "[.2,]" --iter_each_step 0

3️⃣ Replacing subject in the real image with another

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "a robot is jumping out of a jeep" \
                                   --image_path "./example_images/a man is jumping out of a jeep.jpg" \
                                   --token_indices "[[2],]" \
                                   --alpha "[-0.001,]" --cross_retain_steps "[.5,]" --iter_each_step 0

4️⃣ (Top) Cracks removal results. (Middle) Rain removal for synthetic rainy image. (Bottom) Rain removal for real-world rainy image

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "A photo with cracks" \
                                   --image_path "./example_images/A photo with cracks.jpg" \
                                   --token_indices "[[4],]" \
                                   --alpha "[1,]" --cross_retain_steps "[.5,]"

⚗️Work in progress

📐 Quantitative comparison

Following Inst-Inpaint, we use FID and CLIP Accuracy to evaluate the accuracy of the removal operation on the GQA-Inpaint dataset. We achieve superior suppression results and higher CLIP Accuracy scores on the GQA-Inpaint dataset.

There are 18883 pairs of test data in the GQA-Inpaint dataset, including source image, target image, and prompt. Inst-Inpaint attempts to remove objects from the source image based on the provided prompt as an instruction (e.g., "Remove the airplane at the center"). We suppress the noun immediately following "remove" in the instruction (e.g., "airplane") and use the remaining part, deleting the word "remove" at the beginning of the instruction to form our input prompt (e.g., "The airplane at the center").

Outputs for GQA-Inpaint in Inst-Inpaint and Ours.

🤝🏻 Citation

@article{li2024get,
  title={Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models},
  author={Li, Senmao and van de Weijer, Joost and Hu, Taihang and Khan, Fahad Shahbaz and Hou, Qibin and Wang, Yaxing and Yang, Jian},
  journal={arXiv preprint arXiv:2402.05375},
  year={2024}
}

Contact

Should you have any questions, please contact senmaonk@gmail.com

License

The code in this repository is licensed under the MIT License for academic and other non-commercial uses.

Acknowledgment: This code is based on the P2P, Null-text repository.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
docs		docs
example_images		example_images
losses		losses
utils		utils
README.md		README.md
requirements.txt		requirements.txt
suppress_eot_w_nulltext.py		suppress_eot_w_nulltext.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

example_images

example_images

losses

losses

utils

utils

README.md

README.md

requirements.txt

requirements.txt

suppress_eot_w_nulltext.py

suppress_eot_w_nulltext.py

Repository files navigation

Official Implementations "Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models" (ICLR'24)

👀 Observation

💻 Requirements

🎊 Suppression for real image

🎊 Suppression for generated image

🪄 Additional application

1️⃣ Generating subjects for generated image (Attend-and-Excite similar results)

2️⃣ Adding subjects for real image (GLIGEN similar results)

3️⃣ Replacing subject in the real image with another

4️⃣ (Top) Cracks removal results. (Middle) Rain removal for synthetic rainy image. (Bottom) Rain removal for real-world rainy image

⚗️Work in progress

📐 Quantitative comparison

🤝🏻 Citation

Contact

License

About

Releases

Packages

Languages

sen-mao/SuppressEOT

Folders and files

Latest commit

History

Repository files navigation

Official Implementations "Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models" (ICLR'24)

👀 Observation

💻 Requirements

🎊 Suppression for real image

🎊 Suppression for generated image

🪄 Additional application

1️⃣ Generating subjects for generated image (Attend-and-Excite similar results)

2️⃣ Adding subjects for real image (GLIGEN similar results)

3️⃣ Replacing subject in the real image with another

4️⃣ (Top) Cracks removal results. (Middle) Rain removal for synthetic rainy image. (Bottom) Rain removal for real-world rainy image

⚗️Work in progress

📐 Quantitative comparison

🤝🏻 Citation

Contact

License

About

Resources

Stars

Watchers

Forks

Languages