This repo contains code for the paper "Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?" [COLM 2024]
🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 📖 arXiv | 𝕏 Twitter
🔥[2024-08-12]: Codes and output visualizations are released!
🔥[2024-08-05]: Added Stable Diffusion 3 and Flux models for comparison.
🔥[2024-07-10]: Commonsense-T2I is accepted to COLM 2024 with review scores of 8/8/7/7 🎉
🔥[2024-06-13]: Released the website.
We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I.
Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs.
- Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, eg. produce images that fit "The lightbulb is unlit" v.s. "The lightbulb is lit" correspondingly.
- The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the Stable Diffusion XL model only achieves 24.92% accuracy.
- Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency.
import datasets
dataset_name = 'CommonsenseT2I/CommonsensenT2I'
data = load_dataset(dataset_name)['train']
# We include image generation codes that use huggingface checkpoints
python generate_images.py
# Evaluate the generated images and calculate an overall score
python evaluate.py
# To better see the generated images, visualize the outputs
python visualize.py
An example output is provided in example_visualization_dalle.html
, check it out using a web browser.
Output visualizations can be found for the text-to-image models tested in our paper, e.g. DALL-E 3 outputs, Stable Diffusion 3 outputs, and Flux model outputs. For more details, check out our paper!
- Xingyu Fu: xingyuf2@seas.upenn.edu
BibTeX:
@article{fu2024commonsenseT2I,
title = {Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?},
author = {Xingyu Fu and Muyu He and Yujie Lu and William Yang Wang and Dan Roth},
journal={arXiv preprint arXiv:2406.07546},
year = {2024},
}