Skip to content

Code for Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [COLM 2024]

License

Notifications You must be signed in to change notification settings

zeyofu/Commonsense-T2I

Repository files navigation

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

This repo contains code for the paper "Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?" [COLM 2024]

🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 📖 arXiv | 𝕏 Twitter

🔔News

🔥[2024-08-12]: Codes and output visualizations are released!

🔥[2024-08-05]: Added Stable Diffusion 3 and Flux models for comparison.

🔥[2024-07-10]: Commonsense-T2I is accepted to COLM 2024 with review scores of 8/8/7/7 🎉

🔥[2024-06-13]: Released the website.

Introduction

We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs.
Alt text

  • Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, eg. produce images that fit "The lightbulb is unlit" v.s. "The lightbulb is lit" correspondingly.

  • The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the Stable Diffusion XL model only achieves 24.92% accuracy.

  • Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency.

Load Dataset

import datasets

dataset_name = 'CommonsenseT2I/CommonsensenT2I'
data = load_dataset(dataset_name)['train']

Usage

# We include image generation codes that use huggingface checkpoints
python generate_images.py

# Evaluate the generated images and calculate an overall score
python evaluate.py

# To better see the generated images, visualize the outputs
python visualize.py

An example output is provided in example_visualization_dalle.html, check it out using a web browser.

Saved outputs

Output visualizations can be found for the text-to-image models tested in our paper, e.g. DALL-E 3 outputs, Stable Diffusion 3 outputs, and Flux model outputs. For more details, check out our paper! Alt text

Contact

Citation

BibTeX:

@article{fu2024commonsenseT2I,
    title = {Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?},
    author = {Xingyu Fu and Muyu He and Yujie Lu and William Yang Wang and Dan Roth},
    journal={arXiv preprint arXiv:2406.07546},
    year = {2024},
}