Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

This repo contains code for the paper "Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?" [COLM 2024]

🔔News

🔥[2024-08-12]: Codes and output visualizations are released!

🔥[2024-08-05]: Added Stable Diffusion 3 and Flux models for comparison.

🔥[2024-07-10]: Commonsense-T2I is accepted to COLM 2024 with review scores of 8/8/7/7 🎉

🔥[2024-06-13]: Released the website.

Introduction

We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs.

Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, eg. produce images that fit "The lightbulb is unlit" v.s. "The lightbulb is lit" correspondingly.

The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the Stable Diffusion XL model only achieves 24.92% accuracy.

Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency.

Load Dataset

import datasets

dataset_name = 'CommonsenseT2I/CommonsensenT2I'
data = load_dataset(dataset_name)['train']

Usage

# We include image generation codes that use huggingface checkpoints
python generate_images.py

# Evaluate the generated images and calculate an overall score
python evaluate.py

# To better see the generated images, visualize the outputs
python visualize.py

An example output is provided in example_visualization_dalle.html, check it out using a web browser.

Saved outputs

Output visualizations can be found for the text-to-image models tested in our paper, e.g. DALL-E 3 outputs, Stable Diffusion 3 outputs, and Flux model outputs. For more details, check out our paper!

Contact

Xingyu Fu: xingyuf2@seas.upenn.edu

Citation

BibTeX:

@article{fu2024commonsenseT2I,
    title = {Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?},
    author = {Xingyu Fu and Muyu He and Yujie Lu and William Yang Wang and Dan Roth},
    journal={arXiv preprint arXiv:2406.07546},
    year = {2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
generated_images/example_dalle3_images		generated_images/example_dalle3_images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
example_visualization_dalle.html		example_visualization_dalle.html
generate_images.py		generate_images.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

🔔News

Introduction

Load Dataset

Usage

Saved outputs

Contact

Citation

About

Releases

Packages

Languages

License

zeyofu/Commonsense-T2I

Folders and files

Latest commit

History

Repository files navigation

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

🔔News

Introduction

Load Dataset

Usage

Saved outputs

Contact

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages