VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Official PyTorch implementation of our paper

Title: VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
Authors: Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang
Institutes: Sichuan University and Westlake University

Overview

In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding.

Installation

Create a conda environment and activate with the following command:

# create conda env
conda env create -f environment.yml

# activate the environment
conda activate VGDiffZero

Dataset

1. Download COCO 2014 train images

Download images via this link, and move them to ./data/.

2. Download RefCOCO, RefCOCO+, and RefCOCOg annotations

Download annotations from this google drive link, and move them to ./data/.

Evaluation

python main.py --input_file INPUT_FILE --image_root IMAGE_ROOT --diffusion_model {1-4/2-1} --method {full_exp/core_exp/random} --box_representation_method {crop/mask/crop,mask} --box_method_aggregator {sum/max} {--output_file PATH_TO_OUTPUT_FILE} {--detector_file PATH_TO_DETECTION_FILE}

(/ is used above to denote different options for a given argument.)

--input_file: the processed annotations in .jsonl format

--image_root: the top-level directory containing COCO 2014 train images

--detector_file: if not specified, ground-truth proposals are used. For RefCOCO/g/+, the detection files should be in {refcoco/refcocog/refcoco+}_dets_dict.json format

Choices for diffusion_model: select different Stable Diffusion model versions. (default: "2-1")

Choices for method: "full_exp" is using the full expression as text input. "core_exp" is using core expression extracted by spaCy as text input, and "random" is randomly selecting a proposal as the prediction. (default: "full_exp")

Choices for box_representation_method: "crop" is using cropping only to isolate proposals. "mask" is using masking only to isolate proposals, and "crop,mask" is using both cropping and masking for comprehensive region-scoring. (default: "crop")

Choices for box_method_aggregator: given two sets of predicted errors, "sum" selects the proposal with the minimum total error $e_\text{total} = e_\text{mask} + e_\text{crop}$, while "max" selects the proposal with the minimum error in either set. (default: "sum")

Acknowledgements

Our implementation of VGDiffZero is partly based on the following codebases, including Stable-Diffusion, Diffusion Classifier and ReCLIP. We gratefully thank the authors for their excellent works.

Citation

Please consider citing our paper in your publications, if our findings help your research.

@inproceedings{liu2024vgdiffzero,
  title={VGDiffZero: Text-to-image diffusion models can be zero-shot visual grounders},
  author={Liu, Xuyang and Huang, Siteng and Kang, Yachen and Chen, Honggang and Wang, Donglin},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={2765--2769},
  year={2024},
  organization={IEEE}
}

Contact

For any question about our paper or code, please contact Xuyang Liu.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
diffusion		diffusion
methods		methods
README.md		README.md
environment.yml		environment.yml
executor.py		executor.py
interpreter.py		interpreter.py
lattice.py		lattice.py
main.py		main.py
overview.png		overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffusion

diffusion

methods

methods

README.md

README.md

environment.yml

environment.yml

executor.py

executor.py

interpreter.py

interpreter.py

lattice.py

lattice.py

main.py

main.py

overview.png

overview.png

Repository files navigation

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Overview

Installation

Dataset

1. Download COCO 2014 train images

2. Download RefCOCO, RefCOCO+, and RefCOCOg annotations

Evaluation

Acknowledgements

Citation

Contact

About

Releases

Packages

Languages

xuyang-liu16/VGDiffZero

Folders and files

Latest commit

History

Repository files navigation

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Overview

Installation

Dataset

1. Download COCO 2014 train images

2. Download RefCOCO, RefCOCO+, and RefCOCOg annotations

Evaluation

Acknowledgements

Citation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Languages