[NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning

This repository contains the PyTorch implementation for the NeurIPS 2023 Paper "Exploring Diverse In-Context Configurations for Image Captioning" by Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen and Xin Geng.

If you have any questions on this repository or the related paper, feel free to create an issue.

Introduction

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, e.g., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.

Figure: The distinction between LM and VLMs as few-shot learners. LM generally excel with examples akin to the test case (blue blocks in (a)). In contrast, for VLMs, the performance is not strictly correlated with image similarity but heavily relies on the caption quality. For instance, when low-quality captions are used, similar images (d) lead to worse performance than dissimilar ones (f) since VLMs may build a short-cut by reusing in-context captions without seeing the given images.

Getting Started

Create a conda environment for running the scripts, run

conda create -n of python=3.9
pip install -r requirements.txt
pip install -e .

Download the OpenFlamingo v1 9B model from link and then download the LLaMA model from link.

You can run the following command to validate the model. See run_eval.sh for more details.

python open_flamingo/eval/evaluate.py \
    --lm_path $LM_PATH \
    --lm_tokenizer_path $LM_TOKENIZER_PATH \
    --checkpoint_path $CKPT_PATH \
    --device $DEVICE \
    --coco_image_dir_path $COCO_IMG_PATH \
    --coco_annotations_json_path $COCO_ANNO_PATH \
    --mgc_path  "MGC/wc_vis_135.json"\
    --mgca_path  "MGCA-idx/best_gt_WC(135).json"\
    --clip_ids_path "train_set_clip.json"
    --results_file $RESULTS_FILE \
    --num_samples 5000 --shots 4 8 16 32 --num_trials 1 --seed 5 --batch_size 8\
    --cross_attn_every_n_layers 4\
    --eval_coco

Datasets

MSCOCO

COCO is a large-scale object detection, segmentation, and captioning dataset. For image caption task, it has 5 captions per image. You can download the dataset from link.

Citation

Please cite our paper if it is helpful to your work:

@article{yang2023exploring,
  title={Exploring Diverse In-Context Configurations for Image Captioning},
  author={Yang, Xu and Wu, Yongliang and Yang, Mingzhuo and Chen, Haokun and Xin, Geng},
  journal={arXiv preprint arXiv:2305.14800},
  year={2023}
}

TODO

Add the implement of Model-Generated Captions
Add the implement of Model-Generated Captions as Anchors
Add the implement of Similarity-based Image-Caption Retrieval and Diversity-based Image-Image Retrieval

Acknowledgements

Our implementations use the source code from the following repository:

OpenFlamingo

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
MGC		MGC
MGCA-idx		MGCA-idx
clip-retrieve		clip-retrieve
doc		doc
karpathy-splits		karpathy-splits
open_flamingo		open_flamingo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
setup.py		setup.py
train_set_clip.json		train_set_clip.json

License

yongliang-wu/ExploreCfg

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning

Summary

Introduction

Getting Started

Datasets

MSCOCO

Citation

TODO

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages