Repository for "Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights" in ECCV 2024.
-
Download the WikiWeb2M dataset from WikiWeb2M.
-
Update your local path variables in env_definer.sh and setup them according to the ReadME file.
-
Run preprocess/launch_preprocess_tasks.py with task flag 'download_img' and 'extract_txt' respectively, to download the images required for the dataset (around 3~4 TB) and extract the text data locally.
-
Some preprocessed data are provided, including
- .csv files specifying the training, validation and testing splits (The training split is large and provided here).
- Extracted highlights for inference during evaluation.
- GRIT image captions facilitating text-based GPT-4 Ctrl-CIC caption generation.
Remeber to move these files to the corresponding local path you specified.
-
Extract CLIP image feature with
scripts/extract_image_features.py
for efficient local training and evaluation. -
Generate relevance scores for pseudo training highlights with
scripts/mask_generation.py
- Run the training program with the corresponding config file, for example,
python cli/train.py --config experiments/finetune/longt5.yaml
- For traditional CIC tasks, refer to eval_configs. Update the run_id according to your local checkpoints and run the inference scripts.
python cli/eval.py --config experiments/eval_configs/eval_full.yaml
. CIC performance will be recorded during inference. - For Ctrl-CIC tasks, first generate Ctrl-CIC captions with ccic_eval_configs.
The Ctrl-CIC captions can be evaluated as follows:
- CLIPScore and CLIPScore-Sentence by setting use_clip_score and use_sent_score and load_predictions to True, and run the evaluation scripts again.
- Recall, with
python scripts/calculate_recall.py
- Diversity, with
python scripts/diversity_eval.py
- GPT-4(V) empowered metrics, by
- First generate jsons files to be uploaded with
python scripts/generate_prompt.py --task eval
- Update your openai key here
- Run
python scripts/query_response.py
for GPT-4(V) API call - Run
python scripts/get_gpt_scores.py
to compute the GPT-4(V) evaluation metrics scores.
- First generate jsons files to be uploaded with
The pretrained weights are avaliable at huggingface.
For interactive Ctrl-CIC demo, you can run python scripts/rctrl_inference.py
which allows flexible selection of the highlights and image. A similar program is provided for p-ctrl, but the output is shown on the command line.
The dataset and data loading implementation is based on the code provided in WikiWeb2M.
@InProceedings{Mao_2024_ECCV,
author = {Mao, Shunqi and Zhang, Chaoyi and Su, Hang and Song, Hwanjun and Shalyminov, Igor and Cai, Weidong},
title = {Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights},
booktitle = {Proceedings of the 18th European Conference on Computer Vision (ECCV)},
year = {2024}
}