Most Image Captioning models are complicated and very hard to test. Traditional Image caption model first encodes the image using BUTD model, called the bottom up features. This is a Faster-RCNN model trained on Visual Genome dataset. And then use an attention or transformer model to generate a caption. There is also the use of SCST to improve the results.
In 2020, a new model from Microsoft is released, called Oscar, which sets a new record in image captioning. This involves pre-training on large amount of datasets and fine-tuned on downstream tasks. This is also a very complicated and time-consuming process. There is also an improved work called VinVL; which uses their object-attribute detection model to extract features instead of usual bottom-up features used in Oscar.
After coming out the zero-shot model CLIP from OpenAI, many papers released on vision-language related tasks like CLIP-ViL, X-modaler and lastly ClipCap. Among them, ClipCap is the most simplest network everyone can easily test.
COCO
Model | BLEU-4↑ | METEOR↑ | ROUGE-L↑ | CIDEr↑ | SPICE↑ | Params (M) |
Pretrained |
---|---|---|---|---|---|---|---|
ClipCap | 32.2 | 27.1 | - | 108.4 | 20.1 | 156 | download |
LATGeO | 36.4 | 27.8 | 56.7 | 115.8 | - | - |
Conceptual Captions
Model | ROUGE-L↑ | CIDEr↑ | SPICE↑ | Params (M) |
Pretrained |
---|---|---|---|---|---|
ClipCap | 26.7 | 87.3 | 18.5 | 156 | download |
nocaps
Model | in-domain (CIDEr↑ / SPICE↑) |
near-domain (CIDEr↑ / SPICE↑) |
out-of-domain (CIDEr↑ / SPICE↑) |
overall (CIDEr↑ / SPICE↑) |
Params (M) |
---|---|---|---|---|---|
ClipCap | 79.7/12.2 | 67.7/11.3 | 49.4/9.7 | 65.7/11.1 | 156 |
Notes: All these results are without CIDEr optimization.
- torch >= 1.8.1
- torchvision >= 0.8.1
- Python >= 3.8
Clone the repo recursively:
$ git clone --recursive https://github.com/sithu31296/image-captioning.git
Follow the installation steps in coco-caption if you want to evaluate, otherwise not needed.
Other requirements can be installed with:
pip install -r requirements.txt
$ python tools/infer.py \
--model-path MODEL_WEIGHTS \
--img-path TEST_IMAGE_PATH
--beam-search False
Sample inference result:
A couple of people standing next to an elephant.
Dataset | Train | Val | Test | Captions / Image | Vocab Size | Avg. Tokens / Caption |
---|---|---|---|---|---|---|
COCO | 83k | 5k | 5k | 5 | - | - |
ConceptualCaptions | 3M | 15k | - | 1 | 51k | 10.3 |
Flickr8k | 6k | 1k | 1k | 5 | - | - |
Flickr30k | 29k | 1k | 1k | 5 | - | - |
nocaps | - | - | - | - | - | - |
- Download dataset images.
- For COCO, download COCO2014 images from COCO.
- For Flickr8k, download images from Official Website or if you can't download it, try downloading from Kaggle.
- For Flickr30k, download images from Official Website or if you can't download it, try downloading from Kaggle.
- Download Karpathy splits for COCO, Flickr8k and Flickr30k from here.
- Run the following command to extract image features and tokens:
$ python tools/prepare_coco_flickr.py \
--annot-path KARPATHY_ANNOT_JSON_PATH \
--dataset-path DATASET_ROOT_PATH \
--save-path SAVE_PATH
- To evaluate with
coco-caption
, you need to convert Karpathy split json format to COCO json format.
$ python scripts/convert_coco_format.py \
--input-json KARPATHY_ANNOT_JSON_PATH \
--output-json COCO_JSON_SAVE_PATH \
--split 'test' or 'val'
To evaluate on COCO-val, you can also use annotation file in
coco_caption/annotations/captions_val2014.json
.
- Download Training split, Validation split and Image Labels from here.
- Run the following command to download the actual images:
$ python scripts/download_conceptual.py --root datasets/ConceptualCaptions
- Run the following command to extract image features and tokens:
$ python tools/prepare_conceptual.py \
--dataset-path datasets/ConceptualCaptions \
--save-path data/ConceptualCaptions
- To evaluate with
coco-caption
, you need to convert to COCO json format.
$ python scripts/convert_conceptual_to_coco.py \
--input-txt VAL_TXT_PATH \
--output-json COCO_JSON_SAVE_PATH
Create a yaml configuration file. Default configuration file can be found in configs/defaults.yaml
This file is needed for training and evaluation.
$ python tools/train.py --cfg CONFIG_FILE.yaml
$ python tools/val.py --cfg CONFIG_FILE.yaml
Most of the codes are from:
@article{mokady2021clipcap,
title={ClipCap: CLIP Prefix for Image Captioning},
author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
journal={arXiv preprint arXiv:2111.09734},
year={2021}
}