Image Captioning

Most Image Captioning models are complicated and very hard to test. Traditional Image caption model first encodes the image using BUTD model, called the bottom up features. This is a Faster-RCNN model trained on Visual Genome dataset. And then use an attention or transformer model to generate a caption. There is also the use of SCST to improve the results.

In 2020, a new model from Microsoft is released, called Oscar, which sets a new record in image captioning. This involves pre-training on large amount of datasets and fine-tuned on downstream tasks. This is also a very complicated and time-consuming process. There is also an improved work called VinVL; which uses their object-attribute detection model to extract features instead of usual bottom-up features used in Oscar.

After coming out the zero-shot model CLIP from OpenAI, many papers released on vision-language related tasks like CLIP-ViL, X-modaler and lastly ClipCap. Among them, ClipCap is the most simplest network everyone can easily test.

Benchmarks

COCO

Model	BLEU-4↑	METEOR↑	ROUGE-L↑	CIDEr↑	SPICE↑	Params ^(M)	Pretrained
ClipCap	32.2	27.1	-	108.4	20.1	156	download
LATGeO	36.4	27.8	56.7	115.8	-	-

Conceptual Captions

Model	ROUGE-L↑	CIDEr↑	SPICE↑	Params ^(M)	Pretrained
ClipCap	26.7	87.3	18.5	156	download

nocaps

Model	in-domain ^{(CIDEr↑ / SPICE↑)}	near-domain ^{(CIDEr↑ / SPICE↑)}	out-of-domain ^{(CIDEr↑ / SPICE↑)}	overall ^{(CIDEr↑ / SPICE↑)}	Params ^(M)
ClipCap	79.7/12.2	67.7/11.3	49.4/9.7	65.7/11.1	156

Notes: All these results are without CIDEr optimization.

Requirements

torch >= 1.8.1
torchvision >= 0.8.1
Python >= 3.8

Clone the repo recursively:

$ git clone --recursive https://github.com/sithu31296/image-captioning.git

Follow the installation steps in coco-caption if you want to evaluate, otherwise not needed.

Other requirements can be installed with:

pip install -r requirements.txt

Inference

$ python tools/infer.py \
  --model-path MODEL_WEIGHTS \
  --img-path TEST_IMAGE_PATH
  --beam-search False

Sample inference result:

A couple of people standing next to an elephant.

Datasets

Dataset	Train	Val	Test	Captions / Image	Vocab Size	Avg. Tokens / Caption
COCO	83k	5k	5k	5	-	-
ConceptualCaptions	3M	15k	-	1	51k	10.3
Flickr8k	6k	1k	1k	5	-	-
Flickr30k	29k	1k	1k	5	-	-
nocaps	-	-	-	-	-	-

Datasets Preparation

COCO / Flickr8k / Flickr30k

Download dataset images.
- For COCO, download COCO2014 images from COCO.
- For Flickr8k, download images from Official Website or if you can't download it, try downloading from Kaggle.
- For Flickr30k, download images from Official Website or if you can't download it, try downloading from Kaggle.
Download Karpathy splits for COCO, Flickr8k and Flickr30k from here.
Run the following command to extract image features and tokens:

$ python tools/prepare_coco_flickr.py \
  --annot-path KARPATHY_ANNOT_JSON_PATH \
  --dataset-path DATASET_ROOT_PATH \
  --save-path SAVE_PATH

To evaluate with coco-caption, you need to convert Karpathy split json format to COCO json format.

$ python scripts/convert_coco_format.py \
  --input-json KARPATHY_ANNOT_JSON_PATH \
  --output-json COCO_JSON_SAVE_PATH \
  --split 'test' or 'val'

To evaluate on COCO-val, you can also use annotation file in coco_caption/annotations/captions_val2014.json.

Conceptual Captions

Download Training split, Validation split and Image Labels from here.
Run the following command to download the actual images:

$ python scripts/download_conceptual.py --root datasets/ConceptualCaptions

Run the following command to extract image features and tokens:

$ python tools/prepare_conceptual.py \
  --dataset-path datasets/ConceptualCaptions \
  --save-path data/ConceptualCaptions

To evaluate with coco-caption, you need to convert to COCO json format.

$ python scripts/convert_conceptual_to_coco.py \
  --input-txt VAL_TXT_PATH \
  --output-json COCO_JSON_SAVE_PATH

Configuration File

Create a yaml configuration file. Default configuration file can be found in configs/defaults.yaml

This file is needed for training and evaluation.

Training

$ python tools/train.py --cfg CONFIG_FILE.yaml

Evaluation

$ python tools/val.py --cfg CONFIG_FILE.yaml

References

Most of the codes are from:

Citations

@article{mokady2021clipcap,
  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
coco_caption @ d365474		coco_caption @ d365474
configs		configs
imgcap		imgcap
scripts		scripts
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

sithu31296/image-captioning

Folders and files

Latest commit

History

Repository files navigation

Image Captioning

Benchmarks

Requirements

Inference

Datasets

Datasets Preparation

COCO / Flickr8k / Flickr30k

Conceptual Captions

Configuration File

Training

Evaluation

References

Citations

About

Resources

License

Stars

Watchers

Forks

Languages