Skip to content

sithu31296/image-captioning

Repository files navigation

Image Captioning

Most Image Captioning models are complicated and very hard to test. Traditional Image caption model first encodes the image using BUTD model, called the bottom up features. This is a Faster-RCNN model trained on Visual Genome dataset. And then use an attention or transformer model to generate a caption. There is also the use of SCST to improve the results.

In 2020, a new model from Microsoft is released, called Oscar, which sets a new record in image captioning. This involves pre-training on large amount of datasets and fine-tuned on downstream tasks. This is also a very complicated and time-consuming process. There is also an improved work called VinVL; which uses their object-attribute detection model to extract features instead of usual bottom-up features used in Oscar.

After coming out the zero-shot model CLIP from OpenAI, many papers released on vision-language related tasks like CLIP-ViL, X-modaler and lastly ClipCap. Among them, ClipCap is the most simplest network everyone can easily test.

Benchmarks

COCO

Model BLEU-4↑ METEOR↑ ROUGE-L↑ CIDEr↑ SPICE↑ Params
(M)
Pretrained
ClipCap 32.2 27.1 - 108.4 20.1 156 download
LATGeO 36.4 27.8 56.7 115.8 - -

Conceptual Captions

Model ROUGE-L↑ CIDEr↑ SPICE↑ Params
(M)
Pretrained
ClipCap 26.7 87.3 18.5 156 download

nocaps

Model in-domain
(CIDEr↑ / SPICE↑)
near-domain
(CIDEr↑ / SPICE↑)
out-of-domain
(CIDEr↑ / SPICE↑)
overall
(CIDEr↑ / SPICE↑)
Params
(M)
ClipCap 79.7/12.2 67.7/11.3 49.4/9.7 65.7/11.1 156

Notes: All these results are without CIDEr optimization.

Requirements

  • torch >= 1.8.1
  • torchvision >= 0.8.1
  • Python >= 3.8

Clone the repo recursively:

$ git clone --recursive https://github.com/sithu31296/image-captioning.git

Follow the installation steps in coco-caption if you want to evaluate, otherwise not needed.

Other requirements can be installed with:

pip install -r requirements.txt

Inference

$ python tools/infer.py \
  --model-path MODEL_WEIGHTS \
  --img-path TEST_IMAGE_PATH
  --beam-search False

Sample inference result:

test
A couple of people standing next to an elephant.

Datasets

Dataset Train Val Test Captions / Image Vocab Size Avg. Tokens / Caption
COCO 83k 5k 5k 5 - -
ConceptualCaptions 3M 15k - 1 51k 10.3
Flickr8k 6k 1k 1k 5 - -
Flickr30k 29k 1k 1k 5 - -
nocaps - - - - - -

Datasets Preparation

COCO / Flickr8k / Flickr30k

  • Download dataset images.
    • For COCO, download COCO2014 images from COCO.
    • For Flickr8k, download images from Official Website or if you can't download it, try downloading from Kaggle.
    • For Flickr30k, download images from Official Website or if you can't download it, try downloading from Kaggle.
  • Download Karpathy splits for COCO, Flickr8k and Flickr30k from here.
  • Run the following command to extract image features and tokens:
$ python tools/prepare_coco_flickr.py \
  --annot-path KARPATHY_ANNOT_JSON_PATH \
  --dataset-path DATASET_ROOT_PATH \
  --save-path SAVE_PATH
  • To evaluate with coco-caption, you need to convert Karpathy split json format to COCO json format.
$ python scripts/convert_coco_format.py \
  --input-json KARPATHY_ANNOT_JSON_PATH \
  --output-json COCO_JSON_SAVE_PATH \
  --split 'test' or 'val'

To evaluate on COCO-val, you can also use annotation file in coco_caption/annotations/captions_val2014.json.

Conceptual Captions

  • Download Training split, Validation split and Image Labels from here.
  • Run the following command to download the actual images:
$ python scripts/download_conceptual.py --root datasets/ConceptualCaptions
  • Run the following command to extract image features and tokens:
$ python tools/prepare_conceptual.py \
  --dataset-path datasets/ConceptualCaptions \
  --save-path data/ConceptualCaptions
  • To evaluate with coco-caption, you need to convert to COCO json format.
$ python scripts/convert_conceptual_to_coco.py \
  --input-txt VAL_TXT_PATH \
  --output-json COCO_JSON_SAVE_PATH

Configuration File

Create a yaml configuration file. Default configuration file can be found in configs/defaults.yaml

This file is needed for training and evaluation.

Training

$ python tools/train.py --cfg CONFIG_FILE.yaml

Evaluation

$ python tools/val.py --cfg CONFIG_FILE.yaml

References

Most of the codes are from:

Citations

@article{mokady2021clipcap,
  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},
  year={2021}
}

About

Simple and Easy to use Image Captioning Implementation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages