Skip to content


Repository files navigation

CLIP prefix captioning.

Inference Notebook:

Official implementation for the paper "ClipCap: CLIP Prefix for Image Captioning"


Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving comparable to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple mapping network over the raw encoding, and then fine-tune our language model to generate a valid caption. In addition, we present another variant, where we utilize a transformer architecture for the mapping network and avoid the fine-tuning of GPT-2. Still, our light model achieve comaparable to state-of-the-art over nocaps dataset.

COCO Examples

A couple of people standing next to an elephant. A wooden table sitting in front of a window. A bunch of bananas sitting on top of a table.
A woman holding a plate with a piece of cake in front of her face. A wooden table topped with lots of wooden utensils. A red motorcycle parked on top of a dirt field.

Conceptual Captions Examples

3D render of a man holding a globe. Students enjoing the cherry blossoms Green leaf of lettuce on a white plate.
The hotel and casino on the waterfront. The triangle is a symbol of the soul. Cartoon boy in the bath.

Inference Notebooks

To help visualize the results we provide a Colab notebook found in notebooks/clip_prefix_captioning_inference.ipynb.
The notebook will download the pretrained models and run inference on a sample images or on images of your choosing. It is recommended to run this in Google Colab. Inference notebook for the transformer mapping network (without fine-tune GPT-2) can be found here for the COCO model (also in notebooks/transformer_inference.ipynb).

Both COCO and Conceptual Captions pretrained models are available for mlp mapping network. For the transformer (without fine-tuning GPT-2) we provide COCO pretrained model.

Inference GUI

  1. Run it in the browser using UI.
  2. Integrated to Huggingface Spaces with Gradio. See demo: Hugging Face Spaces (currently not supporting beam search)

Training prerequisites

Clone, create environment and install dependencies:

git clone && cd CLIP_prefix_caption
conda env create -f environment.yml
conda activate clip_prefix_caption

COCO training

Download train_captions to data/coco/annotations.

Download training images and validation images and unzip (We use Karpathy et el. split).

Extract CLIP features using (output is data/coco/oscar_split_ViT-B_32_train.pkl):

python --clip_model_type ViT-B/32

Train with fine-tuning of GPT2:

python --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/

Train only transformer mapping network:

python --only_prefix --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ --mapping_type transformer  --num_layres 8 --prefix_length 40 --prefix_length_clip 40

If you wish to use ResNet-based CLIP:

python --clip_model_type RN50x4
python --only_prefix --data ./data/coco/oscar_split_RN50x4_train.pkl --out_dir ./coco_train/ --mapping_type transformer  --num_layres 8 --prefix_length 40 --prefix_length_clip 40 --is_rn

Conceptual training

Download the .TSV train/val files from Conceptual Captions and place them under <data_root> directory.

Download the images and extract CLIP features using (outputs are <data_root>/conceptual_clip_ViT-B_32_train.pkl and <data_root>/conceptual_clip_ViT-B_32_val.pkl):

python --clip_model_type ViT-B/32 --data_root <data_root> --num_threads 16

Notice, downloading the images might take a few days.

Train with fine-tuning of GPT2:

python --data <data_root>/conceptual_clip_ViT-B_32_train.pkl --out_dir ./conceptual_train/

Similarly to the COCO training, you can train a transformer mapping network, and / or parse the images using a ResNet-based CLIP.


If you use this code for your research, please cite:

  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},


This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions.


For any inquiry please contact us at our email addresses: or


Simple image captioning model







No releases published


No packages published