Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



57 Commits

Repository files navigation


This is the source code for paper "CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models".

Recent Updates

  • 2022.05.06 Initialize CPT for grounding, VRD, GQA, and VCR codes.
  • 2022.05.15 Test CPT code for grounding, GQA, and VCR tasks.
  • 2022.05.19 Test CPT code for VRD.

Quick links


alt text

The code is based on two sub-repos. The prompt-feat is used to extract visual features with the help of pre-trained object detector. The Oscar is the pre-trained vision and language model to conduct inference.


We wrap all the commands in You can directly run bash Or:

# you can direcly run by 
# bash

# create a new environment
conda create --name cpt python=3.7
conda activate cpt

# install pytorch1.6
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch


# install apex
git clone
cd apex
python install --cuda_ext --cpp_ext
cd ..

# install requirements
pip install -r requirements.txt

# install prompt_feat
cd prompt_feat
python build develop
cd ..

# install oscar
cd Oscar
# install transformers
git clone
cd transformers
git reset --hard 067923d3267325f525f4e46f357360c191ba562e
cd ..
# install coco_caption
git clone
cd coco-caption
git reset --hard de6f385503ac9a4305a1dcdc39c02312f9fa13fc
# ./
cd ..
python build develop



Before running the code, please first download the pre-trained feature extractor and Oscar models.

bash cmds/prepare_data/

After downloading, there should be:



1. Visual Grounding

Visual Grounding task is to find the visual region corresponding to a query sentence e.g.: the black horse.


Note: all the data will be downloaded at the data directory. If you want to download it at somewhere else, you can create a soft link:

ln -s your_data_path data

Please download the data first.

bash cmds/prepare_data/

Feature Extraction

To extract features:

cd prompt_feat
bash cmds/refcoco/ # make sure you have at least 4 GPUs
# actually 1 GPU is also OK, don't panic.

To modify the code to single GPU or other amount. Please go to prompt_feat/cmds/refcoco/cpt, and modify CUDA_VISIBLE_DEVICES, --nproc_per_node and TEST.IMS_PER_BATCH correspondingly.

CPT Inference

To inference:

cd Oscar
bash cmds/refcoco/

We use the GPU:0 as default choice. If you want to modify the GPU id, please go to cmds/refcoco/ and modify the GPU=0 to the GPU id you want.


To evaluate, please run:

cd Oscar
python eval/refcoco/ results/refcoco/fsl/

2. GQA

GQA is a QA dataset, required reasoning ability.


Please download the data first.

bash cmds/prepare_data/

Feature Extraction

To extract features:

cd prompt_feat
bash cmds/gqa/ # make sure you have at least 4 GPUs
# actually 1 GPU is also OK, don't panic.

To modify the code to single GPU or other amount. Please go to prompt_feat/cmds/gqa/*.sh, and modify CUDA_VISIBLE_DEVICES, --nproc_per_node and TEST.IMS_PER_BATCH correspondingly.

CPT Inference

To inference:

cd Oscar
bash cmds/gqa/
bash cmds/gqa/

We use the GPU:0,1,2,3 as default choice. If you want to modify the GPU ids, please go to cmds/gqa/ and You can also modify the program to single GPU without modifing the batchsize. The result is supposed to be similar because I set the gradient accumulation step to be the dataset size.


To evaluate, please run:

cd Oscar
bash eval/gqa/

3. VCR (Visual Commonsense Reasoning)

VCR is a multiple-choice QA dataset, including question->answer, quesntion+answer->rationale and question->answer+rationale tasks.


Please download the data first.

bash cmds/prepare_data/

Feature Extraction

To extract features:

cd prompt_feat
bash cmds/vcr/ # make sure you have at least 4 GPUs
# actually 1 GPU is also OK, don't panic.

To modify the code to single GPU or other amount. Please go to prompt_feat/cmds/vcr/pt_vcr_val_seg and cpt_vcr_val_seg, and modify CUDA_VISIBLE_DEVICES, --nproc_per_node and TEST.IMS_PER_BATCH correspondingly.

CPT Inference

To inference:

export GPUID=0

# vcr_q_a
bash cmds/vcr/ $GPUID vcr_q_a cpt
bash cmds/vcr/ $GPUID vcr_q_a pt

# vcr_qa_r
bash cmds/vcr/ $GPUID vcr_qa_r cpt
bash cmds/vcr/ $GPUID vcr_qa_r pt

# vcr_qar
bash cmds/vcr/ $GPUID vcr_qar cpt
bash cmds/vcr/ $GPUID vcr_qar pt

We use the GPU:0 as default choice. If you want to modify the GPU id, please modify the GPUID=0 to the GPU id you want.

Meanwhile, our implementation enables running all tasks simutaneously, by assigning different GPUIDs to different tasks.


To evaluate, please run:

cd Oscar
bash eval/vcr/

4. VG (Visual Genome)

VG is a visual relation detection dataset. The model should detect relational triplet in images.


Please download the data first.

bash cmds/prepare_data/

Feature Extraction

To extract features:

cd prompt_feat
bash cmds/vg/ # make sure you have at least 4 GPUs
# actually 1 GPU is also OK, don't panic.

To modify the code to single GPU or other amount. Please go to prompt_feat/cmds/vg/ and, and modify CUDA_VISIBLE_DEVICES, --nproc_per_node and TEST.IMS_PER_BATCH correspondingly.

CPT Inference

To inference:

cd Oscar
bash cmds/vg/

We use the GPU:0,1,2,3 as default choice. If you want to modify the GPU id, please go to Oscar/cmds/vg/ to modify CUDA_VISIBLE_DEVICES and --nproc_per_node. Note that the --per_gpu_train_batch_size multiply number of GPUs should be 40. Or the result will be different.


To evaluate, please run:

cd Oscar
bash eval/vg/ results/vg/cpt/

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Ao Zhang ( If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!


The code is built on scene_graph_benchmark and Oscar Thanks for their excellent codes.


Colorful Prompt Tuning for Pre-trained Vision-Language Models







No releases published


No packages published