# Evaluate Popular Vision Language Models on SPEC
In [our paper](https://arxiv.org/abs/2312.00081), we evaluated four popular VLMs using our SPEC dataset, namely: [CLIP](https://arxiv.org/abs/2103.00020), [BLIP](https://arxiv.org/abs/2201.12086), [FLAVA](https://arxiv.org/abs/2112.04482), and [CoCa](https://arxiv.org/abs/2205.01917). \
This notebook will guide readers to reproduce these results step by step, let's go!

## 1. How to use this notebook?
You can run this notebook locally, before running, make sure that you have prepared the environment. \
You can also directly run this online notebook: [![online notebook](https://img.shields.io/badge/colab-notebook-yellow)](https://colab.research.google.com/github/wjpoom/SPEC/blob/main/notebooks/evaluate_example_colab.ipynb).

## 2. Import Packages

In [1]:
import zipfile
import os
import torch
import warnings
warnings.filterwarnings('ignore')
from spec import get_data, get_model
from huggingface_hub import hf_hub_download

## 3. Prepare the testing dataset
We store the data on HuggingFace. Before starting, you need to download and decompress the data as following：

In [2]:
# specify the path to save the downloaded and extracted the data
data_root = '/path/to/save/data'
# download *.zip files
hf_hub_download(repo_id='wjpoom/SPEC', repo_type='dataset', filename='data.zip', local_dir=data_root)
# extract *.zip files
with zipfile.ZipFile(os.path.join(data_root, 'data.zip'), 'r') as zip_ref:
    zip_ref.extractall(os.path.join(data_root))
# remove the *.zip files
os.remove(os.path.join(data_root, 'data.zip'))
print(f'The SPEC dataset is prepared at: {data_root}.')

## 4. Let's Evaluate VLMs on SPEC dataset!

### 4.1 Evaluate CLIP
We use the `ViT/B-32` variant of [CLIP](https://arxiv.org/abs/2103.00020) with weights resumed from the checkpoint release by OpenAI.

In [3]:
# load model
model_cache_dir = '/path/to/cache/models' # specify the path to save the downloaded model checkpoint
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model, image_preprocess = get_model(model_name='clip', cache_dir=model_cache_dir, device=device)
# load datasets
subset_names = ['existence', 'relative_spatial', 'absolute_size', 'relative_size', 'count', 'absolute_spatial']
subsets = get_data(data_root=data_root, subset_names=subset_names, image_preprocess=image_preprocess, batch_size=64, num_workers=8)
# evaluate
result = {}
i2t_acc = 0.
t2i_acc = 0.
subset_num = 0
for subset_name, dataloaders in subsets.items():
    subset_result = model.evaluate(subset_name=subset_name, dataloaders=dataloaders)
    result[subset_name] = subset_result
    i2t_acc += subset_result['accuracy']['i2t_accuracy']
    t2i_acc += subset_result['accuracy']['t2i_accuracy']
    subset_num += 1
# print and save results
print(f'\n############# finished the evaluation on all selected subsets ###############')
print(f'average of all subset: Image2Text Accuracy: {i2t_acc/subset_num:.2f} %')
print(f'average of all subset: Text2Image Accuracy: {t2i_acc/subset_num:.2f} %')
out_path = '/path/to/save/results'  # specify the path to save the evluation results
os.makedirs(out_path, exist_ok=True)
out_fn = f"clip_openai_evaluate_result.pth"   # specify the filename according to the model you used
torch.save(result, os.path.join(out_path, out_fn))
print(f'result saved to {out_fn}.')

Image to Text retrieval on <existence>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:07<00:00,  2.27it/s]


existence subset: Image2Text Accuracy: 57.00 %


Text to Image retrieval on <existence>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:10<00:00,  1.46it/s]


existence subset: Text2Image Accuracy: 52.00 %


Image to Text retrieval on <relative_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:12<00:00,  2.66it/s]


relative_spatial subset: Image2Text Accuracy: 27.10 %


Text to Image retrieval on <relative_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:36<00:00,  1.15s/it]


relative_spatial subset: Text2Image Accuracy: 26.75 %


Image to Text retrieval on <absolute_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:08<00:00,  2.69it/s]


absolute_size subset: Image2Text Accuracy: 44.27 %


Text to Image retrieval on <absolute_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:22<00:00,  1.08it/s]


absolute_size subset: Text2Image Accuracy: 36.27 %


Image to Text retrieval on <relative_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:08<00:00,  2.67it/s]


relative_size subset: Image2Text Accuracy: 34.07 %


Text to Image retrieval on <relative_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:22<00:00,  1.08it/s]


relative_size subset: Text2Image Accuracy: 32.47 %


Image to Text retrieval on <count>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:43<00:00,  1.62it/s]


count subset: Image2Text Accuracy: 25.27 %


Text to Image retrieval on <count>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [02:59<00:00,  2.52s/it]


count subset: Text2Image Accuracy: 23.62 %


Image to Text retrieval on <absolute_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:45<00:00,  1.57it/s]


absolute_spatial subset: Image2Text Accuracy: 12.64 %


Text to Image retrieval on <absolute_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [02:55<00:00,  2.48s/it]

absolute_spatial subset: Text2Image Accuracy: 12.20 %

############# finished the evaluation on all selected subsets ###############
average of all subset: Image2Text Accuracy: 33.39 %
average of all subset: Text2Image Accuracy: 30.55 %
result saved to clip_openai_evaluate_result.pth.





### 4.2 Evaluate BLIP
We use the `ViT-B` variant of [BLIP](https://arxiv.org/abs/2201.12086) with weights resumed from the checkpoint released in this [link](https://github.com/salesforce/BLIP), which is finetuned on COCO for image-text retrieval.

In [4]:
# load model
model_cache_dir = '/path/to/cache/models' # specify the path to save the downloaded model checkpoint
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model, image_preprocess = get_model(model_name='blip', cache_dir=model_cache_dir, device=device)
# load datasets
subset_names = ['existence', 'relative_spatial', 'absolute_size', 'relative_size', 'count', 'absolute_spatial']
subsets = get_data(data_root=data_root, subset_names=subset_names, image_preprocess=image_preprocess, batch_size=64, num_workers=8)
# evaluate
result = {}
i2t_acc = 0.
t2i_acc = 0.
subset_num = 0
for subset_name, dataloaders in subsets.items():
    subset_result = model.evaluate(subset_name=subset_name, dataloaders=dataloaders)
    result[subset_name] = subset_result
    i2t_acc += subset_result['accuracy']['i2t_accuracy']
    t2i_acc += subset_result['accuracy']['t2i_accuracy']
    subset_num += 1
# print and save results
print(f'\n############# finished the evaluation on all selected subsets ###############')
print(f'average of all subset: Image2Text Accuracy: {i2t_acc/subset_num:.2f} %')
print(f'average of all subset: Text2Image Accuracy: {t2i_acc/subset_num:.2f} %')
out_path = '/path/to/save/results'  # specify the path to save the evluation results
os.makedirs(out_path, exist_ok=True)
out_fn = f"blip_evaluate_result.pth"   # specify the filename according to the model you used
torch.save(result, os.path.join(out_path, out_fn))
print(f'result saved to {out_fn}.')

load checkpoint from ~/.cache/blip/blip-coco-base.pth
missing keys:
[]


Image to Text retrieval on <existence>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:36<00:00,  2.29s/it]


existence subset: Image2Text Accuracy: 55.50 %


Text to Image retrieval on <existence>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:38<00:00,  2.39s/it]


existence subset: Text2Image Accuracy: 50.10 %


Image to Text retrieval on <relative_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [01:11<00:00,  2.23s/it]


relative_spatial subset: Image2Text Accuracy: 30.65 %


Text to Image retrieval on <relative_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [02:17<00:00,  4.31s/it]


relative_spatial subset: Text2Image Accuracy: 29.60 %


Image to Text retrieval on <absolute_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:55<00:00,  2.30s/it]


absolute_size subset: Image2Text Accuracy: 43.20 %


Text to Image retrieval on <absolute_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [01:20<00:00,  3.36s/it]


absolute_size subset: Text2Image Accuracy: 43.07 %


Image to Text retrieval on <relative_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:54<00:00,  2.26s/it]


relative_size subset: Image2Text Accuracy: 34.33 %


Text to Image retrieval on <relative_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [01:20<00:00,  3.37s/it]


relative_size subset: Text2Image Accuracy: 33.27 %


Image to Text retrieval on <count>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [02:44<00:00,  2.32s/it]


count subset: Image2Text Accuracy: 36.87 %


Text to Image retrieval on <count>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [10:56<00:00,  9.25s/it]


count subset: Text2Image Accuracy: 37.40 %


Image to Text retrieval on <absolute_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [02:53<00:00,  2.44s/it]


absolute_spatial subset: Image2Text Accuracy: 12.07 %


Text to Image retrieval on <absolute_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [10:56<00:00,  9.25s/it]


absolute_spatial subset: Text2Image Accuracy: 11.58 %

############# finished the evaluation on all selected subsets ###############
average of all subset: Image2Text Accuracy: 35.44 %
average of all subset: Text2Image Accuracy: 34.17 %
result saved to blip_evaluate_result.pth.


### 4.3 Evaluate FLAVA
We use the `full` version of [FLAVA](https://arxiv.org/abs/2112.04482) with weights resumed from this [checkpoint](https://huggingface.co/facebook/flava-full).

In [None]:
# load model
model_cache_dir = '/path/to/cache/models' # specify the path to save the downloaded model checkpoint
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model, image_preprocess = get_model(model_name='flava', cache_dir=model_cache_dir, device=device)
# load datasets
subset_names = ['existence', 'relative_spatial', 'absolute_size', 'relative_size', 'count', 'absolute_spatial']
subsets = get_data(data_root=data_root, subset_names=subset_names, image_preprocess=image_preprocess, batch_size=64, num_workers=8)
# evaluate
result = {}
i2t_acc = 0.
t2i_acc = 0.
subset_num = 0
for subset_name, dataloaders in subsets.items():
    subset_result = model.evaluate(subset_name=subset_name, dataloaders=dataloaders)
    result[subset_name] = subset_result
    i2t_acc += subset_result['accuracy']['i2t_accuracy']
    t2i_acc += subset_result['accuracy']['t2i_accuracy']
    subset_num += 1
# print and save results
print(f'\n############# finished the evaluation on all selected subsets ###############')
print(f'average of all subset: Image2Text Accuracy: {i2t_acc/subset_num:.2f} %')
print(f'average of all subset: Text2Image Accuracy: {t2i_acc/subset_num:.2f} %')
out_path = '/path/to/save/results'  # specify the path to save the evluation results
os.makedirs(out_path, exist_ok=True)
out_fn = f"flava_evaluate_result.pth"   # specify the filename according to the model you used
torch.save(result, os.path.join(out_path, out_fn))
print(f'result saved to {out_fn}.')

Image to Text retrieval on <existence>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [01:14<00:00,  4.65s/it]


existence subset: Image2Text Accuracy: 57.90 %


Text to Image retrieval on <existence>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [02:07<00:00,  7.99s/it]


existence subset: Text2Image Accuracy: 51.80 %


Image to Text retrieval on <relative_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [02:33<00:00,  4.78s/it]


relative_spatial subset: Image2Text Accuracy: 25.80 %


Text to Image retrieval on <relative_spatial>: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [07:53<00:00, 14.80s/it]


relative_spatial subset: Text2Image Accuracy: 25.85 %


Image to Text retrieval on <absolute_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [01:46<00:00,  4.42s/it]


absolute_size subset: Image2Text Accuracy: 37.07 %


Text to Image retrieval on <absolute_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [04:31<00:00, 11.29s/it]


absolute_size subset: Text2Image Accuracy: 36.67 %


Image to Text retrieval on <relative_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [01:46<00:00,  4.45s/it]


relative_size subset: Image2Text Accuracy: 33.53 %


Text to Image retrieval on <relative_size>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [04:44<00:00, 11.86s/it]


relative_size subset: Text2Image Accuracy: 33.07 %


Image to Text retrieval on <count>: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [05:46<00:00,  4.88s/it]


count subset: Image2Text Accuracy: 14.00 %


Text to Image retrieval on <count>:   1%|█▍                                                                                                      | 1/71 [01:43<2:00:24, 103.21s/it]

### 4.4 Evaluate CoCa
We used the `ViT/B-32` variant of [CoCa](https://arxiv.org/abs/2205.01917) model with weights resumed from the [checkpoint](https://github.com/mlfoundations/open_clip) that pretrained on LAION-2B dataset.

In [None]:
# load model
model_cache_dir = '/path/to/cache/models' # specify the path to save the downloaded model checkpoint
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model, image_preprocess = get_model(model_name='coca', cache_dir=model_cache_dir, device=device)
# load datasets
subset_names = ['existence', 'relative_spatial', 'absolute_size', 'relative_size', 'count', 'absolute_spatial']
subsets = get_data(data_root=data_root, subset_names=subset_names, image_preprocess=image_preprocess, batch_size=64, num_workers=8)
# evaluate
result = {}
i2t_acc = 0.
t2i_acc = 0.
subset_num = 0
for subset_name, dataloaders in subsets.items():
    subset_result = model.evaluate(subset_name=subset_name, dataloaders=dataloaders)
    result[subset_name] = subset_result
    i2t_acc += subset_result['accuracy']['i2t_accuracy']
    t2i_acc += subset_result['accuracy']['t2i_accuracy']
    subset_num += 1
# print and save results
print(f'\n############# finished the evaluation on all selected subsets ###############')
print(f'average of all subset: Image2Text Accuracy: {i2t_acc/subset_num:.2f} %')
print(f'average of all subset: Text2Image Accuracy: {t2i_acc/subset_num:.2f} %')
out_path = '/path/to/save/results'  # specify the path to save the evluation results
os.makedirs(out_path, exist_ok=True)
out_fn = f"coca_evaluate_result.pth"   # specify the filename according to the model you used
torch.save(result, os.path.join(out_path, out_fn))
print(f'result saved to {out_fn}.')

## What's Next?
Want to test your own visual language model on SPEC? We have provided a [tutorial](https://github.com/wjpoom/SPEC/blob/main/docs/evaluate_custom_model.md) to help evaluate custom models, feel free to have a try.