Skip to content

winycg/CLIP-KD

Repository files navigation

CLIP-KD

This repository contains the source code of CLIP-KD [CLIP-KD: An Empirical Study of CLIP Model Distillation].

Install

pip install -r requirements-training.txt
pip install -r requirements-test.txt

Dataset preparation

Conceptual Captions 3M

OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as an argument to main.py.

The script src/data/gather_cc.py will collect the Conceptual Captions 3M images. First, download the Conceptual Captions 3M URLs and then run the script from our repository: For easy notation, we rename Train_GCC-training as cc3m_train, and Validation_GCC-1.1.0-Validation as cc3m_val.

python src/data/gather_cc.py [path/to/cc3m/images/] [path/to/cc3m_train.tsv] [path/to/cc3m_val.tsv]

Our downloaded CC3M training set contains 2.89M images, and our CC3M validation set contains 13K images.

The generated cc3m_train.csv is:

title   filepath
XXXXXX  train/X/X.jpg
...     ...

The generated cc3m_val.csv is:

title   filepath
XXXXXX  val/X/X.jpg
...     ...

Conceptual 12M

The script src/data/gather_cc12m.py will collect the Conceptual 12M images. First, download the Conceptual 12M URLs and then run the script from our repository:

python src/data/gather_cc12m.py [path/to/cc12m/images/] [path/to/cc12m.tsv]

The generated cc12m.csv is:

title   filepath
XXXXXX  train/X/X.jpg
...     ...

Our downloaded CC12M training set contains 9.97M images.

Distill CLIP models

Distillation with different strategies

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

Role Network Method ImageNet Acc Train script
Teacher ViT-B/16 - 36.99 sh
Student ViT-T/16 Baseline 30.55 sh
Student ViT-T/16 +CRD 31.94 sh
Student ViT-T/16 +FD 34.23 sh
Student ViT-T/16 +MFD 34.09 sh
Student ViT-T/16 +GD 31.54 sh
Student ViT-T/16 +ICL 33.11 sh
Student ViT-T/16 +AFD 31.42 sh

Supervised by ViT-B/16 as the teacher

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

Role Network Method ImageNet Acc Train script Download
Teacher ViT-B/16 - 36.99 sh model | log
Student ViT-T/16 Baseline 30.55 sh model | log
Student ViT-T/16 CLIP-KD 34.90 sh model | log
Student MobileViT-S Baseline 32.60 sh model | log
Student MobileViT-S CLIP-KD 35.96 sh model | log
Student Swin-T Baseline 36.38 sh model | log
Student Swin-T CLIP-KD 40.18 sh model | log
Student MobileNetV3 Baseline 25.11 sh model | log
Student MobileNetV3 CLIP-KD 26.95 sh model | log
Student EfficientNet-B0 Baseline 32.55 sh model | log
Student EfficientNet-B0 CLIP-KD 35.44 sh model | log
Student ResNet-18 Baseline 28.55 sh model | log
Student ResNet-18 CLIP-KD 31.36 sh model | log

Supervised by ResNet-101 as the teacher

The teacher is pretrained on CC3M+12M. Students are distilled on CC3M+12M.

Role Network Method ImageNet Acc Train script Download
Teacher ResNet-101 - 36.76 sh model | log
Student MobileViT-S Baseline 32.60 sh model | log
Student MobileViT-S CLIP-KD 34.97 sh model | log
Student Swin-T Baseline 36.38 sh model | log
Student Swin-T CLIP-KD 39.51 sh model | log
Student MobileNetV3 Baseline 25.11 sh model | log
Student MobileNetV3 CLIP-KD 26.15 sh model | log
Student EfficientNet-B0 Baseline 32.55 sh model | log
Student EfficientNet-B0 CLIP-KD 34.64 sh model | log
Student ResNet-18 Baseline 28.55 sh model | log
Student ResNet-18 CLIP-KD 30.88 sh model | log

Transferred from Laion-400M

The teacher is pretrained on Laion-400M. Students are distilled on CC3M+12M.

Role Network Method ImageNet Train script Download
Teacher ViT-L/14 - 72.8 - model
Student ViT-B/16 Baseline 37.0 sh model | log
Student ViT-B/16 CLIP-KD 57.5 sh model | log
Student ViT-T/16 Baseline 30.6 sh model | log
Student ViT-T/16 CLIP-KD 40.9 sh model | log
Role Network Method ImageNet Train script Download
Teacher ViT-B/16 - 67.1 - model
Student ViT-T/16 Baseline 30.6 sh model | log
Student ViT-T/16 CLIP-KD 42.6 sh model | log
Student ResNet-50 Baseline 35.3 sh model | log
Student ResNet-50 CLIP-KD 55.4 sh model | log

Evaluate pretrained models on more downstream tasks

Evaluation a pretrained model on MSCOCO and Flickr cross-retrieval and ImageNet variants (ImageNet-V2, ImageNet-Rendition and ImageNet-Sketch) classification. Please refer to eval_coco.sh and eval_flickr.sh.

Acknowledgement

Our codebase is bulit over open_clip, an open-source codebase to run CLIP models.

We would appreciate it if our paper and repo are helpful to you!

@inproceedings{yang2024clip,
  title={CLIP-KD: An Empirical Study of CLIP Model Distillation},
  author={Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

About

[CVPR-2024] Official implementations of CLIP-KD: An Empirical Study of CLIP Model Distillation

Resources

Stars

Watchers

Forks

Packages

No packages published