TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al. Here is an accompanying blog post from keras.io. For its potential impact, this project also received the #TFCommunitySpotlight award.
The techniques have been demonstrated using three datasets:
This repository provides Kaggle Kernel notebooks so that we can leverage the free TPu v3-8 to run the long training schedules. Please refer to this section.
The importance of knowledge distillation lies in its practical usefulness. With the recipes from "function matching", we can now perform knowledge distillation using a principled approach yielding student models that can actually match the performance of their teacher models. This essentially allows us to compress bigger models into (much) smaller ones thereby reducing storage costs and improving inference speed.
- No use of ground-truth labels during distillation.
- Teacher and student should see same images during distillation as opposed to differently augmented views of same images.
- Aggressive form of MixUp as the key augmentation recipe. MixUp is paired with "Inception-style" cropping (implemented in this script).
- A LONG training schedule for distillation. At least 1000 epochs to get good results without overfitting. The importance of a long training schedule is paramount as studied in the paper.
The table below summarizes the results of my experiments. In all cases, teacher is a BiT-ResNet101x3 model and student is a BiT-ResNet50x1. For fun, you can also try to distill into other model families. BiT stands for "Big Transfer" and it was proposed in this paper.
Dataset | Teacher/Student | Top-1 Acc on Test | Location |
---|---|---|---|
Flowers102 | Teacher | 98.18% | Link |
Flowers102 | Student (1000 epochs) | 81.02% | Link |
Pet37 | Teacher | 90.92% | Link |
Pet37 | Student (300 epochs) | 81.3% | Link |
Pet37 | Student (1000 epochs) | 86% | Link |
Food101 | Teacher | 85.52% | Link |
Food101 | Student (100 epochs) | 76.06% | Link |
(Location
denotes the trained model location.)
These results are consistent with Table 4 of the original paper.
It should be noted that none of the above student training regimes showed signs of overfitting. Further improvements can be done by training for longer. The authors also showed that Shampoo can get to similar performance much quicker than Adam during distillation. So, it may very well be possible to get this performance with fewer epochs with Shampoo.
A few differences from the original implementation:
- The authors use BiT-ResNet152x2 as a teacher. While I didn't use this model for this project, you can find these models on TensorFlow Hub. More details are available here.
- The
mixup()
variant I used will produce a pair of duplicate images if the number of images is even. Now, for 8 workers it will become 8 pairs. This may have led to the reduced performance. We can overcome this by usingtf.roll(images, 1, axis=0)
instead oftf.reverse
in themixup()
function. Thanks to Lucas Beyer for pointing this out.
All the notebooks are fully runnable on Kaggle Kernel. The only requirement is that you'd need a billing enabled GCP account to use GCS Buckets to store data.
Notebook | Description | Kaggle Kernel |
---|---|---|
train_bit.ipynb |
Shows how to train the teacher model. | Link |
train_bit_keras_tuner.ipynb |
Shows how to run hyperparameter tuning using Keras Tuner for the teacher model. |
Link |
funmatch_distillation.ipynb |
Shows an implementation of the recipes from "function matching". |
Link |
These are only demonstrated on the Pet37 dataset but will work out-of-the-box for the other datasets too.
For convenience, TFRecords of different datasets are provided:
Dataset | TFRecords |
---|---|
Flowers102 | Link |
Pet37 | Link |
Food101 | Link |
@misc{beyer2021knowledge,
title={Knowledge distillation: A good teacher is patient and consistent},
author={Lucas Beyer and Xiaohua Zhai and Amélie Royer and Larisa Markeeva and Rohan Anil and Alexander Kolesnikov},
year={2021},
eprint={2106.05237},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Huge thanks to Lucas Beyer (first author of the paper) for providing suggestions on the initial version of the implementation.
Thanks to the ML-GDE program for providing GCP credits.
Thanks to TRC for providing Cloud TPU access.