Skip to content

[ICML 2022] This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). We suggest to use an LS-trained teacher with a low-temperature transfer to render high performance students.

License

sutd-visual-computing-group/LS-KD-compatibility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Revisiting Label Smoothing & Knowledge Distillation
Compatibility: What was Missing?

Keshigeyan Chandrasegaran /  Ngoc‑Trung Tran /  Yunqing Zhao /  Ngai‑Man Cheung
Singapore University of Technology and Design (SUTD)
ICML 2022 
Project | ICML Paper | Pre-trained Models

Abstract

This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019); Shen et al. (2021). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question — to smooth or not to smooth a teacher network? — unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students.

A rule of thumb for practitioners. We suggest to use an LS-trained teacher with a low-temperature transfer (i.e. T = 1) to render high performance students.

About the code

This codebase is written in Pytorch. It is clearly documented with bash file execution points exposing all required arguments and hyper-parameters. We also provide Docker container details to run our code.

✅ Pytorch

✅ ​NVIDIA DALI

✅ ​Multi-GPU / Mixed-Precision training

✅ ​DockerFile

Running the code

ImageNet-1K LS / KD experiments : Clear steps on how to run and reproduce our results for ImageNet-1K LS and KD (Table 2, B.3) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training. We use NVIDIA DALI library for training student networks.

Machine Translation experiments : Clear steps on how to run and reproduce our results for machine translation LS and KD (Table 5, B.2) are provided in src/neural_machine_translation/README.md. We use [1] following exact procedure as [2]

CUB200-2011 experiments : Clear steps on how to run and reproduce our results for fine-grained image classification (CUB200) LS and KD (Table 2, B.1) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training.

Compact Student Distillation : Clear steps on how to run and reproduce our results for Compact Student distillation LS and KD (Table 4, B.3) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training.

Penultimate Layer Visualization : Pseudocode for Penultimate visualization algorithm is provided in src/visualization/visualization_algorithm.png Refer src/visualization/alpha-LS-KD_imagenet_centroids.py for Penultimate layer visualization code to reproduce all visualizations in the main paper and Supplementary (Figures 1, A.1, A.2). The code is clearly documented.

ImageNet-1K KD Results

ResNet-50 → ResNet-18 KD
$\alpha$ / T $\alpha=$ 0.0 $\alpha=$ 0.1
Teacher : ResNet-50 - 76.132 / 92.862 76.200 / 93.082
Student : ResNet-18 T = 1 71.488 / 90.272 71.666 / 90.364
Student : ResNet-18 T = 2 71.360 / 90.362 68.860 / 89.352
Student : ResNet-18 T = 3 69.674 / 89.698 67.752 / 88.932
Student : ResNet-18 T = 64 66.194 / 88.706 64.362 / 87.698
ResNet-50 → ResNet-50 KD
$\alpha$ / T $\alpha=$ 0.0 $\alpha=$ 0.1
Teacher : ResNet-50 - 76.132 / 92.862 76.200 / 93.082
Student : ResNet-50 T = 1 76.328 / 92.996 76.896 / 93.236
Student : ResNet-50 T = 2 76.180 / 93.072 76.110 / 93.138
Student : ResNet-50 T = 3 75.488 / 92.670 75.790 / 93.006
Student : ResNet-50 T = 64 74.278 / 92.410 74.566 / 92.596

Results produced with 20.12-py3 (Nvidia Pytorch Docker container) + Pytorch LTS 1.8.2 + CUDA11.1

Pretrained Models

All pretrained image classification, fine-graind image clasification, neural machine translation and compact student distillation models are available here

Citation

@InProceedings{pmlr-v162-chandrasegaran22a,
    author    = {Chandrasegaran, Keshigeyan and Tran, Ngoc-Trung and Zhao, Yunqing and Cheung, Ngai-Man},
    title     = {Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?},
    booktitle = {Proceedings of the 39th International Conference on Machine Learning},
    pages     = {2890-2916},
    year      = {2022},
    editor    = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
    volume    = {162},
    series    = {Proceedings of Machine Learning Research},
    month     = {17-23 Jul},
    publisher = {PMLR},
}

Acknowledgements

We gratefully acknowledge the following works and libraries:

Special thanks to Lingeng Foo and Timothy Liu for valuable discussion.

References

[1] Tan, Xu, et al. "Multilingual Neural Machine Translation with Knowledge Distillation." International Conference on Learning Representations. 2019.

[2] Shen, Zhiqiang, et al. "Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study." International Conference on Learning Representations. 2021.

About

[ICML 2022] This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). We suggest to use an LS-trained teacher with a low-temperature transfer to render high performance students.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published