Pytorch implementation for NeurIPS 2021 Paper: Are Transformers More Robust Than CNNs?
Our implementation is based on DeiT.
Transformer emerges as a powerful tool for visual recognition. In addition to demonstrating competitive performance on a broad range of visual benchmarks, recent works also argue that Transformers are much more robust than Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these conclusions are drawn from unfair experimental settings, where Transformers and CNNs are compared at different scales and are applied with distinct training frameworks. In this paper, we aim to provide the first fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations.
With our unified training setup, we first challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we find CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers' training recipes. While regarding generalization on out-of-distribution samples, we show pre-training on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures per se, rather than by other training setups. We hope this work can help the community better understand and benchmark the robustness of Transformers and CNNs.
We provide both pretrained vanilla models and adversarially trained models.
| Pretrained Model | ImageNet | ImageNet-A | ImageNet-C | Stylized-ImageNet | |
|---|---|---|---|---|---|
| Res50-Ori | download link | 76.9 | 3.2 | 57.9 | 8.3 |
| Res50-Align | download link | 76.3 | 4.5 | 55.6 | 8.2 |
| Res50-Best | download link | 75.7 | 6.3 | 52.3 | 10.8 |
| DeiT-Small | download link | 76.8 | 12.2 | 48.0 | 13.0 |
ResNets:
- ResNets fully aligned (with DeiT's training recipe) model, denoted as
res*:
| Model Size | Pretrained Model | ImageNet | ImageNet-A | ImageNet-C | Stylized-ImageNet | |
|---|---|---|---|---|---|---|
| Res18* | 11.69M | download link | 67.83 | 1.92 | 64.14 | 7.92 |
| Res50* | 25.56M | download link | 76.28 | 4.53 | 55.62 | 8.17 |
| Res101* | 44.55M | download link | 77.97 | 8.84 | 49.19 | 11.60 |
- ResNets best model (for Out-of-Distribution (OOD) generalization), denoted as
res-best:
| Model Size | Pretrained Model | ImageNet | ImageNet-A | ImageNet-C | Stylized-ImageNet | |
|---|---|---|---|---|---|---|
| Res18-best | 11.69M | download link | 66.81 | 2.03 | 62.65 | 9.45 |
| Res50-best | 25.56M | download link | 75.74 | 6.32 | 52.25 | 10.77 |
| Res101-best | 44.55M | download link | 77.83 | 11.49 | 47.35 | 13.28 |
DeiTs:
| Model Size | Pretrained Model | ImageNet | ImageNet-A | ImageNet-C | Stylized-ImageNet | |
|---|---|---|---|---|---|---|
| DeiT-Mini | 9.98M | download link | 72.89 | 8.19 | 54.68 | 9.88 |
| DeiT-Small | 22.05M | download link | 76.82 | 12.21 | 47.99 | 12.98 |
| Architecture | Pretrained Model | ImageNet | ImageNet-A | ImageNet-C | Stylized-ImageNet | |
|---|---|---|---|---|---|---|
| Teacher | DeiT-Small | download link | 76.8 | 12.2 | 48.0 | 13.0 |
| Student | Res50*-Distill | download link | 76.7 | 5.2 | 54.2 | 9.8 |
| Teacher | Res50* | download link | 76.3 | 4.5 | 55.6 | 8.2 |
| Student | DeiT-S-Distill | download link | 76.2 | 10.9 | 49.3 | 11.9 |
| Pretrained Model | Clean Acc | PGD-100 | Auto Attack | |
|---|---|---|---|---|
| Res50-ReLU | download link | 66.77 | 32.26 | 26.41 |
| Res50-GELU | download link | 67.38 | 40.27 | 35.51 |
| DeiT-Small | download link | 66.50 | 40.32 | 35.50 |
Download and extract ImageNet train and val images from http://image-net.org/.
The directory structure is the standard layout for the torchvision, and the training and validation data is expected to be in the train folder and val folder respectively:
/path/to/imagenet/
train/
class1/
img1.jpeg
class2/
img2.jpeg
val/
class1/
img3.jpeg
class/2
img4.jpeg
Install dependencies:
pip3 install -r requirements.txtTo train a ResNet model on ImageNet run:
bash script/res.shTo train a DeiT model on ImageNet run:
bash script/deit.shDownload and extract ImageNet-A, ImageNet-C, Stylized-ImageNet val images:
/path/to/datasets/
val/
class1/
img1.jpeg
class/2
img2.jpeg
To evaluate pre-trained models, run:
bash script/generation_to_ood.shIt is worth noting that for ImageNet-C evaluation, the error rate is calculated based on the Noise, Blur, Weather and Digital categories.
To perform adversarial training on ResNet run:
bash script/advres.shTo do adversarial training on DeiT run:
bash scripts/advdeit.shTo evaluate the pre-trained models, run:
bash script/eval_advtraining.sh./autoattack contains the AutoAttack public package, with a little modification to best support ImageNet evaluation.
cd autoattack/
bash autoattack.shPlease refer to PatchAttack
If you use our code, models or wish to refer to our results, please use the following BibTex entry:
@inproceedings{bai2021transformers,
title = {Are Transformers More Robust Than CNNs?},
author = {Bai, Yutong and Mei, Jieru and Yuille, Alan and Xie, Cihang},
booktitle = {Thirty-Fifth Conference on Neural Information Processing Systems},
year = {2021},
}