Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how is robustness calculated? #5

Closed
psteinb opened this issue Mar 3, 2022 · 4 comments
Closed

how is robustness calculated? #5

psteinb opened this issue Mar 3, 2022 · 4 comments

Comments

@psteinb
Copy link

psteinb commented Mar 3, 2022

Hi,

thank you for this wonderful work on vision transformers and how to understand them. I have some simple questions which I must apologize for.
I tried to reproduce figure 12 independently of your code base. I struggle a bit to understand the code. Is is correct that you define robustness as robustness = mean(accuracy(y_val_true, y_val_pred))?
Related to this, do I understand correctly that you compute this accuracy on batches of the validation dataset? These batches are of size 256, right?

Thanks.

@xxxnell
Copy link
Owner

xxxnell commented Mar 4, 2022

Hi, thank you for your support!

CIFAR-{10, 100}-C and ImageNet-C consist of 75 datasets (= data corrupted by 15 different types with 5 levels of intensity each). The robustness in this paper is the average of the accuracies on these 75 corrupted datasets.

In particular, I recommend that you measure the robustness as follows:

  1. Run all cells in robustness.ipynb to get predictive performances of a pretrained model on the 75 datasets. CIFAR-{10, 100}-C will be automatically downloaded. Then, you will get a performance sheet like the sample robustness sheet.
  2. Average all accuracies for the 75 datasets. In the robustness sheet, the columns stand for "Intensity", "Type", "NLL", "Cutoff1", "Cutoff2", "Acc", "Acc-90", "Unc", "Unc-90", "IoU", "IoU-90", "Freq", "Freq-90", "Top-5", "Brier", "ECE", "ECSE”, respectively. We only use the accuracy column ("Acc").

To avoid confusion: rigorously, we do not use the following types of datasets for evaluation: "speckle_noise", "gaussian_blur", "spatter", "saturate". Another metric called mCE (which does not used in this paper) is also used for robustness.

The batch size is 256 by default, but I believe the robustness is independent of the batch size.

@xxxnell
Copy link
Owner

xxxnell commented Mar 12, 2022

Closing this issue based on the comment above. Please feel free to reopen this issue if the problem still exists.

@xxxnell xxxnell closed this as completed Mar 12, 2022
@psteinb
Copy link
Author

psteinb commented Mar 14, 2022

Sure thing, please close the issue.
I think it would be great to have access to the intermediate results to (re-)produce the robustness numbers.
I fancied in the robustness notebook that I'd have to retrain all cited models (as I cannot honor models.load(name, ...) in my environment) and (to be honest) didn't want to invest the CO2 for this.
But maybe the .pth checkpoints are available for download and I misread the docs. Please accept my apologies if that is the case.

@xxxnell
Copy link
Owner

xxxnell commented Mar 15, 2022

Thank you for your constructive feedback. I agree with your comments that releasing intermediate results would be helpful, because evaluating pretrained models on 75 datasets can be resource intensive. I will release robustness sheets as intermediate results for some models, and make the pretrained models easily accessible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants