New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blacklisted or non-blacklisted validation set: #36
Comments
Oh that's awkward, since we did blacklist those files - that's what has been documented since 2014 as the recommended approach in the imagenet devkit (although in a somewhat confusing manner, it must be said!) Perhaps each entry on the leaderboard should be labeled with a column saying whether it was evaluated with or without the blacklisted images? It's seems too late to create a new rule at this stage saying which set is required, since which validation set is used impacts what hyperparams are required to hit 93% accuracy. (Our group has used up our AWS credits so we can't run any more models - and I can't imagine other teams would be thrilled at the idea of having to train new models with a new validation set...) |
Adding an extra field to ImageNet submissions indicating whether or not the blacklisted images were included would be useful. @bignamehyp, @jph00, @congxu1987, @daisyden, @ppwwyyxx can each of you update your respective submissions with an additional field @jph00 how much would it cost to rerun your experiments with the same hyperparameters you used for your submission? Assuming the performance is similar and everyone else used the blacklisted images, this might be a simple and cheap solution. Also just so everyone in this discussion is on the same page the devkit instructions are in this readme.txt. The relevant section to this discussion says:
|
"used the blacklisted files" sounds a little bit ambiguous to me. |
Yes, "used the blacklisted files" is equivalent to using all 50,000 images in the validation set. |
IntelCaffe used the whole validation set with 50000 images for inference.
From: Yuxin Wu [mailto:notifications@github.com]
Sent: Wednesday, April 25, 2018 11:31 AM
To: stanford-futuredata/dawn-bench-entries
Cc: Deng, Daisy; Mention
Subject: Re: [stanford-futuredata/dawn-bench-entries] Blacklisted or non-blacklisted validation set: (#36)
"used the blacklisted files" sounds a little bit ambiguous to me.
Do we use "true" for validation on 50000 images?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#36 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AaYxo3KRB-IEo_NmW8a-e9xhpwSfa-5wks5tr-3hgaJpZM4Ten-c>.
|
Hi, let me double confirm with you, since we used the whole imageNet training set (1281167 images) and validation set (50000 images) for both training and inference, in this case what value should we set to "usedBlacklist" field? Thanks! |
Thanks @daisyden. You should say "usedBlacklist": true |
@codyaustun thanks! |
@codyaustun, "used the blacklisted files" sounds like using the blacklisted file to exclude images. Agreed with @ppwwyyxx and @daisyden, it's a bit ambiguous. Maybe change the filename to excludeBlacklistedImages? And it's hard for the readers to understand the difference blacklisted validation set and un-blacklisted version. |
@bignamehyp I agree. Once everyone in this thread has confirmed whether or not they used all 50,000, we can easily update the field name to make it clearer. |
@codyaustun thank you very much for your effort. Amoeba net submissions used all 50,000 images. I will create a PR adding the new filed shortly. |
We used the entire 50,000 images in our single Cloud TPU tests for both TF 1.7 and TF 1.8 (cc @sb2nov). As an experiment, I ran the validation dataset with the blacklisted images excluded on some of the training run checkpoints we did, and on average it improved Top 1 accuracy by 0.25 - 0.35% and Top 5 accuracy by 0.08 - 0.12%. |
Thanks @frankchn! That is useful to know. Based on those numbers, it looks like fast.ai's original ResNet50 submission would likely be unchanged, but the current one might not make the threshold. |
@codyaustun that's not how training to hit a threshold is done - at least not by us. We find the parameters necessary to hit the threshold in the minimal # epochs, but no more. If we had to hit the equivalent of 94.1% accuracy for the current (2014 onwards) Imagenet validation set, we would use slightly different hyperparams. It wouldn't change the time much with suitable hyperparams (we can get 94.1 with one extra epoch with suitable hyperparams). |
@jph00 I understand. My observation was more that the original submission hit 93.132% when it first crossed over the 93% threshold. If @frankchn results generalize to your code, you would still be at the 93% threshold at the same epoch even after including the blacklisted images. The result of that submission wouldn't change both in terms of time and cost. However, the current submission would change because you only reach 93.003%, so it seems the same hyperparameters won't work and you can't simply rerun your submission or revalidate from checkpoints. Is that correct? My goal is to find a solution to this problem that makes for a fair comparison. We are willing to let you update your submission to correct the validation set. I want to get a sense of whether or not that is feasible. Do you know how much it would cost to tune the parameters to hit the threshold on the full validation set? If cost is the only issue to doing this, and it isn't unreasonable, we could simply rerun your experiments or give you credits to resolve this issue without everyone else spending time or money to update their submissions. |
Yes, I could try to do this next week. I'll discuss with AWS.
|
Thanks! We can also help reproduce the experiments if that would be easier or faster. |
Looks like this is resolved with #42. Thanks @bignamehyp for raising the issue, @jph00 for the timely update, and everyone else for your quick responses. |
The ImageNet validation set consists of 50,000 images. In the 2014 devkit, there is a list of 1762 "blacklisted" files. When we report the top-5 accuracy, should we use the blacklisted or non-blacklisted version? In Google's submission, results are obtained by using the full 50,000 including those blacklisted images. But some submissions used the blacklisted version. Just make sure we're comparing the same thing.
The text was updated successfully, but these errors were encountered: