New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ViT Training Details #252
Comments
Training cmd line was this, quite a bit of augmentation, but no dropout/drop_path. I don't think it's optimal, and this was with an earlier version of the model before I refactored. Biggest change was this was trained with the MLP head instead of the single Linear for the final layer (~4M more params). That can be re-enabled in the model def. I'm currently trying other optimizers and regularization settings. EDIT: I also didn't exclude the embedding weights from the weight decay in the initial training sesssion.
|
That was run on 2x Titan RTX |
thanks. |
A new run that's doing better.. 76.1% so far at epoch 162. Droput and stochastic depth (drop_path) enabled.
|
following the other issue linked here -- wondering what dropout just before softmax (in mlp head) for class prediction is getting you? Is it just more regularization? I understand it being present after the first linear layer. |
Hi Ross, do you plan to PR the pretrained weights soon? I trained the ViT_small_patch16_224 with my data, the results are promising. I believe these can be improved if I use transfer learning.
Best regards
Linh
… On 19 Oct 2020, at 01:09, Abhay Gupta ***@***.***> wrote:
following the other issue linked here -- wondering what dropout just before softmax (in mlp head) for class prediction is getting you? Is it just more regularization? I understand it being present after the first linear layer.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@linhduongtuan that's the plan, the training session mentioned above is at epoch 185 for small p16 now, it's at 76.65, pretty sure it'll hit low 77s at worst. How did yours end up? Someone ran a base p16 model for me on bigger compute, apparently it's finished with 78.8 but I don't have my hands on the weights yet. I think that result could be pushed in the low to mid 79s with more epochs and a bit more dropout. If you have any good results I'm certainly open to adding them with a mention. |
I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs |
Hi Ross,
I am very curious to know about these models performances. Your current results seem to be promising. However, it is impossible for me to train such large datasets. So I am looking forward to your PR or someones ASAP.
Thank for your support.
Linh
… On 19 Oct 2020, at 10:21, Ross Wightman ***@***.***> wrote:
I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#252 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFKZ27DWT5TUXKGDS4U24GLSLOWD5ANCNFSM4SQXBLSA>.
|
@ross
Do you have a plan to reimplement code following the paper https://openreview.net/pdf?id=xTJEN-ggl1b&fbclid=IwAR0KCsSsvmJt_txZ4TGUpL4Pd-bG2Vh-Ykw6tQG5dgHsAeOvBsalZzyO2Zk#page4
And PR in the repo?
Linh
… On 19 Oct 2020, at 10:28, duong tuan linh ***@***.***> wrote:
Hi Ross,
I am very curious to know about these models performances. Your current results seem to be promising. However, it is impossible for me to train such large datasets. So I am looking forward to your PR or someones ASAP.
Thank for your support.
Linh
> On 19 Oct 2020, at 10:21, Ross Wightman ***@***.***> wrote:
>
>
> I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or unsubscribe.
>
|
@rwightman Tal |
@mrT23 yes, they're gigantic, but surprisingly fast w/ AMP enabled despite the size... just big fat MMs. Overall more of a curiosity at this stage. fall11_whole.tar (1.31TB) can be found on academic torrents, it has 21841 classes, matches the md5 on official site that can no longer be downloaded. From that you can use as is, but quite unbalanced as 21k. Or filter down to various definitions of ImageNet 10K, 7k, 5k. Usually improving the balance or selecting only leaf nodes, etc as the classes get pruned. |
Thanks for the tip. i looked in the past in academic torrent, but never made the connection between the name and the dataset. Big datasets are often not very user-friendly, and no one really tries to make them more accessible. one simple trick is to provide a variant with resized images to 224x224 (squish pre-processing). might limit a bit the augmentations regime, but worth it - open images multi-label (6M images, 9000 labels) is only 62GB with this resizing. |
weights for this are up now, 77.42. EDIT: correction 77.86 after tweaking the test img crop to something closer to the typical 0.875, and base model weights at 79.35 top-1 are up, training session generously run for me by someone with more GPUs :) |
Official version is out in jax with some training code and pretrained weights (from ImageNet21k). Looks like this thread can be closed. |
when using adam instead of adamw,I just get 6.3% at epoch 100. Should I make other changes for using adam optimizer? |
@JYBWOB Unfortunatly, I don't know about any convention about how to change weight_decay value from adamw to adam. |
Hi,
In your code comments you are able to train a small version of the model to 75% top-1 accuracy. Could you give more details about the hyper-params used (like batch size, learning rate etc.)
Thanks.
The text was updated successfully, but these errors were encountered: