ViT Training Details #252

gupta-abhay · 2020-10-14T14:53:44Z

Hi,

In your code comments you are able to train a small version of the model to 75% top-1 accuracy. Could you give more details about the hyper-params used (like batch size, learning rate etc.)

Thanks.

rwightman · 2020-10-14T15:08:57Z

Training cmd line was this, quite a bit of augmentation, but no dropout/drop_path. I don't think it's optimal, and this was with an earlier version of the model before I refactored. Biggest change was this was trained with the MLP head instead of the single Linear for the final layer (~4M more params). That can be re-enabled in the model def. I'm currently trying other optimizers and regularization settings. EDIT: I also didn't exclude the embedding weights from the weight decay in the initial training sesssion.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamp -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr .001 --weight-decay .01 -b 256

rwightman · 2020-10-14T15:09:34Z

That was run on 2x Titan RTX

gupta-abhay · 2020-10-14T16:27:31Z

thanks.

rwightman · 2020-10-18T17:46:04Z

A new run that's doing better.. 76.1% so far at epoch 162. Droput and stochastic depth (drop_path) enabled.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b 288

gupta-abhay · 2020-10-18T18:08:55Z

following the other issue linked here -- wondering what dropout just before softmax (in mlp head) for class prediction is getting you? Is it just more regularization? I understand it being present after the first linear layer.

linhduongtuan · 2020-10-19T02:07:52Z

Hi Ross, do you plan to PR the pretrained weights soon? I trained the ViT_small_patch16_224 with my data, the results are promising. I believe these can be improved if I use transfer learning. Best regards Linh

…

On 19 Oct 2020, at 01:09, Abhay Gupta ***@***.***> wrote: following the other issue linked here -- wondering what dropout just before softmax (in mlp head) for class prediction is getting you? Is it just more regularization? I understand it being present after the first linear layer. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rwightman · 2020-10-19T03:20:30Z

@linhduongtuan that's the plan, the training session mentioned above is at epoch 185 for small p16 now, it's at 76.65, pretty sure it'll hit low 77s at worst. How did yours end up?

Someone ran a base p16 model for me on bigger compute, apparently it's finished with 78.8 but I don't have my hands on the weights yet. I think that result could be pushed in the low to mid 79s with more epochs and a bit more dropout.

If you have any good results I'm certainly open to adding them with a mention.

rwightman · 2020-10-19T03:21:20Z

I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs

linhduongtuan · 2020-10-19T03:29:06Z

Hi Ross, I am very curious to know about these models performances. Your current results seem to be promising. However, it is impossible for me to train such large datasets. So I am looking forward to your PR or someones ASAP. Thank for your support. Linh

…

On 19 Oct 2020, at 10:21, Ross Wightman ***@***.***> wrote: I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#252 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFKZ27DWT5TUXKGDS4U24GLSLOWD5ANCNFSM4SQXBLSA>.

linhduongtuan · 2020-10-19T05:49:12Z

@ross Do you have a plan to reimplement code following the paper https://openreview.net/pdf?id=xTJEN-ggl1b&fbclid=IwAR0KCsSsvmJt_txZ4TGUpL4Pd-bG2Vh-Ykw6tQG5dgHsAeOvBsalZzyO2Zk#page4 And PR in the repo? Linh

…

On 19 Oct 2020, at 10:28, duong tuan linh ***@***.***> wrote: Hi Ross, I am very curious to know about these models performances. Your current results seem to be promising. However, it is impossible for me to train such large datasets. So I am looking forward to your PR or someones ASAP. Thank for your support. Linh > On 19 Oct 2020, at 10:21, Ross Wightman ***@***.***> wrote: > > > I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or unsubscribe. >

mrT23 · 2020-10-19T06:59:06Z

I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs

@rwightman
with or without ViT (personally i am skeptical, it is a gigantic network that won't give good speed-accuracy tradeoff, even with large pre-training), ImageNet-21k is an interesting dataset.
Do you know a "normal" location on the internet to download ImageNet-21k from ?
i searched for it once and couldn't find something reliable.

Tal

rwightman · 2020-10-19T18:51:31Z

@mrT23 yes, they're gigantic, but surprisingly fast w/ AMP enabled despite the size... just big fat MMs. Overall more of a curiosity at this stage.

fall11_whole.tar (1.31TB) can be found on academic torrents, it has 21841 classes, matches the md5 on official site that can no longer be downloaded. From that you can use as is, but quite unbalanced as 21k. Or filter down to various definitions of ImageNet 10K, 7k, 5k. Usually improving the balance or selecting only leaf nodes, etc as the classes get pruned.

mrT23 · 2020-10-20T05:43:09Z

@mrT23 yes, they're gigantic, but surprisingly fast w/ AMP enabled despite the size... just big fat MMs. Overall more of a curiosity at this stage.

fall11_whole.tar (1.31TB) can be found on academic torrents, it has 21841 classes, matches the md5 on official site that can no longer be downloaded. From that you can use as is, but quite unbalanced as 21k. Or filter down to various definitions of ImageNet 10K, 7k, 5k. Usually improving the balance or selecting only leaf nodes, etc as the classes get pruned.

Thanks for the tip. i looked in the past in academic torrent, but never made the connection between the name and the dataset.

Big datasets are often not very user-friendly, and no one really tries to make them more accessible. one simple trick is to provide a variant with resized images to 224x224 (squish pre-processing). might limit a bit the augmentations regime, but worth it - open images multi-label (6M images, 9000 labels) is only 62GB with this resizing.

rwightman · 2020-10-21T19:56:16Z

A new run that's doing better.. 76.1% so far at epoch 162. Droput and stochastic depth (drop_path) enabled.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b 288

weights for this are up now, 77.42.

EDIT: correction 77.86 after tweaking the test img crop to something closer to the typical 0.875, and base model weights at 79.35 top-1 are up, training session generously run for me by someone with more GPUs :)

rwightman · 2020-10-23T15:40:35Z

Official version is out in jax with some training code and pretrained weights (from ImageNet21k). Looks like this thread can be closed.

JYBWOB · 2021-09-25T02:52:58Z

A new run that's doing better.. 76.1% so far at epoch 162. Droput and stochastic depth (drop_path) enabled.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b 288

when using adam instead of adamw，I just get 6.3% at epoch 100. Should I make other changes for using adam optimizer?

hankyul2 · 2021-11-30T04:11:56Z

@JYBWOB
Yes, you should. If you use adam instead of adamw, you should adjust weight_decay and learning rate. The difference between two optimizers is how weight_decay is applied to calculate gradient.

Adam: code | paper
AdamW: code | paper

Unfortunatly, I don't know about any convention about how to change weight_decay value from adamw to adam.

rwightman mentioned this issue Oct 18, 2020

Hilach/dropout lucidrains/vit-pytorch#7

Closed

rwightman closed this as completed Oct 23, 2020

rwightman mentioned this issue Nov 10, 2020

ViT training script #275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ViT Training Details #252

ViT Training Details #252

gupta-abhay commented Oct 14, 2020

rwightman commented Oct 14, 2020 •

edited

rwightman commented Oct 14, 2020

gupta-abhay commented Oct 14, 2020

rwightman commented Oct 18, 2020

gupta-abhay commented Oct 18, 2020 •

edited

linhduongtuan commented Oct 19, 2020 via email

rwightman commented Oct 19, 2020 •

edited

rwightman commented Oct 19, 2020

linhduongtuan commented Oct 19, 2020 via email

linhduongtuan commented Oct 19, 2020 via email

mrT23 commented Oct 19, 2020 •

edited

rwightman commented Oct 19, 2020

mrT23 commented Oct 20, 2020 •

edited

rwightman commented Oct 21, 2020 •

edited

rwightman commented Oct 23, 2020

JYBWOB commented Sep 25, 2021

hankyul2 commented Nov 30, 2021

ViT Training Details #252

ViT Training Details #252

Comments

gupta-abhay commented Oct 14, 2020

rwightman commented Oct 14, 2020 • edited

rwightman commented Oct 14, 2020

gupta-abhay commented Oct 14, 2020

rwightman commented Oct 18, 2020

gupta-abhay commented Oct 18, 2020 • edited

linhduongtuan commented Oct 19, 2020 via email

rwightman commented Oct 19, 2020 • edited

rwightman commented Oct 19, 2020

linhduongtuan commented Oct 19, 2020 via email

linhduongtuan commented Oct 19, 2020 via email

mrT23 commented Oct 19, 2020 • edited

rwightman commented Oct 19, 2020

mrT23 commented Oct 20, 2020 • edited

rwightman commented Oct 21, 2020 • edited

rwightman commented Oct 23, 2020

JYBWOB commented Sep 25, 2021

hankyul2 commented Nov 30, 2021

rwightman commented Oct 14, 2020 •

edited

gupta-abhay commented Oct 18, 2020 •

edited

rwightman commented Oct 19, 2020 •

edited

mrT23 commented Oct 19, 2020 •

edited

mrT23 commented Oct 20, 2020 •

edited

rwightman commented Oct 21, 2020 •

edited