Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ViT Training Details #252

Closed
gupta-abhay opened this issue Oct 14, 2020 · 17 comments
Closed

ViT Training Details #252

gupta-abhay opened this issue Oct 14, 2020 · 17 comments

Comments

@gupta-abhay
Copy link

Hi,

In your code comments you are able to train a small version of the model to 75% top-1 accuracy. Could you give more details about the hyper-params used (like batch size, learning rate etc.)

Thanks.

@rwightman
Copy link
Collaborator

rwightman commented Oct 14, 2020

Training cmd line was this, quite a bit of augmentation, but no dropout/drop_path. I don't think it's optimal, and this was with an earlier version of the model before I refactored. Biggest change was this was trained with the MLP head instead of the single Linear for the final layer (~4M more params). That can be re-enabled in the model def. I'm currently trying other optimizers and regularization settings. EDIT: I also didn't exclude the embedding weights from the weight decay in the initial training sesssion.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamp -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr .001 --weight-decay .01 -b 256

@rwightman
Copy link
Collaborator

That was run on 2x Titan RTX

@gupta-abhay
Copy link
Author

thanks.

@rwightman
Copy link
Collaborator

A new run that's doing better.. 76.1% so far at epoch 162. Droput and stochastic depth (drop_path) enabled.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b 288

@gupta-abhay
Copy link
Author

gupta-abhay commented Oct 18, 2020

following the other issue linked here -- wondering what dropout just before softmax (in mlp head) for class prediction is getting you? Is it just more regularization? I understand it being present after the first linear layer.

@linhduongtuan
Copy link

linhduongtuan commented Oct 19, 2020 via email

@rwightman
Copy link
Collaborator

rwightman commented Oct 19, 2020

@linhduongtuan that's the plan, the training session mentioned above is at epoch 185 for small p16 now, it's at 76.65, pretty sure it'll hit low 77s at worst. How did yours end up?

Someone ran a base p16 model for me on bigger compute, apparently it's finished with 78.8 but I don't have my hands on the weights yet. I think that result could be pushed in the low to mid 79s with more epochs and a bit more dropout.

If you have any good results I'm certainly open to adding them with a mention.

@rwightman
Copy link
Collaborator

I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs

@linhduongtuan
Copy link

linhduongtuan commented Oct 19, 2020 via email

@linhduongtuan
Copy link

linhduongtuan commented Oct 19, 2020 via email

@mrT23
Copy link
Contributor

mrT23 commented Oct 19, 2020

I have ImageNet-21k (Full) and OpenImages that I could do some transfer learning with, but it would take a reallllly long time with 2 GPUs

@rwightman
with or without ViT (personally i am skeptical, it is a gigantic network that won't give good speed-accuracy tradeoff, even with large pre-training), ImageNet-21k is an interesting dataset.
Do you know a "normal" location on the internet to download ImageNet-21k from ?
i searched for it once and couldn't find something reliable.

Tal

@rwightman
Copy link
Collaborator

@mrT23 yes, they're gigantic, but surprisingly fast w/ AMP enabled despite the size... just big fat MMs. Overall more of a curiosity at this stage.

fall11_whole.tar (1.31TB) can be found on academic torrents, it has 21841 classes, matches the md5 on official site that can no longer be downloaded. From that you can use as is, but quite unbalanced as 21k. Or filter down to various definitions of ImageNet 10K, 7k, 5k. Usually improving the balance or selecting only leaf nodes, etc as the classes get pruned.

@mrT23
Copy link
Contributor

mrT23 commented Oct 20, 2020

@mrT23 yes, they're gigantic, but surprisingly fast w/ AMP enabled despite the size... just big fat MMs. Overall more of a curiosity at this stage.

fall11_whole.tar (1.31TB) can be found on academic torrents, it has 21841 classes, matches the md5 on official site that can no longer be downloaded. From that you can use as is, but quite unbalanced as 21k. Or filter down to various definitions of ImageNet 10K, 7k, 5k. Usually improving the balance or selecting only leaf nodes, etc as the classes get pruned.

Thanks for the tip. i looked in the past in academic torrent, but never made the connection between the name and the dataset.

Big datasets are often not very user-friendly, and no one really tries to make them more accessible. one simple trick is to provide a variant with resized images to 224x224 (squish pre-processing). might limit a bit the augmentations regime, but worth it - open images multi-label (6M images, 9000 labels) is only 62GB with this resizing.

@rwightman
Copy link
Collaborator

rwightman commented Oct 21, 2020

A new run that's doing better.. 76.1% so far at epoch 162. Droput and stochastic depth (drop_path) enabled.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b 288

weights for this are up now, 77.42.

EDIT: correction 77.86 after tweaking the test img crop to something closer to the typical 0.875, and base model weights at 79.35 top-1 are up, training session generously run for me by someone with more GPUs :)

@rwightman
Copy link
Collaborator

Official version is out in jax with some training code and pretrained weights (from ImageNet21k). Looks like this thread can be closed.

@JYBWOB
Copy link

JYBWOB commented Sep 25, 2021

A new run that's doing better.. 76.1% so far at epoch 162. Droput and stochastic depth (drop_path) enabled.

./distributed_train.sh 2 /data/imagenet/ --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b 288

when using adam instead of adamw,I just get 6.3% at epoch 100. Should I make other changes for using adam optimizer?

@hankyul2
Copy link
Contributor

@JYBWOB
Yes, you should. If you use adam instead of adamw, you should adjust weight_decay and learning rate. The difference between two optimizers is how weight_decay is applied to calculate gradient.

Unfortunatly, I don't know about any convention about how to change weight_decay value from adamw to adam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants