Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the use of Apex AMP and hybrid stages #22

Open
DonkeyShot21 opened this issue Jan 8, 2022 · 6 comments
Open

On the use of Apex AMP and hybrid stages #22

DonkeyShot21 opened this issue Jan 8, 2022 · 6 comments

Comments

@DonkeyShot21
Copy link

Is there a specific reason why you used Apex AMP instead of the native AMP provided by PyTorch? Have you tried native AMP?

I tried to train poolformer_s12 and poolformer_s24 with solo-learn; with native fp16 the loss goes to nan after a few epochs, while with fp32 it works fine. Did you experience similar behavior?

On a side note, can you provide the implementation and the hyperparameters for the hybrid stage [Pool, Pool, Attention, Attention]? It seems very interesting!

@yuweihao
Copy link
Collaborator

Hi @DonkeyShot21 , Thanks for your attention.

We only used to train poolformer_s12 with Apex AMP and it works well, so we use the Apex AMP to show how to train it with Apex AMP on GPUs. We have not tested it with native AMP, and thus have no experience in native AMP.

We plan to release the implementation and more trained models of hybrid stages around March. As for [Pool, Pool, Attention, Attention]-S12 (81.0% accuracy) shown in the paper, we trained it with LayerNorm, batch size of 1024, the learning rate of 1e-3. The remained hyper-parameters are the same as poolformer_s12. The implementation of the pooling token mixer is the same as that of PoolFormer. After the first two stages, the position embedding is added. The attention token mixer is similar to that in timm. The difference is that as the default data format of our implementation is [B, C, H, W], the input of the attention token mixer is transformed into [B, N, C], and the output is transformed into [B, C, H, W] again.

@DonkeyShot21
Copy link
Author

Hi @yuweihao, thanks for the nice reply.

Apex can be hard to install without sudo, that is why I prefer native AMP. Actually, I have tried both (Apex, native) with solo-learn and both lead to nans in the loss quite quickly. This also happens with Swin and ViT. I am trying your implementation now with native AMP and it seems it works nicely, the logs are similar to the ones you posted on google drive. So I guess my problem is related to the SSL methods or to the fact that solo-learn does not support mixup and cutmix. The only way I could stabilize training was with SGD + LARS and gradient accumulation (to simulate a large batch size), but the results are very bad, much worse than ResNet18. I guess SGD is not a good match for metaformers in general.

Thanks for the details on the hybrid stage. I have also seen in other issues that you said that depthwise convs can also be used instead of pooling with a slight increase in performance. Do you think this can be paired with the hybrid stages as well (e.g. depthwise conv in the first 2 blocks and then attention in the last 2)?

@yuweihao
Copy link
Collaborator

Hi @DonkeyShot21 , thanks for your wonderful works for the research community:)

Yes, [DWConv, DWConv, Attention, Attention] also works very well and it is in our release plan of models with hybrid stages.

@DonkeyShot21
Copy link
Author

Thank you again! Looking forward to the release!

@DonkeyShot21
Copy link
Author

Hey @yuweihao, sorry to bother you again. For the hybrid stage [Pool, Pool, Attention, Attention] did you use layer norm just for the attention blocks or for the pooling blocks as well? I am trying to reproduce it on ImageNet-100 but I didn't get better performance than vanilla poolformer. The params and flops are the same as you reported, so I guess the implementation should be correct.

@DonkeyShot21 DonkeyShot21 reopened this Jan 13, 2022
@yuweihao
Copy link
Collaborator

Hi @DonkeyShot21 , I use layer norm for all [Pool, Pool, Attention, Attention]-S12 blocks. I guess the attention blocks may be easy to overfit on small datasets, which results in worse performance than vanilla poolformer on ImageNet-100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants