Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the speed slower than pvtv2-b1? #18

Closed
lucasjinreal opened this issue Dec 20, 2021 · 5 comments
Closed

why the speed slower than pvtv2-b1? #18

lucasjinreal opened this issue Dec 20, 2021 · 5 comments

Comments

@lucasjinreal
Copy link

lucasjinreal commented Dec 20, 2021

Recently I trained a transformer based instance seg model, tested with different backbone, here is the result and speed test:

image

batchsize is training batchsize. Why the speed of poolformer is the slowest one? is that normal?

Slower than pvtv2-b1 and precision less than it...

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 20, 2021

Hi @jinfagang , PoolFormer here is just a tool to demonstrate MetaFormer. The implementation may not be efficient for industrial use. For example, nn.AvgPool2d may not be optimized well in CUDA. It can be replaced with DW Conv self.token_mixer = nn.Conv2d(in_channels=dim, out_channels=dim, kernel_size=3, stride=1, padding=1, groups=dim) to speed up. For GroupNorm, I still don't know how to speed up it currently.

@lucasjinreal
Copy link
Author

@yuweihao Hi, does that means I should train whole model again if change nn.AvgPool2d to DW conv?

@chuong98
Copy link

chuong98 commented Dec 21, 2021

@jinfagang You don;t have to. In our experiment, replacing GN with BN, and then reimplement the Poollayer with a fixed, predefined DepthWise conv, gives us about 30% speed up, and the accuracy drop 1% on ImageNet.
If use BN, you can fuse the Conv-BN to speed it up.

@lucasjinreal
Copy link
Author

@chuong98 can you show your pretrained fixed DepthWise conv? how to reset it's weights?

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 21, 2021

Hi, @jinfagang , @chuong98 , I just found it seems that CUDA much prefers NHWC rather than NCHW [1]. However, NCHW is used by default in PyTorch and PoolFormer also uses this layout. This may also be optimized to further speed it up [2].

image
The figure is from [1].

[1] https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#tensor-layout
[2] https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html

@yuweihao yuweihao closed this as completed Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants