New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix perfornance issue of GroupNorm on CUDA when feature map is small. #46170
Conversation
This pull request was exported from Phabricator. Differential Revision: D24242738 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very nice performance improvement. It would be good to have the comments in the kernels and describe how you choose which path to use. Please make sure that you are testing all the kernel variants that you add (I'm not sure added tests cover everything). Also, can you enable and test bfloat16 dispatch on cuda? Hopefully it should just work, and we are enabling most bfloat16 operations now.
Thanks again for the fix!
Codecov Report
@@ Coverage Diff @@
## master #46170 +/- ##
=======================================
Coverage 68.32% 68.33%
=======================================
Files 410 410
Lines 53793 53793
=======================================
+ Hits 36756 36757 +1
+ Misses 17037 17036 -1
Continue to review full report at Codecov.
|
da53ae3
to
4b6c486
Compare
This pull request was exported from Phabricator. Differential Revision: D24242738 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D24242738 |
4b6c486
to
ac11a3a
Compare
BFloat16 has been enabled. |
💊 CI failures summary and remediationsAs of commit 1ccf848 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 27 times. |
This pull request was exported from Phabricator. Differential Revision: D24242738 |
ac11a3a
to
62313f6
Compare
This pull request was exported from Phabricator. Differential Revision: D24242738 |
62313f6
to
b5c7bae
Compare
This pull request was exported from Phabricator. Differential Revision: D24242738 |
b5c7bae
to
e1b4e36
Compare
This pull request was exported from Phabricator. Differential Revision: D24242738 |
e1b4e36
to
5a2876f
Compare
5a2876f
to
30b2b6a
Compare
This pull request was exported from Phabricator. Differential Revision: D24242738 |
…pytorch#46170) Summary: Pull Request resolved: pytorch#46170 Fix perfornance issue of GroupNorm on CUDA when feature map is small. Benchmark script: ``` import torch import torch.nn.functional as F from timeit import Timer norm = torch.nn.GroupNorm(8, 512).cuda() num = 5000 sizes = [(1024, 512, 14, 14), (1024, 512, 7, 7), (1024, 512)] def forward(x): _ = norm(x) torch.cuda.synchronize() def backward(y, grad): y.backward(grad, retain_graph=True) torch.cuda.synchronize() if __name__ == "__main__": # warm up x = torch.rand(*(sizes[0]), dtype=torch.float, device="cuda", requires_grad=True) for _ in range(100): forward(x) for size in sizes: x = torch.rand(*size, dtype=torch.float, device="cuda", requires_grad=True) t = Timer("forward(x)", "from __main__ import forward, x") print(f"size = {size}:") t1 = t.timeit(num) / num * 1e6 print(f"avg_forward_time = {t1}us") y = norm(x) grad = torch.randn_like(y) t = Timer("backward(y, grad)", "from __main__ import backward, y, grad") t2 = t.timeit(num) / num * 1e6 print(f"avg_backward_time = {t2}us") ``` Benchmark result before this Diff: ``` size = (1024, 512, 14, 14): avg_forward_time = 1636.729855206795us avg_backward_time = 5488.682465581223us size = (1024, 512, 7, 7): avg_forward_time = 465.88476160541177us avg_backward_time = 3129.9425506033003us size = (1024, 512): avg_forward_time = 96.90486900508404us avg_backward_time = 2319.4099438143894us ``` Benchmark result after this Diff: ``` size = (1024, 512, 14, 14): avg_forward_time = 1635.6191572034732us avg_backward_time = 4140.7730475999415us size = (1024, 512, 7, 7): avg_forward_time = 463.6513736099005us avg_backward_time = 1641.7451039887965us size = (1024, 512): avg_forward_time = 66.59087920561433us avg_backward_time = 128.6882139975205us ``` Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm" Differential Revision: D24242738 fbshipit-source-id: 56c0b5f381ac96cb539e9f01b8c504337a57cd9c
This pull request was exported from Phabricator. Differential Revision: D24242738 |
30b2b6a
to
1ccf848
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
This pull request has been merged in a87a1c1. |
1 similar comment
This pull request has been merged in a87a1c1. |
Summary: Fix perfornance issue of GroupNorm on CUDA when feature map is small.
Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm"
Differential Revision: D24242738
As mentioned in #46086, the current GroupNorm implementation performs bad on CUDA when the feature map is small even compared to the impl via BatchNorm before PyTorch 1.5.1. This PR fixed the performance issue when the feature map is small.
Benchmark script:
Benchmark result after this PR on a V100 devgpu:
Benchmark result before this PR on a V100 devgpu:
Run this benchmark script on PyTorch 1.5.1 (the build options may be different, just for reference):