Fix perfornance issue of GroupNorm on CUDA when feature map is small. #46170

xiaomengy · 2020-10-11T22:16:44Z

Summary: Fix perfornance issue of GroupNorm on CUDA when feature map is small.

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm"

Differential Revision: D24242738

As mentioned in #46086, the current GroupNorm implementation performs bad on CUDA when the feature map is small even compared to the impl via BatchNorm before PyTorch 1.5.1. This PR fixed the performance issue when the feature map is small.

Benchmark script:

import torch
import torch.nn.functional as F

from timeit import Timer

norm = torch.nn.GroupNorm(8, 512).cuda()

num = 5000

sizes = [(1024, 512, 14, 14), (1024, 512, 7, 7), (1024, 512)]


def forward(x):
    _ = norm(x)
    torch.cuda.synchronize()


def backward(y, grad):
    y.backward(grad, retain_graph=True)
    torch.cuda.synchronize()


if __name__ == "__main__":
    # warm up
    x = torch.rand(*(sizes[0]), dtype=torch.float,
                   device="cuda", requires_grad=True)
    for _ in range(100):
        forward(x)

    for size in sizes:
        x = torch.rand(*size, dtype=torch.float,
                       device="cuda", requires_grad=True)
        t = Timer("forward(x)", "from __main__ import forward, x")
        print(f"size = {size}:")
        t1 = t.timeit(num) / num * 1e6
        print(f"avg_forward_time =  {t1}us")

        y = norm(x)
        grad = torch.randn_like(y)
        t = Timer("backward(y, grad)", "from __main__ import backward, y, grad")
        t2 = t.timeit(num) / num * 1e6
        print(f"avg_backward_time = {t2}us")

Benchmark result after this PR on a V100 devgpu:

size = (1024, 512, 14, 14):
avg_forward_time =  1635.6191572034732us
avg_backward_time = 4140.7730475999415us
size = (1024, 512, 7, 7):
avg_forward_time =  463.6513736099005us
avg_backward_time = 1641.7451039887965us
size = (1024, 512):
avg_forward_time =  66.59087920561433us
avg_backward_time = 128.6882139975205us

Benchmark result before this PR on a V100 devgpu:

size = (1024, 512, 14, 14):
avg_forward_time =  1636.729855206795us
avg_backward_time = 5488.682465581223us
size = (1024, 512, 7, 7):
avg_forward_time =  465.88476160541177us
avg_backward_time = 3129.9425506033003us
size = (1024, 512):
avg_forward_time =  96.90486900508404us
avg_backward_time = 2319.4099438143894us

Run this benchmark script on PyTorch 1.5.1 (the build options may be different, just for reference):

size = (1024, 512, 14, 14):
avg_forward_time =  2728.9786524139345us
avg_backward_time = 9711.360842408612us
size = (1024, 512, 7, 7):
avg_forward_time =  773.7861637957394us
avg_backward_time = 2496.8661199789494us
size = (1024, 512):
avg_forward_time =  173.00677900202572us
avg_backward_time = 188.9778567943722us

facebook-github-bot · 2020-10-11T22:17:07Z

This pull request was exported from Phabricator. Differential Revision: D24242738

ngimel

That's a very nice performance improvement. It would be good to have the comments in the kernels and describe how you choose which path to use. Please make sure that you are testing all the kernel variants that you add (I'm not sure added tests cover everything). Also, can you enable and test bfloat16 dispatch on cuda? Hopefully it should just work, and we are enabling most bfloat16 operations now.
Thanks again for the fix!

aten/src/ATen/native/cuda/group_norm_kernel.cu

torch/testing/_internal/common_nn.py

codecov · 2020-10-12T02:50:18Z

Codecov Report

Merging #46170 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #46170   +/-   ##
=======================================
  Coverage   68.32%   68.33%           
=======================================
  Files         410      410           
  Lines       53793    53793           
=======================================
+ Hits        36756    36757    +1     
+ Misses      17037    17036    -1

Impacted Files	Coverage Δ
torch/testing/_internal/common_nn.py	`85.53% <ø> (ø)`
torch/testing/_internal/expecttest.py	`78.57% <0.00%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d389b1...1ccf848. Read the comment docs.

facebook-github-bot · 2020-10-13T03:22:47Z

This pull request was exported from Phabricator. Differential Revision: D24242738

facebook-github-bot · 2020-10-13T03:47:56Z

This pull request was exported from Phabricator. Differential Revision: D24242738

xiaomengy · 2020-10-13T03:48:47Z

That's a very nice performance improvement. It would be good to have the comments in the kernels and describe how you choose which path to use. Please make sure that you are testing all the kernel variants that you add (I'm not sure added tests cover everything). Also, can you enable and test bfloat16 dispatch on cuda? Hopefully it should just work, and we are enabling most bfloat16 operations now.
Thanks again for the fix!

BFloat16 has been enabled.

dr-ci · 2020-10-13T05:14:13Z

💊 CI failures summary and remediations

As of commit 1ccf848 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 27 times.

facebook-github-bot · 2020-10-13T07:48:17Z

This pull request was exported from Phabricator. Differential Revision: D24242738

facebook-github-bot · 2020-10-13T18:56:50Z

This pull request was exported from Phabricator. Differential Revision: D24242738

facebook-github-bot · 2020-10-13T23:58:36Z

This pull request was exported from Phabricator. Differential Revision: D24242738

facebook-github-bot · 2020-10-14T04:48:02Z

This pull request was exported from Phabricator. Differential Revision: D24242738

facebook-github-bot · 2020-10-14T16:44:15Z

This pull request was exported from Phabricator. Differential Revision: D24242738

…pytorch#46170) Summary: Pull Request resolved: pytorch#46170 Fix perfornance issue of GroupNorm on CUDA when feature map is small. Benchmark script: ``` import torch import torch.nn.functional as F from timeit import Timer norm = torch.nn.GroupNorm(8, 512).cuda() num = 5000 sizes = [(1024, 512, 14, 14), (1024, 512, 7, 7), (1024, 512)] def forward(x): _ = norm(x) torch.cuda.synchronize() def backward(y, grad): y.backward(grad, retain_graph=True) torch.cuda.synchronize() if __name__ == "__main__": # warm up x = torch.rand(*(sizes[0]), dtype=torch.float, device="cuda", requires_grad=True) for _ in range(100): forward(x) for size in sizes: x = torch.rand(*size, dtype=torch.float, device="cuda", requires_grad=True) t = Timer("forward(x)", "from __main__ import forward, x") print(f"size = {size}:") t1 = t.timeit(num) / num * 1e6 print(f"avg_forward_time = {t1}us") y = norm(x) grad = torch.randn_like(y) t = Timer("backward(y, grad)", "from __main__ import backward, y, grad") t2 = t.timeit(num) / num * 1e6 print(f"avg_backward_time = {t2}us") ``` Benchmark result before this Diff: ``` size = (1024, 512, 14, 14): avg_forward_time = 1636.729855206795us avg_backward_time = 5488.682465581223us size = (1024, 512, 7, 7): avg_forward_time = 465.88476160541177us avg_backward_time = 3129.9425506033003us size = (1024, 512): avg_forward_time = 96.90486900508404us avg_backward_time = 2319.4099438143894us ``` Benchmark result after this Diff: ``` size = (1024, 512, 14, 14): avg_forward_time = 1635.6191572034732us avg_backward_time = 4140.7730475999415us size = (1024, 512, 7, 7): avg_forward_time = 463.6513736099005us avg_backward_time = 1641.7451039887965us size = (1024, 512): avg_forward_time = 66.59087920561433us avg_backward_time = 128.6882139975205us ``` Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm" Differential Revision: D24242738 fbshipit-source-id: 56c0b5f381ac96cb539e9f01b8c504337a57cd9c

facebook-github-bot · 2020-10-14T18:06:10Z

This pull request was exported from Phabricator. Differential Revision: D24242738

ngimel

Looks good, thank you!

facebook-github-bot · 2020-10-15T10:11:23Z

This pull request has been merged in a87a1c1.

facebook-github-bot · 2020-10-15T10:11:32Z

This pull request has been merged in a87a1c1.

facebook-github-bot added the fb-exported label Oct 11, 2020

xiaomengy mentioned this pull request Oct 11, 2020

GroupNorm is slow stragely when calling backward() in pytorch 1.6.0 #46086

Closed

xiaomengy requested a review from ngimel October 11, 2020 22:25

ngimel reviewed Oct 12, 2020

View reviewed changes

xiaomengy force-pushed the export-D24242738 branch from da53ae3 to 4b6c486 Compare October 13, 2020 03:22

xiaomengy force-pushed the export-D24242738 branch from 4b6c486 to ac11a3a Compare October 13, 2020 03:47

xiaomengy force-pushed the export-D24242738 branch from ac11a3a to 62313f6 Compare October 13, 2020 07:48

xiaomengy force-pushed the export-D24242738 branch from 62313f6 to b5c7bae Compare October 13, 2020 18:56

xiaomengy force-pushed the export-D24242738 branch from b5c7bae to e1b4e36 Compare October 13, 2020 23:58

xiaomengy force-pushed the export-D24242738 branch from e1b4e36 to 5a2876f Compare October 14, 2020 04:48

xiaomengy force-pushed the export-D24242738 branch from 5a2876f to 30b2b6a Compare October 14, 2020 16:44

xiaomengy force-pushed the export-D24242738 branch from 30b2b6a to 1ccf848 Compare October 14, 2020 18:06

xiaomengy requested a review from ngimel October 15, 2020 02:28

ngimel approved these changes Oct 15, 2020

View reviewed changes

facebook-github-bot closed this in a87a1c1 Oct 15, 2020

xiaomengy deleted the export-D24242738 branch October 15, 2020 06:51

facebook-github-bot added the Merged label Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix perfornance issue of GroupNorm on CUDA when feature map is small. #46170

Fix perfornance issue of GroupNorm on CUDA when feature map is small. #46170

xiaomengy commented Oct 11, 2020 •

edited

facebook-github-bot commented Oct 11, 2020

ngimel left a comment

codecov bot commented Oct 12, 2020 •

edited

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 13, 2020

xiaomengy commented Oct 13, 2020

dr-ci bot commented Oct 13, 2020 •

edited by facebook-github-bot

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 14, 2020

facebook-github-bot commented Oct 14, 2020

facebook-github-bot commented Oct 14, 2020

ngimel left a comment

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020

Fix perfornance issue of GroupNorm on CUDA when feature map is small. #46170

Fix perfornance issue of GroupNorm on CUDA when feature map is small. #46170

Conversation

xiaomengy commented Oct 11, 2020 • edited

facebook-github-bot commented Oct 11, 2020

ngimel left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 12, 2020 • edited

Codecov Report

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 13, 2020

xiaomengy commented Oct 13, 2020

dr-ci bot commented Oct 13, 2020 • edited by facebook-github-bot

💊 CI failures summary and remediations

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 13, 2020

facebook-github-bot commented Oct 14, 2020

facebook-github-bot commented Oct 14, 2020

facebook-github-bot commented Oct 14, 2020

ngimel left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020

xiaomengy commented Oct 11, 2020 •

edited

codecov bot commented Oct 12, 2020 •

edited

dr-ci bot commented Oct 13, 2020 •

edited by facebook-github-bot