[cuda] vectorized gamma and beta loading in vectorized_layer_norm #107287

valentinandrei · 2023-08-16T11:29:22Z

Improves the performance of vectorized_layer_norm by vectorizing access to gamma and beta buffers. This uses 128 bit load instructions which improves memory bandwidth. The speedup is ~3% on average and there are no obvious regressions on any problem sizes.

Used the following code to test:

import torch
from torch.utils.benchmark import Compare, Timer  # @manual

l_inputs = [
    (32, 32),
    (64, 32),
    (256, 128),
    (512, 1024),
    (1024, 2048),
    (2048, 2048),
    (4096, 16384),
    (70000, 64),
    (131072, 512),
    (1000, 520),
    (4005, 4005),
    (10000, 1000),
    (1024, 10000),
    (8192, 4096),
    (10000, 10000),
    (3072, 10000),
    (6144, 10000),
    (1024, 20000),
    (1024, 20000),
    (512, 1536),
    (512, 6144),
    (512, 10240),
    (1000, 1000),
    (2000, 2000),
    (10240, 10240),
    (384, 128),
    (2048, 1024),
    (267, 513),
    (67, 123479),
    (1024, 123479),
    (2048, 66679),
    (200, 256),
    (1000, 256),
    (6000, 256),
    (6272, 256),
    (200, 512),
    (1000, 512),
    (6000, 512),
    (6272, 512),
    (200, 1024),
    (1000, 1024),
    (6000, 1024),
    (6272, 1024),
    (200, 2048),
    (1000, 2048),
    (6000, 2048),
    (6272, 2048),
    (200, 3072),
    (1000, 3072),
    (6000, 3072),
    (6272, 3072),
]


def run_model_on_device(fs, X, gO, device_string, numeric_type):
    ln = torch.nn.LayerNorm((fs,), device=device_string, dtype=numeric_type)
    ln.reset_parameters()
    X.grad = None
    ln.zero_grad(set_to_none=True)
    out = ln(X)
    out.backward(gO)
    return (ln.weight.grad, ln.bias.grad)


def run_correctness_test(eps_weight, eps_bias):
    dtype = torch.float
    for val in l_inputs:
        bs = val[0]
        fs = val[1]
        mean_adjustment = torch.randn(fs, device="cpu", dtype=torch.float)
        X = mean_adjustment * torch.randn(
            bs, fs, device="cpu", dtype=torch.float, requires_grad=True
        )

        X = X.detach().requires_grad_()
        gO = torch.rand_like(X)
        X_gpu = X.to("cuda")
        X_gpu = X_gpu.detach().requires_grad_()
        gO_gpu = gO.to("cuda")
        gO_gpu = gO_gpu.detach().requires_grad_()

        grad_cpu_ref = run_model_on_device(fs, X, gO, "cpu", dtype)
        grad_gpu = run_model_on_device(fs, X_gpu, gO_gpu, "cuda", dtype)
        weight_grad_gpu_target = grad_gpu[0].detach().to("cpu")
        bias_grad_gpu_target = grad_gpu[1].detach().to("cpu")

        weight_delta = torch.abs(grad_cpu_ref[0] - weight_grad_gpu_target)
        weight_mismatches = (weight_delta >= eps_weight).nonzero()
        weight_mismatch_pct = len(weight_mismatches) / len(weight_delta) * 100

        bias_delta = torch.abs(grad_cpu_ref[1] - bias_grad_gpu_target)
        bias_mismatches = (bias_delta >= eps_bias).nonzero()
        bias_mismatch_pct = len(bias_mismatches) / len(bias_delta) * 100

        print(
            "Size ({} x {}) mismatch percentage: weight {:3.2f} bias {:3.2f}".format(
                fs, bs, weight_mismatch_pct, bias_mismatch_pct
            )
        )


# Run the correctness tests
run_correctness_test(0.01, 0.01)

# Run the performance tests. We need to run this at global scope because otherwise
# the `ln` and `gO` objects are likely removed by the JIT compiler
results = []
for dtype in (torch.float, torch.half):
    for val in l_inputs:
        bs = val[0]
        fs = val[1]
        ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype)
        X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True)
        gO = torch.rand_like(X)
        stmtfwd = "ln(X)"
        stmtfwdbwd = (
            "X.grad=None; ln.zero_grad(set_to_none=True); out = ln(X); out.backward(gO)"
        )
        tfwd = Timer(
            stmt=stmtfwd,
            label="ln",
            sub_label=f"{bs:5}, {fs:5}",
            description=f"fwd, {dtype}",
            globals=globals(),
        )
        tfwdbwd = Timer(
            stmt=stmtfwdbwd,
            label="ln",
            sub_label=f"{bs:5}, {fs:5}",
            description=f"fwdbwd, {dtype}",
            globals=globals(),
        )
        for t in (tfwd, tfwdbwd):
            results.append(t.blocked_autorange())
    print(fs, end="\r")
c = Compare(results)
c.print()

pytorch-bot · 2023-08-16T11:29:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107287

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 19d91ad with merge base 8298720 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

valentinandrei · 2023-08-16T13:11:45Z

cc: @malfet

malfet · 2023-08-16T15:00:58Z

@pytorchbot merge

pytorchmergebot · 2023-08-16T15:03:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-16T16:49:31Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm5.6-py3.8 / test (default, 3, 3, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

vadimkantorov · 2023-08-16T20:33:24Z

Also found this blog post on optimizing LayerNorm perf: https://oneflow2020.medium.com/how-to-implement-an-efficient-layernorm-cuda-kernel-oneflow-performance-optimization-731e91a285b8 (from the end of 2021)

Not sure if the results still stand, but it showed that FusedLayerNorm from apex was still faster that PyTorch, and that their impl is even faster than apex (so this oneflow was 2x faster than pytorch). If this is the case, worth borrowing their impl?

And yep, they are also using vectorized loads for CUDA. Their code is licensed Apache2 and is available at https://github.com/Oneflow-Inc/oneflow/blob/2d24fe08be1b1bedcc22fb409c5d688924ce89fc/oneflow/user/kernels/layer_norm_gpu_kernel.cu

Builtin FusedLayerNorm is slower than apex one #37713

valentinandrei · 2023-08-17T13:18:08Z

Also found this blog post on optimizing LayerNorm perf: https://oneflow2020.medium.com/how-to-implement-an-efficient-layernorm-cuda-kernel-oneflow-performance-optimization-731e91a285b8 (from the end of 2021)

Not sure if the results still stand, but it showed that FusedLayerNorm from apex was still faster that PyTorch, and that their impl is even faster than apex (so this oneflow was 2x faster than pytorch). If this is the case, worth borrowing their impl?

And yep, they are also using vectorized loads for CUDA. Their code is licensed Apache2 and is available at https://github.com/Oneflow-Inc/oneflow/blob/2d24fe08be1b1bedcc22fb409c5d688924ce89fc/oneflow/user/kernels/layer_norm_gpu_kernel.cu

Related:

Builtin FusedLayerNorm is slower than apex one #37713

Thanks for the pointer. I'll take a look to see if it's applicable and follow-up.

valentinandrei · 2023-08-17T13:23:44Z

@pytorchbot rebase -b main

pytorchmergebot · 2023-08-17T13:26:06Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2023-08-17T13:26:11Z

Successfully rebased main onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout main && git pull --rebase)

valentinandrei · 2023-08-17T15:05:10Z

@pytorchbot merge

pytorchmergebot · 2023-08-17T15:07:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-17T15:38:04Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / linux-focal-rocm5.6-py3.8 / test (default, 2, 3, linux.rocm.gpu), trunk / linux-focal-rocm5.6-py3.8 / test (default, 3, 3, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

valentinandrei · 2023-08-17T17:58:24Z

@pytorchbot merge

pytorchmergebot · 2023-08-17T18:01:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Vectorized gamma and beta loading in vectorized_layer_norm

0e42dd2

pytorch-bot bot added the release notes: cuda release notes category label Aug 16, 2023

malfet approved these changes Aug 16, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 16, 2023

pytorchmergebot added the merging label Aug 16, 2023

pytorchmergebot removed the merging label Aug 16, 2023

Fix ROCm failure; check alignment for gamma and beta

afe4692

valentinandrei added 2 commits August 17, 2023 07:08

Fixed typoo which caused compilation errors

2107d1c

Fix ROCm failures; attempt 2

8bd3da8

valentinandrei added 4 commits August 17, 2023 13:26

Vectorized gamma and beta loading in vectorized_layer_norm

9e927fb

Fix ROCm failure; check alignment for gamma and beta

63d2a54

Fixed typoo which caused compilation errors

4b6b026

Fix ROCm failures; attempt 2

bad757b

pytorchmergebot force-pushed the main branch from 9cbcbd2 to bad757b Compare August 17, 2023 13:26

pytorchmergebot added the merging label Aug 17, 2023

pytorchmergebot removed the merging label Aug 17, 2023

Fix ROCm failures; attempt 3

0db67bf

valentinandrei added 2 commits August 17, 2023 19:08

Fix ROCm failures; attempt 3

d542a33

Merge branch 'pytorch:main' into main

19d91ad

pytorchmergebot added the merging label Aug 17, 2023

pytorchmergebot added Merged and removed merging labels Aug 17, 2023

pytorchmergebot closed this in d86445a Aug 17, 2023

valentinandrei mentioned this pull request Sep 28, 2023

[cuda] Replace L1 access with warp shuffles in layer_norm_grad_input_kernel #110203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] vectorized gamma and beta loading in vectorized_layer_norm #107287

[cuda] vectorized gamma and beta loading in vectorized_layer_norm #107287

valentinandrei commented Aug 16, 2023 •

edited by malfet

pytorch-bot bot commented Aug 16, 2023 •

edited

valentinandrei commented Aug 16, 2023

malfet commented Aug 16, 2023

pytorchmergebot commented Aug 16, 2023

pytorchmergebot commented Aug 16, 2023

vadimkantorov commented Aug 16, 2023 •

edited

valentinandrei commented Aug 17, 2023

valentinandrei commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

valentinandrei commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

valentinandrei commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

[cuda] vectorized gamma and beta loading in vectorized_layer_norm #107287

[cuda] vectorized gamma and beta loading in vectorized_layer_norm #107287

Conversation

valentinandrei commented Aug 16, 2023 • edited by malfet

pytorch-bot bot commented Aug 16, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107287

✅ No Failures

valentinandrei commented Aug 16, 2023

malfet commented Aug 16, 2023

pytorchmergebot commented Aug 16, 2023

Merge started

pytorchmergebot commented Aug 16, 2023

Merge failed

vadimkantorov commented Aug 16, 2023 • edited

valentinandrei commented Aug 17, 2023

valentinandrei commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

valentinandrei commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

Merge started

pytorchmergebot commented Aug 17, 2023

Merge failed

valentinandrei commented Aug 17, 2023

pytorchmergebot commented Aug 17, 2023

Merge started

valentinandrei commented Aug 16, 2023 •

edited by malfet

pytorch-bot bot commented Aug 16, 2023 •

edited

vadimkantorov commented Aug 16, 2023 •

edited