Enable addmm + GELU epilogue fusion via cuBLASLt #103811

aakhundov · 2023-06-17T19:09:31Z

Stack from ghstack (oldest at bottom):

-> Enable addmm + GELU epilogue fusion via cuBLASLt #103811

Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in ATen/native/cuda/Blas.cpp due to compilation and numerical issues in CUDA <= 11.4. This PR:

Enables addmm + GELU epilogue fusion for CUDA >= 11.8.
Restricts the usage of fused addmm epilogue to contiguous output (bugfix).
Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (main.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (main.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (main.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (main.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (main.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (main.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (main.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (main.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (main.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (main.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (main.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (main.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: @eellison

Differential Revision: D46829884

cc @ptrblck @csarofeen @xwang233

@eellison

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: @eellison [ghstack-poisoned]

pytorch-bot · 2023-06-17T19:09:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103811

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f190760:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-06-17T19:09:35Z

❌ The email address for the commit (0cc8142) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please submit a support request ticket.

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: eellison ghstack-source-id: 99790435e54d75d47b53f2d0807c040769c06580 Pull Request resolved: #103811

linux-foundation-easycla · 2023-06-17T19:22:11Z

❌ The email address for the commit (0cc8142) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please submit a support request ticket.

aakhundov · 2023-06-17T19:31:52Z

/easycla

aakhundov · 2023-06-18T18:24:30Z

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

IvanYashchuk

My approval works only for the test/test_linalg.py file if merged with the GH bot.

IvanYashchuk · 2023-06-19T09:20:09Z

test/test_linalg.py

In the future, we should move these tests to the test_matmul_cuda.py file. Other than that changes look good!

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: eellison Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884) cc ptrblck csarofeen xwang233 [ghstack-poisoned]

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: eellison ghstack-source-id: fde142e9168e2d10135dfb988842a35a821147ef Pull Request resolved: #103811

aakhundov · 2023-06-19T14:32:49Z

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: eellison Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884) cc ptrblck csarofeen xwang233 [ghstack-poisoned]

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: eellison ghstack-source-id: 7318435ee0190cc35bd9d25a2acbf7f471f575d0 Pull Request resolved: #103811

aakhundov · 2023-06-19T16:36:02Z

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: eellison Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884) cc ptrblck csarofeen xwang233 [ghstack-poisoned]

Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: eellison ghstack-source-id: 1701375073854035fc7ebed58d4f4066b8c4db86 Pull Request resolved: #103811

aakhundov · 2023-06-19T16:49:18Z

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

eellison

nice !!

eellison · 2023-06-20T23:15:12Z

aten/src/ATen/native/cuda/Blas.cpp

@@ -287,13 +287,13 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
              self.const_data_ptr<scalar_t>(),
              result_->data_ptr<scalar_t>(),
              result_ld,
-#if 0
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11080


for anyone looking at this in the future: we're using 11.8 here because that's what our CI has test coverage for

eellison · 2023-06-20T23:23:38Z

test/test_linalg.py

        self._test_addmm_impl(torch._addmm_activation, "relu", device, dtype)

+    @precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
+                        torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})


i'm surprised the torch.half bounds are so high here, but this is a prior issue, no need to fix on your pr

My understanding is that these tests are not run on torch.half (only on float32, float64, and bfloat16 on CUDA). So maybe it's a copy / paste artifact.

aakhundov · 2023-06-21T19:39:43Z

@pytorchbot merge

pytorchmergebot · 2023-06-21T19:42:32Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

aakhundov · 2023-06-21T19:47:07Z

@pytorchbot label "topic: not user facing"

aakhundov · 2023-06-21T19:56:48Z

@pytorchbot merge

pytorchmergebot · 2023-06-21T19:59:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2023-06-22T02:36:09Z

cc @aakhundov I suspect that this change is causing failures on periodic Windows CUDA job https://hud.pytorch.org/pytorch/pytorch/commit/1c79003b3c13c7bc47e5796e4451d6565121f3a0. Could you help take a look into the issue? May be relax the tolerance on Windows if the failure is expected?

2023-06-22T01:54:28.9232875Z =================================== RERUNS ====================================
2023-06-22T01:54:28.9233147Z _________________ TestLinalgCUDA.test_addmm_gelu_cuda_float32 _________________
2023-06-22T01:54:28.9233410Z Traceback (most recent call last):
2023-06-22T01:54:28.9233730Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 5582, in test_addmm_gelu
2023-06-22T01:54:28.9234092Z     self._test_addmm_impl(torch._addmm_activation, "gelu", device, dtype)
2023-06-22T01:54:28.9234464Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 5526, in _test_addmm_impl
2023-06-22T01:54:28.9234830Z     self._test_addmm_addmv(func, V, m1, m2, beta=1, activation=activation)
2023-06-22T01:54:28.9235268Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 5447, in _test_addmm_addmv
2023-06-22T01:54:28.9235818Z     self.assertEqual(res1, res3)
2023-06-22T01:54:28.9236206Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 3100, in assertEqual
2023-06-22T01:54:28.9236552Z     raise error_metas[0].to_error(
2023-06-22T01:54:28.9236787Z AssertionError: Tensor-likes are not close!
2023-06-22T01:54:28.9236938Z 
2023-06-22T01:54:28.9237027Z Mismatched elements: 67 / 250 (26.8%)
2023-06-22T01:54:28.9237294Z Greatest absolute difference: 0.00047320593148469925 at index (2, 5) (up to 0.0001 allowed)
2023-06-22T01:54:28.9237609Z Greatest relative difference: 0.3691028356552124 at index (6, 22) (up to 1.3e-06 allowed)

aakhundov · 2023-06-22T09:01:44Z

@huydhn Thanks for the heads-up! From the error, it seems that the GELU epilogue fusion is not available on Windows (or it doesn't use tanh approximation). In any case, we should be able to fix this by adding not IS_LINUX at this line: https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py#L5419 . I'll send a PR shortly.

Btw, the failing Windows CUDA job didn't show up in the CI of the PR. Is this expected? Is there a way to trigger the Windows CUDA job manually in the PR, before merging? Thanks!

@eellison

Summary: This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.131s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.194s OK ``` Reviewers: @eellison @huydhn Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.131s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.194s OK ``` Reviewers: eellison huydhn Subscribers: Tasks: Tags: ghstack-source-id: 992faa201a4269bde3859df462cf00761cf8666f Pull Request resolved: #104031

aakhundov · 2023-06-22T09:21:53Z

@huydhn Submitted a fix in #104031. Would be great if we could run the Windows CUDA job on this PR to make sure it remedies the issue. Thanks!

aakhundov · 2023-06-22T09:25:35Z

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

huydhn · 2023-06-22T16:05:08Z

Btw, the failing Windows CUDA job didn't show up in the CI of the PR. Is this expected? Is there a way to trigger the Windows CUDA job manually in the PR, before merging? Thanks!

Yes, you can add ciflow/periodic to your PR to trigger Windows CUDA jobs. As the name implies, the job only runs periodically and not on every PR by default because we only have a limited number of Windows CUDA runners. When failure like this happens, it's ok to follow up with a fix.

I have add ciflow/periodic onto #104031, so it should be running there now.

Summary: This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.131s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.194s OK ``` Reviewers: eellison huydhn Subscribers: Tasks: Tags: ghstack-source-id: 6b3a992609dd70f67a803b497e068d4f46583d4d Pull Request resolved: #104031

…UDA" Summary: This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.131s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.194s OK ``` Reviewers: eellison huydhn Subscribers: Tasks: Tags: Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688) [ghstack-poisoned]

Summary: This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.131s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.194s OK ``` Reviewers: eellison huydhn Subscribers: Tasks: Tags: Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688) [ghstack-poisoned]

@eellison

Summary: This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Windows. See [this comment](#103811 (comment)) in the #103811 for the details of the error. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.131s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.194s OK ``` Reviewers: @eellison @huydhn Subscribers: Tasks: Tags: Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688) Pull Request resolved: #104031 Approved by: https://github.com/huydhn, https://github.com/malfet

aakhundov requested review from lezcano, nikitaved and IvanYashchuk as code owners June 17, 2023 19:09

pytorch-bot bot added release notes: linalg_frontend release notes category labels Jun 17, 2023

aakhundov requested a review from eellison June 17, 2023 19:26

lezcano removed their request for review June 18, 2023 08:37

aakhundov self-assigned this Jun 18, 2023

aakhundov mentioned this pull request Jun 18, 2023

[Inductor] Inference Fuse Gelu #103480

Closed

IvanYashchuk removed their request for review June 19, 2023 09:10

IvanYashchuk added module: cuda Related to torch.cuda, and CUDA support in general module: cublas Problem related to cublas support ciflow/trunk Trigger trunk jobs on your pull request and removed release notes: linalg_frontend release notes category labels Jun 19, 2023

IvanYashchuk approved these changes Jun 19, 2023

View reviewed changes

eellison approved these changes Jun 20, 2023

View reviewed changes

pytorchmergebot added the merging label Jun 21, 2023

pytorchmergebot removed the merging label Jun 21, 2023

pytorch-bot bot added the topic: not user facing topic category label Jun 21, 2023

pytorchmergebot added the merging label Jun 21, 2023

pytorchmergebot added Merged and removed merging labels Jun 21, 2023

pytorchmergebot closed this in 1c79003 Jun 21, 2023

aakhundov mentioned this pull request Jun 22, 2023

Fix test_addmm_gelu assertion on Windows CUDA #104031

Closed

eqy mentioned this pull request Jun 23, 2023

Unify GELU tanh approximation in _addmm_activation GPU back-end #104061

Closed

facebook-github-bot deleted the gh/aakhundov/3/head branch June 25, 2023 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable addmm + GELU epilogue fusion via cuBLASLt #103811

Enable addmm + GELU epilogue fusion via cuBLASLt #103811

aakhundov commented Jun 17, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Jun 17, 2023 •

edited

linux-foundation-easycla bot commented Jun 17, 2023

linux-foundation-easycla bot commented Jun 17, 2023

aakhundov commented Jun 17, 2023

aakhundov commented Jun 18, 2023

IvanYashchuk left a comment

IvanYashchuk Jun 19, 2023

aakhundov commented Jun 19, 2023

aakhundov commented Jun 19, 2023

aakhundov commented Jun 19, 2023

eellison left a comment

eellison Jun 20, 2023

eellison Jun 20, 2023

aakhundov Jun 21, 2023

aakhundov commented Jun 21, 2023

pytorchmergebot commented Jun 21, 2023

aakhundov commented Jun 21, 2023

aakhundov commented Jun 21, 2023

pytorchmergebot commented Jun 21, 2023

huydhn commented Jun 22, 2023 •

edited

aakhundov commented Jun 22, 2023

aakhundov commented Jun 22, 2023

aakhundov commented Jun 22, 2023

huydhn commented Jun 22, 2023 •

edited

Enable addmm + GELU epilogue fusion via cuBLASLt #103811

Enable addmm + GELU epilogue fusion via cuBLASLt #103811

Conversation

aakhundov commented Jun 17, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented Jun 17, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103811

✅ No Failures

linux-foundation-easycla bot commented Jun 17, 2023

linux-foundation-easycla bot commented Jun 17, 2023

aakhundov commented Jun 17, 2023

aakhundov commented Jun 18, 2023

IvanYashchuk left a comment

Choose a reason for hiding this comment

IvanYashchuk Jun 19, 2023

Choose a reason for hiding this comment

aakhundov commented Jun 19, 2023

aakhundov commented Jun 19, 2023

aakhundov commented Jun 19, 2023

eellison left a comment

Choose a reason for hiding this comment

eellison Jun 20, 2023

Choose a reason for hiding this comment

eellison Jun 20, 2023

Choose a reason for hiding this comment

aakhundov Jun 21, 2023

Choose a reason for hiding this comment

aakhundov commented Jun 21, 2023

pytorchmergebot commented Jun 21, 2023

Merge failed

aakhundov commented Jun 21, 2023

aakhundov commented Jun 21, 2023

pytorchmergebot commented Jun 21, 2023

Merge started

huydhn commented Jun 22, 2023 • edited

aakhundov commented Jun 22, 2023

aakhundov commented Jun 22, 2023

aakhundov commented Jun 22, 2023

huydhn commented Jun 22, 2023 • edited

aakhundov commented Jun 17, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Jun 17, 2023 •

edited

huydhn commented Jun 22, 2023 •

edited

huydhn commented Jun 22, 2023 •

edited