Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable addmm + GELU epilogue fusion via cuBLASLt #103811

Closed
wants to merge 4 commits into from

Conversation

aakhundov
Copy link
Contributor

@aakhundov aakhundov commented Jun 17, 2023

Stack from ghstack (oldest at bottom):

Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in ATen/native/cuda/Blas.cpp due to compilation and numerical issues in CUDA <= 11.4. This PR:

  1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

  2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

  3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (main.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (main.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (main.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (main.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (main.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (main.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (main.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (main.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (main.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (main.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (main.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (main.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: @eellison

Differential Revision: D46829884

cc @ptrblck @csarofeen @xwang233

Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: @eellison

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103811

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f190760:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

CLA Missing ID CLA Not Signed

@pytorch-bot pytorch-bot bot added release notes: linalg_frontend release notes category labels Jun 17, 2023
aakhundov added a commit that referenced this pull request Jun 17, 2023
Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: eellison

ghstack-source-id: 99790435e54d75d47b53f2d0807c040769c06580
Pull Request resolved: #103811
@linux-foundation-easycla
Copy link

CLA Missing ID CLA Not Signed

@aakhundov aakhundov requested a review from eellison June 17, 2023 19:26
@aakhundov
Copy link
Contributor Author

/easycla

@lezcano lezcano removed their request for review June 18, 2023 08:37
@aakhundov aakhundov self-assigned this Jun 18, 2023
@aakhundov
Copy link
Contributor Author

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@IvanYashchuk IvanYashchuk removed their request for review June 19, 2023 09:10
@IvanYashchuk IvanYashchuk added module: cuda Related to torch.cuda, and CUDA support in general module: cublas Problem related to cublas support ciflow/trunk Trigger trunk jobs on your pull request and removed release notes: linalg_frontend release notes category labels Jun 19, 2023
Copy link
Collaborator

@IvanYashchuk IvanYashchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approval works only for the test/test_linalg.py file if merged with the GH bot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we should move these tests to the test_matmul_cuda.py file. Other than that changes look good!

Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: eellison

Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884)

cc ptrblck csarofeen xwang233

[ghstack-poisoned]
aakhundov added a commit that referenced this pull request Jun 19, 2023
Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: eellison

ghstack-source-id: fde142e9168e2d10135dfb988842a35a821147ef
Pull Request resolved: #103811
@aakhundov
Copy link
Contributor Author

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: eellison

Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884)

cc ptrblck csarofeen xwang233

[ghstack-poisoned]
aakhundov added a commit that referenced this pull request Jun 19, 2023
Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: eellison

ghstack-source-id: 7318435ee0190cc35bd9d25a2acbf7f471f575d0
Pull Request resolved: #103811
@aakhundov
Copy link
Contributor Author

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: eellison

Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884)

cc ptrblck csarofeen xwang233

[ghstack-poisoned]
aakhundov added a commit that referenced this pull request Jun 19, 2023
Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: eellison

ghstack-source-id: 1701375073854035fc7ebed58d4f4066b8c4db86
Pull Request resolved: #103811
@aakhundov
Copy link
Contributor Author

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice !!

@@ -287,13 +287,13 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
self.const_data_ptr<scalar_t>(),
result_->data_ptr<scalar_t>(),
result_ld,
#if 0
#if defined(CUDA_VERSION) && CUDA_VERSION >= 11080
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for anyone looking at this in the future: we're using 11.8 here because that's what our CI has test coverage for

self._test_addmm_impl(torch._addmm_activation, "relu", device, dtype)

@precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm surprised the torch.half bounds are so high here, but this is a prior issue, no need to fix on your pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that these tests are not run on torch.half (only on float32, float64, and bfloat16 on CUDA). So maybe it's a copy / paste artifact.

@aakhundov
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@aakhundov
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 21, 2023
@aakhundov
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor

huydhn commented Jun 22, 2023

cc @aakhundov I suspect that this change is causing failures on periodic Windows CUDA job https://hud.pytorch.org/pytorch/pytorch/commit/1c79003b3c13c7bc47e5796e4451d6565121f3a0. Could you help take a look into the issue? May be relax the tolerance on Windows if the failure is expected?

2023-06-22T01:54:28.9232875Z =================================== RERUNS ====================================
2023-06-22T01:54:28.9233147Z _________________ TestLinalgCUDA.test_addmm_gelu_cuda_float32 _________________
2023-06-22T01:54:28.9233410Z Traceback (most recent call last):
2023-06-22T01:54:28.9233730Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 5582, in test_addmm_gelu
2023-06-22T01:54:28.9234092Z     self._test_addmm_impl(torch._addmm_activation, "gelu", device, dtype)
2023-06-22T01:54:28.9234464Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 5526, in _test_addmm_impl
2023-06-22T01:54:28.9234830Z     self._test_addmm_addmv(func, V, m1, m2, beta=1, activation=activation)
2023-06-22T01:54:28.9235268Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 5447, in _test_addmm_addmv
2023-06-22T01:54:28.9235818Z     self.assertEqual(res1, res3)
2023-06-22T01:54:28.9236206Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 3100, in assertEqual
2023-06-22T01:54:28.9236552Z     raise error_metas[0].to_error(
2023-06-22T01:54:28.9236787Z AssertionError: Tensor-likes are not close!
2023-06-22T01:54:28.9236938Z 
2023-06-22T01:54:28.9237027Z Mismatched elements: 67 / 250 (26.8%)
2023-06-22T01:54:28.9237294Z Greatest absolute difference: 0.00047320593148469925 at index (2, 5) (up to 0.0001 allowed)
2023-06-22T01:54:28.9237609Z Greatest relative difference: 0.3691028356552124 at index (6, 22) (up to 1.3e-06 allowed)

@aakhundov
Copy link
Contributor Author

@huydhn Thanks for the heads-up! From the error, it seems that the GELU epilogue fusion is not available on Windows (or it doesn't use tanh approximation). In any case, we should be able to fix this by adding not IS_LINUX at this line: https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py#L5419 . I'll send a PR shortly.

Btw, the failing Windows CUDA job didn't show up in the CI of the PR. Is this expected? Is there a way to trigger the Windows CUDA job manually in the PR, before merging? Thanks!

aakhundov added a commit that referenced this pull request Jun 22, 2023
Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: @eellison @huydhn

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
aakhundov added a commit that referenced this pull request Jun 22, 2023
Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: eellison huydhn

Subscribers:

Tasks:

Tags:

ghstack-source-id: 992faa201a4269bde3859df462cf00761cf8666f
Pull Request resolved: #104031
@aakhundov
Copy link
Contributor Author

@huydhn Submitted a fix in #104031. Would be great if we could run the Windows CUDA job on this PR to make sure it remedies the issue. Thanks!

@aakhundov
Copy link
Contributor Author

@aakhundov has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor

huydhn commented Jun 22, 2023

Btw, the failing Windows CUDA job didn't show up in the CI of the PR. Is this expected? Is there a way to trigger the Windows CUDA job manually in the PR, before merging? Thanks!

Yes, you can add ciflow/periodic to your PR to trigger Windows CUDA jobs. As the name implies, the job only runs periodically and not on every PR by default because we only have a limited number of Windows CUDA runners. When failure like this happens, it's ok to follow up with a fix.

I have add ciflow/periodic onto #104031, so it should be running there now.

pytorchmergebot pushed a commit that referenced this pull request Jun 22, 2023
Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: eellison huydhn

Subscribers:

Tasks:

Tags:

ghstack-source-id: 6b3a992609dd70f67a803b497e068d4f46583d4d
Pull Request resolved: #104031
pytorchmergebot pushed a commit that referenced this pull request Jun 22, 2023
…UDA"

Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: eellison huydhn

Subscribers:

Tasks:

Tags:

Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jun 22, 2023
Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Widnows. See [this comment](#103811 (comment)) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: eellison huydhn

Subscribers:

Tasks:

Tags:

Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jun 22, 2023
Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Windows. See [this comment](#103811 (comment)) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: @eellison @huydhn

Subscribers:

Tasks:

Tags:

Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688)
Pull Request resolved: #104031
Approved by: https://github.com/huydhn, https://github.com/malfet
@facebook-github-bot facebook-github-bot deleted the gh/aakhundov/3/head branch June 25, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants