preferred blas library; cublaslt gemm implementation #122106

jeffdaily · 2024-03-18T17:35:13Z

Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling torch.backends.cuda.preferred_blas_library(backend="cublaslt") or as an alias backend="hipblaslt".

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang

pytorch-bot · 2024-03-18T17:35:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122106

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit b7ebb6e with merge base cf5ca58 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
sebotnet33ts_256
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
The action 'Test' has timed out after 210 minutes.
trunk / linux-focal-cuda12.1-py3.10-gcc9 / test (nogpu_AVX512, 1, 1, linux.2xlarge) (gh)
RuntimeError: profiler/test_profiler 1/1 failed

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
RuntimeError: profiler/test_profiler 1/1 failed
pull / linux-focal-py3.12-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
RuntimeError: profiler/test_profiler 1/1 failed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lezcano

Looks good. I left a few comments

aten/src/ATen/Context.cpp

lezcano · 2024-03-20T10:22:06Z

aten/src/ATen/cuda/CUDABlas.cpp

+{
+  if (at::globalContext().blasPreferredBackend() == BlasBackend::Cublaslt) {
+#ifdef USE_ROCM
+    // hipblaslt does not support complex gemm yet


Does HIPblasLT have any other limitations for the versions we support?

hipBLASLt in rocm 6.0 does not support complex or double types. It also only supports MI200 and MI300. I will add a TORCH_CHECK for that in aten/src/ATen/Context.cpp setter.

aten/src/ATen/cuda/CUDABlas.cpp

test/test_linalg.py

torch/backends/cuda/__init__.py

lezcano

Changes look good.

Could you extend the current matmul testing in test_matmul_small_brute_force_{1,2,3}d_Nd to exercise these paths, to make sure the new paths output correct results?

Also see the small point at #122106 (comment)

jeffdaily · 2024-04-08T23:08:22Z

Changes look good.

Could you extend the current matmul testing in test_matmul_small_brute_force_{1,2,3}d_Nd to exercise these paths, to make sure the new paths output correct results?

Also see the small point at #122106 (comment)

Done.

pytorchmergebot · 2024-04-19T23:19:22Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/122106/head returned non-zero exit code 1

Rebasing (1/14)
Auto-merging aten/src/ATen/Context.h
Auto-merging torch/_C/__init__.pyi.in
Auto-merging torch/_dynamo/trace_rules.py
Auto-merging torch/csrc/Module.cpp
CONFLICT (content): Merge conflict in torch/csrc/Module.cpp
error: could not apply b303c18747e... add preferred blas backend selector
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply b303c18747e... add preferred blas backend selector

Raised by https://github.com/pytorch/pytorch/actions/runs/8760666291

jeffdaily · 2024-04-19T23:22:41Z

@pytorchbot merge

pytorchmergebot · 2024-04-19T23:24:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-20T05:23:08Z

The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot.

jeffdaily · 2024-04-22T15:36:35Z

@pytorchbot merge -f "all failures are unrelated and show up on HUD too"

pytorchmergebot · 2024-04-22T15:38:14Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes broken ROCm 5.7 build caused by #122106. Pull Request resolved: #124797 Approved by: https://github.com/atalman

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

clee2000 · 2024-04-26T21:26:45Z

I'm pretty sure this broke windows tests
https://hud.pytorch.org/pytorch/pytorch/commit/6ede882c0b1d5ccc95b0c82ca5e206eb2dfb2911
https://github.com/pytorch/pytorch/actions/runs/8850792172/job/24308901905
Can you forward fix?

2024-04-26T17:27:26.5528977Z _______ TestLinalgCUDA.test_matmul_small_brute_force_1d_Nd_cuda_float32 _______
2024-04-26T17:27:26.5529578Z Traceback (most recent call last):
2024-04-26T17:27:26.5530400Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4441, in test_matmul_small_brute_force_1d_Nd
2024-04-26T17:27:26.5531227Z     self.check_single_matmul(x, y)
2024-04-26T17:27:26.5532103Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4392, in check_single_matmul
2024-04-26T17:27:26.5532920Z     ans = torch.matmul(x, y)
2024-04-26T17:27:26.5533479Z RuntimeError: at::cuda::blas::bgemm_internal_cublaslt: not implemented for float
2024-04-26T17:27:26.5533953Z 
2024-04-26T17:27:26.5534208Z To execute this test, run the following from the base repo dir:
2024-04-26T17:27:26.5535104Z      python test\test_linalg.py -k test_matmul_small_brute_force_1d_Nd_cuda_float32
2024-04-26T17:27:26.5535603Z 
2024-04-26T17:27:26.5535927Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

jeffdaily · 2024-04-26T22:18:55Z

I think best I can do is disable this feature for Win platform. It should have been disabled but I didn't do it correctly.

PR pytorch#122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly.

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

PR #122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly. Pull Request resolved: #125080 Approved by: https://github.com/clee2000

Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources. The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`. Pull Request resolved: pytorch#122106 Approved by: https://github.com/lezcano

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

PR pytorch#122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly. Pull Request resolved: pytorch#125080 Approved by: https://github.com/clee2000

Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources. The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`. Pull Request resolved: pytorch#122106 Approved by: https://github.com/lezcano

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

PR #122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly. Pull Request resolved: #125080 Approved by: https://github.com/clee2000

jeffdaily added the ciflow/rocm label Mar 18, 2024

jeffdaily requested a review from xw285cornell March 18, 2024 17:35

jeffdaily requested a review from jithunnair-amd as a code owner March 18, 2024 17:35

pytorch-bot bot added ciflow/inductor module: dynamo labels Mar 18, 2024

pytorchbot added the open source label Mar 18, 2024

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 18, 2024

jeffdaily requested review from lezcano, nikitaved, IvanYashchuk and a team as code owners March 19, 2024 20:03

pytorch-bot bot added the release notes: linalg_frontend release notes category label Mar 19, 2024

lezcano reviewed Mar 20, 2024

View reviewed changes

lezcano reviewed Mar 29, 2024

View reviewed changes

jeffdaily added the rocm This tag is for PRs from ROCm team label Apr 8, 2024

jeffdaily added 11 commits April 8, 2024 22:05

add preferred blas backend selector

b303c18

implement bgemm and gemm using cublaslt

4ce3e05

update cuda to hip mappings

253b4da

hipblaslt does not support complex gemm yet

8183348

add unit test

b07fb80

respond to reviewer comments

6445de5

lint fix

6b05cb2

fix preferred blas test

fdb3aa9

revert hipify changes that got committed

c17d02d

remove redundant dead code path

dd80ff6

extend test_matmul_small_brute_force_{1,2,3}d_Nd to exercise these paths

a447ded

jeffdaily force-pushed the preferred_blas_library branch from 459f66c to a447ded Compare April 8, 2024 23:08

jeffdaily requested a review from lezcano April 8, 2024 23:09

Merge branch 'main' into preferred_blas_library

b7ebb6e

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 19, 2024

pytorchmergebot added the merging label Apr 19, 2024

pytorchmergebot closed this in 6ede882 Apr 22, 2024

pytorchmergebot added Merged and removed merging labels Apr 22, 2024

jeffdaily mentioned this pull request Apr 23, 2024

[ROCm][CI] fix 5.7 nightly wheel build #124797

Closed

pytorchmergebot pushed a commit that referenced this pull request Apr 24, 2024

[ROCm][CI] fix 5.7 nightly wheel build (#124797)

82fe907

Fixes broken ROCm 5.7 build caused by #122106. Pull Request resolved: #124797 Approved by: https://github.com/atalman

alat-rights pushed a commit to alat-rights/pytorch that referenced this pull request Apr 26, 2024

[ROCm][CI] fix 5.7 nightly wheel build (pytorch#124797)

a3ff96b

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

jeffdaily added a commit to ROCm/pytorch that referenced this pull request Apr 26, 2024

forward fix preferred blas backend and windows CI

ad774b4

PR pytorch#122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly.

This was referenced Apr 26, 2024

forward fix preferred blas backend and windows CI #125080

Closed

Optionally use hipblaslt #120551

Open

carmocca pushed a commit to carmocca/pytorch that referenced this pull request Apr 29, 2024

[ROCm][CI] fix 5.7 nightly wheel build (pytorch#124797)

cc3c4da

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

izaitsevfb added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Apr 29, 2024

andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024

[ROCm][CI] fix 5.7 nightly wheel build (pytorch#124797)

cb74bb1

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024

[ROCm][CI] fix 5.7 nightly wheel build (pytorch#124797)

bcf4db6

Fixes broken ROCm 5.7 build caused by pytorch#122106. Pull Request resolved: pytorch#124797 Approved by: https://github.com/atalman

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preferred blas library; cublaslt gemm implementation #122106

preferred blas library; cublaslt gemm implementation #122106

jeffdaily commented Mar 18, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Mar 18, 2024 •

edited

lezcano left a comment

lezcano Mar 20, 2024

jeffdaily Mar 21, 2024

lezcano left a comment •

edited

jeffdaily commented Apr 8, 2024

pytorchmergebot commented Apr 19, 2024

jeffdaily commented Apr 19, 2024

pytorchmergebot commented Apr 19, 2024

pytorchmergebot commented Apr 20, 2024

jeffdaily commented Apr 22, 2024

pytorchmergebot commented Apr 22, 2024

clee2000 commented Apr 26, 2024

jeffdaily commented Apr 26, 2024 •

edited

preferred blas library; cublaslt gemm implementation #122106

preferred blas library; cublaslt gemm implementation #122106

Conversation

jeffdaily commented Mar 18, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented Mar 18, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122106

✅ You can merge normally! (5 Unrelated Failures)

lezcano left a comment

Choose a reason for hiding this comment

lezcano Mar 20, 2024

Choose a reason for hiding this comment

jeffdaily Mar 21, 2024

Choose a reason for hiding this comment

lezcano left a comment • edited

Choose a reason for hiding this comment

jeffdaily commented Apr 8, 2024

pytorchmergebot commented Apr 19, 2024

jeffdaily commented Apr 19, 2024

pytorchmergebot commented Apr 19, 2024

Merge started

pytorchmergebot commented Apr 20, 2024

jeffdaily commented Apr 22, 2024

pytorchmergebot commented Apr 22, 2024

Merge started

clee2000 commented Apr 26, 2024

jeffdaily commented Apr 26, 2024 • edited

jeffdaily commented Mar 18, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Mar 18, 2024 •

edited

lezcano left a comment •

edited

jeffdaily commented Apr 26, 2024 •

edited