[ROCm] use correct workspace for hipblaslt, silence warning #150227

ethanwee1 · 2025-03-28T21:19:37Z

Follow up to #145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always.

Fixes #ISSUE_NUMBER

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-03-28T21:19:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150227

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit 03e8aa0 with merge base cbc0964 ():

NEW FAILURES - The following jobs have failed:

linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test (gh)
Process completed with exit code 1.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-py3-clang12-executorch / build (gh) (trunk failure)
Final attempt failed. Child_process exited with error code 1

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 15648280608 is 1.55% higher than expected 15410000000 ±+1.50% if this is an expected regression, please update the expected results.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cyyever · 2025-03-31T04:51:03Z

@pytorchmergebot merge -r

pytorchmergebot · 2025-03-31T04:52:40Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Follow up to pytorch#145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always.

pytorchmergebot · 2025-03-31T04:52:43Z

Successfully rebased rocm_fix_hipblaslt_workspace onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rocm_fix_hipblaslt_workspace && git pull --rebase)

pytorchmergebot · 2025-03-31T04:53:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-31T04:54:06Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

cyyever · 2025-03-31T07:04:14Z

@pytorchmergebot merge -i

pytorchmergebot · 2025-03-31T07:05:57Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed. Forward-fix for #150227 that broke ROCm distributed tests but wasn't part of initial CI signal. Pull Request resolved: #150348 Approved by: https://github.com/jeffdaily

facebook-github-bot · 2025-04-01T22:29:43Z

@pytorchbot revert -m="Diff reverted internally" -c="ghfirst"

This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).)

pytorchmergebot · 2025-04-01T22:31:06Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-04-01T22:31:15Z

@ethanwee1 your PR has been successfully reverted.

…150227)" This reverts commit c158eac. Reverted #150227 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](#150227 (comment)))

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

atalman

lgtm

atalman · 2025-04-02T11:54:14Z

@pytorchmergebot merge -i

pytorchmergebot · 2025-04-02T11:56:09Z

Merge started

Your change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3-clang12-executorch / build, pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test, linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-04-02T12:01:52Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch merge --squash __pull-request-150227__init__ returned non-zero exit code 1

Auto-merging aten/src/ATen/cuda/CUDABlas.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/cuda/CUDABlas.cpp
Squash commit -- not updating HEAD
Automatic merge failed; fix conflicts and then commit the result.

Details for Dev Infra team

Raised by workflow job

jeffdaily · 2025-04-02T16:11:33Z

@atalman I appreciate you trying to reland this. Thanks, truly. But this PR is only needed if #145130 is relanded.

eqy · 2025-04-02T17:12:58Z

aten/src/ATen/cuda/CUDABlas.cpp

+// See Note [hipblaslt handles].
+// ROCm's hipblas and hipblaslt do not share handles, unlike with CUDA.
+// Using getCurrentCUDABlasLtHandle is on purpose. For CUDA it's the same as
+// getCurrentCUDABlasHandle, but for ROCm it's a unique handle.


That handles can be shared between cuBLAS and cuBLASLt is factually true, but I think it is not really relevant here. The cuBLAS handle here is really only used as a key to get a corresponding workspace, and as we do not expect to run a cuBLAS-API backed and cuBLASLt-API backed matmul (on the same stream) at the same time, it's safe to use the workspace that is already allocated for one for the other.

My guess is the real reason the warning shows up on ROCm but not on CUDA is that at present the default CUBLAS_WORKSPACE_CONFIG effective size is always >= the default CUBLASLT_WORKSPACE_SIZE setting. On the CUDA side the intent is to only allocate the cuBLAS workspace and reuse it for Lt, but if Lt requests a larger workspace it precludes this unification.

If you agree with this I think a clearer explanation would be along the lines of "CUDA attempts to share workspaces with the assumption that cuBLAS workspace size >= cuBLASLt workspace size, but as this assumption may not hold on ROCm, we also add a mapping for Lt handle -> workspace in addition to BLAS handle -> workspace."

…150227) Follow up to pytorch#145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always. Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#150227 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed. Forward-fix for pytorch#150227 that broke ROCm distributed tests but wasn't part of initial CI signal. Pull Request resolved: pytorch#150348 Approved by: https://github.com/jeffdaily

…ytorch#150227)" This reverts commit c158eac. Reverted pytorch#150227 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#150227 (comment)))

github-actions · 2025-06-01T17:35:32Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

jeffdaily · 2025-06-02T19:44:10Z

This PR is no longer needed.

The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed. Forward-fix for pytorch#150227 that broke ROCm distributed tests but wasn't part of initial CI signal. Pull Request resolved: pytorch#150348 Approved by: https://github.com/jeffdaily

ethanwee1 requested review from eqy and syed-ahmed as code owners March 28, 2025 21:19

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Mar 28, 2025

jeffdaily added the topic: not user facing topic category label Mar 28, 2025

jeffdaily previously approved these changes Mar 28, 2025

View reviewed changes

jeffdaily added rocm This tag is for PRs from ROCm team ciflow/rocm Trigger "default" config CI on ROCm labels Mar 28, 2025

pytorchbot added the open source label Mar 28, 2025

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Mar 28, 2025

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Mar 28, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 31, 2025

ethanwee1 and others added 2 commits March 31, 2025 04:52

[ROCm] use correct workspace for hipblaslt, silence warning

8db82eb

Follow up to pytorch#145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always.

fix build error

03e8aa0

pytorchmergebot force-pushed the rocm_fix_hipblaslt_workspace branch from 7a5da19 to 03e8aa0 Compare March 31, 2025 04:52

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm labels Mar 31, 2025

pytorchmergebot added the merging label Mar 31, 2025

pytorchmergebot removed the merging label Mar 31, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 31, 2025

pytorchmergebot added the merging label Mar 31, 2025

pytorchmergebot added the Merged label Mar 31, 2025

pytorchmergebot closed this in c158eac Mar 31, 2025

pytorchmergebot removed the merging label Mar 31, 2025

jeffdaily mentioned this pull request Mar 31, 2025

[ROCm] update test buffer fudge factor for hipblaslt #150348

Closed

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Apr 1, 2025

pytorchmergebot reopened this Apr 1, 2025

atalman approved these changes Apr 2, 2025

View reviewed changes

pytorchmergebot added the merging label Apr 2, 2025

pytorchmergebot removed the merging label Apr 2, 2025

jeffdaily mentioned this pull request Apr 2, 2025

[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces #145130

Closed

eqy reviewed Apr 2, 2025

View reviewed changes

github-actions bot added the Stale label Jun 1, 2025

jeffdaily closed this Jun 2, 2025

[ROCm] use correct workspace for hipblaslt, silence warning #150227

[ROCm] use correct workspace for hipblaslt, silence warning #150227

Uh oh!

Conversation

ethanwee1 commented Mar 28, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150227

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

cyyever commented Mar 31, 2025

Uh oh!

pytorchmergebot commented Mar 31, 2025

Uh oh!

pytorchmergebot commented Mar 31, 2025

Uh oh!

pytorchmergebot commented Mar 31, 2025

Merge started

Uh oh!

pytorchmergebot commented Mar 31, 2025

Merge failed

Uh oh!

cyyever commented Mar 31, 2025

Uh oh!

pytorchmergebot commented Mar 31, 2025

Merge started

Uh oh!

facebook-github-bot commented Apr 1, 2025

Uh oh!

pytorchmergebot commented Apr 1, 2025

Uh oh!

pytorchmergebot commented Apr 1, 2025

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

atalman commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Merge started

Uh oh!

pytorchmergebot commented Apr 2, 2025

Merge failed

Uh oh!

jeffdaily commented Apr 2, 2025

Uh oh!

eqy Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 1, 2025

Uh oh!

jeffdaily commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ethanwee1 commented Mar 28, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 28, 2025 •

edited

Loading