[ROCm] Updated default workspace for gfx95 #153988

BLOrange-AMD · 2025-05-20T22:07:53Z

Fixes test_cuda.py::test_cublas_workspace_explicit_allocation on gfx95

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-05-20T22:07:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153988

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit a30caff with merge base 2b43d63 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.6-py3.10-gcc11-test / test (distributed, 1, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/test_symmetric_memory.py::SymmMemSingleProcTest::test_stream_write_value32

CANCELLED JOB - The following job was cancelled. Please retry:

rocm / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.2) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jeffdaily · 2025-05-27T20:26:31Z

Only basic CI is needed here since gfx950 is not in our public CI.

jeffdaily · 2025-05-27T20:26:47Z

@pytorchbot merge

pytorchmergebot · 2025-05-27T20:28:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-27T21:38:11Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.6-py3.10-gcc11-test / test (distributed, 1, 3, linux.g4dn.12xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

jeffdaily · 2025-05-29T16:20:26Z

@pytorchbot merge -f "it is not possible for this PR to affect any current CI flows"

pytorchmergebot · 2025-05-29T16:21:57Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eellison · 2025-05-30T14:09:57Z

Inductor hud is broken since this pr: https://hud.pytorch.org/hud/pytorch/pytorch/main/2?per_page=50&mergeEphemeralLF=true.

Should we revert ?

jeffdaily · 2025-05-30T23:50:49Z

Inductor hud is broken since this pr: https://hud.pytorch.org/hud/pytorch/pytorch/main/2?per_page=50&mergeEphemeralLF=true.

Should we revert ?

@eellison I don't see how the inductor hud for cuda could break with this change? The one-line change in this PR is already inside an if torch.version.hip so there's no way it could apply to cuda. The error in the log says

      /tmp/pip-req-build-0czsr0a7/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.cu(110): error: class "cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100TmaUmmaWarpSpecializedBlockScaled<11, 3, 2, cute::tuple<cute::_2, cute::_1, cute::_1>>, cute::tuple<cute::_128, cute::_128, cute::_128>, cute::tuple<cutlass::float_e2m1_t, cutlass::float_ue8m0_t>, cute::tuple<cute::tuple<int64_t, cute::C<1>, int64_t>, cute::Layout<cute::tuple<cute::tuple<cute::tuple<cute::_32, cute::_4>, int32_t>, cute::tuple<cute::tuple<cute::_32, cute::_4>, int32_t>, cute::tuple<cute::_1, int32_t>>, cute::tuple<cute::tuple<cute::tuple<cute::_16, cute::_4>, int32_t>, cute::tuple<cute::tuple<cute::C<0>, cute::C<1>>, cute::_512>, cute::tuple<cute::C<0>, int32_t>>>>, cute::tuple<cutlass::float_e2m1_t, cutlass::float_ue8m0_t>, cute::tuple<cute::tuple<int64_t, cute::C<1>, int64_t>, cute::Layout<cute::tuple<cute::tuple<cute::tuple<cute::_32, cute::_4>, int32_t>, cute::tuple<cute::tuple<cute::_32, cute::_4>, int32_t>, cute::tuple<cute::_1, int32_t>>, cute::tuple<cute::tuple<cute::tuple<cute::_16, cute::_4>, int32_t>, cute::tuple<cute::tuple<cute::C<0>, cute::C<1>>, cute::_512>, cute::tuple<cute::C<0>, int32_t>>>>, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_MXF4_SS<cutlass::float_e2m1_t, cutlass::float_e2m1_t, float, cutlass::float_ue8m0_t, 128, 128, 32, cute::UMMA::Major::K, cute::UMMA::Major::K, cute::UMMA::ScaleIn::One, cute::UMMA::ScaleIn::One>>, cute::Layout<cute::tuple<cute::_1, cute::_1, cute::_1>, cute::tuple<cute::C<0>, cute::C<0>, cute::C<0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::tuple<cute::SM90_TMA_LOAD, cute::SM90_TMA_LOAD>, cute::tuple<cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<4>, cute::Layout<cute::tuple<cute::_8, cute::_128>, cute::tuple<cute::_128, cute::_1>>>, cute::Layout<cute::tuple<cute::tuple<cute::tuple<cute::tuple<cute::_32, cute::_4>, cute::C<1>>, cute::tuple<cute::_32, cute::_2>>, cute::_1, cute::tuple<cute::_2, cute::_1>>, cute::tuple<cute::tuple<cute::tuple<cute::tuple<cute::_16, cute::_4>, cute::C<512>>, cute::tuple<cute::C<0>, cute::C<1>>>, cute::_0, cute::tuple<cute::C<2>, cute::C<512>>>>>, void, cute::identity, cute::tuple<cute::SM90_TMA_LOAD_MULTICAST, cute::SM90_TMA_LOAD_MULTICAST>, cute::tuple<cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<4>, cute::Layout<cute::tuple<cute::_8, cute::_128>, cute::tuple<cute::_128, cute::_1>>>, cute::Layout<cute::tuple<cute::tuple<cute::tuple<cute::tuple<cute::_32, cute::_4>, cute::C<1>>, cute::tuple<cute::_32, cute::_2>>, cute::_1, cute::tuple<cute::_2, cute::_1>>, cute::tuple<cute::tuple<cute::tuple<cute::tuple<cute::_16, cute::_4>, cute::C<512>>, cute::tuple<cute::C<0>, cute::C<1>>>, cute::_0, cute::tuple<cute::C<2>, cute::C<512>>>>>, void, cute::identity>" has no member "Sm1xxBlkScaledConfig"

jeffdaily · 2025-05-30T23:51:41Z

@eellison looks like torchao is failing to build suddenly, but not due to this PR.

Fixes test_cuda.py::test_cublas_workspace_explicit_allocation on gfx95 Pull Request resolved: pytorch#153988 Approved by: https://github.com/jeffdaily

Updated default workspace for gfx95

a30caff

pytorch-bot bot added the topic: not user facing topic category label May 20, 2025

pytorchbot added the open source label May 20, 2025

albanD requested a review from jeffdaily May 22, 2025 17:40

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 22, 2025

jeffdaily approved these changes May 27, 2025

View reviewed changes

jeffdaily changed the title ~~Updated default workspace for gfx95~~ [ROCm] Updated default workspace for gfx95 May 27, 2025

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch labels May 27, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 27, 2025

pytorchmergebot added the merging label May 27, 2025

pytorchmergebot removed the merging label May 27, 2025

pytorchmergebot added the merging label May 29, 2025

pytorchmergebot closed this in 2c6f24c May 29, 2025

pytorchmergebot added Merged and removed merging labels May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Updated default workspace for gfx95 #153988

[ROCm] Updated default workspace for gfx95 #153988

Uh oh!

BLOrange-AMD commented May 20, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented May 20, 2025 •

edited

Loading

Uh oh!

jeffdaily commented May 27, 2025

Uh oh!

jeffdaily commented May 27, 2025

Uh oh!

pytorchmergebot commented May 27, 2025

Uh oh!

pytorchmergebot commented May 27, 2025

Uh oh!

jeffdaily commented May 29, 2025

Uh oh!

pytorchmergebot commented May 29, 2025

Uh oh!

eellison commented May 30, 2025

Uh oh!

jeffdaily commented May 30, 2025

Uh oh!

jeffdaily commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[ROCm] Updated default workspace for gfx95 #153988

[ROCm] Updated default workspace for gfx95 #153988

Uh oh!

Conversation

BLOrange-AMD commented May 20, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153988

❌ 1 New Failure, 1 Cancelled Job

Uh oh!

jeffdaily commented May 27, 2025

Uh oh!

jeffdaily commented May 27, 2025

Uh oh!

pytorchmergebot commented May 27, 2025

Merge started

Uh oh!

pytorchmergebot commented May 27, 2025

Merge failed

Uh oh!

jeffdaily commented May 29, 2025

Uh oh!

pytorchmergebot commented May 29, 2025

Merge started

Uh oh!

eellison commented May 30, 2025

Uh oh!

jeffdaily commented May 30, 2025

Uh oh!

jeffdaily commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BLOrange-AMD commented May 20, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 20, 2025 •

edited

Loading