Skip to content

Conversation

jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Sep 10, 2025

Fixes build timeouts >4h on libtorch build jobs: https://hud.pytorch.org/hud/pytorch/pytorch/75e7f49f9c70116d7c4f8f86c3d0688ade306284/1?per_page=50&name_filter=inux-binary-libtorch%20%2F%20libtorch-rocm&mergeEphemeralLF=true

Brings back code to narrow down CK compilation targets from 69a25f6#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777

gfx942 supports fp8

Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738

Validation:
rocm6.4 and rocm6.3 libtorch builds finished within 3.9h.

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @danielvegamyhre (since their change had removed this snippet, causing ROCm builds to increase >4h)

Copy link

pytorch-bot bot commented Sep 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162648

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 5 Cancelled Jobs, 1 Unrelated Failure

As of commit 0aac9b5 with merge base 3a7db34 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch labels Sep 10, 2025
@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Sep 10, 2025

@danielvegamyhre (since their change had removed this snippet, causing ROCm builds to increase >4h)

Can you clarify what code exactly was removed in my PR that cause build time to increase for ROCM...? It's not clear to me and I'd like to understand, thanks

@jeffdaily
Copy link
Collaborator

@danielvegamyhre (since their change had removed this snippet, causing ROCm builds to increase >4h)

Can you clarify what code exactly was removed in my PR that cause build time to increase for ROCM...? It's not clear to me and I'd like to understand, thanks

b6d0a9e#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777L277-L282

@jeffdaily jeffdaily added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 11, 2025
@jithunnair-amd jithunnair-amd added the ciflow/binaries_libtorch Trigger binary build and upload jobs for libtorch on the PR label Sep 11, 2025
@jeffdaily jeffdaily marked this pull request as ready for review September 11, 2025 21:48
@jithunnair-amd jithunnair-amd added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Sep 12, 2025
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased build_fbgemm_ck_only_for_gfx942 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout build_fbgemm_ck_only_for_gfx942 && git pull --rebase)

@cthi
Copy link
Contributor

cthi commented Sep 22, 2025

@jithunnair-amd this change looks good to me overall, just clarified what I meant before as we should only build gfx942 for now with fbgemm + ROCm.

Do you plan to merge it soon?

@jithunnair-amd
Copy link
Collaborator Author

@jithunnair-amd this change looks good to me overall, just clarified what I meant before as we should only build gfx942 for now with fbgemm + ROCm.

Do you plan to merge it soon?

@cthi Yes, however, the CUDA build failures were a bit baffling. I'm going to try rebasing again.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased build_fbgemm_ck_only_for_gfx942 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout build_fbgemm_ck_only_for_gfx942 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the build_fbgemm_ck_only_for_gfx942 branch from a55349f to 5b12488 Compare September 23, 2025 02:31
@jithunnair-amd jithunnair-amd changed the title [ROCm] Build FBGEMM_GENAI for gfx942 and gfx950 only [ROCm] Build FBGEMM_GENAI for gfx942 only Sep 23, 2025
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot merge -f "CI failures unrelated. Merging to restore nightly libtorch builds"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 24, 2025
Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897).

This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently.

This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch).

Pull Request resolved: #163776
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
jainapurva pushed a commit that referenced this pull request Sep 29, 2025
jainapurva pushed a commit that referenced this pull request Sep 29, 2025
Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897).

This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently.

This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch).

Pull Request resolved: #163776
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
pytorchbot pushed a commit that referenced this pull request Oct 6, 2025
Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897).

This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently.

This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch).

Pull Request resolved: #163776
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit 0ec946a)
atalman pushed a commit that referenced this pull request Oct 6, 2025
[ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776)

Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897).

This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently.

This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch).

Pull Request resolved: #163776
Approved by: https://github.com/jeffdaily


(cherry picked from commit 0ec946a)

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries_libtorch Trigger binary build and upload jobs for libtorch on the PR ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants