Add libtorch nightly build for CUDA 12.8 #146265

tinglvv · 2025-02-02T02:42:46Z

Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error

"Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note.

Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support.

#145570

cc @atalman @malfet @ptrblck @msaroufim @eqy @nWEIdia

pytorch-bot · 2025-02-02T02:42:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146265

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 62ef609 with merge base 16e202a ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

linux-binary-manywheel / manywheel-py3_12-cuda11_8-test / test (gh) (detected as infra flaky with no log or failing log classifier)
linux-binary-manywheel / manywheel-py3_13t-xpu-test (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tinglvv · 2025-02-07T00:45:05Z

Test failures in libtorch and manywheel with CUDA Error:
Tesla M60 with CUDA capability sm_52 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120 compute_120.

Reason is we are removing sm50 and sm60 from 12.8 binary in this PR, to resolve the ld --relink error in #145792 (comment).

And current upstream CI test runs on Tesla M60 which is sm_50.
See runs_on: linux.4xlarge.nvidia.gpu --> instance_type: g3.4xlarge --> Tesla M60 (https://aws.amazon.com/blogs/aws/new-next-generation-gpu-powered-ec2-instances-g3/).

Proposing solution:
Use g4dn or g5 AWS runners to run the test for 12.8, g4dn is T4 GPU Turing (sm_75) and g5 is Ampere (sm_80). If we were to deprecate Volta (sm_70) in 12.8 as well, then need to get a g5 runner.
Runner choices https://github.com/tinglvv/test-infra/blob/main/.github/scale-config.yml#L101

cc @atalman @ptrblck

Skylion007 · 2025-02-07T18:19:51Z

Can we not use the linker script with --relink keep the old arch support?

tinglvv · 2025-02-07T21:49:43Z

Can we not use the linker script with --relink keep the old arch support?

Hi @Skylion007 , right the --relink would work. Meanwhile, we are actually also deprecating the sm_50,60,70 for cuda 12.8 (they will be deprecated officially in future cuda releases), and this would resolve the build error.

Skylion007 · 2025-02-10T17:04:13Z

This will drop support for 1080 and similar consumer chips, right? We are finally starting to drop GPU arches that are commonly used and can run modern architectures in inference. These are very common in university clusters.

SM70 only supports GV100s right? Why not support SM60 so torch supports more devices?

Is 12.9 dropping all these cuda arches completely? Or is this just to unblock the binary size issues? Seems like there might be a longer term alternative to fixing the 1GB libtorch limit such getting the linker script to work with relink, LTO might reduce binary size sufficiently to save one of the arch's, or just splitting the binaries.

ptrblck · 2025-02-10T22:16:35Z

@Skylion007 Future CUDA versions will drop sm_50-sm_70 completely as @tinglvv explained and CUDA 12.8 deprecated these.

We are finally starting to drop GPU arches that are commonly used and can run modern architectures in inference. These are very common in university clusters.

Universities and other users stuck on older GPUs or drivers are still able to use PyTorch binaries built with an older CUDA toolkit (e.g. 12.6.3 or 11.8). We are keeping PyTorch binaries with CUDA 11.8 alive for 2+ years for this reason.

tinglvv · 2025-02-10T23:15:56Z

Fix for the build failures - use g4dn runners for 12.8 binary testing. (sm_75)
Hi @atalman , could you help update the runners? Thanks!

tinglvv · 2025-02-18T23:17:26Z

@pytorchbot rebase

pytorchmergebot · 2025-02-18T23:18:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-18T23:18:55Z

Successfully rebased cu128-libtorch-build onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cu128-libtorch-build && git pull --rebase)

atalman

lgtm

tinglvv · 2025-02-20T23:25:11Z

@pytorchbot merge

pytorchmergebot · 2025-02-20T23:26:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note. Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support. #145570 Pull Request resolved: #146265 Approved by: https://github.com/atalman

Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note. Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support. pytorch#145570 Pull Request resolved: pytorch#146265 Approved by: https://github.com/atalman

pytorch-bot bot added the release notes: releng release notes category label Feb 2, 2025

tinglvv added ciflow/binaries Trigger all binary build and upload jobs on the PR topic: not user facing topic category and removed release notes: releng release notes category labels Feb 2, 2025

pytorchbot added the open source label Feb 2, 2025

tinglvv mentioned this pull request Feb 3, 2025

Enable CUDA 12.8.0, Disable CUDA 12.4 #145570

Open

26 tasks

tinglvv marked this pull request as ready for review February 7, 2025 00:45

tinglvv requested a review from a team as a code owner February 7, 2025 00:45

cpuhrsch requested a review from ngimel February 8, 2025 01:38

cpuhrsch added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 8, 2025

tinglvv changed the title ~~Add libtorch CUDA 12.8~~ Add libtorch nightly build for CUDA 12.8 Feb 12, 2025

pytorchmergebot force-pushed the cu128-libtorch-build branch from 0bd6ed0 to 4192809 Compare February 18, 2025 23:18

atalman approved these changes Feb 19, 2025

View reviewed changes

tinglvv added 2 commits February 19, 2025 16:19

Add libtorch 12.8 and remove sm50/60

9299107

update 12.8 x86 runner to g4dn.4xlarge

62ef609

tinglvv force-pushed the cu128-libtorch-build branch from 4192809 to 62ef609 Compare February 20, 2025 00:19

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 20, 2025

pytorchmergebot added the merging label Feb 20, 2025

pytorchmergebot added the Merged label Feb 21, 2025

pytorchmergebot closed this in fe100c3 Feb 21, 2025

pytorchmergebot removed the merging label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add libtorch nightly build for CUDA 12.8 #146265

Add libtorch nightly build for CUDA 12.8 #146265

Uh oh!

tinglvv commented Feb 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 2, 2025 •

edited

Loading

Uh oh!

tinglvv commented Feb 7, 2025 •

edited

Loading

Uh oh!

Skylion007 commented Feb 7, 2025

Uh oh!

tinglvv commented Feb 7, 2025

Uh oh!

Skylion007 commented Feb 10, 2025

Uh oh!

ptrblck commented Feb 10, 2025

Uh oh!

tinglvv commented Feb 10, 2025

Uh oh!

tinglvv commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 18, 2025

Uh oh!

atalman left a comment

Uh oh!

tinglvv commented Feb 20, 2025

Uh oh!

pytorchmergebot commented Feb 20, 2025

Uh oh!

Uh oh!

Add libtorch nightly build for CUDA 12.8 #146265

Add libtorch nightly build for CUDA 12.8 #146265

Uh oh!

Conversation

tinglvv commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146265

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

tinglvv commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Feb 7, 2025

Uh oh!

tinglvv commented Feb 7, 2025

Uh oh!

Skylion007 commented Feb 10, 2025

Uh oh!

ptrblck commented Feb 10, 2025

Uh oh!

tinglvv commented Feb 10, 2025

Uh oh!

tinglvv commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 18, 2025

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

tinglvv commented Feb 20, 2025

Uh oh!

pytorchmergebot commented Feb 20, 2025

Merge started

Uh oh!

Uh oh!

tinglvv commented Feb 2, 2025 •

edited

Loading

pytorch-bot bot commented Feb 2, 2025 •

edited

Loading

tinglvv commented Feb 7, 2025 •

edited

Loading