Move prioritized text linker optimization code from setup.py to cmake #160078

robert-hardwick · 2025-08-07T10:53:36Z

Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.

Summary

🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )

This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.

Motivation

Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.

Note:

Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.

Stack from ghstack (oldest at bottom):

Co-authored-by: Usamah Zaheer usamah.zaheer@arm.com

cc @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01 @usamahz

[ghstack-poisoned]

pytorch-bot · 2025-08-07T10:53:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160078

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ef97dc3 with merge base 1330440 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

robert-hardwick · 2025-08-07T11:02:37Z

Going to close #155901 for this PR. It needed to be rebased and #160079 depends on it, so thought it would be easier to bring into the ghstack.

robert-hardwick · 2025-08-07T11:07:25Z

@pytorchbot label "ciflow/linux-aarch64"

robert-hardwick · 2025-08-07T11:12:12Z

@pytorchbot label "topic: not user facing"

robert-hardwick · 2025-08-07T11:12:33Z

@pytorchbot label "ciflow/linux-aarch64"

setup.py

[ghstack-poisoned]

robert-hardwick · 2025-08-21T10:17:51Z

@Aidyn-A wondering if you can take a look at this since you originall authored #121975 . I've manually defined the targets ( see CMakeLists.txt change ) will this be enough to maintain the performance improvement, where previously we set global linker flags?

.gitignore

[ghstack-poisoned]

robert-hardwick · 2025-09-12T09:14:08Z

@atalman those failures are fixed in #162643 , the branch is just out of date. This one didn't need to be reverted really. I think we can merge this one back in.

[ghstack-poisoned]

…pytorch#160078) Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com> Pull Request resolved: pytorch#160078 Approved by: https://github.com/seemethere

…rch#159737) ---- This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms. In this PR: - We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too. - We can therefore remove build_ArmComputeLibrary functions from the ci build scripts. - There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update ) - We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ). - ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci. Pull Request resolved: pytorch#159737 Approved by: https://github.com/seemethere ghstack dependencies: pytorch#160078

…to cmake (pytorch#160078)" This reverts commit 26b3ae5. Reverted pytorch#160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](pytorch#160078 (comment)))

[ghstack-poisoned]

robert-hardwick · 2025-09-18T17:01:45Z

@pytorchbot merge

pytorchmergebot · 2025-09-18T17:03:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…pytorch#160078) Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com> Pull Request resolved: pytorch#160078 Approved by: https://github.com/seemethere

…rch#159737) ---- This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms. In this PR: - We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too. - We can therefore remove build_ArmComputeLibrary functions from the ci build scripts. - There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update ) - We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ). - ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci. Pull Request resolved: pytorch#159737 Approved by: https://github.com/seemethere ghstack dependencies: pytorch#160078

…to cmake (pytorch#160078)" This reverts commit 26b3ae5. Reverted pytorch#160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](pytorch#160078 (comment)))

…pytorch#160078) Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com> Pull Request resolved: pytorch#160078 Approved by: https://github.com/seemethere

…rch#159737) ---- This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms. In this PR: - We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too. - We can therefore remove build_ArmComputeLibrary functions from the ci build scripts. - There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update ) - We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ). - ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci. Pull Request resolved: pytorch#159737 Approved by: https://github.com/seemethere ghstack dependencies: pytorch#160078

…to cmake (pytorch#160078)" This reverts commit 26b3ae5. Reverted pytorch#160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](pytorch#160078 (comment)))

…pytorch#160078) Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com> Pull Request resolved: pytorch#160078 Approved by: https://github.com/seemethere

…rch#159737) ---- This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms. In this PR: - We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too. - We can therefore remove build_ArmComputeLibrary functions from the ci build scripts. - There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update ) - We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ). - ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci. Pull Request resolved: pytorch#159737 Approved by: https://github.com/seemethere ghstack dependencies: pytorch#160078

…to cmake (pytorch#160078)" This reverts commit 26b3ae5. Reverted pytorch#160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](pytorch#160078 (comment)))

…pytorch#160078) Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com> Pull Request resolved: pytorch#160078 Approved by: https://github.com/seemethere

Update

62e5e8c

[ghstack-poisoned]

robert-hardwick requested a review from a team as a code owner August 7, 2025 10:53

This was referenced Aug 7, 2025

Build and Install Arm Compute Library in manylinux docker image #159737

Closed

Make manylinux build.sh work for AArch64 and AArch64+CUDA builds #160079

Draft

pytorchbot added the open source label Aug 7, 2025

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 7, 2025

robert-hardwick added module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 and removed ciflow/linux-aarch64 linux aarch64 CI workflow labels Aug 7, 2025

robert-hardwick mentioned this pull request Aug 7, 2025

Default USE_PRIORITIZED_TEXT_FOR_LD=1 on Linux aarch64 via setup.py #155901

Closed

robert-hardwick requested a review from tinglvv August 7, 2025 11:10

pytorch-bot bot added the topic: not user facing topic category label Aug 7, 2025

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 7, 2025

robert-hardwick changed the title ~~Enable prioritized linker optimization for AArch64 in setup.py and clean up CI script - cherry-pick from 155901~~ Enable prioritized linker optimization for AArch64 in setup.py and clean up CI script Aug 7, 2025

Skylion007 reviewed Aug 8, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

robert-hardwick commented Aug 8, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Update

dd04aee

[ghstack-poisoned]

robert-hardwick changed the title ~~Enable prioritized linker optimization for AArch64 in setup.py and clean up CI script~~ Move prioritized text linker optimization code from setup.py to cmake Aug 21, 2025

robert-hardwick requested review from Skylion007 and Aidyn-A August 21, 2025 10:15

robert-hardwick added the arm priority label Aug 21, 2025

robert-hardwick commented Aug 21, 2025

View reviewed changes

.gitignore Show resolved Hide resolved

robert-hardwick added 2 commits September 4, 2025 12:04

Update

9927fb1

[ghstack-poisoned]

Update

b65e0a8

[ghstack-poisoned]

seemethere approved these changes Sep 4, 2025

View reviewed changes

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Sep 11, 2025

pytorchmergebot reopened this Sep 11, 2025

atalman added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Sep 11, 2025

Update

17ca78d

[ghstack-poisoned]

robert-hardwick mentioned this pull request Sep 12, 2025

Remove .ci/aarch64_linux folder #162810

Draft

robert-hardwick added 2 commits September 17, 2025 14:15

Update

45f9cf4

[ghstack-poisoned]

Update

ef97dc3

[ghstack-poisoned]

pytorchmergebot added the merging label Sep 18, 2025

pytorchmergebot closed this in 1aeac30 Sep 18, 2025

pytorchmergebot removed the merging label Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move prioritized text linker optimization code from setup.py to cmake #160078

Move prioritized text linker optimization code from setup.py to cmake #160078

Uh oh!

robert-hardwick commented Aug 7, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

robert-hardwick commented Aug 21, 2025

Uh oh!

Uh oh!

robert-hardwick commented Sep 12, 2025

Uh oh!

robert-hardwick commented Sep 18, 2025

Uh oh!

pytorchmergebot commented Sep 18, 2025

Uh oh!

Uh oh!

Move prioritized text linker optimization code from setup.py to cmake #160078

Move prioritized text linker optimization code from setup.py to cmake #160078

Uh oh!

Conversation

robert-hardwick commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Uh oh!

pytorch-bot bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160078

✅ No Failures

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

robert-hardwick commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

robert-hardwick commented Aug 21, 2025

Uh oh!

Uh oh!

robert-hardwick commented Sep 12, 2025

Uh oh!

robert-hardwick commented Sep 18, 2025

Uh oh!

pytorchmergebot commented Sep 18, 2025

Merge started

Uh oh!

Uh oh!

robert-hardwick commented Aug 7, 2025 •

edited

Loading

pytorch-bot bot commented Aug 7, 2025 •

edited

Loading