[aarch64] Add CUDA 12.4 build script for ARM wheel #1775

tinglvv · 2024-04-07T04:59:37Z

Add cuda_aarch64 ARM wheel build script with CUDA 12.4.
Reference #1302.

cc @ptrblck

atalman · 2024-04-09T18:04:31Z

cc @snadampal

johnnynunez · 2024-04-10T13:51:13Z

@tinglvv is this for grace or jetson or arm sbsa?

tinglvv · 2024-04-10T16:56:15Z

@tinglvv is this for grace or jetson or arm sbsa?

Hi, this is built for ARM sbsa which is also used by Grace. Main purpose is to run this ARM+CUDA wheel on Grace CPU. I have not tested on Jetson yet.

johnnynunez · 2024-04-10T17:49:35Z

@tinglvv is this for grace or jetson or arm sbsa?

Hi, this is built for ARM sbsa which is also used by Grace. Main purpose is to run this ARM+CUDA wheel on Grace CPU. I have not tested on Jetson yet.

thanks @tinglvv very nice

snadampal · 2024-04-10T18:56:52Z

Hi @tinglvv , are you testing the wheel for CPU inference as well? please let me know what you have covered so far.
btw, I'm upgrading the ARM wheel builder to manylinux 2_28 with gcc-12 and no conda, here is my PR: #1781
would you be able to try these changes in that environment as well?

tinglvv · 2024-04-10T20:45:06Z

Hi @tinglvv , are you testing the wheel for CPU inference as well? please let me know what you have covered so far. btw, I'm upgrading the ARM wheel builder to manylinux 2_28 with gcc-12 and no conda, here is my PR: #1781 would you be able to try these changes in that environment as well?

Hi, sorry do you mean testing with CPU-aarch64? I simply added the CUDA support on top of the existing workflow for CPU-aarch64. I have tested on Grace with the cuda related operations from test_ops.py.

Sure I can try with your PR and let you know.

nWEIdia · 2024-04-11T06:22:32Z

common/install_cuda_aarch64.sh

+  ldconfig
+}
+
+function prune_124 {


What does NVPrune do?

@nWEIdia it removes GPU architectures from libraries, which are never used to lower the binary size. This workflow can be dangerous if libraries depend on heuristics and can select kernels from the same GPU family (we've seen issues before where sm_61 was dropped causing all kinds of issues on GTX cards).
I don't think the pruning is useful anymore, as we are using the CUDA dependencies from PyPI now.
However, we might want to keep it here and follow up with a cleanup in a separate PR.

ptrblck · 2024-04-11T19:55:15Z

are you testing the wheel for CPU inference as well? please let me know what you have covered so far.
btw, I'm upgrading the ARM wheel builder to manylinux 2_28 with gcc-12 and no conda, here is my PR: #1781
would you be able to try these changes in that environment as well?

@snadampal I'm unsure if this request is related to your PR or this change. If your PR needs additional testing, could you please tag us there?

Tests pass, so can we merge it, @atalman @malfet or is your review pending?

snadampal · 2024-04-11T20:23:07Z

Hi @ptrblck , my PR is to migrate the whole aarch64 ci to manylinux 2_28 and remove conda packages. my request was to make sure this cuda addition changes work there as well, so that no surprises when we make the switch.

tinglvv · 2024-04-11T20:37:38Z

Hi @ptrblck , my PR is to migrate the whole aarch64 ci to manylinux 2_28 and remove conda packages. my request was to make sure this cuda addition changes work there as well, so that no surprises when we make the switch.

Hi @snadampal, I am using a few packages from the conda environment, https://github.com/pytorch/builder/pull/1775/files#diff-3f5d59f85dd25cceac14ee2e14f8eb83547a68fbf087fc6dcc9471c368ba0ac2R88-R90. If conda is removed, is there a place to get these libs?

tinglvv · 2024-04-11T20:58:30Z

Hi @snadampal, there's major change in the files which may delay the merge of this PR. The workflow for cuda-aarch64 wheel does not exist yet, so goal here was not to have a perfect workflow but to establish it first.

We would test your PR as the next step as we iterate and resolve issues after the merge. Hope this works for you. Thanks.

snadampal · 2024-04-11T21:10:43Z

Hi @tinglvv , the libraries you pointed above are taken care in the new CI scripts (without conda), so I think CUDA should work fine there as well. I'm fine with rebasing mine if this PR goes first, but I request you to test CUDA again in my PR.

bryantbiggs · 2024-04-11T21:16:06Z

Every time a package is moved out of conda, an angel gets its wings 😅

aarch64_linux/aarch64_wheel_ci_build.py

snadampal · 2024-04-11T22:15:03Z

aarch64_linux/aarch64_wheel_ci_build.py

+        "/usr/local/cuda/lib64/libcudnn_ops_train.so.8",
+        "/opt/conda/envs/aarch64_env/lib/libopenblas.so.0",
+        "/opt/conda/envs/aarch64_env/lib/libgfortran.so.5",
+        "/opt/conda/envs/aarch64_env/lib/libgomp.so.1",


currently the scripts are packaging libomp.so
did you check the inference performance for any models? is there any performance difference observed with libgomp vs libomp in the current wheels?

I'm observing around 10% performance drop for eager mode inference with libgomp compared to libomp.
if you don't have a strong preference, I suggest keeping libomp till we know better.
for more details, check my comment here: #1774 (comment)

Can you please a follow-up on libgomp to libomp migration in your PR #1781, as it is not trivial change and certainly out of scope of this PR. I would like to underline that this PR is for enabling CUDA. The libgomp has been used with PyTorch for long time: it is reliable and nothing is wrong with it functionality-wise. Moreover, I do not want Ting to waste her time on debugging dependencies due to libomp in this PR.

Hi @Aidyn-A , currently the aarch64 wheels are linked to libomp not libgomp.
https://pypi.org/project/torch/#files

My point was why to change it now? without having a strong reason.

I have another PR to switch wheels from libomp to libgomp but is currently blocked due to the 10% regression.

From what I am seeing, it comes with:

/usr/local/lib/python3.8/dist-packages/torch.libs/libgomp-0f9e2209.so.1.0.0

I hope you are checking either torch 2.2 or nightly aarch64-linux wheel, because I am seeing,
libomp-b8e5bcfb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libomp-b8e5bcfb.so (0x0000ffffa8c30000)

complete list:

linux-vdso.so.1 (0x0000ffffb298d000) libtorch_cpu.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./libtorch_cpu.so (0x0000ffffaa960000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffaa930000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffaa780000) /lib/ld-linux-aarch64.so.1 (0x0000ffffb2954000) libc10.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./libc10.so (0x0000ffffaa680000) librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffffaa660000) libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffffaa640000) libopenblasp-r0-f658af2e.3.25.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libopenblasp-r0-f658af2e.3.25.so (0x0000ffffa8e10000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffa8d70000) libomp-b8e5bcfb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libomp-b8e5bcfb.so (0x0000ffffa8c30000) libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffa8c10000) libarm_compute-7362313d.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute-7362313d.so (0x0000ffffa8170000) libarm_compute_graph-15f701fb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute_graph-15f701fb.so (0x0000ffffa8030000) libarm_compute_core-0793f69d.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute_core-0793f69d.so (0x0000ffffa7fe0000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffa7db0000) libgfortran-105e6576.so.5.0.0 => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libgfortran-105e6576.so.5.0.0 (0x0000ffffa7c50000)

btw, libomp wasn't intentionally chosen for aarch64-linux wheels, I think it was picked up from the conda environment. If we all agree that libgomp is what is recommended for PyTorch, then I'm fine with switching to it now. In fact I've already suggested moving to gnu omp (#1774) but waiting mainly because of the 10% regressions with it for the eager mode.

Now we have more data on libgomp vs libomp (please check #1774 (comment)), and i'm fine with switching to libgomp for aarch64 linux.

manywheel/Dockerfile_cuda_aarch64

snadampal · 2024-04-11T23:03:38Z

Hi @tinglvv , the current CI doesn't cover aarch64-linux workflow, all the current scripts are only for CD.
so passing the CI doesn't mean this PR was tested on aarch64 linux.
I believe you tested locally, can you please share details on what Dockerfile used for testing?
I mean, did you use the new Dockerfile added in this PR?
if yes, in the current form it's using the default GCC-12 and PyTorch and ACL 23.08 will not compile with gcc-12. I have added few comments around that.

tinglvv · 2024-04-11T23:32:15Z

Hi @tinglvv , the libraries you pointed above are taken care in the new CI scripts (without conda), so I think CUDA should work fine there as well. I'm fine with rebasing mine if this PR goes first, but I request you to test CUDA again in my PR.

I see, that sounds great, thanks for the clarification. Will test it again in your PR.

Hi @tinglvv , the current CI doesn't cover aarch64-linux workflow, all the current scripts are only for CD. so passing the CI doesn't mean this PR was tested on aarch64 linux. I believe you tested locally, can you please share details on what Dockerfile used for testing? I mean, did you use the new Dockerfile added in this PR? if yes, in the current form it's using the default GCC-12 and PyTorch and ACL 23.08 will not compile with gcc-12. I have added few comments around that.

Hi, I tested with this new docker file in the PR. I downgraded to gcc-11 in here. https://github.com/pytorch/builder/pull/1775/files#diff-83169a88982e6f1445d5ac8f76ef9e2dde63bfd0f502f3e536484328ff51d851R4. My workflow was

Build the docker image with
GPU_ARCH_TYPE=cuda-aarch64 GPU_ARCH_VERSION=12.4 manywheel/build_docker.sh
Run image:
docker run -it pytorch/manylinuxaarch64-builder:cuda-aarch64-main
Build the wheel
cd /builder/aarch64_linux && BASE_CUDA_VERSION=12.4 DESIRED_PYTHON=3.8 ./aarch64_ci_build.sh
Thanks for the review, let me address the comments.

snadampal

I'm observing around 10% performance drop for eager mode inference with libgomp compared to libomp.
if you don't have a strong preference, I suggest keeping libomp till we know better.
for more details, check my comment here: #1774 (comment)

malfet

Please create a pull-request against pytorch to validate that it works, see pytorch/pytorch#123747 for example

snadampal · 2024-04-16T17:41:42Z

Hi @tinglvv , is this targeted for PyTorch 2.3.x?
otherwise I will make libgomp switch alone as a separate PR, because libgomp can go into PyTorch 2.3.x

tinglvv · 2024-04-17T21:31:21Z

Hi @tinglvv , is this targeted for PyTorch 2.3.x? otherwise I will make libgomp switch alone as a separate PR, because libgomp can go into PyTorch 2.3.x

Hi, we aim to merge this soon so I believe it is targeting 2.3.x.

manywheel/build_docker.sh

malfet · 2024-04-19T22:44:04Z

Hi @tinglvv , is this targeted for PyTorch 2.3.x? otherwise I will make libgomp switch alone as a separate PR, because libgomp can go into PyTorch 2.3.x

Hi, we aim to merge this soon so I believe it is targeting 2.3.x.

Sorry, but 2.3 feature development cut off date was mid March, so no, this should not affect/target 2.3, but 2.4 sounds like a good goal

rebasing #124112. too many conflict files, so starting a new PR. Test pytorch/builder#1775 (merged) for ARM wheel addition Test pytorch/builder#1828 (merged) for setting MAX_JOBS Current issue to follow up: #126980 Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Pull Request resolved: #126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman

rebasing pytorch#124112. too many conflict files, so starting a new PR. Test pytorch/builder#1775 (merged) for ARM wheel addition Test pytorch/builder#1828 (merged) for setting MAX_JOBS Current issue to follow up: pytorch#126980 Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Pull Request resolved: pytorch#126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman

[WIP] [aarch64] Add CUDA 12.4 build ARM wheel

babd9f5

facebook-github-bot added the cla signed label Apr 7, 2024

tinglvv changed the title ~~[WIP] [aarch64] Add CUDA 12.4 build ARM wheel~~ [WIP] [aarch64] Add CUDA 12.4 build script for ARM wheel Apr 7, 2024

fix lint from black

91df612

tinglvv changed the title ~~[WIP] [aarch64] Add CUDA 12.4 build script for ARM wheel~~ [aarch64] Add CUDA 12.4 build script for ARM wheel Apr 10, 2024

nWEIdia reviewed Apr 11, 2024

View reviewed changes

ptrblck mentioned this pull request Apr 11, 2024

Package manager install on Nvidia Grace Hopper does not make cuda available pytorch/pytorch#123835

Open

snadampal reviewed Apr 11, 2024

View reviewed changes

aarch64_linux/aarch64_wheel_ci_build.py Show resolved Hide resolved

snadampal reviewed Apr 11, 2024

View reviewed changes

aarch64_linux/aarch64_wheel_ci_build.py Show resolved Hide resolved

snadampal reviewed Apr 11, 2024

View reviewed changes

manywheel/Dockerfile_cuda_aarch64 Outdated Show resolved Hide resolved

snadampal reviewed Apr 11, 2024

View reviewed changes

manywheel/Dockerfile_cuda_aarch64 Outdated Show resolved Hide resolved

tinglvv added 2 commits April 12, 2024 10:41

update gcc toolset path and add workflow

e7d419b

fix cuda wheel path

0f2b149

snadampal suggested changes Apr 12, 2024

View reviewed changes

malfet reviewed Apr 12, 2024

View reviewed changes

tinglvv mentioned this pull request Apr 15, 2024

[draft] cuda 124 arm wheel test pytorch/pytorch#124112

Closed

malfet reviewed Apr 19, 2024

View reviewed changes

manywheel/build_docker.sh Outdated Show resolved Hide resolved

Update manywheel/build_docker.sh

7d80355

malfet merged commit a79e1ce into pytorch:main Apr 19, 2024
20 of 22 checks passed

atalman mentioned this pull request Apr 23, 2024

Adjust docker gpu aarch64 docker image #1794

Merged

ptheywood mentioned this pull request Apr 24, 2024

Grace-Hopper Pytorch wheels with CUDA support N8-CIR-Bede/documentation#199

Open

tinglvv mentioned this pull request May 19, 2024

CUDA 12.4 ARM wheel integration to CD - nightly build pytorch/pytorch#126174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aarch64] Add CUDA 12.4 build script for ARM wheel #1775

[aarch64] Add CUDA 12.4 build script for ARM wheel #1775

tinglvv commented Apr 7, 2024

atalman commented Apr 9, 2024

johnnynunez commented Apr 10, 2024

tinglvv commented Apr 10, 2024

johnnynunez commented Apr 10, 2024

snadampal commented Apr 10, 2024 •

edited

Loading

tinglvv commented Apr 10, 2024 •

edited

Loading

nWEIdia Apr 11, 2024

ptrblck Apr 11, 2024

ptrblck commented Apr 11, 2024

snadampal commented Apr 11, 2024

tinglvv commented Apr 11, 2024

tinglvv commented Apr 11, 2024

snadampal commented Apr 11, 2024

bryantbiggs commented Apr 11, 2024

snadampal Apr 11, 2024

snadampal Apr 12, 2024

Aidyn-A Apr 12, 2024 •

edited

Loading

snadampal Apr 12, 2024 •

edited

Loading

Aidyn-A Apr 13, 2024

snadampal Apr 13, 2024

snadampal Apr 13, 2024

snadampal Apr 15, 2024

snadampal commented Apr 11, 2024

tinglvv commented Apr 11, 2024 •

edited

Loading

snadampal left a comment

malfet left a comment

snadampal commented Apr 16, 2024

tinglvv commented Apr 17, 2024

malfet commented Apr 19, 2024 •

edited

Loading

[aarch64] Add CUDA 12.4 build script for ARM wheel #1775

[aarch64] Add CUDA 12.4 build script for ARM wheel #1775

Conversation

tinglvv commented Apr 7, 2024

atalman commented Apr 9, 2024

johnnynunez commented Apr 10, 2024

tinglvv commented Apr 10, 2024

johnnynunez commented Apr 10, 2024

snadampal commented Apr 10, 2024 • edited Loading

tinglvv commented Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptrblck commented Apr 11, 2024

snadampal commented Apr 11, 2024

tinglvv commented Apr 11, 2024

tinglvv commented Apr 11, 2024

snadampal commented Apr 11, 2024

bryantbiggs commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aidyn-A Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

snadampal Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snadampal commented Apr 11, 2024

tinglvv commented Apr 11, 2024 • edited Loading

snadampal left a comment

Choose a reason for hiding this comment

malfet left a comment

Choose a reason for hiding this comment

snadampal commented Apr 16, 2024

tinglvv commented Apr 17, 2024

malfet commented Apr 19, 2024 • edited Loading

snadampal commented Apr 10, 2024 •

edited

Loading

tinglvv commented Apr 10, 2024 •

edited

Loading

Aidyn-A Apr 12, 2024 •

edited

Loading

snadampal Apr 12, 2024 •

edited

Loading

tinglvv commented Apr 11, 2024 •

edited

Loading

malfet commented Apr 19, 2024 •

edited

Loading