Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aarch64] Add CUDA 12.4 build script for ARM wheel #1775

Merged
merged 5 commits into from
Apr 19, 2024

Conversation

tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented Apr 7, 2024

Add cuda_aarch64 ARM wheel build script with CUDA 12.4.
Reference #1302.

cc @ptrblck

@tinglvv tinglvv changed the title [WIP] [aarch64] Add CUDA 12.4 build ARM wheel [WIP] [aarch64] Add CUDA 12.4 build script for ARM wheel Apr 7, 2024
@atalman
Copy link
Contributor

atalman commented Apr 9, 2024

cc @snadampal

@johnnynunez
Copy link

@tinglvv is this for grace or jetson or arm sbsa?

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 10, 2024

@tinglvv is this for grace or jetson or arm sbsa?

Hi, this is built for ARM sbsa which is also used by Grace. Main purpose is to run this ARM+CUDA wheel on Grace CPU. I have not tested on Jetson yet.

@tinglvv tinglvv changed the title [WIP] [aarch64] Add CUDA 12.4 build script for ARM wheel [aarch64] Add CUDA 12.4 build script for ARM wheel Apr 10, 2024
@johnnynunez
Copy link

@tinglvv is this for grace or jetson or arm sbsa?

Hi, this is built for ARM sbsa which is also used by Grace. Main purpose is to run this ARM+CUDA wheel on Grace CPU. I have not tested on Jetson yet.

thanks @tinglvv very nice

@snadampal
Copy link
Contributor

snadampal commented Apr 10, 2024

Hi @tinglvv , are you testing the wheel for CPU inference as well? please let me know what you have covered so far.
btw, I'm upgrading the ARM wheel builder to manylinux 2_28 with gcc-12 and no conda, here is my PR: #1781
would you be able to try these changes in that environment as well?

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 10, 2024

Hi @tinglvv , are you testing the wheel for CPU inference as well? please let me know what you have covered so far. btw, I'm upgrading the ARM wheel builder to manylinux 2_28 with gcc-12 and no conda, here is my PR: #1781 would you be able to try these changes in that environment as well?

Hi, sorry do you mean testing with CPU-aarch64? I simply added the CUDA support on top of the existing workflow for CPU-aarch64. I have tested on Grace with the cuda related operations from test_ops.py.

Sure I can try with your PR and let you know.

ldconfig
}

function prune_124 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does NVPrune do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nWEIdia it removes GPU architectures from libraries, which are never used to lower the binary size. This workflow can be dangerous if libraries depend on heuristics and can select kernels from the same GPU family (we've seen issues before where sm_61 was dropped causing all kinds of issues on GTX cards).
I don't think the pruning is useful anymore, as we are using the CUDA dependencies from PyPI now.
However, we might want to keep it here and follow up with a cleanup in a separate PR.

@ptrblck
Copy link
Collaborator

ptrblck commented Apr 11, 2024

are you testing the wheel for CPU inference as well? please let me know what you have covered so far.
btw, I'm upgrading the ARM wheel builder to manylinux 2_28 with gcc-12 and no conda, here is my PR: #1781
would you be able to try these changes in that environment as well?

@snadampal I'm unsure if this request is related to your PR or this change. If your PR needs additional testing, could you please tag us there?

Tests pass, so can we merge it, @atalman @malfet or is your review pending?

@snadampal
Copy link
Contributor

Hi @ptrblck , my PR is to migrate the whole aarch64 ci to manylinux 2_28 and remove conda packages. my request was to make sure this cuda addition changes work there as well, so that no surprises when we make the switch.

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 11, 2024

Hi @ptrblck , my PR is to migrate the whole aarch64 ci to manylinux 2_28 and remove conda packages. my request was to make sure this cuda addition changes work there as well, so that no surprises when we make the switch.

Hi @snadampal, I am using a few packages from the conda environment, https://github.com/pytorch/builder/pull/1775/files#diff-3f5d59f85dd25cceac14ee2e14f8eb83547a68fbf087fc6dcc9471c368ba0ac2R88-R90. If conda is removed, is there a place to get these libs?

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 11, 2024

Hi @snadampal, there's major change in the files which may delay the merge of this PR. The workflow for cuda-aarch64 wheel does not exist yet, so goal here was not to have a perfect workflow but to establish it first.

We would test your PR as the next step as we iterate and resolve issues after the merge. Hope this works for you. Thanks.

@snadampal
Copy link
Contributor

Hi @tinglvv , the libraries you pointed above are taken care in the new CI scripts (without conda), so I think CUDA should work fine there as well. I'm fine with rebasing mine if this PR goes first, but I request you to test CUDA again in my PR.

@bryantbiggs
Copy link
Contributor

Every time a package is moved out of conda, an angel gets its wings 😅

"/usr/local/cuda/lib64/libcudnn_ops_train.so.8",
"/opt/conda/envs/aarch64_env/lib/libopenblas.so.0",
"/opt/conda/envs/aarch64_env/lib/libgfortran.so.5",
"/opt/conda/envs/aarch64_env/lib/libgomp.so.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently the scripts are packaging libomp.so
did you check the inference performance for any models? is there any performance difference observed with libgomp vs libomp in the current wheels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm observing around 10% performance drop for eager mode inference with libgomp compared to libomp.
if you don't have a strong preference, I suggest keeping libomp till we know better.
for more details, check my comment here: #1774 (comment)

Copy link

@Aidyn-A Aidyn-A Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please a follow-up on libgomp to libomp migration in your PR #1781, as it is not trivial change and certainly out of scope of this PR. I would like to underline that this PR is for enabling CUDA. The libgomp has been used with PyTorch for long time: it is reliable and nothing is wrong with it functionality-wise. Moreover, I do not want Ting to waste her time on debugging dependencies due to libomp in this PR.

Copy link
Contributor

@snadampal snadampal Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Aidyn-A , currently the aarch64 wheels are linked to libomp not libgomp.
https://pypi.org/project/torch/#files

My point was why to change it now? without having a strong reason.

I have another PR to switch wheels from libomp to libgomp but is currently blocked due to the 10% regression.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I am seeing, it comes with:

/usr/local/lib/python3.8/dist-packages/torch.libs/libgomp-0f9e2209.so.1.0.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you are checking either torch 2.2 or nightly aarch64-linux wheel, because I am seeing,
libomp-b8e5bcfb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libomp-b8e5bcfb.so (0x0000ffffa8c30000)

complete list:

	linux-vdso.so.1 (0x0000ffffb298d000)
	libtorch_cpu.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./libtorch_cpu.so (0x0000ffffaa960000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffaa930000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffaa780000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffffb2954000)
	libc10.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./libc10.so (0x0000ffffaa680000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffffaa660000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffffaa640000)
	libopenblasp-r0-f658af2e.3.25.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libopenblasp-r0-f658af2e.3.25.so (0x0000ffffa8e10000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffa8d70000)
	libomp-b8e5bcfb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libomp-b8e5bcfb.so (0x0000ffffa8c30000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffa8c10000)
	libarm_compute-7362313d.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute-7362313d.so (0x0000ffffa8170000)
	libarm_compute_graph-15f701fb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute_graph-15f701fb.so (0x0000ffffa8030000)
	libarm_compute_core-0793f69d.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute_core-0793f69d.so (0x0000ffffa7fe0000)
	libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffa7db0000)
	libgfortran-105e6576.so.5.0.0 => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libgfortran-105e6576.so.5.0.0 (0x0000ffffa7c50000)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, libomp wasn't intentionally chosen for aarch64-linux wheels, I think it was picked up from the conda environment. If we all agree that libgomp is what is recommended for PyTorch, then I'm fine with switching to it now. In fact I've already suggested moving to gnu omp (#1774) but waiting mainly because of the 10% regressions with it for the eager mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have more data on libgomp vs libomp (please check #1774 (comment)), and i'm fine with switching to libgomp for aarch64 linux.

@snadampal
Copy link
Contributor

Hi @tinglvv , the current CI doesn't cover aarch64-linux workflow, all the current scripts are only for CD.
so passing the CI doesn't mean this PR was tested on aarch64 linux.
I believe you tested locally, can you please share details on what Dockerfile used for testing?
I mean, did you use the new Dockerfile added in this PR?
if yes, in the current form it's using the default GCC-12 and PyTorch and ACL 23.08 will not compile with gcc-12. I have added few comments around that.

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 11, 2024

Hi @tinglvv , the libraries you pointed above are taken care in the new CI scripts (without conda), so I think CUDA should work fine there as well. I'm fine with rebasing mine if this PR goes first, but I request you to test CUDA again in my PR.

I see, that sounds great, thanks for the clarification. Will test it again in your PR.

Hi @tinglvv , the current CI doesn't cover aarch64-linux workflow, all the current scripts are only for CD. so passing the CI doesn't mean this PR was tested on aarch64 linux. I believe you tested locally, can you please share details on what Dockerfile used for testing? I mean, did you use the new Dockerfile added in this PR? if yes, in the current form it's using the default GCC-12 and PyTorch and ACL 23.08 will not compile with gcc-12. I have added few comments around that.

Hi, I tested with this new docker file in the PR. I downgraded to gcc-11 in here. https://github.com/pytorch/builder/pull/1775/files#diff-83169a88982e6f1445d5ac8f76ef9e2dde63bfd0f502f3e536484328ff51d851R4. My workflow was

  • Build the docker image with
    GPU_ARCH_TYPE=cuda-aarch64 GPU_ARCH_VERSION=12.4 manywheel/build_docker.sh
  • Run image:
    docker run -it pytorch/manylinuxaarch64-builder:cuda-aarch64-main
  • Build the wheel
    cd /builder/aarch64_linux && BASE_CUDA_VERSION=12.4 DESIRED_PYTHON=3.8 ./aarch64_ci_build.sh
    Thanks for the review, let me address the comments.

Copy link
Contributor

@snadampal snadampal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm observing around 10% performance drop for eager mode inference with libgomp compared to libomp.
if you don't have a strong preference, I suggest keeping libomp till we know better.
for more details, check my comment here: #1774 (comment)

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a pull-request against pytorch to validate that it works, see pytorch/pytorch#123747 for example

@snadampal
Copy link
Contributor

Hi @tinglvv , is this targeted for PyTorch 2.3.x?
otherwise I will make libgomp switch alone as a separate PR, because libgomp can go into PyTorch 2.3.x

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 17, 2024

Hi @tinglvv , is this targeted for PyTorch 2.3.x? otherwise I will make libgomp switch alone as a separate PR, because libgomp can go into PyTorch 2.3.x

Hi, we aim to merge this soon so I believe it is targeting 2.3.x.

manywheel/build_docker.sh Outdated Show resolved Hide resolved
@malfet malfet merged commit a79e1ce into pytorch:main Apr 19, 2024
20 of 22 checks passed
@malfet
Copy link
Contributor

malfet commented Apr 19, 2024

Hi @tinglvv , is this targeted for PyTorch 2.3.x? otherwise I will make libgomp switch alone as a separate PR, because libgomp can go into PyTorch 2.3.x

Hi, we aim to merge this soon so I believe it is targeting 2.3.x.

Sorry, but 2.3 feature development cut off date was mid March, so no, this should not affect/target 2.3, but 2.4 sounds like a good goal

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 27, 2024
rebasing #124112.
too many conflict files, so starting a new PR.

Test pytorch/builder#1775 (merged) for ARM wheel addition
Test pytorch/builder#1828 (merged) for setting MAX_JOBS

Current issue to follow up:
#126980

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
Pull Request resolved: #126174
Approved by: https://github.com/nWEIdia, https://github.com/atalman
titaiwangms pushed a commit to titaiwangms/pytorch that referenced this pull request May 28, 2024
rebasing pytorch#124112.
too many conflict files, so starting a new PR.

Test pytorch/builder#1775 (merged) for ARM wheel addition
Test pytorch/builder#1828 (merged) for setting MAX_JOBS

Current issue to follow up:
pytorch#126980

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
Pull Request resolved: pytorch#126174
Approved by: https://github.com/nWEIdia, https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants