Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic linking between RAPIDS wheels #33

Open
1 of 11 tasks
vyasr opened this issue Apr 1, 2024 · 22 comments
Open
1 of 11 tasks

Support dynamic linking between RAPIDS wheels #33

vyasr opened this issue Apr 1, 2024 · 22 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Apr 1, 2024

Currently RAPIDS wheels adhere strictly to the manylinux policy. While the glibc/kernel ABI restrictions are not particularly onerous, the requirement that binary wheels be essentially self-contained and only depend on a small set of external shared libraries is problematic. To adhere to this restriction, RAPIDS wheels statically link (or in rare cases, bundle) all of their external library dependencies, leading to severe binary bloat. The biggest problem with this behavior is that the current sizes prohibit us from publishing our wheels on PyPI. Beyond that come the usual more infrastructural problems: longer CI times due to extra compilation, larger binaries making wheel download and installation slower, etc. The focus of this issue is to define a better solution than static linking for this problem that still adheres to the manylinux spec in spirit while reducing binary sizes. This issue will not address the usage of CUDA math library dynamic library wheels; that will be discussed separately.

Proposed Solution

RAPIDS should start publishing its C++ libraries as standalone wheels that can be pip installed independently from the Python(/Cython) wheels.These wheels should

  • Be py3 wheels (independent of Python version, except in rare cases like ucxx where we actually use the Python C API in the C++ library) that are built once per arch/CUDA major version
  • Continue to statically link to the CUDA runtime and math libraries
  • Contain a complete C++ dev library including CMake files, headers, and transitive dependencies. IOW these wheels should be suitable both for use both during compilation and at runtime.
  • Leverage scikit-build-core's entry point support to automate exposing their CMake to other packages building against them.

A key question to address is how to encode binary dependencies between wheels. One option is for each wheel to embed RPATHs pointing to the expected relative path to library dependencies in other wheels. This could be accomplished with some CMake to extract library locations from targets and then construct relative paths during the build based on the assumption that the packages are installed into a standard site-packages layout. However, since this approach is fragile and has generally been frowned upon by the Python community in the past, I suggest that we instead exploit dynamic loading to load the library on import of a package. This choice would make packages sensitive to import order (C++ wheels would need to be imported before any other extension module that links to them) but I think that's a reasonable price to pay since it only matters when depending on a C++ wheel. This solution also lets us handle the logic in Python, making it far easier to configure and control. Moreover, it will make the solution fairly composable when an extension module depends on a C++ wheel that depends on yet another C++ wheel.

Once these wheels exist, we should rewrite the existing Python packages to require the corresponding C++ wheels. The current approach of "find C++ if exists, build otherwise" can be scrapped in favor of always requiring that the C++ CMake package be found. Consumers will have the choice of installing the C++ library (e.g. from conda), building it from source, or installing the C++ wheel. The C++ wheel will become a hard dependency in pyproject.toml, so it will automatically be installed when building. In conda environments the pyproject dependencies are ignored, so the new wheels will not be installed, and similarly in devcontainer builds where requirements are generated dynamically from dependencies.yaml. Ultimately a pylibraft->libraft dependency will behave nearly identically to a raft-dask->pylibraft dependency from the perspective of dependency management.

Notes

  • Since the Python wheels will be dynamically linking to the C++ libraries, these wheels should be a lot closer to what we need in devcontainer/DLFW/PB2/etc builds. As a result we may be able to actually start using them there.

Repositories

@msarahan
Copy link

msarahan commented Apr 1, 2024

Contain a complete C++ dev library including CMake files, headers, and transitive dependencies. IOW these wheels should be suitable both for use both during compilation and at runtime.

How much space does this cost? I understand the simplicity benefits of doing it this way, but if our mission is to save space, why are we making this compromise?

@vyasr
Copy link
Contributor Author

vyasr commented Apr 1, 2024

I'd guess it'll be on the order of 25MB. IMO if after the other changes we're still that close to the 1GB limit on any package then I don't think removing these files would be a real solution since all it would take is adding one new arch etc to the compilation for us to be over the limit again.

What alternative would you suggest? That we build against RAPIDS dependencies installed in some other way and then specify a runtime dependency that contains only the libraries and nothing else?

Also I'd add that I would expect the bulk of those 25 MB to come from bundling CCCL, which we could fix by creating a wheel for rapids-core-dependencies as well.

@bdice
Copy link

bdice commented Apr 1, 2024

Another significant benefit of this approach would be that the marginal cost of building for more Python versions (e.g. 3.12) would be much smaller. The most significant build cost would be paid exactly once for the C++ wheel (rather than for each Python minor version) and then we could build for many Python minor versions at a significantly reduced resource cost.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 1, 2024

Yes, that's definitely something else I considered. I was originally thinking of exposing the C++ library from the Python wheel directly, but one of the (multiple) reasons that tipped me towards a separate wheel was making the Python wheels relatively cheap to build.

@msarahan
Copy link

msarahan commented Apr 1, 2024

25 MB probably isn't enough to warrant extra complexity given the hundreds of MB we already are dealing with. We should definitely measure this stuff, though.

We can have build-requires and requires dependencies both be specified. The former being the one with dev stuff, the latter without. It's not as nice as Conda's run_exports, but same idea. Doable, with room for improvement in tooling.

@jakirkham
Copy link
Member

If we were able to have a thin Cython layer around each dependency that we used exclusively, we could use that in other packages and have the benefits of reduced library duplication/static linking

@vyasr
Copy link
Contributor Author

vyasr commented Apr 1, 2024

We can have build-requires and requires dependencies both be specified. The former being the one with dev stuff, the latter without. It's not as nice as Conda's run_exports, but same idea. Doable, with room for improvement in tooling.

Yes, I definitely think that's worth doing. I considered that but didn't want to include that as part of this proposal because that's a change that we should try to make concurrently with conda packaging (we don't split this in conda either). There's a writeup about this somewhere, I'll find it and share it.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 1, 2024

If we were able to have a thin Cython layer around each dependency that we used exclusively, we could use that in other packages and have the benefits of reduced library duplication/static linking

I'm not sure I follow what you mean. How is this different from what's being proposed here, aside from adding a Cython wrapper? What would that Cython wrapper do?

@vyasr
Copy link
Contributor Author

vyasr commented Apr 2, 2024

On the subject of measurements, here's what I currently see locally:

[root@dt08 cpp_wheels]# ls -lh wheelhouse/
total 1.6G
-rw-r--r-- 1 root root 821M Apr  1 20:42 libcugraph-24.6.0-cp311-cp311-manylinux_2_17_x86_64.whl
-rw-r--r-- 1 root root 788M Apr  1 03:59 libraft-24.6.0-cp311-cp311-manylinux_2_17_x86_64.whl
-rw-r--r-- 1 root root 3.7M Apr  1 03:14 librmm-24.6.0-cp311-cp311-manylinux_2_17_x86_64.whl
-rw-r--r-- 1 root root 1.5M Apr  1 20:49 pylibcugraph-24.6.0-cp311-cp311-manylinux_2_17_x86_64.whl
-rw-r--r-- 1 root root 3.9M Apr  1 04:01 pylibraft-24.6.0-cp311-cp311-manylinux_2_17_x86_64.whl
-rw-r--r-- 1 root root 1.7M Apr  1 03:15 rmm-24.6.0-cp311-cp311-manylinux_2_17_x86_64.whl

We're under 1 GB with this! For context, the pylibcugraph and cugraph wheels I see from recent PRs is 1.47 GB. One major missing piece here is NCCL, which I expect will add ~100MB back to the size.

If I open up the wheels and look at their contents:

[root@dt08 wheelhouse]# du -sh unpacked_libcugraph/libcugraph/*
512     unpacked_libcugraph/libcugraph/VERSION
9.5K    unpacked_libcugraph/libcugraph/__init__.py
9.5K    unpacked_libcugraph/libcugraph/_version.py
19M     unpacked_libcugraph/libcugraph/include
1.1G    unpacked_libcugraph/libcugraph/lib64
9.5K    unpacked_libcugraph/libcugraph/load.py
[root@dt08 wheelhouse]# du -sh unpacked_libraft/libraft/*
512     unpacked_libraft/libraft/VERSION
9.5K    unpacked_libraft/libraft/__init__.py
9.5K    unpacked_libraft/libraft/_version.py
37M     unpacked_libraft/libraft/include
1.1G    unpacked_libraft/libraft/lib64
9.5K    unpacked_libraft/libraft/load.py
1.5K    unpacked_libraft/libraft/test
[root@dt08 wheelhouse]# du -sh unpacked_libcugraph/libcugraph/lib64/*
67K     unpacked_libcugraph/libcugraph/lib64/cmake
1.1G    unpacked_libcugraph/libcugraph/lib64/libcugraph.so
4.7M    unpacked_libcugraph/libcugraph/lib64/libcugraph_c.so
192K    unpacked_libcugraph/libcugraph/lib64/rapids
[root@dt08 wheelhouse]# du -sh unpacked_libraft/libraft/lib64/*
238K    unpacked_libraft/libraft/lib64/cmake
1.1G    unpacked_libraft/libraft/lib64/libraft.so
192K    unpacked_libraft/libraft/lib64/rapids

Definitely suggests that as I expected we wouldn't be benefiting much from trying to optimize the include directory, at least not unless we dramatically reduce library sizes somehow.

@jameslamb
Copy link
Member

Nice! I want to say this somewhere, here seems as good a place as any... since 1GB is a special value (a PyPI limit), I think as part of this work we should be enforcing that limit on wheels in CI across all the repos.

That could be done with that pydistcheck thing I made or with a shell script using du or similar. But either way, I think it'd be useful to catch "hey this artifact is gonna be too big" in CI instead of during publishing to PyPI.

@jameslamb
Copy link
Member

Adding a link to this highly-relevant conversation happening on the Python discourse over the last 2 weeks.

https://discuss.python.org/t/enforcing-consistent-metadata-for-packages/50008/28

Some quotes that really stood out to me

Multiple times when we’ve discussed size limit requests and questions like “why are packages using CUDA so large?”, the suggestion given to package authors to reduce binary size consumption is to split out the non-Python parts into a separate wheel, and depend on that in the main package (which uses the CPython C API and has to be built 5 times if one supports 5 Python 3.x versions)

and

Similarly, for native dependencies, NumPy and SciPy both vendor the libopenblas shared library (see pypackaging-native’s page on this for more details ). It takes up about 67% of the numpy wheel sizes, and ~40% of scipy wheel sizes. With four minor Python versions supported, that’s 8x the same thing being vendored. We’d actually really like to unvendor that, and already have a separate wheel: scipy-openblas32 · PyPI . However, depending on it is forbidden without marking everything as dynamic, which isn’t great. So we’ve done all the hard work, dealing with packaging, symbol mangling and supporting functionality to safely load a shared library from another wheel. But the blocker is that we cannot express the dependency (important, we don’t want to ship an sdist for scipy-openblas32, it’s really only about unvendoring a binary).

This conversation is closely related to PEP 725 - "Specifying external dependencies in pyproject.toml" (link)

@vyasr
Copy link
Contributor Author

vyasr commented Apr 17, 2024

Thanks James! We should probably chime in there at some point, but perhaps once we're a bit further along with our implementation.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 20, 2024

One thing that we should keep in mind while implementing this feature is that it may cause problems for our usage of sccache in CI. After this change, C++ library dependencies will now be found in other wheels instead of being downloaded via CPM. While CPM's downloads will always go to the same path, wheels will instead be downloaded into a different ephemeral virtual environment during builds every time. If sccache sees the different path as a different dependency (i.e. if the path change results in a cache miss) then we will end up recompiling artifacts far more frequently than we should. I'm not sure if this is the case, so it's something we'll have to experiment with if nobody else knows for sure either (@trxcllnt, @ajschmidt8, or @robertmaynard might know this already). If it is an issue, there are two ways out of this:

  1. The sure path: turn off build isolation and install dependencies manually into the root environment or into a manually created venv in a specified directory. Either option will produce consistent paths.
  2. The easier path, if it's viable: sccache may allow configuration of the key to force it to use hashes of files (included headers and linked libraries) exclusively instead of paths, in which case we could just do that and not worry about the ephemeral path changes. @robertmaynard mentioned that this might exist.

@trxcllnt
Copy link

trxcllnt commented Apr 22, 2024

sccache doesn't allow overriding the computed hash (aside from respecting an additional envvar to hash with everything else), so you'll have to do --no-build-isolation and install dependencies into a consistent location. This is why this is why the devcontainers and DLFW builds do this.

@robertmaynard
Copy link

I was thinking about pre-processor mode (https://github.com/mozilla/sccache/blob/main/docs/Local.md#preprocessor-cache-mode ) but that only allows you to ignore the working directory in the hash, and not other directories.

Plus it doesn't work with non local backed caches...

@vyasr
Copy link
Contributor Author

vyasr commented Apr 23, 2024

OK yeah so be it, I figured no build isolation was where we'd end up but wanted to check. It would have been nice if sccache had added some feature that made this possible!

@jakirkham
Copy link
Member

cc @raydouglass (for awareness)

rapids-bot bot pushed a commit to rapidsai/rmm that referenced this issue May 3, 2024
This PR changes wheel building in RMM to create a separate C++ wheel that is then found from the Python wheel. The C++ wheel is now a hard dependency of the Python wheel. This allows Python packaging to more closely mirror the structure of our conda packaging, and the way we would normally wish to package these in any other manager. It also allows us to reduce package sizes by allowing better sharing of artifacts between different Python packages that rely on the same C++ components from other packages.

Contributes to rapidsai/build-planning#33

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Ray Douglass (https://github.com/raydouglass)
  - Bradley Dice (https://github.com/bdice)

URL: #1529
@bdice bdice mentioned this issue May 3, 2024
8 tasks
@vyasr
Copy link
Contributor Author

vyasr commented May 6, 2024

As part of this effort, we will probably want to start producing nvcomp wheels similarly to how we are now producing ucx wheels.

rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue May 8, 2024
This PR changes wheel building in raft to create a separate C++ wheel that is then found from the Python wheel. The C++ wheel is now a hard dependency of the Python wheel. This allows Python packaging to more closely mirror the structure of our conda packaging, and the way we would normally wish to package these in any other manager. It also allows us to reduce package sizes by allowing better sharing of artifacts between different Python packages that rely on the same C++ components from other packages.

Contributes to rapidsai/build-planning#33

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)
  - Divye Gala (https://github.com/divyegala)

URL: #2264
@vyasr
Copy link
Contributor Author

vyasr commented May 8, 2024

I've updated the list in the original message to include all the repositories for which we need to make this change. That should make it easier to identify outstanding work to pull off the stack.

@vyasr
Copy link
Contributor Author

vyasr commented May 10, 2024

After the merge of rapidsai/raft#2264, we discovered downstream issues in cuvs runs when using newer versions of pylibraft that leveraged the libraft wheel. These issues traced back to similar symbol visibility issues to those that have plagued us in a few different contexts due to raft supporting usage as both a precompiled and a header-only library. These problems have also manifested in the cugraph PR to build wheels.

The upshot here is that we will not be able to progress on this effort with raft or any of raft's dependents (cugraph/cuml) until the raft-cuvs split is completed and raft is made a header-only library. In addition, raft will have to ensure that all of its symbols are properly hidden, see rapidsai/raft#1722; the latter cannot really be completed until after the raft-cuvs split because we cannot manage the symbol hiding cleanly when a given function could be compiled into libraft.so (the precompiled raft lib) or into libX.so where X is some other library that is using raft as a header-only lib.

In the meantime, we can still make progress on other wheels that do not include raft in their dependency tree such as cudf, cuspatial, and ucxx.

@mmccarty
Copy link

See rapidsai/cuvs#113 for tracking the raft-cuvs split.

abc99lr pushed a commit to abc99lr/raft that referenced this issue May 10, 2024
This PR changes wheel building in raft to create a separate C++ wheel that is then found from the Python wheel. The C++ wheel is now a hard dependency of the Python wheel. This allows Python packaging to more closely mirror the structure of our conda packaging, and the way we would normally wish to package these in any other manager. It also allows us to reduce package sizes by allowing better sharing of artifacts between different Python packages that rely on the same C++ components from other packages.

Contributes to rapidsai/build-planning#33

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#2264
@robertmaynard
Copy link

After the merge of rapidsai/raft#2264, we discovered downstream issues in cuvs runs when using newer versions of pylibraft that leveraged the libraft wheel. These issues traced back to similar symbol visibility issues to those that have plagued us in a few different contexts due to raft supporting usage as both a precompiled and a header-only library. These problems have also manifested in the cugraph PR to build wheels.

We have replicated these issues outside of RAPIDS and have written a reproducer of the issue here: https://github.com/robertmaynard/cross_dso_example

One of the proposd solutions that @vyasr and I originally discussed was using -whole-archive to resolve this issue, but that exposed other issues in cublas. Which I have reported upstream (http://nvbugs/4649059/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants