Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use libucx wheels #1041

Merged
merged 36 commits into from
May 10, 2024
Merged

use libucx wheels #1041

merged 36 commits into from
May 10, 2024

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented May 6, 2024

Contributes to rapidsai/build-planning#57.

Similar to rapidsai/ucxx#226, proposes using the new UCX wheels from https://github.com/rapidsai/ucx-wheels, instead of vendoring system versions of libuc{m,p,s,t}.so.

Benefits of these changes

Allows users of ucx-py to avoid needing system installations of the UCX libraries.

Shrinks the ucx-py wheels by 6.7MB compressed (77%) and 19.1 MB uncompressed (73%).

how I calculated that (click me)

Mounting in a directory with a wheel built from this branch...

docker run \
    --rm \
    -v $(pwd)/final_dist:/opt/work \
    -it python:3.10 \
    bash

pip install pydistcheck
pydistcheck --inspect /opt/work/*.whl
----- package inspection summary -----
file size
  * compressed size: 2.0M
  * uncompressed size: 7.0M
  * compression space saving: 71.3%
contents
  * directories: 10
  * files: 38 (2 compiled)
size by extension
  * .so - 6.9M (97.7%)
  * .py - 0.1M (2.0%)
  * .pyx - 9.3K (0.1%)
  * no-extension - 7.1K (0.1%)
  * .pyi - 3.9K (0.1%)
  * .c - 1.7K (0.0%)
  * .txt - 39.0B (0.0%)
largest files
  * (5.3M) ucp/_libs/ucx_api.cpython-310-x86_64-linux-gnu.so
  * (1.6M) ucp/_libs/arr.cpython-310-x86_64-linux-gnu.so
  * (36.3K) ucp/core.py
  * (20.3K) ucp/benchmarks/cudf_merge.py
  * (12.1K) ucp/benchmarks/send_recv.py

Compared to a recent nightly release.

pip download \
    -d /tmp/delete-me \
    --prefer-binary \
    --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
    'ucx-py-cu12>=0.38.0a'

pydistcheck --inspect /tmp/delete-me/*.whl
----- package inspection summary -----
file size
  * compressed size: 8.7M
  * uncompressed size: 26.1M
  * compression space saving: 66.8%
contents
  * directories: 11
  * files: 65 (21 compiled)
size by extension
  * .0 - 14.4M (55.4%)
  * .so - 8.4M (32.2%)
  * .a - 1.8M (6.7%)
  * .140 - 0.7M (2.5%)
  * .12 - 0.7M (2.5%)
  * .py - 0.1M (0.5%)
  * .pyx - 9.3K (0.0%)
  * no-extension - 7.3K (0.0%)
  * .la - 4.2K (0.0%)
  * .pyi - 3.9K (0.0%)
  * .c - 1.7K (0.0%)
  * .txt - 39.0B (0.0%)
largest files
  * (8.7M) ucx_py_cu12.libs/libucp-5720f0c9.so.0.0.0
  * (5.3M) ucp/_libs/ucx_api.cpython-310-x86_64-linux-gnu.so
  * (2.0M) ucx_py_cu12.libs/libucs-3c3009f0.so.0.0.0
  * (1.6M) ucp/_libs/arr.cpython-310-x86_64-linux-gnu.so
  * (1.5M) ucx_py_cu12.libs/libuct-2a15b69b.so.0.0.0

Notes for Reviewers

Left some comments on the diff describing specific design choices.

The libraries from the libucx wheel are only used if a system installation isn't available

Built a wheel in a container using the same image used here in CI.

docker run \
    --rm \
    --gpus 1 \
    --env-file "${HOME}/.aws/creds.env" \
    --env CI=true \
    -v $(pwd):/opt/work \
    -w /opt/work \
    -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.10 \
    bash

ci/build_wheel.sh

Found that the libraries from the libucx wheel are correctly found at build time, and are later found at import time.

using 'rapidsai/citestwheel' image and LD_DEBUG (click me)
# run a RAPIDS wheel-testing container, mount in the directory with the built wheel
docker run \
    --rm \
    --gpus 1 \
    -v $(pwd)/final_dist:/opt/work \
    -w /opt/work \
    -it rapidsai/citestwheel:cuda12.2.2-ubuntu22.04-py3.10 \
    bash

rapidsai/citestwheel does NOT the UCX libraries installed at /usr/lib*.

find /usr -name 'libucm.so*'
# (empty)

Installed the ucx-py wheel.

# install the wheel
pip install ./*.whl

# now libuc{m,p,s,t} at found in site-packages
find /usr -name 'libucm.so*'
# (empty)

find /pyenv -name 'libucm.so*'
# /pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib/libucm.so.0.0.0
# /pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib/libucm.so.0
# /pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib/libucm.so

# try importing ucx-py and track where 'ld' finds the ucx libraries
LD_DEBUG="files,libs" LD_DEBUG_OUTPUT=out.txt \
python -c "from ucp._libs import arr"

# 'ld' creates multiple files... combine them to 1 for easier searching
cat out.txt.* > out-full.txt

In that output, saw that ld was finding libucs.so first.

It searched all the system paths before finally finding it in the libucx wheel.

1037:	file=libucs.so [0];  dynamically loaded by /pyenv/versions/3.10.14/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so [0]
      1037:	find library=libucs.so [0]; searching
      1037:	 search path=		(LD_LIBRARY_PATH)
      1037:	 search path=/pyenv/versions/3.10.14/lib		(RUNPATH from file /pyenv/versions/3.10.14/bin/python)
      1037:	  trying file=/pyenv/versions/3.10.14/lib/libucs.so
      1037:	 search cache=/etc/ld.so.cache
      1037:	 search path=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3:/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2:/lib/x86_64-linux-gnu/tls/haswell/x86_64:/lib/x86_64-linux-gnu/tls/haswell:/lib/x86_64-linux-gnu/tls/x86_64:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu/haswell/x86_64:/lib/x86_64-linux-gnu/haswell:/lib/x86_64-linux-gnu/x86_64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3:/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2:/usr/lib/x86_64-linux-gnu/tls/haswell/x86_64:/usr/lib/x86_64-linux-gnu/tls/haswell:/usr/lib/x86_64-linux-gnu/tls/x86_64:/usr/lib/x86_64-linux-gnu/tls:/usr/lib/x86_64-linux-gnu/haswell/x86_64:/usr/lib/x86_64-linux-gnu/haswell:/usr/lib/x86_64-linux-gnu/x86_64:/usr/lib/x86_64-linux-gnu:/lib/glibc-hwcaps/x86-64-v3:/lib/glibc-hwcaps/x86-64-v2:/lib/tls/haswell/x86_64:/lib/tls/haswell:/lib/tls/x86_64:/lib/tls:/lib/haswell/x86_64:/lib/haswell:/lib/x86_64:/lib:/usr/lib/glibc-hwcaps/x86-64-v3:/usr/lib/glibc-hwcaps/x86-64-v2:/usr/lib/tls/haswell/x86_64:/usr/lib/tls/haswell:/usr/lib/tls/x86_64:/usr/lib/tls:/usr/lib/haswell/x86_64:/usr/lib/haswell:/usr/lib/x86_64:/usr/lib		(system search path)
      1037:	  trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/tls/haswell/x86_64/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/tls/haswell/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/tls/x86_64/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/tls/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/haswell/x86_64/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/haswell/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/x86_64/libucs.so
      1037:	  trying file=/lib/x86_64-linux-gnu/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/tls/haswell/x86_64/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/tls/haswell/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/tls/x86_64/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/tls/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/haswell/x86_64/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/haswell/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/x86_64/libucs.so
      1037:	  trying file=/usr/lib/x86_64-linux-gnu/libucs.so
      1037:	  trying file=/lib/glibc-hwcaps/x86-64-v3/libucs.so
      1037:	  trying file=/lib/glibc-hwcaps/x86-64-v2/libucs.so
      1037:	  trying file=/lib/tls/haswell/x86_64/libucs.so
      1037:	  trying file=/lib/tls/haswell/libucs.so
      1037:	  trying file=/lib/tls/x86_64/libucs.so
      1037:	  trying file=/lib/tls/libucs.so
      1037:	  trying file=/lib/haswell/x86_64/libucs.so
      1037:	  trying file=/lib/haswell/libucs.so
      1037:	  trying file=/lib/x86_64/libucs.so
      1037:	  trying file=/lib/libucs.so
      1037:	  trying file=/usr/lib/glibc-hwcaps/x86-64-v3/libucs.so
      1037:	  trying file=/usr/lib/glibc-hwcaps/x86-64-v2/libucs.so
      1037:	  trying file=/usr/lib/tls/haswell/x86_64/libucs.so
      1037:	  trying file=/usr/lib/tls/haswell/libucs.so
      1037:	  trying file=/usr/lib/tls/x86_64/libucs.so
      1037:	  trying file=/usr/lib/tls/libucs.so
      1037:	  trying file=/usr/lib/haswell/x86_64/libucs.so
      1037:	  trying file=/usr/lib/haswell/libucs.so
      1037:	  trying file=/usr/lib/x86_64/libucs.so
      1037:	  trying file=/usr/lib/libucs.so
      1037:	
      1037:	file=/pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib/libucs.so [0];  dynamically loaded by /pyenv/versions/3.10.14/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so [0]
      1037:	file=/pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib/libucs.so [0];  generating link map
      1037:	  dynamic: 0x00007f4ce42d7c80  base: 0x00007f4ce427e000   size: 0x000000000006fda0
      1037:	    entry: 0x00007f4ce4290ce0  phdr: 0x00007f4ce427e040  phnum:                 1

Then the others were found via the RPATH entries on libucs.so.

libucm.so.0:

       196:	file=libucm.so.0 [0];  needed by /pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib/libucs.so [0]
       196:	find library=libucm.so.0 [0]; searching
       196:	 search path=...redacted...:/pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib		(RPATH from file /pyenv/versions/3.10.14/lib/python3.10/site-packages/libucx/lib/libucs.so)
      ...

However, the libraries from the libucx wheel appear to be the last place ld searches. That means that if you use these wheels on a system with a system installation of libuc{m,p,s,t}, that system installation's libraries will be loaded instead.

using 'rapidsai/ci-wheel' image and LD_DEBUG (click me)
docker run \
    --rm \
    --gpus 1 \
    -v $(pwd)/final_dist:/opt/work \
    -w /opt/work \
    -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.10 \
    bash

rapidsai/ci-wheel has the UCX libraries installed at /usr/lib64.

find /usr/ -name 'libucm.so*'
# /usr/lib64/libucm.so.0.0.0
# /usr/lib64/libucm.so.0
# /usr/lib64/libucm.so

Installed a wheel and tried to import from it.

pip install ./*.whl

LD_DEBUG="files,libs" LD_DEBUG_OUTPUT=out.txt \
python -c "from ucp._libs import arr"

cat out.txt.* > out-full.txt

In that situation, I saw the system libraries found before the one from the wheel.

       226:	file=libucs.so [0];  dynamically loaded by /pyenv/versions/3.10.14/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so [0]
       226:	find library=libucs.so [0]; searching
       226:	 search path=/pyenv/versions/3.10.14/lib		(RPATH from file /pyenv/versions/3.10.14/bin/python)
       226:	  trying file=/pyenv/versions/3.10.14/lib/libucs.so
       226:	 search path=/pyenv/versions/3.10.14/lib		(RPATH from file /pyenv/versions/3.10.14/bin/python)
       226:	  trying file=/pyenv/versions/3.10.14/lib/libucs.so
       226:	 search path=/opt/rh/gcc-toolset-11/root/usr/lib64/tls:/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib		(LD_LIBRARY_PATH)
       226:	  trying file=/opt/rh/gcc-toolset-11/root/usr/lib64/tls/libucs.so
       226:	  trying file=/opt/rh/gcc-toolset-11/root/usr/lib64/libucs.so
       226:	  trying file=/opt/rh/gcc-toolset-11/root/usr/lib/libucs.so
       226:	 search cache=/etc/ld.so.cache
       226:	  trying file=/usr/lib64/libucs.so

In this case, when the system libraries are available, site-packages/libucx/lib isn't even searched.

To avoid any RAPIDS-specific stuff tricking me, I tried in a generic python:3.10 image. Found that the library could be loaded and all the libuc{m,p,s,t} libraries from the libucx wheel are found 🎉 .

using 'python:3.10' wheel (click me)
docker run \
    --rm \
    --gpus 1 \
    -v $(pwd)/final_dist:/opt/work \
    -w /opt/work \
    -it python:3.10 \
    bash

pip install \
    --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
    ./*.whl

LD_DEBUG="files,libs" LD_DEBUG_OUTPUT=out.txt \
python -c "from ucp._libs import arr"

💥

        16:	opening file=/usr/local/lib/python3.10/site-packages/libucx/lib/libucm.so.0 [0]; direct_opencount=1
        16:	
        16:	opening file=/usr/local/lib/python3.10/site-packages/libucx/lib/libucs.so [0]; direct_opencount=1

@@ -4,6 +4,9 @@ build:
os: "ubuntu-22.04"
tools:
python: "mambaforge-22.9"
jobs:
pre_install:
- bash ci/build_docs_pre_install.sh
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no pip-installable library libucx... it needs to be either libucx-cu11 or libucx-cu12.

Readthedocs builds are installing directly from the pyproject.toml checked into source control, which
doesn't have those suffixes added.

ucx-py/.readthedocs.yml

Lines 8 to 11 in 03c864b

python:
install:
- method: pip
path: .

Resulting in this:

INFO: pip is looking at multiple versions of ucx-py to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement libucx<1.16,>=1.15.0 (from ucx-py) (from versions: none)
ERROR: No matching distribution found for libucx<1.16,>=1.15.0

(example build link)

readthedocs doesn't allow customizing the pip install with arbitrary flags, e.g. by adding --no-deps (docs).

So I think this pre-install script to fix the suffixes is the best option to keep those builds working. For more details on how it works, see https://docs.readthedocs.io/en/stable/build-customization.html.

This comment was marked as resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I hadn't considered trying to get this into the conda environment file that way!

But I don't think that will work. conda converts the pip: section in an env.yaml into a requirements.txt, and you can only supply arguments that'd be valid in a requirements file passed to pip.

various forms of that that I tried (click me)

There are some issues tracking the request "forward custom flags to pip in conda environments" but none have resulted in changes to conda:

That means it's not possible to include arguments like --no-deps.

# env.yaml
name: delete-me
channels:
- conda-forge
dependencies:
- pip
- pip:
  - --no-deps -e ./
conda env create --name delete-me --file ./env.yaml

yields the following

Installing pip dependencies: - Ran pip subprocess with arguments:
['/raid/jlamb/miniforge/envs/delete-me/bin/python', '-m', 'pip', 'install', '-U', '-r', '/raid/jlamb/repos/ucx-py/condaenv.p9aa8j_k.requirements.txt', '--exists-action=b']
Pip subprocess output:

Pip subprocess error:
Usage: __main__.py [options]

ERROR: Invalid requirement: --no-deps -e ./
__main__.py: error: no such option: --no-deps

And without that --no-deps, conda tries to install all the dependencies of the package, resulting in

Pip subprocess error:
  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [4 lines of output]
      Collecting cython>=3.0.0
        Using cached Cython-3.0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
      ERROR: Could not find a version that satisfies the requirement libucx==1.15.0 (from versions: none)
      ERROR: No matching distribution found for libucx==1.15.0
      [end of output]

The only other thing I can think of is setting environment variable PIP_NO_DEPS=true in the build environment. I tried that using conda's support for environment variables (docs), but I don't think those affect the build... just the environment's activation.

name: delete-me
channels:
- conda-forge
dependencies:
- pip
- pip:
  - -e ./
variables:
- PIP_NO_DEPS: "true"

I see 2 options for this:

I think that environment variable option would be preferable, actually. This project has a small number of top-level dependencies, keeping the necessary ones in the conda environment file isn't a big lift. @vyasr what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference, but I can set PIP_NO_DEPS=true if you two agree that's the preferred way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that seems a bit cleaner to me.

Did you try PIP_NO_DEPENDENCIES? I don't remember which is the right name (or maybe both work), and also the behavior of true/false for pip's *_NO_* variables can be confusing so it's worth double-checking all the options with the setting that you proposed in the env.yaml before we set this on RTD. If all else fails, though, I'm fine with setting on RTD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright that failed (build link), I think because the environment variable you set up was marked "private".

From https://docs.readthedocs.io/en/stable/environment-variables.html#custom-env-var-exceptions

Custom environment variables that are not marked as Public will not be available in pull request builds

I just created a new public one and kicked off a rebuild... unfortunately that failed too 😭 (build link)

Thanks for adding me as a maintainer, I'll use those permissions to keep trying stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhhh I forgot that PIP_NO_DEPENDENCIES does not stop pip install from trying to download build dependencies.

Processing /home/docs/checkouts/readthedocs.org/user_builds/ucx-py/checkouts/1041
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'
  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [4 lines of output]
      Collecting cython>=3.0.0
        Downloading Cython-3.0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
      ERROR: Could not find a version that satisfies the requirement libucx==1.15.0 (from versions: none)
      ERROR: No matching distribution found for libucx==1.15.0
      [end of output]

(build link)

I just added PIP_NO_BUILD_ISOLATION="True" to the environment as well... still didn't work 😞

(build link)

I have one other idea, will post in a second.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize.... these things did not work:

  • set env variable PIP_NO_DEPS="True" in readthedocs builds
  • set env variable PIP_NO_BUILD_ISOLATION="True" in readthedocs builds
  • set both of the above in readthedocs builds
  • set any combination of those environment variables in variables: block in builddocs.yml conda env
  • pass pip: - --no-deps ../../ via builddocs.yml conda env
  • use a real, suffixed dependency name like libucx-cu11>=1.15.0 in pyproject.toml + add --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple in builddocs.yml conda env

That last one failing revealed a more significant issue, not limited to just docs building...

calling libucx.load_library() unconditionally means ucx-py will not be importable unless the process has access to a GPU 😱

RuntimeError: The CUDA driver library libcuda.so.1 was not found on your system. This library cannot be provided by the libucx wheel and must be installed separately.

(build link)

The docs DO BUILD successfully on the current state of this branch 🎉 (build link), but only because I added behavior to work around not having a GPU available.

So @pentschev @vyasr that's a design decision for you.

When someone tries to import ucp without access to a GPU, what should happen?

  • the import should fail
  • the import should warn but succeed
  • something else

Copy link
Member Author

@jameslamb jameslamb May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing our offline conversations:

  • yes ucx-py is expected to be usable without a GPU
  • the libucx wheels should stop raising exceptions when libucx.load_library() is called in an environment without a GPU

In response, we published a new libucx==1.15.0.post1 removing those exceptions: rapidsai/ucx-wheels#5

Tested it with ucxx on systems with and without a GPU, and it seemed to work well: rapidsai/ucx-wheels#5 (comment)

Those new wheels helped the docs builds here get further, but those builds exposed another issue... as of the changes in this PR, ucx-py was compiling with the system-installed headers (e.g. /usr/include/ucm) but the wheel-provided shared libraries (e.g. site-packages/libucx/lib/libucm.so.0).

That showed up in "symbol not found" issues like this:

OSError: ${HOME}/conda/1041/lib/python3.12/site-packages/libucx/lib/libucs_signal.so: undefined symbol: ucs_debug_is_error_signal

(build link)

I just pushed a change that fixes that: eba110f

And looks like the docs builds are happy and docs are rendering correctly (RTD build link) 🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the headers! I forgot to check on those. Glad you found this issue.

setup.py Outdated Show resolved Hide resolved
rapids-bot bot pushed a commit that referenced this pull request May 7, 2024
Proposes removing the build-time dependency on `tomli` for wheels and conda packages.

Noticed that working on #1041. It doesn't appear to be used anywhere here.

```shell
git grep tomli
```

## Notes for Reviewers

That dependency was added back in #895. I'm not sure why, but I suspect it was related to the use of `versioneer` in this project at the time. Reference: python-versioneer/python-versioneer#338 (comment)

This project doesn't use `versioneer` any more (#931). I strongly suspect that the dependency on `tomli` can be removed.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham
  - Ray Douglass (https://github.com/raydouglass)

URL: #1042
@jameslamb jameslamb changed the title WIP: use libucx wheels use libucx wheels May 7, 2024
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks fine to me, I'll anyway say @vyasr should also give this a pass as he's gonna definitely be able to comment on any details that are gonna pass unnoticed by me.

@jameslamb jameslamb marked this pull request as ready for review May 7, 2024 17:13
@jameslamb jameslamb requested review from a team as code owners May 7, 2024 17:13
vyasr pushed a commit to rapidsai/ucx-wheels that referenced this pull request May 9, 2024
Contributes to rapidsai/build-planning#57.

`libucx.load_library()` defined here tries to pre-load `libcuda.so` and
`libnvidia-ml.so`, to raise an informative error (instead of a cryptic
one from a linker) if someone attempts to use the libraries from this
wheel on a system without a GPU.

Some of the projects using these wheels, like `ucxx` and `ucx-py`, are
expected to be usable on systems without a GPU. See
rapidsai/ucx-py#1041 (comment).

To avoid those libraries needing to try-catch these errors, this
proposes the following:

* removing those checks and deferring to downstream libraries to handle
the non-GPU case
* modifying the build logic so we can publish patched versions of these
wheels like `v1.15.0.post1`

### Notes for Reviewers

Proposing starting with `1.15.0.post1` right away, since that's the
version that `ucx-py` will use. I'm proposing the following sequence of
PRs here (assuming downstream testing goes well):

1. this one
2. another changing the version to `1.14.0.post1`
3. another changing the version to `1.16.0.post1`
setup.py Show resolved Hide resolved

set -euo pipefail

sed -r -i "s/libucx==(.*)\"/libucx-cu12==\1\"/g" ./pyproject.toml
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing bringing back this script for docs builds, but having it substitute in a real, suffixed name for libucx-cu12 so that docs builds will really install that.

I think that's necessary to meet these 3 constraints:

  • docs builds run pip install . to install ucx-py
  • fallback matrices in dependencies.yaml need to be the unsuffixed names (e.g. libucx not libucx-cu12)
  • pre-commit hook running rapids-dependency-file-generator is going to put into pyproject.toml whatever is in the fallback matrices in dependencies.yaml

vyasr pushed a commit to rapidsai/ucx-wheels that referenced this pull request May 10, 2024
Follow-up to #5.

Proposes publishing a `1.14.1.post1` version, identical to version
`1.14.1` except that `load_library()` will no longer raise exceptions in
non-GPU environments.

## Notes for Reviewers

Just putting this up to get in the CI run. Should probably wait to merge
it until testing on rapidsai/ucx-py#1041 is
done.
@vyasr
Copy link
Contributor

vyasr commented May 10, 2024

/merge

@rapids-bot rapids-bot bot merged commit 3d7be74 into rapidsai:branch-0.38 May 10, 2024
39 checks passed
rapids-bot bot pushed a commit to rapidsai/ucxx that referenced this pull request May 10, 2024
Contributes to rapidsai/build-planning#57.

Follow-up to #226.

Proposes the following changes for wheel builds:

* removing system-installed UCX *headers*
* making the code to remove system-installed UCX libraries a bit more specific
   - *(to minimize the risk of accidentally deleting some non-UCX thing who name matches the pattern `libuc*`)*

## Notes for Reviewers

Before applying similar changes to `ucx-py`, I noticed it being compiled with the system-installed headers but then linking against the libraries provided by the `libucx` wheels: rapidsai/ucx-py#1041 (comment)

This change should reduce the risk of that happening.

### How I tested this

Poked around the filesystem that `build_wheel.sh` runs in by pulling one of our standard wheel-building container images used in CI.

```shell
docker run \
    --rm \
    -v $(pwd):/opt/work \
    -w /opt/work \
    -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.10 \
    bash

find /usr -type f -name 'libucm*'
# /usr/lib64/libucm.la
# /usr/lib64/libucm.a
# /usr/lib64/libucm.so.0.0.0
# /usr/lib64/ucx/libucm_cuda.a
# /usr/lib64/ucx/libucm_cuda.la
# /usr/lib64/ucx/libucm_cuda.so.0.0.0

find /usr -type d -name 'uct'
# /usr/include/uct
```

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ray Douglass (https://github.com/raydouglass)

URL: #230
@jameslamb jameslamb deleted the ucx-wheels branch May 10, 2024 20:03
rapids-bot bot pushed a commit to rapidsai/raft that referenced this pull request May 13, 2024
With rapidsai/ucx-py#1041 merged, UCX wheels are now fixed and thus reenabling raft-dask wheel tests that require UCX-Py should be safe.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Jake Awe (https://github.com/AyodeAwe)

URL: #2307
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants