[deps] split ci_docgpu CPU/GPU depsets by ans9868 · Pull Request #62596 · ray-project/ray

ans9868 · 2026-04-14T04:45:16Z

Summary

Fixes the torch-spline-conv conflict in docgpu depset by splitting CPU and GPU variants into separate depsets with their respective PyTorch wheel indices.

What Changed

ci/raydepsets/configs/ci_docgpu.depsets.yaml: Split single ci_docgpu_depset into two:
- ci_docgpu_cpu_depset_${PYTHON_SHORT}: CPU-only with --index https://download.pytorch.org/whl/cpu
- ci_docgpu_gpu_depset_${PYTHON_SHORT}: GPU-only with --index https://download.pytorch.org/whl/cu128
ci/docker/docgpu.build.wanda.yaml (line 5): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)
ci/docker/docgpu.build.Dockerfile (line 7): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)

Why

PR #62485 introduced a single depset combining both CPU and GPU requirements, creating an unsolvable torch-spline-conv conflict:

Because you require torch-spline-conv==1.2.2+pt27cu128 and
torch-spline-conv==1.2.2+pt27cpu, your requirements are unsatisfiable.

Splitting into separate depsets with explicit indices (following the ci_ml pattern) resolves this.

Note

This PR includes the configuration and Docker file changes. Lock files will
be regenerated and committed in a separate follow-up PR because:

Lock file generation requires running bazel run //ci/raydepsets:raydepsets -- build,
which compiles all dependencies and exposes a pre-existing etils version conflict
(etils==1.5.2 in dl-cpu-requirements.txt vs etils==1.14.0 in the constraint file).
The architectural fix (config split) is complete and correct regardless of
lock file state. It can merge immediately while the etils conflict is resolved
separately.
This keeps the PR focused: architectural changes now, lock regeneration later
once etils is fixed.

Fixes the torch-spline-conv conflict introduced in PR ray-project#62485 by splitting the single ci_docgpu_depset into separate CPU and GPU variants: - ci_docgpu_cpu_depset: CPU-only with --index https://download.pytorch.org/whl/cpu - ci_docgpu_gpu_depset: GPU-only with --index https://download.pytorch.org/whl/cu128 Update Docker build files to reference the GPU lock only (docgpu_gpu_depset_py.lock). This follows the proven raydepsets pattern used by ci_ml_build_depset (CPU) and ci_ml_gpubuild_depset (GPU). Note: Lock file regeneration is blocked by a pre-existing etils version conflict (separate issue). Lock files will be committed once that is resolved. Closes ray-project#62595 Signed-off-by: Adel Nour <ans9868@nyu.edu>

aslonnie

@elliot-barn could you help review this?

Signed-off-by: Nour999 <130527901+ans9868@users.noreply.github.com>

cursor · 2026-04-14T16:27:47Z

+      - py310
+      - py312
+    pre_hooks:
+      - ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13


CPU depset generates lock files nothing consumes

Low Severity

The ci_docgpu_cpu_depset_${PYTHON_SHORT} depset generates docgpu_cpu_depset_py${PYTHON_VERSION}.lock files, but no Dockerfile or wanda config references them — only the GPU variant is consumed by docgpu.build.Dockerfile and docgpu.build.wanda.yaml. This differs from the ci_ml pattern being followed, where both CPU and GPU lock files are consumed via BUILD_VARIANT. The CPU depset will cost CI time to compile in the follow-up lock generation without being used.

^{Reviewed by Cursor Bugbot for commit ca1881f. Configure here.}

Adds missing CPU-only packages (jax, torchmetrics, torchtext, etils, etc.) to GPU depset via new docgpu_gpu_additions.txt file. This avoids merging full dl-cpu with dl-gpu in one compile, preventing torch-spline-conv conflict while ensuring GPU image has all needed packages. Updates ci_docgpu.depsets.yaml GPU variant to reference both: - python/requirements/ml/py313/dl-gpu-requirements.txt (GPU PyTorch/PyG) - python/requirements/ml/py313/docgpu_gpu_additions.txt (missing CPU-only packages) INCOMPLETE: GPU lock files (docgpu_gpu_depset_py3.10.lock, py3.12.lock) not yet generated. Docker build will fail until locks are committed in follow-up PR. Lock generation blocked by pre-existing etils version conflict (separate issue). Signed-off-by: Adel Nour <ans9868@nyu.edu>

ans9868 · 2026-04-14T17:31:53Z

docgpu Split Fix — Work in Progress

Problem

PR #62485 merged CPU and GPU requirements into one depset, causing torch-spline-conv conflict (CPU plain version vs GPU cu128 variant cannot resolve in single compiler run).

Initial Approach

Split into two depsets with separate indices. But GPU depset using only dl-gpu meant missing jax, torchmetrics, torchtext from dl-cpu.

Current Solution

Added docgpu_gpu_additions.txt to pull non-conflicting CPU packages into GPU depset. Avoids recreating the conflict when merging both full files while ensuring GPU image has all packages.

Status

jax versions now correct (jax==0.4.28 + jaxlib==0.4.28+cuda backend match). But several unknowns remain:

etils version conflict (pre-existing): py3.10 builds may fail on constraint mismatch
JAX/jaxlib pairing untested until lock generation
docgpu_gpu_additions compilation with cu128 index unvalidated
GPU lock files not yet generated (blocked by etils)

P.S. Lock file generation is separate issue. Will handle once etils resolved.

Depending on my schedule I think I could fix this in 3-7 days. Feedback on this approach is welcome. This is trickier than I initially expected. more information about the full bug in the issue here: #62595

Edit: Completed the fix yesterday evening. Would love a review. More information in the comment below

Generate four lock files completing the ci_docgpu depset split: - docgpu_cpu_depset_py3.{10,12}.lock: CPU PyTorch wheels - docgpu_gpu_depset_py3.{10,12}.lock: GPU cu128 wheels Remove old undivided docgpu_depset_py3.{10,12}.lock. Also improve docgpu_gpu_additions.txt: add python_version < '3.13' guard to torchtext (no cp313 wheel exists for 0.18.0) and clarify comments. Validated: - torch-spline-conv: +pt27cpu in CPU locks, +pt27cu128 in GPU locks - etils: 1.5.2 on py3.10, 1.14.0 on py3.12 - jaxlib: 0.4.28+cuda12.cudnn89 in both GPU locks Fixes ray-project#62595 Signed-off-by: Adel Nour <ans9868@nyu.edu>

ans9868 · 2026-04-15T03:23:08Z

Split docgpu CPU/GPU depsets, regenerate lock files

Fixes the torch-spline-conv conflict introduced by #62485. That PR combined dl-cpu-requirements.txt and dl-gpu-requirements.txt in a single depset, but CPU and GPU wheels use different local version suffixes (+pt27cpu vs +pt27cu128), so the resolver sees two incompatible versions of the same package and fails.

Approach: split into two depsets (following the ci_ml pattern), each with its own resolver run and explicit index — CPU on .../whl/cpu, GPU on .../whl/cu128. Added docgpu_gpu_additions.txt to carry non-conflicting CPU-side packages (jax frontend, torchmetrics, torchtext, TF ecosystem) into the GPU depset without re-merging the conflicting indexes.

Lock files regenerated locally with:

bazel run //ci/raydepsets:raydepsets -- build ci/raydepsets/configs/ci_docgpu.depsets.yaml

Four new files committed:

docgpu_cpu_depset_py3.10.lock
docgpu_cpu_depset_py3.12.lock
docgpu_gpu_depset_py3.10.lock
docgpu_gpu_depset_py3.12.lock

Old single-depset locks removed:

docgpu_depset_py3.10.lock
docgpu_depset_py3.12.lock

CI failures (unrelated)

Two failures in buildkite/"core: python tests [g6_s5]", neither touching anything in this diff:

test_placement_group_status[False] — IndexError in debug_status(...).split("Demands:")[1] From my investigation ... it fails on all 3 parallel shards with the same IndexError. The test calls ray status, parses the output by splitting on "Demands:", and takes the second half. With autoscaler v1 (enable_v2=False), that section never appears in the output within the 5-second timeout, so the split returns only one element and [1] crashes. The v2 variant (enable_v2=True) passes cleanly. Nothing in this PR touches the autoscaler or status output.
test_async_shutdown (compiled graphs) — thread timeout in work_queue.get(block=True), also pre-existing.

Would love a review on the approach when you get a chance. On the two CI failures should I rerun, or are these known flaky tests I should just wait on? Happy to do whatever is most useful.

Merge conflict

This branch deletes the old lock files because a later upstream commit modified them (jax bump). Conflict resolves trivially in favor of our deletion. I am happy to rebase if preferred.

ans9868 · 2026-04-15T17:04:48Z

One possible path forward is both failing tests (test_placement_group_status[False] and test_async_shutdown) could be added to the flaky_tests list in ci/ray_ci/none.tests.yml. This would exclude them from blocking normal CI runs until they're properly fixed. Happy to send that as a separate PR if that's the preferred approach. I just wanted to propose a possible fix.

Signed-off-by: Adel Nour <ans9868@nyu.edu>

ans9868 · 2026-04-16T13:15:21Z

Merged with master. I think the flaky test failures may have been addressed by recent upstream commits especially:
a22de9c — [rllib] disabling flaky rllib tests
ba63f26 — [Data] Fix autoscaler bug that blocks timely release of leased resources

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit e9bd76a. Configure here.}

…ter upstream bump to 0.4.33 Signed-off-by: Adel Nour <ans9868@nyu.edu>

Signed-off-by: Adel Nour <ans9868@nyu.edu>

ans9868 · 2026-04-17T03:57:13Z

Issue closed here #62595 (comment)

ans9868 requested review from a team, matthewdeng and richardliaw as code owners April 14, 2026 04:45

ray-gardener Bot added devprod community-contribution Contributed by the community labels Apr 14, 2026

ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from 2e76459 to 8eff66b Compare April 14, 2026 15:30

ans9868 closed this Apr 14, 2026

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread ci/raydepsets/configs/ci_docgpu.depsets.yaml

Comment thread ci/raydepsets/configs/ci_docgpu.depsets.yaml

ans9868 reopened this Apr 14, 2026

ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from 4d69759 to 57dbd9a Compare April 14, 2026 15:57

aslonnie reviewed Apr 14, 2026

View reviewed changes

aslonnie requested a review from elliot-barn April 14, 2026 16:00

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread ci/raydepsets/configs/ci_docgpu.depsets.yaml Outdated

Merge branch 'master' into fix/docgpu-split-cpu-gpu-locks

ca1881f

Signed-off-by: Nour999 <130527901+ans9868@users.noreply.github.com>

ans9868 force-pushed the fix/docgpu-split-cpu-gpu-locks branch from b2db55a to ca1881f Compare April 14, 2026 16:07

cursor Bot reviewed Apr 14, 2026

View reviewed changes

updated with main; Some commits seem to fix flaky tests

e9bd76a

Signed-off-by: Adel Nour <ans9868@nyu.edu>

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread ci/raydepsets/configs/ci_docgpu.depsets.yaml

[deps] remove jax from docgpu_gpu_additions; now covered by dl-gpu af…

74052b2

…ter upstream bump to 0.4.33 Signed-off-by: Adel Nour <ans9868@nyu.edu>

[deps] fix end-of-file newline in docgpu_gpu_additions.txt

1dd7a4b

Signed-off-by: Adel Nour <ans9868@nyu.edu>

ans9868 closed this Apr 17, 2026

This was referenced May 20, 2026

[docs] Add contributor guide for editing and managing Python dependencies #63547

Open

Bayesian Searcher Stability & Modernization (Ax, Optuna, BayesOpt) #60512

Open

Copilot AI mentioned this pull request May 20, 2026

Replace profile README with requested bio, specialties, OSS contributions, and featured writing ans9868/ans9868#1

Draft

Conversation

ans9868 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Why

Note

Related

Uh oh!

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

ans9868 commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

ans9868 commented Apr 14, 2026

Uh oh!

aslonnie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot Apr 14, 2026

Choose a reason for hiding this comment

CPU depset generates lock files nothing consumes

Uh oh!

Uh oh!

ans9868 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

docgpu Split Fix — Work in Progress

Problem

Initial Approach

Current Solution

Status

Uh oh!

ans9868 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Split docgpu CPU/GPU depsets, regenerate lock files

CI failures (unrelated)

Merge conflict

Uh oh!

ans9868 commented Apr 15, 2026

Uh oh!

ans9868 commented Apr 16, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ans9868 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ans9868 commented Apr 14, 2026 •

edited

Loading

ans9868 commented Apr 14, 2026 •

edited

Loading

ans9868 commented Apr 15, 2026 •

edited

Loading