[deps] split ci_docgpu CPU/GPU depsets#62596
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2e76459 to
8eff66b
Compare
|
Someone fixed this it seems like in master. Going to close this here. |
|
Ah I got confused; there isn't a solution to this on the master branch. |
Fixes the torch-spline-conv conflict introduced in PR ray-project#62485 by splitting the single ci_docgpu_depset into separate CPU and GPU variants: - ci_docgpu_cpu_depset: CPU-only with --index https://download.pytorch.org/whl/cpu - ci_docgpu_gpu_depset: GPU-only with --index https://download.pytorch.org/whl/cu128 Update Docker build files to reference the GPU lock only (docgpu_gpu_depset_py.lock). This follows the proven raydepsets pattern used by ci_ml_build_depset (CPU) and ci_ml_gpubuild_depset (GPU). Note: Lock file regeneration is blocked by a pre-existing etils version conflict (separate issue). Lock files will be committed once that is resolved. Closes ray-project#62595 Signed-off-by: Adel Nour <ans9868@nyu.edu>
4d69759 to
57dbd9a
Compare
aslonnie
left a comment
There was a problem hiding this comment.
@elliot-barn could you help review this?
Signed-off-by: Nour999 <130527901+ans9868@users.noreply.github.com>
b2db55a to
ca1881f
Compare
| - py310 | ||
| - py312 | ||
| pre_hooks: | ||
| - ci/raydepsets/pre_hooks/remove-compiled-headers.sh 3.13 |
There was a problem hiding this comment.
CPU depset generates lock files nothing consumes
Low Severity
The ci_docgpu_cpu_depset_${PYTHON_SHORT} depset generates docgpu_cpu_depset_py${PYTHON_VERSION}.lock files, but no Dockerfile or wanda config references them — only the GPU variant is consumed by docgpu.build.Dockerfile and docgpu.build.wanda.yaml. This differs from the ci_ml pattern being followed, where both CPU and GPU lock files are consumed via BUILD_VARIANT. The CPU depset will cost CI time to compile in the follow-up lock generation without being used.
Reviewed by Cursor Bugbot for commit ca1881f. Configure here.
Adds missing CPU-only packages (jax, torchmetrics, torchtext, etils, etc.) to GPU depset via new docgpu_gpu_additions.txt file. This avoids merging full dl-cpu with dl-gpu in one compile, preventing torch-spline-conv conflict while ensuring GPU image has all needed packages. Updates ci_docgpu.depsets.yaml GPU variant to reference both: - python/requirements/ml/py313/dl-gpu-requirements.txt (GPU PyTorch/PyG) - python/requirements/ml/py313/docgpu_gpu_additions.txt (missing CPU-only packages) INCOMPLETE: GPU lock files (docgpu_gpu_depset_py3.10.lock, py3.12.lock) not yet generated. Docker build will fail until locks are committed in follow-up PR. Lock generation blocked by pre-existing etils version conflict (separate issue). Signed-off-by: Adel Nour <ans9868@nyu.edu>
docgpu Split Fix — Work in ProgressProblemPR #62485 merged CPU and GPU requirements into one depset, causing torch-spline-conv conflict (CPU plain version vs GPU cu128 variant cannot resolve in single compiler run). Initial ApproachSplit into two depsets with separate indices. But GPU depset using only Current SolutionAdded Statusjax versions now correct (jax==0.4.28 + jaxlib==0.4.28+cuda backend match). But several unknowns remain:
P.S. Lock file generation is separate issue. Will handle once etils resolved. Depending on my schedule I think I could fix this in 3-7 days. Feedback on this approach is welcome. This is trickier than I initially expected. more information about the full bug in the issue here: #62595 Edit: Completed the fix yesterday evening. Would love a review. More information in the comment below |
Generate four lock files completing the ci_docgpu depset split:
- docgpu_cpu_depset_py3.{10,12}.lock: CPU PyTorch wheels
- docgpu_gpu_depset_py3.{10,12}.lock: GPU cu128 wheels
Remove old undivided docgpu_depset_py3.{10,12}.lock.
Also improve docgpu_gpu_additions.txt: add python_version < '3.13'
guard to torchtext (no cp313 wheel exists for 0.18.0) and clarify
comments.
Validated:
- torch-spline-conv: +pt27cpu in CPU locks, +pt27cu128 in GPU locks
- etils: 1.5.2 on py3.10, 1.14.0 on py3.12
- jaxlib: 0.4.28+cuda12.cudnn89 in both GPU locks
Fixes ray-project#62595
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Split docgpu CPU/GPU depsets, regenerate lock filesFixes the torch-spline-conv conflict introduced by #62485. That PR combined Approach: split into two depsets (following the Lock files regenerated locally with: Four new files committed:
Old single-depset locks removed:
CI failures (unrelated)Two failures in buildkite/"core: python tests [g6_s5]", neither touching anything in this diff:
Would love a review on the approach when you get a chance. On the two CI failures should I rerun, or are these known flaky tests I should just wait on? Happy to do whatever is most useful. Merge conflictThis branch deletes the old lock files because a later upstream commit modified them (jax bump). Conflict resolves trivially in favor of our deletion. I am happy to rebase if preferred. |
|
One possible path forward is both failing tests (test_placement_group_status[False] and test_async_shutdown) could be added to the |
Signed-off-by: Adel Nour <ans9868@nyu.edu>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit e9bd76a. Configure here.
…ter upstream bump to 0.4.33 Signed-off-by: Adel Nour <ans9868@nyu.edu>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
|
Issue closed here #62595 (comment) |


Summary
Fixes the torch-spline-conv conflict in docgpu depset by splitting CPU and GPU variants into separate depsets with their respective PyTorch wheel indices.
What Changed
ci/raydepsets/configs/ci_docgpu.depsets.yaml: Split singleci_docgpu_depsetinto two:ci_docgpu_cpu_depset_${PYTHON_SHORT}: CPU-only with--index https://download.pytorch.org/whl/cpuci_docgpu_gpu_depset_${PYTHON_SHORT}: GPU-only with--index https://download.pytorch.org/whl/cu128ci/docker/docgpu.build.wanda.yaml(line 5): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)ci/docker/docgpu.build.Dockerfile(line 7): Updated lock reference to GPU variant (docgpu_gpu_depset_py$PYTHON.lock)Why
PR #62485 introduced a single depset combining both CPU and GPU requirements, creating an unsolvable torch-spline-conv conflict:
Splitting into separate depsets with explicit indices (following the
ci_mlpattern) resolves this.Note
This PR includes the configuration and Docker file changes. Lock files will
be regenerated and committed in a separate follow-up PR because:
Lock file generation requires running
bazel run //ci/raydepsets:raydepsets -- build,which compiles all dependencies and exposes a pre-existing etils version conflict
(
etils==1.5.2in dl-cpu-requirements.txt vsetils==1.14.0in the constraint file).The architectural fix (config split) is complete and correct regardless of
lock file state. It can merge immediately while the etils conflict is resolved
separately.
This keeps the PR focused: architectural changes now, lock regeneration later
once etils is fixed.
Related
Closes #62595
Related to #62485
Found via: buildkite/precheck/dependencies/build: raydepsets: compile all dependencies [g8_s1] #60522