Skip to content

conda-file builds cannot install packages requiring __cuda virtual package #1026

@pinin4fjords

Description

@pinin4fjords

conda-file builds cannot install packages requiring __cuda virtual package

Summary

Wave's conda-file builds cannot install any conda-forge package that depends on the __cuda virtual package, because Wave's build servers do not have GPUs and the templates provide no mechanism to set CONDA_OVERRIDE_CUDA during the conda solve step. This affects all GPU-enabled packages on conda-forge from mid-2022 onwards, including PyTorch >= 1.12.1, JAX, TensorFlow, and others.

The standard conda/mamba workaround for building on CPU-only systems is setting CONDA_OVERRIDE_CUDA=<version> as an environment variable during the solve. Wave does not support this.

This was previously reported in the community forum (Problems with GPU-enabled wave images, Jan 2025) by Nico_Trummer, who hit the same __cuda resolution failure building JAX GPU containers. That thread is unresolved - Paolo responded about the build timeout aspect (#597) but the core CONDA_OVERRIDE_CUDA gap was not addressed. This issue documents the problem fully with reproduction steps, tested workarounds, and concrete fix proposals.

Reproduction

Given this environment.gpu.yml:

channels:
  - conda-forge
  - bioconda
dependencies:
  - "bioconda::ribodetector=0.3.3"
  - "conda-forge::pytorch-gpu>=2.0"
wave --conda-file environment.gpu.yml --platform linux/amd64 --freeze --await

Fails with:

Could not solve for environment specs
pytorch-gpu >=2.0 is not installable because it requires
  pytorch [...], which requires
    __cuda =* *, which is missing on the system.

Build log: https://wave.seqera.io/view/builds/bd-bbf66b1b68ac0df5_1

Why this matters

The __cuda virtual package requirement was introduced in conda-forge's PyTorch builds starting at version 1.12.1. We verified this by testing every major version on a Linux x86_64 system without CONDA_OVERRIDE_CUDA set:

pytorch-gpu version Resolves without __cuda? CUDA packaging
1.11.0 Yes Uses cudatoolkit (regular package)
1.12.1 No Requires __cuda virtual package
1.13.1 No Requires __cuda virtual package
2.0.0 No Requires __cuda virtual package
2.5.1 No Requires __cuda virtual package

This means Wave can only install pytorch-gpu=1.11.0 (March 2022, CUDA 11.1) via --conda-file. All newer versions fail.

This is a real problem in production: the nf-core/rnaseq pipeline's GPU ribodetector test stalls for 2 hours under Singularity because the old PyTorch 1.11.0 container (CUDA 11.1 runtime) deadlocks on CUDA 12.x hosts. We need a container with modern PyTorch (CUDA 12.x), but Wave's --conda-file path cannot build one.

What we tried

--config-env 'CONDA_OVERRIDE_CUDA=12.6': This sets the env var in the final container image, not during the build/solve step. The conda solver never sees it. (Build log)

--conda-run-command: Per the Wave source (TemplateUtils.java, addCommands method), this appends Dockerfile RUN commands after the micromamba install step. It cannot influence the conda solve.

--conda-base-image nvidia/cuda:12.6.3-base-ubuntu24.04: Fails because the NVIDIA base image does not have micromamba installed. (Build log)

Adding __cuda>=12 as a dependency in the YAML: The solver correctly identifies it as a virtual package that must be provided by the system and refuses to install it. (Build log)

Workaround: custom Dockerfile (works, but no community freeze)

A custom Dockerfile with CONDA_OVERRIDE_CUDA set inline works:

FROM mambaorg/micromamba:1.5.10-noble
COPY conda.yml /tmp/conda.yml
RUN CONDA_OVERRIDE_CUDA="12.6" micromamba install -y -n base -f /tmp/conda.yml && micromamba clean -a -y
USER root
ENV PATH="$MAMBA_ROOT_PREFIX/bin:$PATH"
wave -f Dockerfile --context . --platform linux/amd64 --await --tower-token $TOKEN
# Returns: wave.seqera.io/wt/51f41f22dbb8/wave/build:96b4265a2148d918

We pulled this container and verified it contains PyTorch 2.10.0 with CUDA 12.9 and ribodetector 0.3.3 works correctly:

$ docker run --rm <image> python -c "import torch; print(torch.__version__, torch.version.cuda)"
2.10.0 12.9

$ docker run --rm <image> ribodetector --version
ribodetector 0.3.3

However, the custom Dockerfile path requires --build-repo for --freeze, meaning it cannot be frozen to the community Wave registry (community.wave.seqera.io/library/...). This makes it unusable for nf-core modules, which rely on community-frozen container URLs.

Proposed solutions

There are several ways to address this, ranging from minimal to comprehensive:

Option A: Add {{conda_env_prefix}} template placeholder (minimal, targeted)

Add a new placeholder to the Dockerfile templates that injects environment variables before micromamba install:

 FROM {{mamba_image}} AS build
 COPY --chown=$MAMBA_USER:$MAMBA_USER conda.yml /tmp/conda.yml
-RUN micromamba install -y -n base -f /tmp/conda.yml \
+RUN {{conda_env_prefix}}micromamba install -y -n base -f /tmp/conda.yml \

Expose this via a new CLI flag (e.g., --conda-solve-env KEY=VALUE) and Nextflow config option. The placeholder would render as CONDA_OVERRIDE_CUDA="12.6" when set, or empty string when not.

This requires changes to:

  • All 8 conda templates (v1/v2, Docker/Singularity, file/packages)
  • CondaOpts.java (new field)
  • TemplateUtils.java (new binding)
  • Wave CLI (new flag)
  • Nextflow Wave config (new option)

Option B: Set CONDA_OVERRIDE_CUDA unconditionally in templates

Ruled out: testing shows that for packages with both CPU and GPU variants (e.g. bare pytorch without the -gpu/-cpu suffix), the override causes the solver to prefer CUDA builds, silently pulling in hundreds of MB of CUDA toolkit into containers that never intended to use a GPU.

Option C: Detect GPU packages and set override automatically

Wave could inspect the conda environment spec for known GPU metapackages (pytorch-gpu, jaxlib, tensorflow-gpu, etc.) or __cuda dependencies and automatically set CONDA_OVERRIDE_CUDA during the solve. This would be the most user-friendly option but requires more implementation effort.

Option D: Two-pass solve with automatic retry

Run the solve normally. If it fails and the error output contains __cuda, retry with CONDA_OVERRIDE_CUDA set. The template's RUN command would become something like:

micromamba install -y -n base -f /tmp/conda.yml || \
  (micromamba install --dry-run -y -n base -f /tmp/conda.yml 2>&1 | grep -q __cuda \
   && CONDA_OVERRIDE_CUDA="12" micromamba install -y -n base -f /tmp/conda.yml)

This requires no new CLI flags, no package list, and no repodata inspection. The solver itself discovers whether __cuda is needed. The cost is one extra failed solve (~4s) for GPU environments only; non-GPU environments succeed on the first pass with zero overhead. This could be implemented entirely within the existing templates.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions