Skip to content

Add CUDA support with new -cuda tag#749

Open
ToshY wants to merge 8 commits into
wader:masterfrom
ToshY:feature/cuda
Open

Add CUDA support with new -cuda tag#749
ToshY wants to merge 8 commits into
wader:masterfrom
ToshY:feature/cuda

Conversation

@ToshY
Copy link
Copy Markdown
Contributor

@ToshY ToshY commented May 3, 2026

Fixes #480


This PR was basically brought by Claude (Opus 4.6/4.7). Tested images for both the standard and CUDA build on WSL2, and was also able to embed it in existing Python alpine based image (but needs additional libraries / ENV vars to get it working, which is denoted in the README).

I am however unable to fully verify the changes in the Dockerfile (except the part of adding nv-codec-headers) as I simple just lack the knowledge for it.

A file is available in the PR, docs/ffmpeg-with-cuda.md that just serves as summary for the changes that were made. It is purely kept temporarily in case the reviewer wants to check the decisions that were made (by Claude). If the PR might be merged, it can be removed before merging (to avoid clutter).


Generated summary by Claude 🤖

Summary

Adds a second image variant mwader/static-ffmpeg:<tag>-cuda (amd64 only) that supports NVIDIA GPU acceleration (h264_nvenc, hevc_nvenc, av1_nvenc, NVDEC, CUVID, scale_cuda, …) via the host driver and the NVIDIA Container Toolkit.

The default :<tag> image is unchanged — still fully static-pie musl, zero NEEDED entries, drops into FROM scratch. CUDA users explicitly opt in via the -cuda tag.

Why a separate variant?

  • The default tag's value proposition is "drop into any base image including FROM scratch". CUDA requires dlopen() of host driver libraries → fundamentally incompatible with static-pie on musl (no dynamic loader). Making the default dynamic would silently break existing users.
  • CUDA users need a GPU host + the NVIDIA Container Toolkit — different deployment model.
  • → Different tag = explicit opt-in + clear support boundary.
Default :<tag> CUDA :<tag>-cuda
Linkage static-pie musl musl dynamic-PIE (libc only)
readelf -d NEEDED (none) exactly one: libc.musl-x86_64.so.1
GPU ✅ NVENC / NVDEC / CUVID
Arch amd64 + arm64 amd64 only
ffmpeg exit codes upstream identical to upstream

Architecture (six-layer stack)

The CUDA variant works on Alpine + musl by combining six independently-essential layers. Each was added to fix one specific failure mode discovered during development. Full problem → cause → fix write-ups in docs/ffmpeg-with-cuda.md.

# Layer Stage Fixes
1 Absolute-path link of /lib/ld-musl-x86_64.so.1 builder musl static libc.a dlopen stub silently returning NULL
2 Dynamic-PIE link mode (-fPIE -pie, not -static-pie) builder static-pie has no dynamic loader, dlopen impossible
3 /etc/ld-musl-x86_64.path listing toolkit injection dirs runtime musl can't find /usr/lib64, /usr/lib/wsl/lib, …
4 gcompat package + libdl.so.2 → libgcompat.so.0 symlink runtime NVIDIA driver libs need libc.so.6 / libdl.so.2 (glibc names)
5 libnvshim.so LD_PRELOAD (ABI-shim symbols only) runtime glibc-internal symbols missing from gcompat (gnu_get_libc_version, __register_atfork, dlmopen, dlvsym, …)
6 Bash entrypoint wrapper (139 → 0 only, error-keyword gated) runtime benign teardown SIGSEGV from libcuda __cxa_finalize on musl

Files changed

  • DockerfileARG ENABLE_CUDA=; gated nv-codec-headers install; ffmpeg configure gains --enable-ffnvcodec --enable-cuvid --enable-nvenc --enable-nvdec and the dynamic-PIE/absolute-path-libc link flags; new final-cuda stage with gcompat, libnvshim.so, ld-musl path, env, and entrypoint wrapper.
  • checkelf — new --cuda mode that allows the musl libc/loader as the only NEEDED entry; all other hardening checks (RELRO, BIND_NOW, PIE, NX stack) preserved.
  • README.md — new "CUDA / NVENC / NVDEC" section; tag listing updated.
  • docs/ffmpeg-with-cuda.md — full problem → root cause → fix write-up of every issue encountered, plus diagnostic playbook and regression-guard recipes.
  • .github/workflows/multiarch.yml — split the matrix into three jobs: build-default-arm64, build-default-amd64 (parallel), then build-cuda-amd64 (needs: build-default-amd64, reuses the same buildx cache scope so only the final stage materializes).

Tag layout

Tag Pushed by
<tag> manifest list of <tag>-amd64 + <tag>-arm64
<tag>-amd64 build-default-amd64
<tag>-arm64 build-default-arm64
<tag>-cuda build-cuda-amd64 (single-arch)
<tag>-cuda-amd64 build-cuda-amd64 (explicit arch alias)

latest and latest-cuda follow the latest stable release.

Explicitly NOT supported

Feature Reason
--enable-cuda-nvcc Requires the full ~3 GB glibc-based CUDA toolkit at build time
--enable-libnpp / scale_npp Same — glibc-only; use scale_cuda instead
arm64 NVIDIA Container Toolkit on arm64 is server-class only (Jetson uses a different stack)
FROM scratch / distroless target images No musl loader available

CI impact

Walltime estimate (the longest job dictates total run time; jobs run in parallel where possible):

Job Runner Walltime
build-default-arm64 ubicloud arm ~40 min
build-default-amd64 ubuntu-latest ~60 min
build-cuda-amd64 ubuntu-latest ~10–15 min (cache hit on builder layers)

Total wall time: ~70–75 min (vs ~60 min before). The arm64 job still runs fully in parallel with the amd64 chain; the CUDA job blocks on the amd64 default build to reuse its buildx cache scope (avoids ~60 min of duplicate codec compilation).

Verification

End-to-end recipe in
docs/ffmpeg-with-cuda.md §4:

IMG=mwader/static-ffmpeg:<tag>-cuda

# 1. Linkage: exactly one NEEDED entry (musl libc)
docker create --name sf "$IMG" && docker cp sf:/ffmpeg /tmp/ff && docker rm sf
readelf -d /tmp/ff | grep -E 'NEEDED|BIND_NOW'

# 2. NVENC encode
docker run --rm --gpus all "$IMG" \
    -hide_banner -loglevel error \
    -f lavfi -i testsrc=duration=2:size=1280x720:rate=30 \
    -c:v h264_nvenc -f null -
# expect: exit=0, no SEGV line

# 3. Exit-code parity vs non-CUDA :8.1 (regression guard for in-process exit-interposer bug)
docker run --rm --gpus all "$IMG" -hide_banner -loglevel error \
    -f lavfi -i testsrc=duration=1:size=320x240:rate=30 \
    -c:v this_codec_does_not_exist -f null -          # must exit 8
docker run --rm --gpus all "$IMG" -hide_banner -loglevel error \
    -i /no/such/file.mp4 -f null -                     # must exit 254

All five verification steps pass on the test build (RTX 3060 Ti, driver 596.21, CUDA 13.2, WSL2).

Runtime requirements

  • Host with NVIDIA driver + NVIDIA Container Toolkit.
  • Run with --gpus all (or --runtime=nvidia + NVIDIA_VISIBLE_DEVICES).
  • NVIDIA_DRIVER_CAPABILITIES=compute,utility,video is baked into the image — compute mounts libcuda.so.1, video mounts libnvcuvid.so / libnvidia-encode.so. Default toolkit caps (utility only) would break NVENC.

Known design notes (in docs)

  • libnvshim.so MUST NOT export exit / _exit / _Exit. The earlier in-process attempt to suppress the teardown SIGSEGV via an _exit interposer silently swallowed every ffmpeg error exit code (always returned 0). The shim is now strictly the minimum glibc→musl ABI symbol set; lifecycle policy lives in the bash entrypoint wrapper where it can read the real exit status via ${PIPESTATUS[0]} and pattern-match on actual stderr keywords. See docs/ffmpeg-with-cuda.md §2 P6.
  • The teardown SIGSEGV is a libcuda __cxa_finalize crash inside main() — there is no in-process hook (atexit, signal handler, etc.) that can suppress it without risk of papering over real bugs. The out-of-process wrapper downgrades exit 139 → 0 only when stderr contains no recognised error keyword.
  • Image-wide ENV LD_PRELOAD=libnvshim.so is only safe in ffmpeg-only images. The published :*-cuda image runs only /ffmpeg (ENTRYPOINT ["/ffmpeg"]), which was built and tested with the shim preloaded. Downstream users who COPY --from the binaries into a multi-process image (Python/Node app + ffmpeg, etc.) and blindly replicate ENV LD_PRELOAD will see other musl interpreters (pip, python, …) crash with SIGSEGV (exit 139) at startup — libnvshim exports glibc-only symbols and transitively pulls in gcompat (via DT_NEEDED libdl.so.2), which is not safe to inject into arbitrary musl processes. The README "Use in another image with COPY --from" → "Multi-process images" subsection documents the scoped-wrapper alternative (/usr/local/bin/ffmpeg shell stub that sets LD_PRELOAD only for the ffmpeg invocation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Support for CUDA

1 participant