Add CUDA support with new `-cuda` tag by ToshY · Pull Request #749 · wader/static-ffmpeg

ToshY · 2026-05-03T20:49:56Z

Fixes #480

This PR was basically brought by Claude (Opus 4.6/4.7). Tested images for both the standard and CUDA build on WSL2, and was also able to embed it in existing Python alpine based image (but needs additional libraries / ENV vars to get it working, which is denoted in the README).

I am however unable to fully verify the changes in the Dockerfile (except the part of adding nv-codec-headers) as I simple just lack the knowledge for it.

A file is available in the PR, docs/ffmpeg-with-cuda.md that just serves as summary for the changes that were made. It is purely kept temporarily in case the reviewer wants to check the decisions that were made (by Claude). If the PR might be merged, it can be removed before merging (to avoid clutter).

Generated summary by Claude 🤖

Summary

Adds a second image variant mwader/static-ffmpeg:<tag>-cuda (amd64 only) that supports NVIDIA GPU acceleration (h264_nvenc, hevc_nvenc, av1_nvenc, NVDEC, CUVID, scale_cuda, …) via the host driver and the NVIDIA Container Toolkit.

The default :<tag> image is unchanged — still fully static-pie musl, zero NEEDED entries, drops into FROM scratch. CUDA users explicitly opt in via the -cuda tag.

Why a separate variant?

The default tag's value proposition is "drop into any base image including FROM scratch". CUDA requires dlopen() of host driver libraries → fundamentally incompatible with static-pie on musl (no dynamic loader). Making the default dynamic would silently break existing users.
CUDA users need a GPU host + the NVIDIA Container Toolkit — different deployment model.
→ Different tag = explicit opt-in + clear support boundary.

	Default `:<tag>`	CUDA `:<tag>-cuda`
Linkage	static-pie musl	musl dynamic-PIE (libc only)
`readelf -d` NEEDED	(none)	exactly one: `libc.musl-x86_64.so.1`
GPU	❌	✅ NVENC / NVDEC / CUVID
Arch	amd64 + arm64	amd64 only
ffmpeg exit codes	upstream	identical to upstream

Architecture (six-layer stack)

The CUDA variant works on Alpine + musl by combining six independently-essential layers. Each was added to fix one specific failure mode discovered during development. Full problem → cause → fix write-ups in docs/ffmpeg-with-cuda.md.

#	Layer	Stage	Fixes
1	Absolute-path link of `/lib/ld-musl-x86_64.so.1`	builder	musl static `libc.a` `dlopen` stub silently returning NULL
2	Dynamic-PIE link mode (`-fPIE -pie`, not `-static-pie`)	builder	static-pie has no dynamic loader, `dlopen` impossible
3	`/etc/ld-musl-x86_64.path` listing toolkit injection dirs	runtime	musl can't find `/usr/lib64`, `/usr/lib/wsl/lib`, …
4	`gcompat` package + `libdl.so.2 → libgcompat.so.0` symlink	runtime	NVIDIA driver libs need `libc.so.6` / `libdl.so.2` (glibc names)
5	`libnvshim.so` LD_PRELOAD (ABI-shim symbols only)	runtime	glibc-internal symbols missing from gcompat (`gnu_get_libc_version`, `__register_atfork`, `dlmopen`, `dlvsym`, …)
6	Bash entrypoint wrapper (139 → 0 only, error-keyword gated)	runtime	benign teardown SIGSEGV from libcuda `__cxa_finalize` on musl

Files changed

Dockerfile — ARG ENABLE_CUDA=; gated nv-codec-headers install; ffmpeg configure gains --enable-ffnvcodec --enable-cuvid --enable-nvenc --enable-nvdec and the dynamic-PIE/absolute-path-libc link flags; new final-cuda stage with gcompat, libnvshim.so, ld-musl path, env, and entrypoint wrapper.
checkelf — new --cuda mode that allows the musl libc/loader as the only NEEDED entry; all other hardening checks (RELRO, BIND_NOW, PIE, NX stack) preserved.
README.md — new "CUDA / NVENC / NVDEC" section; tag listing updated.
docs/ffmpeg-with-cuda.md — full problem → root cause → fix write-up of every issue encountered, plus diagnostic playbook and regression-guard recipes.
.github/workflows/multiarch.yml — split the matrix into three jobs: build-default-arm64, build-default-amd64 (parallel), then build-cuda-amd64 (needs: build-default-amd64, reuses the same buildx cache scope so only the final stage materializes).

Tag layout

Tag	Pushed by
`<tag>`	manifest list of `<tag>-amd64` + `<tag>-arm64`
`<tag>-amd64`	`build-default-amd64`
`<tag>-arm64`	`build-default-arm64`
`<tag>-cuda`	`build-cuda-amd64` (single-arch)
`<tag>-cuda-amd64`	`build-cuda-amd64` (explicit arch alias)

latest and latest-cuda follow the latest stable release.

Explicitly NOT supported

Feature	Reason
`--enable-cuda-nvcc`	Requires the full ~3 GB glibc-based CUDA toolkit at build time
`--enable-libnpp` / `scale_npp`	Same — glibc-only; use `scale_cuda` instead
arm64	NVIDIA Container Toolkit on arm64 is server-class only (Jetson uses a different stack)
`FROM scratch` / distroless target images	No musl loader available

CI impact

Walltime estimate (the longest job dictates total run time; jobs run in parallel where possible):

Job	Runner	Walltime
`build-default-arm64`	ubicloud arm	~40 min
`build-default-amd64`	ubuntu-latest	~60 min
`build-cuda-amd64`	ubuntu-latest	~10–15 min (cache hit on builder layers)

Total wall time: ~70–75 min (vs ~60 min before). The arm64 job still runs fully in parallel with the amd64 chain; the CUDA job blocks on the amd64 default build to reuse its buildx cache scope (avoids ~60 min of duplicate codec compilation).

Verification

End-to-end recipe in
docs/ffmpeg-with-cuda.md §4:

IMG=mwader/static-ffmpeg:<tag>-cuda

# 1. Linkage: exactly one NEEDED entry (musl libc)
docker create --name sf "$IMG" && docker cp sf:/ffmpeg /tmp/ff && docker rm sf
readelf -d /tmp/ff | grep -E 'NEEDED|BIND_NOW'

# 2. NVENC encode
docker run --rm --gpus all "$IMG" \
    -hide_banner -loglevel error \
    -f lavfi -i testsrc=duration=2:size=1280x720:rate=30 \
    -c:v h264_nvenc -f null -
# expect: exit=0, no SEGV line

# 3. Exit-code parity vs non-CUDA :8.1 (regression guard for in-process exit-interposer bug)
docker run --rm --gpus all "$IMG" -hide_banner -loglevel error \
    -f lavfi -i testsrc=duration=1:size=320x240:rate=30 \
    -c:v this_codec_does_not_exist -f null -          # must exit 8
docker run --rm --gpus all "$IMG" -hide_banner -loglevel error \
    -i /no/such/file.mp4 -f null -                     # must exit 254

All five verification steps pass on the test build (RTX 3060 Ti, driver 596.21, CUDA 13.2, WSL2).

Runtime requirements

Host with NVIDIA driver + NVIDIA Container Toolkit.
Run with --gpus all (or --runtime=nvidia + NVIDIA_VISIBLE_DEVICES).
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video is baked into the image — compute mounts libcuda.so.1, video mounts libnvcuvid.so / libnvidia-encode.so. Default toolkit caps (utility only) would break NVENC.

Known design notes (in docs)

libnvshim.so MUST NOT export exit / _exit / _Exit. The earlier in-process attempt to suppress the teardown SIGSEGV via an _exit interposer silently swallowed every ffmpeg error exit code (always returned 0). The shim is now strictly the minimum glibc→musl ABI symbol set; lifecycle policy lives in the bash entrypoint wrapper where it can read the real exit status via ${PIPESTATUS[0]} and pattern-match on actual stderr keywords. See docs/ffmpeg-with-cuda.md §2 P6.
The teardown SIGSEGV is a libcuda __cxa_finalize crash inside main() — there is no in-process hook (atexit, signal handler, etc.) that can suppress it without risk of papering over real bugs. The out-of-process wrapper downgrades exit 139 → 0 only when stderr contains no recognised error keyword.
Image-wide ENV LD_PRELOAD=libnvshim.so is only safe in ffmpeg-only images. The published :*-cuda image runs only /ffmpeg (ENTRYPOINT ["/ffmpeg"]), which was built and tested with the shim preloaded. Downstream users who COPY --from the binaries into a multi-process image (Python/Node app + ffmpeg, etc.) and blindly replicate ENV LD_PRELOAD will see other musl interpreters (pip, python, …) crash with SIGSEGV (exit 139) at startup — libnvshim exports glibc-only symbols and transitively pulls in gcompat (via DT_NEEDED libdl.so.2), which is not safe to inject into arbitrary musl processes. The README "Use in another image with COPY --from" → "Multi-process images" subsection documents the scoped-wrapper alternative (/usr/local/bin/ffmpeg shell stub that sets LD_PRELOAD only for the ffmpeg invocation).

ToshY added 7 commits April 24, 2026 17:16

using claude to build docker image with cuda support

a93ea50

a "working" cuda image without all enable flags

d176a3f

initial working build with cuda

c5979af

silent exit codes arent propagated so change to 1; needs investigation

e72e7c8

fixed exit code

6c486f8

working cuda and non-cuda build

e0b1099

update ci to run cuda after amd64 build to use cache

e3f8bdb

ToshY mentioned this pull request May 3, 2026

[Enhancement] Support for CUDA #480

Open

let wrapper point directly to ffmpeg instead

7c4a819

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA support with new `-cuda` tag#749

Add CUDA support with new `-cuda` tag#749
ToshY wants to merge 8 commits into
wader:masterfrom
ToshY:feature/cuda

ToshY commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ToshY commented May 3, 2026

Summary

Why a separate variant?

Architecture (six-layer stack)

Files changed

Tag layout

Explicitly NOT supported

CI impact

Verification

Runtime requirements

Known design notes (in docs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant