Summary
GPUEngine[float32] silently hangs on NVIDIA GB10 (arm64 Grace Hopper, DGX Spark)
when CUDA graph capture is active and the workload uploads a non-trivial weight
set via WeightUploader.UploadWeights followed by graph construction.
Impact
Blocks GPU training for downstream callers that rely on graph capture. Wolf's
CrossAsset training (12 Fibonacci scales, 193 features per scale, 50 weight
tensors including 256x1024 matrices) reliably hangs at the Using GPU engine
log line with 0% GPU utilization across 5 independent attempts.
Reproduction
Wolf CrossAsset training without env override:
- Hangs silently past
Using GPU engine for 20-70 minutes
- 0% GPU utilization throughout
- No cudaError emitted, no panic, no deadlock from the Go side
With ZERFOO_DISABLE_CUDA_GRAPH=1 in the Spark manifest env vars the same
training completes. First successful GPU training on DGX: epochs 0-3 losses
0.864, 0.693, 0.651, 0.627 (still running).
A minimal 4x4 MatMul smoke test (Wolf cmd/gpu-smoke/main.go, image
localhost:5000/wolf-gpu-smoke:wave16-2687f664) PASSES with capture
enabled — 3.5s with ZERFOO_DISABLE_CUDA_GRAPH=1, 4.4s without. So the
hang is specific to multi-tensor uploads + larger graphs, not to the
engine init or basic MatMul path.
Environment
- NVIDIA DGX Spark GB10 (arm64 Grace Hopper), 128 GiB unified memory
- Ubuntu 24.04 in Podman pod (via feza-ai/spark)
- CUDA 13.0.96
- ztensor
v1.5.1-0.20260415020900-fd646fb10680
- zerfoo
v1.48.1-0.20260415044400-d3ef8b617b34
- Go 1.26,
CGO_ENABLED=1
Existing evidence
compute/gpu_engine.go:416-424 (TODO above line 421): MmapStorage +
cudaMemcpy misalignment on ARM64 Grace Hopper breaks CUDA graph capture.
compute/engine.go:137: allocations during capture (cudaMalloc) fail with
error 901.
compute/gpu_engine.go:617-630 BeginCapture: partial mitigation via
CaptureAwareAllocator. This path is NOT invoked by
graph/cuda_graph.go:299 which calls cuda.StreamBeginCapture directly
without switching the engine allocator. Any allocation inside the captured
region still routes through the default allocWeight, which on GB10 with
managed memory calls cuda.MallocManaged (illegal during capture).
Tracking
Full investigation and fix plan: docs/plan.md in this repo, added at
commit b4868ab (2026-04-15). Downstream workaround: feza-ai/wolf PR #108
(merged 2026-04-15) pins ZERFOO_DISABLE_CUDA_GRAPH=1 in
deploy/spark/train-crossasset-gpu.yaml.
Wolf-side devlog (investigation history and first successful GPU training
run on DGX):
- feza-ai/wolf
docs/devlog.md 2026-04-15 entry "Wave 16 — GPU silent-stall
root cause identified (CUDA graph capture on arm64 GB10)".
Summary
GPUEngine[float32]silently hangs on NVIDIA GB10 (arm64 Grace Hopper, DGX Spark)when CUDA graph capture is active and the workload uploads a non-trivial weight
set via
WeightUploader.UploadWeightsfollowed by graph construction.Impact
Blocks GPU training for downstream callers that rely on graph capture. Wolf's
CrossAsset training (12 Fibonacci scales, 193 features per scale, 50 weight
tensors including 256x1024 matrices) reliably hangs at the
Using GPU enginelog line with 0% GPU utilization across 5 independent attempts.
Reproduction
Wolf CrossAsset training without env override:
Using GPU enginefor 20-70 minutesWith
ZERFOO_DISABLE_CUDA_GRAPH=1in the Spark manifest env vars the sametraining completes. First successful GPU training on DGX: epochs 0-3 losses
0.864, 0.693, 0.651, 0.627 (still running).
A minimal 4x4 MatMul smoke test (Wolf
cmd/gpu-smoke/main.go, imagelocalhost:5000/wolf-gpu-smoke:wave16-2687f664) PASSES with captureenabled — 3.5s with
ZERFOO_DISABLE_CUDA_GRAPH=1, 4.4s without. So thehang is specific to multi-tensor uploads + larger graphs, not to the
engine init or basic MatMul path.
Environment
v1.5.1-0.20260415020900-fd646fb10680v1.48.1-0.20260415044400-d3ef8b617b34CGO_ENABLED=1Existing evidence
compute/gpu_engine.go:416-424(TODO above line 421):MmapStorage+cudaMemcpymisalignment on ARM64 Grace Hopper breaks CUDA graph capture.compute/engine.go:137: allocations during capture (cudaMalloc) fail witherror 901.
compute/gpu_engine.go:617-630BeginCapture: partial mitigation viaCaptureAwareAllocator. This path is NOT invoked bygraph/cuda_graph.go:299which callscuda.StreamBeginCapturedirectlywithout switching the engine allocator. Any allocation inside the captured
region still routes through the default
allocWeight, which on GB10 withmanaged memory calls
cuda.MallocManaged(illegal during capture).Tracking
Full investigation and fix plan:
docs/plan.mdin this repo, added atcommit
b4868ab(2026-04-15). Downstream workaround: feza-ai/wolf PR #108(merged 2026-04-15) pins
ZERFOO_DISABLE_CUDA_GRAPH=1indeploy/spark/train-crossasset-gpu.yaml.Wolf-side devlog (investigation history and first successful GPU training
run on DGX):
docs/devlog.md2026-04-15 entry "Wave 16 — GPU silent-stallroot cause identified (CUDA graph capture on arm64 GB10)".