Skip to content

GB10 CUDA graph capture silently hangs during multi-tensor weight upload (CrossAsset training) #93

@dndungu

Description

@dndungu

Summary

GPUEngine[float32] silently hangs on NVIDIA GB10 (arm64 Grace Hopper, DGX Spark)
when CUDA graph capture is active and the workload uploads a non-trivial weight
set via WeightUploader.UploadWeights followed by graph construction.

Impact

Blocks GPU training for downstream callers that rely on graph capture. Wolf's
CrossAsset training (12 Fibonacci scales, 193 features per scale, 50 weight
tensors including 256x1024 matrices) reliably hangs at the Using GPU engine
log line with 0% GPU utilization across 5 independent attempts.

Reproduction

Wolf CrossAsset training without env override:

  • Hangs silently past Using GPU engine for 20-70 minutes
  • 0% GPU utilization throughout
  • No cudaError emitted, no panic, no deadlock from the Go side

With ZERFOO_DISABLE_CUDA_GRAPH=1 in the Spark manifest env vars the same
training completes. First successful GPU training on DGX: epochs 0-3 losses
0.864, 0.693, 0.651, 0.627 (still running).

A minimal 4x4 MatMul smoke test (Wolf cmd/gpu-smoke/main.go, image
localhost:5000/wolf-gpu-smoke:wave16-2687f664) PASSES with capture
enabled — 3.5s with ZERFOO_DISABLE_CUDA_GRAPH=1, 4.4s without. So the
hang is specific to multi-tensor uploads + larger graphs, not to the
engine init or basic MatMul path.

Environment

  • NVIDIA DGX Spark GB10 (arm64 Grace Hopper), 128 GiB unified memory
  • Ubuntu 24.04 in Podman pod (via feza-ai/spark)
  • CUDA 13.0.96
  • ztensor v1.5.1-0.20260415020900-fd646fb10680
  • zerfoo v1.48.1-0.20260415044400-d3ef8b617b34
  • Go 1.26, CGO_ENABLED=1

Existing evidence

  • compute/gpu_engine.go:416-424 (TODO above line 421): MmapStorage +
    cudaMemcpy misalignment on ARM64 Grace Hopper breaks CUDA graph capture.
  • compute/engine.go:137: allocations during capture (cudaMalloc) fail with
    error 901.
  • compute/gpu_engine.go:617-630 BeginCapture: partial mitigation via
    CaptureAwareAllocator. This path is NOT invoked by
    graph/cuda_graph.go:299 which calls cuda.StreamBeginCapture directly
    without switching the engine allocator. Any allocation inside the captured
    region still routes through the default allocWeight, which on GB10 with
    managed memory calls cuda.MallocManaged (illegal during capture).

Tracking

Full investigation and fix plan: docs/plan.md in this repo, added at
commit b4868ab (2026-04-15). Downstream workaround: feza-ai/wolf PR #108
(merged 2026-04-15) pins ZERFOO_DISABLE_CUDA_GRAPH=1 in
deploy/spark/train-crossasset-gpu.yaml.

Wolf-side devlog (investigation history and first successful GPU training
run on DGX):

  • feza-ai/wolf docs/devlog.md 2026-04-15 entry "Wave 16 — GPU silent-stall
    root cause identified (CUDA graph capture on arm64 GB10)".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions