GB10 CUDA graph capture silently hangs during multi-tensor weight upload (CrossAsset training)

## Summary

`GPUEngine[float32]` silently hangs on NVIDIA GB10 (arm64 Grace Hopper, DGX Spark)
when CUDA graph capture is active and the workload uploads a non-trivial weight
set via `WeightUploader.UploadWeights` followed by graph construction.

## Impact

Blocks GPU training for downstream callers that rely on graph capture. Wolf's
CrossAsset training (12 Fibonacci scales, 193 features per scale, 50 weight
tensors including 256x1024 matrices) reliably hangs at the `Using GPU engine`
log line with 0% GPU utilization across 5 independent attempts.

## Reproduction

Wolf CrossAsset training without env override:
- Hangs silently past `Using GPU engine` for 20-70 minutes
- 0% GPU utilization throughout
- No cudaError emitted, no panic, no deadlock from the Go side

With `ZERFOO_DISABLE_CUDA_GRAPH=1` in the Spark manifest env vars the same
training completes. First successful GPU training on DGX: epochs 0-3 losses
0.864, 0.693, 0.651, 0.627 (still running).

A minimal 4x4 MatMul smoke test (Wolf `cmd/gpu-smoke/main.go`, image
`localhost:5000/wolf-gpu-smoke:wave16-2687f664`) PASSES with capture
enabled — 3.5s with `ZERFOO_DISABLE_CUDA_GRAPH=1`, 4.4s without. So the
hang is specific to multi-tensor uploads + larger graphs, not to the
engine init or basic MatMul path.

## Environment

- NVIDIA DGX Spark GB10 (arm64 Grace Hopper), 128 GiB unified memory
- Ubuntu 24.04 in Podman pod (via feza-ai/spark)
- CUDA 13.0.96
- ztensor `v1.5.1-0.20260415020900-fd646fb10680`
- zerfoo `v1.48.1-0.20260415044400-d3ef8b617b34`
- Go 1.26, `CGO_ENABLED=1`

## Existing evidence

- `compute/gpu_engine.go:416-424` (TODO above line 421): `MmapStorage` +
  `cudaMemcpy` misalignment on ARM64 Grace Hopper breaks CUDA graph capture.
- `compute/engine.go:137`: allocations during capture (`cudaMalloc`) fail with
  error 901.
- `compute/gpu_engine.go:617-630` `BeginCapture`: partial mitigation via
  `CaptureAwareAllocator`. This path is NOT invoked by
  `graph/cuda_graph.go:299` which calls `cuda.StreamBeginCapture` directly
  without switching the engine allocator. Any allocation inside the captured
  region still routes through the default `allocWeight`, which on GB10 with
  managed memory calls `cuda.MallocManaged` (illegal during capture).

## Tracking

Full investigation and fix plan: `docs/plan.md` in this repo, added at
commit `b4868ab` (2026-04-15). Downstream workaround: feza-ai/wolf PR #108
(merged 2026-04-15) pins `ZERFOO_DISABLE_CUDA_GRAPH=1` in
`deploy/spark/train-crossasset-gpu.yaml`.

Wolf-side devlog (investigation history and first successful GPU training
run on DGX):
- feza-ai/wolf `docs/devlog.md` 2026-04-15 entry "Wave 16 — GPU silent-stall
  root cause identified (CUDA graph capture on arm64 GB10)".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GB10 CUDA graph capture silently hangs during multi-tensor weight upload (CrossAsset training) #93

Summary

Impact

Reproduction

Environment

Existing evidence

Tracking

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GB10 CUDA graph capture silently hangs during multi-tensor weight upload (CrossAsset training) #93

Description

Summary

Impact

Reproduction

Environment

Existing evidence

Tracking

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions