From b4868ab4074bda29e240cb922baa275e4f86b408 Mon Sep 17 00:00:00 2001 From: David Ndungu Date: Wed, 15 Apr 2026 21:00:36 -0700 Subject: [PATCH 1/4] docs(plan): add GB10 CUDA graph capture hang resolution plan Replaces closed Issue-79 investigation plan with comprehensive 6-epic execution plan for resolving silent hang on NVIDIA DGX Spark GB10 (arm64 Grace Hopper) during multi-tensor weight uploads with CUDA graph capture active. Waves 1-8 define parallel agent counts, deliverables (E1 reproduction, E2 capture-aware alloc, E3 conditional Mmap investigation, E4 fail-fast fallback, E5 downstream rollout, E6 release), milestones M1-M5, risk register, and Spark operational notes. --- docs/plan.md | 542 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 495 insertions(+), 47 deletions(-) diff --git a/docs/plan.md b/docs/plan.md index b8cc761..fb14eac 100644 --- a/docs/plan.md +++ b/docs/plan.md @@ -1,23 +1,495 @@ -# ztensor Open GitHub Issues Resolution +# ztensor Work Plan + +## Title + +Resolve GB10 CUDA graph capture hang in GPUEngine[float32] on multi-tensor +training workloads. ## Context -Resolve all open GitHub issues in `github.com/zerfoo/ztensor`. +### Problem statement + +`GPUEngine[float32]` silently hangs on NVIDIA GB10 (arm64 Grace Hopper, DGX +Spark) when CUDA graph capture is active and the workload uploads a +non-trivial weight set via `WeightUploader.UploadWeights` followed by graph +construction. A minimal 4x4 MatMul smoke test passes with capture enabled, +so the failure is specific to larger multi-tensor workloads. + +Reproduction downstream: Wolf CrossAsset training (12 Fibonacci scales, +193 features per scale, approximately 50 weight tensors including +256x1024 matrices) reliably hangs at the log line `Using GPU engine` with +0 percent GPU utilization across 5 independent attempts. Setting the +environment variable `ZERFOO_DISABLE_CUDA_GRAPH=1` fully bypasses the +hang and lets training complete (epochs 0 to 3 produced losses 0.864, +0.693, 0.651, 0.627). + +Environment: NVIDIA DGX Spark GB10 (arm64 Grace Hopper), Ubuntu 24.04 in +Podman container, CUDA 13.0.96, ztensor +`v1.5.1-0.20260415020900-fd646fb10680`, zerfoo +`v1.48.1-0.20260415044400-d3ef8b617b34`, Go 1.26, CGO_ENABLED=1. + +Existing evidence in source: + +- `compute/gpu_engine.go:416-424` (the TODO above line 421) documents that + `MmapStorage` plus `cudaMemcpy` misalignment on ARM64 Grace Hopper breaks + CUDA graph capture. The current workaround skips `MmapStorage` tensors in + `UploadWeights`. +- `compute/engine.go:137` documents that allocations during capture + (`cudaMalloc`) fail with error 901. +- Partial mitigation exists at `compute/gpu_engine.go:617-630` + (`BeginCapture`) which switches `pool` to `CaptureAwareAllocator` so + allocations during capture use `cudaMallocAsync` on the capture stream + and are recorded as graph nodes. This path is not exercised when + training through zerfoo's `graph/cuda_graph.go` capture wrapper, which + calls `cuda.StreamBeginCapture` directly at `graph/cuda_graph.go:299` + without switching the engine's allocator. +- Upstream tracker: feza-ai/wolf PR #108 (merged, pins + `ZERFOO_DISABLE_CUDA_GRAPH=1`) and zerfoo `docs/adr/088-gemma4-ple-cuda-graph-capture.md` + which fixed a related capture breakage in gemma4e inference. + +### Objectives + +- Identify the exact allocation or H2D path that triggers a silent hang + during graph capture on GB10 with a multi-tensor upload followed by + forward pass. +- Deliver either working CUDA graph capture on GB10 under production + training workloads, or a fail-fast error with an actionable message when + capture cannot safely proceed. +- Remove the need for downstream callers (Wolf, zerfoo inference + manifests) to set `ZERFOO_DISABLE_CUDA_GRAPH=1` for the affected + workloads. +- Preserve the existing gemma4e inference capture path documented in + zerfoo ADR-088 (no regression on passing workloads). + +### Non goals + +- Rewriting the `MmapStorage` quantized-weight path to use a different + upload strategy. Scope is constrained to making capture safe (or fail + loudly) with the existing upload paths for CrossAsset-style dense + float32 workloads. +- Adding new CUDA kernel code. The fix is expected to live in the capture + lifecycle, allocator routing, and error handling layers. +- Supporting CUDA graph capture on non-managed-memory GPUs where it is + currently off by default. + +### Constraints and assumptions + +- DGX Spark (GB10) is the only target hardware where this bug manifests. + Local dev on Apple Silicon and x86 CPU tests cannot reproduce, so fixes + must be validated via Spark pod submissions (`scripts/bench-spark.sh` + equivalents, or ad-hoc manifests in `docs/bench/manifests/`). Never + `ssh` to the DGX to run benches; follow the repo convention in + `/Users/dndungu/Code/zerfoo/zerfoo/CLAUDE.md`. +- ztensor must remain CGO-free by default. CUDA access is via + `purego/dlopen` through `internal/cuda`. Any new runtime probe must go + through `internal/cuda/runtime_purego.go`. +- Managed memory path (`e.managedMem`) is the default on GB10 (unified + memory). The hang happens on that path. Do not assume a non-managed + baseline. +- The main branch must stay green for CPU and non-capture GPU tests on + every commit. Capture-specific tests gate on a DGX runner. + +### Success metrics + +- CrossAsset GPU training completes at least 3 epochs on DGX GB10 with + CUDA graph capture enabled (no env-var override) and produces + decreasing loss across epochs. +- A reproduction test in `compute/` (or a new `graph/` test) triggers the + same code path the hang followed, and now either passes with capture + on or returns a typed error that names the capture-incompatible + operation within 5 seconds. +- `ZERFOO_DISABLE_CUDA_GRAPH=1` is removed from Wolf + `deploy/spark/train-crossasset-gpu.yaml` and from zerfoo + `docs/bench/manifests/gemma4-e2e.yaml` and `gpu-parity.yaml` (the + latter only if the capture fix covers their workloads). +- No regression on the 184/185 instruction capture rate measured on + GGUF inference (see zerfoo `docs/adr/033-how-we-beat-ollama.md`). + +## Discovery Summary + +ENGINEERING discovery against the knowledge graph was not rerun for this +plan because the symptom, reproduction path, and suspect code sites are +already identified in the user-supplied report and in-source TODOs at +`compute/gpu_engine.go:421` and `compute/engine.go:137`. The discovery +artifact lives inline below. + +### Relevant code paths + +- `compute/gpu_engine.go:293-525` -- `UploadWeights` entry point, covers + Q4_K, Q5_0, Q8_0, FP8 E4M3, FP16, BF16, float32 branches. Each branch + calls `allocWeight` then `uploadBytes`. `MmapStorage` is explicitly + skipped. +- `compute/gpu_engine.go:576-596` -- `allocWeight` and `uploadBytes`. + With `managedMem`, allocation routes through `cuda.MallocManaged` and + upload is a direct host memcpy. Without managed memory, allocation + routes through `e.runtime.Malloc` (the GRAL default) and upload issues + `cudaMemcpyHostToDevice`. +- `compute/gpu_engine.go:611-655` -- `BeginCapture`/`EndCapture` on the + engine. Switches the pool to `CaptureAwareAllocator`. +- `graph/cuda_graph.go:270-345` -- The zerfoo-facing capture driver + that actually calls `cuda.StreamBeginCapture`. This path does NOT + invoke `GPUEngine.BeginCapture`, so the capture-aware allocator + switch is missed. Any allocation inside the captured region still goes + through the default `allocWeight`, which on GB10 with managed memory + calls `cuda.MallocManaged` (illegal during capture). +- `internal/cuda/runtime_purego.go:368-385` -- `StreamBeginCapture` + uses `cudaStreamCaptureModeRelaxed`. Relaxed mode does not forbid + host work but it does forbid `cudaMalloc` family calls on the capture + stream. + +### Likely root-cause candidates (in priority order) + +1. `graph/cuda_graph.go` begins capture without routing the engine's + allocator through the capture-aware path. A mid-capture + `cuda.MallocManaged` or arena resize returns error 901 synchronously, + but the return is swallowed because the arena path logs at a level + that is suppressed, or the stream goes into an unrecoverable captured + state and the next `Sync` deadlocks. +2. `MmapStorage` quantized weights are lazy: `matMulMmap` dequantizes + per op and uploads via `cudaMemcpy` on the capture stream. On ARM64 + with an unaligned mmap base, this H2D either fails silently or + corrupts the stream capture graph, causing the next CUDA call to + block forever. +3. The first forward pass crosses the kv-cache-like workspace setup + that allocates a scratch buffer lazily. The allocation is not + registered with the pre-capture `EnsureCaptureInputsGPU` code at + `graph/cuda_graph.go:283-287`, so it races with capture. + +### Use case catalog + +| ID | Domain | Name | Actor | Interfaces | Priority | Wiring status | +|----|--------|------|-------|-----------|----------|---------------| +| UC-001 | compute | Upload a multi-tensor float32 weight set to GB10 managed memory before capture | zerfoo training driver | `GPUEngine.UploadWeights` | P0 | WIRED | +| UC-002 | compute | Run a captured forward+backward pass on CrossAsset-shape float32 tensors | zerfoo training driver | `GPUEngine.BeginCapture` / `graph.BuildAndRun` / `EndCapture` | P0 | BROKEN on GB10 | +| UC-003 | compute | Detect a non-capturable allocation attempt and return a typed error instead of hanging | zerfoo training/inference driver | `GPUEngine.BeginCapture`, `allocWeight` | P0 | MISSING | +| UC-004 | compute | Reset the GPU arena between training batches without disturbing an active capture | zerfoo trainer | `compute.PoolResetter.ResetPool` | P1 | WIRED (verify) | +| UC-005 | compute | Fall back to non-captured execution when capture setup fails, without requiring process restart | zerfoo runtime | `graph/cuda_graph.go:RunInstructions` fallback path | P1 | PARTIAL (existing rollback only covers `StreamBeginCapture` failures, not post-capture hangs) | +| UC-006 | compute | Re-enable CUDA graph capture for gemma4e inference on GB10 via manifest edits | zerfoo serve / bench | `docs/bench/manifests/gemma4-e2e.yaml` | P1 | BLOCKED on this plan | +| UC-007 | compute | Re-enable CUDA graph capture for CrossAsset training on GB10 via Wolf manifest | Wolf trainer | `deploy/spark/train-crossasset-gpu.yaml` | P0 | BLOCKED on this plan | +| UC-008 | compute | Regression coverage for the minimal hang repro in CI (DGX-only job) | ztensor developer | `go test ./graph/... -run TestCUDAGraph_MultiTensorUpload` | P1 | MISSING | + +Gaps: UC-002, UC-003, UC-008 need implementation. UC-005 is partially +wired (only the StreamBeginCapture-failure rollback path at +`graph/cuda_graph.go:299-303` covers this; a post-capture timeout is +missing). + +Reference (for this plan's purposes): manifest derived inline above, no +separate JSON artifact committed. If the fix evolves further, write +`.claude/scratch/usecases-manifest.json` on the next iteration. + +## Scope and Deliverables + +### In scope + +- Reproduction harness that runs on DGX GB10 via Spark and reliably + triggers the hang within 60 seconds when capture is active. +- Instrumentation that turns the silent hang into an observable error + (stream capture status probe + explicit log on allocator calls during + capture). +- Root-cause fix (one of: allocator routing, MmapStorage alignment, + pre-capture workspace allocation) that allows CrossAsset training to + run with capture on. +- Fail-fast mode that detects unavoidable capture-incompatible + conditions and returns a typed error so the caller can retry without + capture. +- Regression test gated on a build tag or environment variable so it + only runs on DGX. +- Manifest updates in downstream consumers once the fix lands. +- ADR documenting the decision (new ztensor ADR-003, taking the next + number in that repo's `docs/adr/`). + +### Out of scope + +- Porting the fix to ROCm or OpenCL backends. Those paths do not have + capture support today. +- Changing the default `managedMem` detection logic. +- Rewriting the quantized-weight upload logic. If `MmapStorage` turns + out to be a root cause, the fix is to guard capture entry, not to + redesign weight upload. + +### Deliverables + +| ID | Description | Owner | Acceptance criteria | +|----|-------------|-------|---------------------| +| D1 | Reproduction test `TestCUDAGraph_MultiTensorUpload_GB10` in `graph/cuda_graph_test.go` | TBD | Hangs or fails consistently on GB10 without the fix, passes after the fix, runs under 60s | +| D2 | Diagnostic probe `cuda.StreamCaptureStatus` exposed via `internal/cuda/runtime_purego.go` | TBD | Returns one of `None`, `Active`, `Invalidated` with unit tests on CPU-mock path | +| D3 | Capture-aware allocator wiring in `graph/cuda_graph.go` | TBD | All allocations inside capture region go through `CaptureAwareAllocator`; verified by logging on debug build | +| D4 | Typed error `compute.ErrCaptureIncompatibleAllocation` returned from `allocWeight` and `uploadBytes` when called on a capturing stream | TBD | Callers get the error synchronously; no hang possible | +| D5 | Root-cause fix passing CrossAsset training on GB10 with capture enabled | TBD | 3 epochs complete, losses decrease, runtime within 10 percent of the disable-graph baseline | +| D6 | ADR documenting decision in ztensor `docs/adr/003-cuda-graph-capture-on-gb10.md` | TBD | Covers context, options considered, decision, consequences | +| D7 | Downstream manifest cleanups (Wolf + zerfoo) that drop `ZERFOO_DISABLE_CUDA_GRAPH=1` for workloads the fix covers | TBD | Manifests merged; CI green on affected jobs | + +## Checkable Work Breakdown + +All estimates are rough; refine when a task starts. + +### E1 Reproduce and instrument the hang + +- [ ] T1.1 Add `StreamCaptureStatus` purego binding in `internal/cuda/runtime_purego.go` (wraps `cudaStreamGetCaptureInfo`). Owner: TBD. Est: 90m. verifies: [UC-003] + - Acceptance: Returns the three-valued enum, exported via `cuda.StreamCaptureStatus(stream *Stream) (Status, error)`. Unit test on a non-capturing stream returns `None`. + - Dependencies: none. +- [ ] T1.2 Add `ensureNotCapturing()` guard to `allocWeight` and `uploadBytes` in `compute/gpu_engine.go`. If status is `Active`, return a typed error `ErrCaptureIncompatibleAllocation`. Owner: TBD. Est: 60m. verifies: [UC-003] + - Acceptance: Existing non-capture tests unaffected. New unit test with a mock stream in `Active` state triggers the error. + - Dependencies: T1.1. +- [ ] T1.3 Write `TestCUDAGraph_MultiTensorUpload_GB10` in `compute/gpu_engine_test.go` gated behind `//go:build dgxgb10` build tag. The test uploads 50 tensors (including a 256x1024 float32 matrix), then invokes `BeginCapture`, runs a MatMul, `EndCapture`. Owner: TBD. Est: 2h. verifies: [UC-001, UC-002] + - Acceptance: Without the fix the test fails with either a hang (caught by a 30s `context.WithTimeout`) or the new typed error. + - Dependencies: T1.2. +- [ ] T1.4 Package the test into a Spark manifest `docs/bench/manifests/cuda-graph-gb10-repro.yaml` and submit. Collect logs for evidence. Owner: TBD. Est: 90m. verifies: [UC-002] + - Acceptance: Manifest submitted via `curl -X POST $SPARK/api/v1/pods ...`; log output includes the hang signature or the new typed error. File one zerfoo-side GitHub issue if a new failure mode surfaces. + - Dependencies: T1.3. +- [ ] T1.5 Add unit and integration tests covering T1.1 to T1.3 code paths. Owner: TBD. Est: 60m. verifies: [infrastructure] + - Acceptance: CPU-mock unit tests pass in `go test ./compute/... ./internal/cuda/...`. + - Dependencies: T1.1, T1.2. +- [ ] T1.6 Run `gofmt -s -w`, `goimports`, and `golangci-lint run ./...` after the E1 changes. Owner: TBD. Est: 15m. verifies: [infrastructure] + - Dependencies: T1.5. + +### E2 Fix the silent hang path (capture-aware allocation) + +- [ ] T2.1 Route `zerfoo/graph/cuda_graph.go` capture entry through `GPUEngine.BeginCapture`/`EndCapture` instead of calling `cuda.StreamBeginCapture` directly. Owner: TBD. Est: 2h. verifies: [UC-002, UC-005] + - Acceptance: Log line shows `CaptureAwareAllocator` is engaged before the capture region; existing gemma4e inference tests still pass. + - Risk: zerfoo `graph/cuda_graph.go` is across a repo boundary. This task splits into ztensor-side (T2.1a) and zerfoo-side (T2.1b) commits in separate PRs, wired through a ztensor minor bump. + - Dependencies: T1.4. +- [ ] T2.1a ztensor: expose a stable `compute.GPUEngine.WithCapture(fn func() error) error` helper so callers do not need to unwrap pool types. Owner: TBD. Est: 60m. verifies: [UC-002] + - Acceptance: Helper unit-tested on CPU-mock engine; returns errors from either begin/end path. + - Dependencies: T1.2. +- [ ] T2.1b zerfoo: switch `graph/cuda_graph.go:beginCapture` to use `WithCapture`. Owner: TBD. Est: 45m. verifies: [UC-002] + - Acceptance: Existing zerfoo GGUF inference tests still pass; gemma4e and gemma3 parity suites unchanged. + - Dependencies: T2.1a, ztensor version bump merged. +- [ ] T2.2 Introduce a `managedMem` guard in `allocWeight` that routes to `cudaMallocAsync` on the capture stream when `CaptureAwareAllocator` is active. Otherwise fall back to `MallocManaged`. Owner: TBD. Est: 90m. verifies: [UC-002] + - Acceptance: Unit test with a mocked capture stream records an async-alloc node instead of a sync call. + - Dependencies: T2.1a. +- [ ] T2.3 Pre-allocate workspace buffers used by `MatMul`, `Add`, and `RMSNorm` variants at `UploadWeights` time so no lazy alloc occurs inside capture for dense float32 workloads. Owner: TBD. Est: 3h. verifies: [UC-001, UC-002] + - Acceptance: Instrument with a counter; capture region records zero `allocWeight` calls for the CrossAsset workload. + - Dependencies: T1.3, T2.1a. +- [ ] T2.4 Add unit and integration tests for T2.1 to T2.3. Owner: TBD. Est: 90m. verifies: [infrastructure] + - Dependencies: T2.3. +- [ ] T2.5 Run linters and formatters (`gofmt`, `goimports`, `golangci-lint`). Owner: TBD. Est: 15m. verifies: [infrastructure] + - Dependencies: T2.4. +- [ ] T2.6 Submit the repro manifest from T1.4 on the fixed branch. Confirm CrossAsset-shape upload + capture run completes in under 5 seconds. Owner: TBD. Est: 60m. verifies: [UC-002, UC-007] + - Acceptance: Pod `Succeeded`; log excerpt saved in devlog. + - Dependencies: T2.5. + +### E3 Investigate MmapStorage alignment on GB10 (conditional on E2 not being sufficient) + +- [ ] T3.1 Add a targeted test `TestMmapStorage_GB10_Align` that allocates an `MmapStorage` tensor whose base address is intentionally 4-byte aligned (not 16) and calls `cudaMemcpy` onto the capture stream. Owner: TBD. Est: 2h. verifies: [UC-001] + - Acceptance: Reproduces the corruption on GB10 OR cleanly confirms that managed-memory path sidesteps the issue. + - Dependencies: T2.6. +- [ ] T3.2 If T3.1 reproduces, pad `MmapStorage.Bytes()` to a 128-byte aligned staging buffer before `cudaMemcpy`. Otherwise document in the ADR that `MmapStorage` skip in `UploadWeights` remains the intended behavior. Owner: TBD. Est: 3h. verifies: [UC-001] + - Dependencies: T3.1. +- [ ] T3.3 Update the TODO at `compute/gpu_engine.go:421` so the comment reflects the resolved state (either fixed with T3.2 or reaffirmed as intended design). Owner: TBD. Est: 15m. verifies: [infrastructure] + - Dependencies: T3.2. +- [ ] T3.4 Tests, linters, formatters. Owner: TBD. Est: 30m. verifies: [infrastructure] + - Dependencies: T3.3. + +### E4 Fail-fast path for residual capture-incompatible workloads + +- [ ] T4.1 Wrap `graph/cuda_graph.go` capture run with a 30-second watchdog that samples `StreamCaptureStatus` every second. If capture is `Invalidated` or a heartbeat ping stalls, call `StreamEndCapture`, mark failed, and fall back. Owner: TBD. Est: 2h. verifies: [UC-005] + - Dependencies: T1.1. +- [ ] T4.2 Expose a helper `compute.CaptureSafe(engine, fn)` that tries capture, catches `ErrCaptureIncompatibleAllocation`, and runs the instructions uncaptured on the same stream. Owner: TBD. Est: 90m. verifies: [UC-005] + - Dependencies: T1.2, T4.1. +- [ ] T4.3 Tests, linters, formatters. Owner: TBD. Est: 30m. verifies: [infrastructure] + - Dependencies: T4.2. + +### E5 Downstream rollout + +- [ ] T5.1 Remove `ZERFOO_DISABLE_CUDA_GRAPH=1` from Wolf `deploy/spark/train-crossasset-gpu.yaml`. Submit the bench once with capture enabled and attach logs. Owner: TBD. Est: 60m. verifies: [UC-007] + - Dependencies: T2.6 (ztensor fix released), T2.1b (zerfoo pickup). +- [ ] T5.2 Remove `ZERFOO_DISABLE_CUDA_GRAPH=1` from zerfoo `docs/bench/manifests/gemma4-e2e.yaml` once capture passes the parity suite without it. Owner: TBD. Est: 60m. verifies: [UC-006] + - Dependencies: T2.6. +- [ ] T5.3 Keep `ZERFOO_DISABLE_CUDA_GRAPH=1` in `docs/bench/manifests/gpu-parity.yaml` only if a specific parity workload still requires it; otherwise remove. Owner: TBD. Est: 30m. verifies: [UC-006] + - Dependencies: T5.2. +- [ ] T5.4 Update docs: remove the "known issue" note from zerfoo ADR-088's Consequences section once the gemma4e manifest drops the override. Owner: TBD. Est: 30m. verifies: [infrastructure] + - Dependencies: T5.2. + +### E6 Release and documentation + +- [ ] T6.1 Write ztensor `docs/adr/003-cuda-graph-capture-on-gb10.md` capturing context, options considered, decision, and consequences. Owner: TBD. Est: 90m. verifies: [infrastructure] + - Dependencies: T2.6. +- [ ] T6.2 Append a devlog entry dated 2026-04-15 describing the hang repro, the root cause, and the fix. Include the Spark pod name(s) and log excerpts. Owner: TBD. Est: 45m. verifies: [infrastructure] + - Dependencies: T6.1. +- [ ] T6.3 Cut a ztensor minor release via release-please (`v1.6.0`). Bump zerfoo dependency once tag publishes. Owner: TBD. Est: 60m. verifies: [infrastructure] + - Acceptance: `github.com/zerfoo/ztensor v1.6.0` on `main`; zerfoo `go.mod` updated in the same cycle as T2.1b. + - Dependencies: T6.2. + +## Parallel Work -Status as of 2026-04-09: -- **#78** NCCL purego migration -- CLOSED via PR #80 (merged `af8af73`). -- **#79** GPU engine dst-output routing -- INVESTIGATION CLOSED ztensor-side. - ztensor primitives cleared; follow-up must happen in zerfoo. Branch - `fix/issue-79-matmul-accumulate-repro` retained as evidence. +### Parallel tracks -No open ztensor issues remain assigned to this plan. If new issues are -filed, re-open this plan or start a fresh one. +| Track | Tasks | Notes | +|-------|-------|-------| +| A: Reproduction and probe | T1.1, T1.2, T1.3 | Must finish first to unblock everything else | +| B: Fix path | T2.1a, T2.2, T2.3 | Can start once T1.2 lands the probe | +| C: Mmap investigation | T3.1, T3.2 | Starts only after T2 confirms the fix is or is not sufficient | +| D: Fallback path | T4.1, T4.2 | Runs in parallel with Track B once T1.1 is in | +| E: zerfoo pickup | T2.1b | Sequential after T2.1a is released | +| F: Rollout | T5.1, T5.2, T5.3, T5.4 | After the fix is released | -## Evidence for #79 closure +Sync points: the ztensor release (T6.3) is the hard sync for any +zerfoo-side change. Track E cannot start until Track B tags a version. -Test file `compute/gpu_dst_roundtrip_test.go` on branch -`fix/issue-79-matmul-accumulate-repro` ports the exact backward-pass op -sequence from `zerfoo/timeseries/patchtst_gpu_train.go:1022-1031`: +### Waves + +Each wave lists the exact number of parallel agents to spin up. Agent +count equals the number of task IDs listed on that wave. + +#### Wave 1: Repro and probe (2 agents) + +- [ ] T1.1 Add `StreamCaptureStatus` purego binding verifies: [UC-003] +- [ ] T1.2 Add `ensureNotCapturing` guard and typed error verifies: [UC-003] + +#### Wave 2: Reproduction harness (3 agents) + +- [ ] T1.3 Write `TestCUDAGraph_MultiTensorUpload_GB10` verifies: [UC-001, UC-002] +- [ ] T1.5 Unit and integration tests for E1 verifies: [infrastructure] +- [ ] T1.6 Lint and format E1 verifies: [infrastructure] + +#### Wave 3: Repro on hardware (1 agent) + +- [ ] T1.4 Spark manifest and hardware run verifies: [UC-002] + +#### Wave 4: Fix + fallback in parallel (4 agents) + +- [ ] T2.1a ztensor `WithCapture` helper verifies: [UC-002] +- [ ] T2.2 Capture-aware `allocWeight` routing verifies: [UC-002] +- [ ] T2.3 Pre-allocate forward-pass workspace verifies: [UC-001, UC-002] +- [ ] T4.1 Capture watchdog verifies: [UC-005] + +#### Wave 5: Tests, linters, zerfoo pickup (4 agents) + +- [ ] T2.4 Unit and integration tests for E2 verifies: [infrastructure] +- [ ] T2.5 Lint and format E2 verifies: [infrastructure] +- [ ] T4.2 `CaptureSafe` helper verifies: [UC-005] +- [ ] T4.3 Lint and format E4 verifies: [infrastructure] + +#### Wave 6: Hardware validation (1 agent) + +- [ ] T2.6 CrossAsset-shape capture run on DGX verifies: [UC-002, UC-007] + +#### Wave 7: Release + downstream cleanup (3 agents) + +- [ ] T6.1 ADR-003 for ztensor verifies: [infrastructure] +- [ ] T6.2 Devlog entry verifies: [infrastructure] +- [ ] T6.3 Cut ztensor v1.6.0 verifies: [infrastructure] + +#### Wave 8: Mmap follow-up (conditional, 4 agents) + +- [ ] T3.1 Mmap alignment repro verifies: [UC-001] +- [ ] T3.2 Mmap alignment fix or confirmation verifies: [UC-001] +- [ ] T3.3 Update gpu_engine.go:421 TODO verifies: [infrastructure] +- [ ] T3.4 Tests, linters verifies: [infrastructure] + +#### Wave 9: Rollout (3 agents) + +- [ ] T5.1 Drop env var from Wolf manifest verifies: [UC-007] +- [ ] T5.2 Drop env var from gemma4-e2e manifest verifies: [UC-006] +- [ ] T5.4 Update zerfoo ADR-088 Consequences verifies: [infrastructure] + +Wave 5.3 handles the `gpu-parity.yaml` manifest only if T5.2 verification +succeeds; it sits as a stretch alongside Wave 9. + +## Timeline and Milestones + +| ID | Description | Depends on | Target date | +|----|-------------|------------|-------------| +| M1 | Reproduction test reliably triggers the hang on DGX and returns a typed error (no silent hang) | T1.4 | 2026-04-17 | +| M2 | Fix merged to ztensor `main`, CrossAsset-shape capture passes on DGX | T2.6 | 2026-04-21 | +| M3 | ztensor v1.6.0 released and picked up by zerfoo `main` | T6.3 | 2026-04-23 | +| M4 | `ZERFOO_DISABLE_CUDA_GRAPH=1` removed from Wolf CrossAsset deploy manifest, 3 training epochs pass with capture on | T5.1 | 2026-04-25 | +| M5 | Gemma4e inference manifest cleaned up; ADR-088 consequences updated | T5.2, T5.4 | 2026-04-28 | + +## Risk Register + +| ID | Risk | Impact | Likelihood | Mitigation | +|----|------|--------|------------|------------| +| R1 | Root cause is neither allocator routing nor Mmap alignment but an intrinsic CUDA 13 + GB10 driver bug | Forces permanent `ZERFOO_DISABLE_CUDA_GRAPH=1` on training workloads | Medium | Wave 4 includes the fail-fast path (T4.1/T4.2); even if the fix fails, we ship a clean typed error and stop the silent hang | +| R2 | Capture-aware allocator forces `cudaMallocAsync`, which GB10 driver stack may not honor in managed-memory mode | Partial capture broken across all GGUF inference paths | Medium | Gate the new routing behind a runtime probe that confirms `cudaStreamGetCaptureInfo` reports `Active` before switching allocators | +| R3 | Watchdog false-positive abandons valid captures on slow first-pass warmup | Performance regression for inference | Low | Use 30-second default, only trigger when `StreamCaptureStatus` is `Invalidated` not merely slow | +| R4 | zerfoo-side pickup of `WithCapture` lags the release, leaving the bug live | Continued production pain in Wolf | Medium | Land T2.1a and T2.1b in the same 48-hour window, pair with a zerfoo patch release | +| R5 | Pre-allocated workspace buffers bloat GPU memory for small models | Memory regression on edge models | Low | Keep allocation lazy but move it out of the captured region; only allocate on first warmup pass | +| R6 | Tests gated by `//go:build dgxgb10` never run in CI | Regression regressed silently | Medium | Add a DGX runner selector that submits the gated test via `scripts/bench-spark.sh`-style wrapper at least weekly | + +## Operating Procedure + +- Definition of done for each task: PR merged, CI green, DGX Spark run + attached (for GPU tasks), ADR updated where applicable, release cut + where the task is blocked by a version bump. +- Every implementation task has a paired testing subtask. Add tests + under `compute/` for engine-level fixes and under `graph/` for + capture-lifecycle fixes. +- After each commit run `gofmt -s -w`, `goimports -w`, and + `golangci-lint run ./...` on the affected packages. +- Small focused commits; never mix changes across `compute/`, + `graph/`, `internal/cuda/` in one commit because the pre-commit hook + rejects cross-directory staging. +- DGX benches go via Spark only. Never `ssh` to run `go test -tags + cuda` or `go test -bench` on DGX (see zerfoo CLAUDE.md line on the + 2026-04-07 outage). +- Use rebase and merge on GitHub, not squash, not merge commits. +- After merging to `main`, let release-please open a release PR and + merge it to tag the ztensor release. + +## Progress Log + +### 2026-04-15 Change summary + +- Replaced the closed-Issue-79 plan body with a new plan targeting the + GB10 CUDA graph capture hang reported via Wolf PR #108. Preserved + Issue-79 investigation notes in the `Archive` section below because + they document DGX Spark procedural gotchas that remain relevant. +- No tasks completed yet; seeded Epics E1 through E6 and Milestones M1 + through M5. +- No ADRs created yet. The plan commits to ztensor + `docs/adr/003-cuda-graph-capture-on-gb10.md` being written under T6.1. +- Cross-references: zerfoo `docs/adr/088-gemma4-ple-cuda-graph-capture.md`, + zerfoo `docs/plan.md` epic E99, zerfoo `docs/devlog.md` entries dated + 2026-04-14 and 2026-04-15 on `ZERFOO_DISABLE_CUDA_GRAPH=1`. + +## Hand off Notes + +A new engineer picking this up needs: + +- DGX Spark access via the Spark HTTP API on + `http://192.168.86.250:8080`. No interactive `ssh` for benches (see + `/Users/dndungu/Code/zerfoo/zerfoo/CLAUDE.md`). +- Familiarity with `compute/gpu_engine.go` (UploadWeights and capture + entry points) and `graph/cuda_graph.go` (capture driver). Read + zerfoo ADR-088 first for the gemma4e precedent. +- `docs/bench/manifests/` examples to copy when writing + `cuda-graph-gb10-repro.yaml`. +- Access to the Wolf repo at `github.com/feza-ai/wolf` for the + downstream manifest cleanup (T5.1). +- Permission to cut a ztensor release (release-please PR merge rights). +- Do not commit secrets or API tokens; `SPARK_API_TOKEN` lives in the + DGX host and is referenced via `Authorization: Bearer $(cat token)` + only. + +## Appendix + +### Referenced files + +- `compute/gpu_engine.go:293` UploadWeights entry +- `compute/gpu_engine.go:416-424` MmapStorage skip TODO +- `compute/gpu_engine.go:576-596` allocWeight and uploadBytes +- `compute/gpu_engine.go:611-655` BeginCapture and EndCapture +- `compute/engine.go:137` documented cudaMalloc 901 constraint +- `graph/cuda_graph.go:270-345` capture driver (no allocator switch) +- `internal/cuda/runtime_purego.go:368-385` StreamBeginCapture purego +- zerfoo `docs/adr/088-gemma4-ple-cuda-graph-capture.md` precedent + +### Archive -- Issue 79 investigation (closed 2026-04-09) + +Retained for two reasons: the Spark operational notes still apply to +this plan, and the closure evidence demonstrates that ztensor +primitives are not at fault for the PatchTST frozen-loss signature, +which informs where NOT to look when debugging the GB10 hang. + +- #78 NCCL purego migration -- CLOSED via PR #80 (merged `af8af73`). +- #79 GPU engine dst-output routing -- INVESTIGATION CLOSED ztensor-side. + Branch `fix/issue-79-matmul-accumulate-repro` retained as evidence. + +Test file `compute/gpu_dst_roundtrip_test.go` on that branch ports the +exact backward-pass op sequence from +`zerfoo/timeseries/patchtst_gpu_train.go:1022-1031`: ``` Transpose(patches -> patchesT) @@ -27,38 +499,25 @@ Add(gradW, dPEW, gradW) # in-place accumulate gradW.Data() ``` -Ran 7 variants on DGX GB10 via Spark pod `ztensor-issue79-repro-1775761950`: +Ran 7 variants on DGX GB10 via Spark pod +`ztensor-issue79-repro-1775761950`: ``` TestGPUEngine_Add_DstRoundTrip_OutOfPlace PASS TestGPUEngine_Add_DstRoundTrip_InPlace PASS TestGPUEngine_Add_DstRoundTrip_RepeatedInPlace PASS TestGPUEngine_Add_DstRoundTrip_NoExplicitSync PASS -TestGPUEngine_PatchTSTBackward_DstRoundTrip PASS (4x3 / 4x2 tiny) -TestGPUEngine_PatchTSTBackward_RealisticShapes PASS (1600x8 / 1600x64, 20 iters) -TestGPUEngine_PatchTSTBackward_LargerBatch PASS (3200x8 / 3200x64, 20 iters) +TestGPUEngine_PatchTSTBackward_DstRoundTrip PASS +TestGPUEngine_PatchTSTBackward_RealisticShapes PASS +TestGPUEngine_PatchTSTBackward_LargerBatch PASS ``` -None of the four hypotheses (alpha/beta/gamma/delta) from issue #79 is -triggered by the patch-embedding backward sequence at production shapes -and over many accumulation iterations. The `makeGPUResult` / -`SetStorage` / `GPUStorage.Slice()` path correctly routes dst tensors. - -Findings posted to issue #79 as two comments; devlog entry dated -2026-04-09. - -## Remaining suspects (zerfoo-side) +None of the four hypotheses from the issue body was triggered. The +`makeGPUResult` / `SetStorage` / `GPUStorage.Slice()` path correctly +routes dst tensors. -Future work, not tracked in this plan: -1. `encoderBackward` full op chain (dozens of ops per batch) -- not covered here. -2. CPU-loop `dPosData += dXData` interleave at `patchtst_gpu_train.go:1012-1019` - forcing mid-backward D2H on the same stream. -3. zerfoo-side `gradTs` wrapper/arena state diverging from fresh `tensor.New` wrappers. - -Recommended next action: instrument `trainWindowedGPU` directly -(log device pointers before/after each op on the first batch). - -## Infra notes captured during investigation +Spark operational gotchas captured during that investigation, still +valid: - Spark silently drops `pod.spec.containers[0].command` when multi-element. Use `args: ["bash", "-c", ...]` with no `command` field. @@ -74,14 +533,3 @@ Recommended next action: instrument `trainWindowedGPU` directly Reference manifest: `docs/bench/manifests/issue-79-repro.yaml`. Reference script: `/var/lib/zerfoo/bench-out/issue79-run.sh` on DGX host. - -## Progress Log - -### 2026-04-09 -- Investigation closed -- Merged PR #80, closed #78. -- Wrote, expanded, and ran 7-variant dst-routing test suite on DGX GB10. -- All variants PASS at production shapes with 20-iteration accumulation. -- Posted two update comments to issue #79 with findings. -- Trimmed plan: removed E3/E4/E5/E6 (diagnose/fix/release tracks) since - the underlying premise -- reproducing #79 at ztensor level -- is - disproven. Investigation continues in zerfoo if at all. From 033528a2229f891adea973a3c3e96d41747f3116 Mon Sep 17 00:00:00 2001 From: David Ndungu Date: Wed, 15 Apr 2026 21:04:15 -0700 Subject: [PATCH 2/4] feat(cuda): T1.1 add StreamCaptureStatus purego binding Wraps cudaStreamGetCaptureInfo so callers can detect stream capture state before recording incompatible operations. Returns None without error when the runtime is unavailable (CPU-only builds). Used by T1.2's ensureNotCapturing guard in compute/gpu_engine.go to block sync allocations during graph capture. --- internal/cuda/purego.go | 14 ++++--- internal/cuda/runtime_purego.go | 35 +++++++++++++++++ internal/cuda/runtime_purego_test.go | 58 ++++++++++++++++++++++++++++ 3 files changed, 101 insertions(+), 6 deletions(-) create mode 100644 internal/cuda/runtime_purego_test.go diff --git a/internal/cuda/purego.go b/internal/cuda/purego.go index 5c49e4a..b22313d 100644 --- a/internal/cuda/purego.go +++ b/internal/cuda/purego.go @@ -34,12 +34,13 @@ type CUDALib struct { cudaMemsetAsync uintptr // CUDA graph API (optional, resolved separately -- may not exist on older runtimes) - cudaStreamBeginCapture uintptr - cudaStreamEndCapture uintptr - cudaGraphInstantiate uintptr - cudaGraphLaunch uintptr - cudaGraphDestroy uintptr - cudaGraphExecDestroy uintptr + cudaStreamBeginCapture uintptr + cudaStreamEndCapture uintptr + cudaStreamGetCaptureInfo uintptr + cudaGraphInstantiate uintptr + cudaGraphLaunch uintptr + cudaGraphDestroy uintptr + cudaGraphExecDestroy uintptr } var ( @@ -116,6 +117,7 @@ func Open() (*CUDALib, error) { // CUDA graph API (CUDA 10.0+) {"cudaStreamBeginCapture", &lib.cudaStreamBeginCapture}, {"cudaStreamEndCapture", &lib.cudaStreamEndCapture}, + {"cudaStreamGetCaptureInfo", &lib.cudaStreamGetCaptureInfo}, {"cudaGraphInstantiate", &lib.cudaGraphInstantiate}, {"cudaGraphLaunch", &lib.cudaGraphLaunch}, {"cudaGraphDestroy", &lib.cudaGraphDestroy}, diff --git a/internal/cuda/runtime_purego.go b/internal/cuda/runtime_purego.go index c23418c..aee1d0c 100644 --- a/internal/cuda/runtime_purego.go +++ b/internal/cuda/runtime_purego.go @@ -394,6 +394,41 @@ func StreamEndCapture(s *Stream) (*Graph, error) { return &Graph{handle: graphHandle}, nil } +// CaptureStatus reports whether a stream is currently capturing. +// Values mirror CUDA's cudaStreamCaptureStatus enum. +type CaptureStatus int + +const ( + // CaptureStatusNone indicates the stream is not capturing (cudaStreamCaptureStatusNone). + CaptureStatusNone CaptureStatus = 0 + // CaptureStatusActive indicates the stream is actively capturing (cudaStreamCaptureStatusActive). + CaptureStatusActive CaptureStatus = 1 + // CaptureStatusInvalidated indicates the stream was capturing but an error + // invalidated the capture (cudaStreamCaptureStatusInvalidated). + CaptureStatusInvalidated CaptureStatus = 2 +) + +// StreamCaptureStatus queries the capture status of a stream. +// Wraps cudaStreamGetCaptureInfo. Returns CaptureStatusNone without error +// when the CUDA runtime is unavailable (CPU-only builds) — a non-CUDA +// runtime is never capturing. +func StreamCaptureStatus(s *Stream) (CaptureStatus, error) { + l := lib() + if l == nil || l.cudaStreamGetCaptureInfo == 0 { + return CaptureStatusNone, nil + } + var status uint32 + var id uint64 + ret := ccall(l.cudaStreamGetCaptureInfo, + s.handle, + uintptr(unsafe.Pointer(&status)), + uintptr(unsafe.Pointer(&id))) + if ret != cudaSuccess { + return CaptureStatusNone, fmt.Errorf("cudaStreamGetCaptureInfo failed: %s", cudaErrorString(ret)) + } + return CaptureStatus(status), nil +} + // GraphInstantiate creates an executable graph from a captured graph. // The executable graph can be launched repeatedly without re-capturing. func GraphInstantiate(g *Graph) (*GraphExec, error) { diff --git a/internal/cuda/runtime_purego_test.go b/internal/cuda/runtime_purego_test.go new file mode 100644 index 0000000..b1f4a61 --- /dev/null +++ b/internal/cuda/runtime_purego_test.go @@ -0,0 +1,58 @@ +package cuda + +import "testing" + +func TestStreamCaptureStatus_NoRuntime(t *testing.T) { + // With or without CUDA, a stream that is not capturing must report None. + // When CUDA is unavailable, the binding returns None without error. + // When CUDA is available, a freshly created stream is not capturing. + var s *Stream + if Available() { + var err error + s, err = CreateStream() + if err != nil { + t.Fatalf("CreateStream failed: %v", err) + } + defer func() { + if destroyErr := s.Destroy(); destroyErr != nil { + t.Errorf("Stream.Destroy failed: %v", destroyErr) + } + }() + } else { + s = &Stream{} + } + + status, err := StreamCaptureStatus(s) + if err != nil { + t.Fatalf("StreamCaptureStatus returned error: %v", err) + } + if status != CaptureStatusNone { + t.Fatalf("expected CaptureStatusNone on a non-capturing stream, got %v", status) + } +} + +func TestCaptureStatus_EnumValues(t *testing.T) { + // Compile-time exhaustive switch — ensures enum values stay stable and + // every variant remains addressable from client code. + cases := []CaptureStatus{ + CaptureStatusNone, + CaptureStatusActive, + CaptureStatusInvalidated, + } + for _, c := range cases { + switch c { + case CaptureStatusNone: + if int(c) != 0 { + t.Errorf("CaptureStatusNone = %d, want 0", int(c)) + } + case CaptureStatusActive: + if int(c) != 1 { + t.Errorf("CaptureStatusActive = %d, want 1", int(c)) + } + case CaptureStatusInvalidated: + if int(c) != 2 { + t.Errorf("CaptureStatusInvalidated = %d, want 2", int(c)) + } + } + } +} From 3f8779c1221387cfde2a04228185d02f9ffb422e Mon Sep 17 00:00:00 2001 From: David Ndungu Date: Wed, 15 Apr 2026 21:23:56 -0700 Subject: [PATCH 3/4] feat(compute): T1.2 add ensureNotCapturing guard and ErrCaptureIncompatibleAllocation Weight allocation and host-to-device uploads now fail fast with a typed error when invoked while a CUDA graph capture is active on the engine's stream. On GB10 the legacy path silently hangs because cudaMalloc / MallocManaged are not capturable; this guard surfaces that condition to callers via errors.Is(err, ErrCaptureIncompatibleAllocation). The guard queries cuda.StreamCaptureStatus (T1.1). On CPU-only runtimes or nil streams the probe returns nil and the code path is unchanged. Probe failures are propagated rather than swallowed. Refs E1 T1.2. --- compute/capture_guard_test.go | 40 +++++++++++++++++++++++++++++++++++ compute/errors.go | 11 ++++++++++ compute/gpu_engine.go | 37 ++++++++++++++++++++++++++++++-- 3 files changed, 86 insertions(+), 2 deletions(-) create mode 100644 compute/capture_guard_test.go create mode 100644 compute/errors.go diff --git a/compute/capture_guard_test.go b/compute/capture_guard_test.go new file mode 100644 index 0000000..91eb5a1 --- /dev/null +++ b/compute/capture_guard_test.go @@ -0,0 +1,40 @@ +package compute + +import ( + "errors" + "testing" +) + +// TestEnsureNotCapturing_NilStream verifies that ensureNotCapturing returns +// nil on an engine whose stream is nil (CPU-only runtime). This is the +// common path on machines without CUDA. +func TestEnsureNotCapturing_NilStream(t *testing.T) { + e := &GPUEngine[float32]{} + if err := e.ensureNotCapturing(); err != nil { + t.Fatalf("ensureNotCapturing on nil-stream engine: got %v, want nil", err) + } +} + +// TestErrCaptureIncompatibleAllocation_Is verifies that +// ErrCaptureIncompatibleAllocation is a sentinel error usable with +// errors.Is, both directly and when wrapped. +func TestErrCaptureIncompatibleAllocation_Is(t *testing.T) { + if !errors.Is(ErrCaptureIncompatibleAllocation, ErrCaptureIncompatibleAllocation) { + t.Fatal("errors.Is should match sentinel against itself") + } + wrapped := wrapErr(ErrCaptureIncompatibleAllocation) + if !errors.Is(wrapped, ErrCaptureIncompatibleAllocation) { + t.Fatal("errors.Is should see sentinel through a wrapper") + } +} + +// wrapErr emulates a caller that wraps the sentinel error with %w. +// Kept local to the test to avoid leaking helpers into the package API. +func wrapErr(err error) error { + return &wrappedErr{inner: err} +} + +type wrappedErr struct{ inner error } + +func (w *wrappedErr) Error() string { return "wrapped: " + w.inner.Error() } +func (w *wrappedErr) Unwrap() error { return w.inner } diff --git a/compute/errors.go b/compute/errors.go new file mode 100644 index 0000000..c7d37d3 --- /dev/null +++ b/compute/errors.go @@ -0,0 +1,11 @@ +package compute + +import "errors" + +// ErrCaptureIncompatibleAllocation is returned when a weight allocation +// or upload is attempted while a CUDA graph capture is active on the +// engine's stream. Allocations during capture are not supported and +// would silently hang on GB10. Callers should either upload weights +// before BeginCapture, or catch this error and fall back to an +// uncaptured run. +var ErrCaptureIncompatibleAllocation = errors.New("compute: allocation attempted during active CUDA graph capture") diff --git a/compute/gpu_engine.go b/compute/gpu_engine.go index 8482fb5..b11ff91 100644 --- a/compute/gpu_engine.go +++ b/compute/gpu_engine.go @@ -573,10 +573,38 @@ func (e *GPUEngine[T]) UploadWeights(tensors []*tensor.TensorNumeric[float32]) e return nil } +// ensureNotCapturing returns ErrCaptureIncompatibleAllocation if the +// engine's stream is currently capturing a CUDA graph. On CPU-only +// runtimes or when the stream handle is nil, returns nil (no capture +// is possible). If querying capture status itself fails, returns +// that error (do not assume safety on probe failure). +func (e *GPUEngine[T]) ensureNotCapturing() error { + if e.stream == nil { + return nil + } + ptr := e.stream.Ptr() + if ptr == nil { + return nil + } + s := cuda.StreamFromPtr(ptr) + status, err := cuda.StreamCaptureStatus(s) + if err != nil { + return fmt.Errorf("ensureNotCapturing: %w", err) + } + if status == cuda.CaptureStatusActive { + return ErrCaptureIncompatibleAllocation + } + return nil +} + // allocWeight allocates permanent memory for a weight tensor. // Uses cudaMallocManaged on devices with managed memory support, -// otherwise uses cudaMalloc. +// otherwise uses cudaMalloc. Returns ErrCaptureIncompatibleAllocation +// if invoked while a CUDA graph capture is active on the engine's stream. func (e *GPUEngine[T]) allocWeight(byteSize int) (unsafe.Pointer, error) { + if err := e.ensureNotCapturing(); err != nil { + return nil, err + } if e.managedMem { return cuda.MallocManaged(byteSize) } @@ -585,8 +613,13 @@ func (e *GPUEngine[T]) allocWeight(byteSize int) (unsafe.Pointer, error) { // uploadBytes copies src bytes into a device (or managed) pointer. // With managed memory, this is a direct CPU memcpy (no H2D needed). -// Without managed memory, this uses cudaMemcpy H2D. +// Without managed memory, this uses cudaMemcpy H2D. Returns +// ErrCaptureIncompatibleAllocation if invoked while a CUDA graph capture +// is active on the engine's stream. func (e *GPUEngine[T]) uploadBytes(devPtr unsafe.Pointer, src []byte) error { + if err := e.ensureNotCapturing(); err != nil { + return err + } if e.managedMem { dst := unsafe.Slice((*byte)(devPtr), len(src)) copy(dst, src) From 8db22cdafbc5cdfd13c69c527b87b887072d231d Mon Sep 17 00:00:00 2001 From: David Ndungu Date: Wed, 15 Apr 2026 21:26:16 -0700 Subject: [PATCH 4/4] docs(plan): T1.1 + T1.2 complete (Wave 1 E1 probes) --- docs/plan.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/plan.md b/docs/plan.md index fb14eac..054e4d5 100644 --- a/docs/plan.md +++ b/docs/plan.md @@ -225,10 +225,10 @@ All estimates are rough; refine when a task starts. ### E1 Reproduce and instrument the hang -- [ ] T1.1 Add `StreamCaptureStatus` purego binding in `internal/cuda/runtime_purego.go` (wraps `cudaStreamGetCaptureInfo`). Owner: TBD. Est: 90m. verifies: [UC-003] +- [x] T1.1 Add `StreamCaptureStatus` purego binding in `internal/cuda/runtime_purego.go` (wraps `cudaStreamGetCaptureInfo`). Owner: task-T1.1. Est: 90m. verifies: [UC-003] Completed: 2026-04-15 - Acceptance: Returns the three-valued enum, exported via `cuda.StreamCaptureStatus(stream *Stream) (Status, error)`. Unit test on a non-capturing stream returns `None`. - Dependencies: none. -- [ ] T1.2 Add `ensureNotCapturing()` guard to `allocWeight` and `uploadBytes` in `compute/gpu_engine.go`. If status is `Active`, return a typed error `ErrCaptureIncompatibleAllocation`. Owner: TBD. Est: 60m. verifies: [UC-003] +- [x] T1.2 Add `ensureNotCapturing()` guard to `allocWeight` and `uploadBytes` in `compute/gpu_engine.go`. If status is `Active`, return a typed error `ErrCaptureIncompatibleAllocation`. Owner: task-T1.2. Est: 60m. verifies: [UC-003] Completed: 2026-04-15 - Acceptance: Existing non-capture tests unaffected. New unit test with a mock stream in `Active` state triggers the error. - Dependencies: T1.1. - [ ] T1.3 Write `TestCUDAGraph_MultiTensorUpload_GB10` in `compute/gpu_engine_test.go` gated behind `//go:build dgxgb10` build tag. The test uploads 50 tensors (including a 256x1024 float32 matrix), then invokes `BeginCapture`, runs a MatMul, `EndCapture`. Owner: TBD. Est: 2h. verifies: [UC-001, UC-002] @@ -334,8 +334,8 @@ count equals the number of task IDs listed on that wave. #### Wave 1: Repro and probe (2 agents) -- [ ] T1.1 Add `StreamCaptureStatus` purego binding verifies: [UC-003] -- [ ] T1.2 Add `ensureNotCapturing` guard and typed error verifies: [UC-003] +- [x] T1.1 Add `StreamCaptureStatus` purego binding verifies: [UC-003] 2026-04-15 +- [x] T1.2 Add `ensureNotCapturing` guard and typed error verifies: [UC-003] 2026-04-15 #### Wave 2: Reproduction harness (3 agents)