feat: GB10 CUDA graph capture probes (T1.1, T1.2)#94
Merged
Conversation
Replaces closed Issue-79 investigation plan with comprehensive 6-epic execution plan for resolving silent hang on NVIDIA DGX Spark GB10 (arm64 Grace Hopper) during multi-tensor weight uploads with CUDA graph capture active. Waves 1-8 define parallel agent counts, deliverables (E1 reproduction, E2 capture-aware alloc, E3 conditional Mmap investigation, E4 fail-fast fallback, E5 downstream rollout, E6 release), milestones M1-M5, risk register, and Spark operational notes.
Wraps cudaStreamGetCaptureInfo so callers can detect stream capture state before recording incompatible operations. Returns None without error when the runtime is unavailable (CPU-only builds). Used by T1.2's ensureNotCapturing guard in compute/gpu_engine.go to block sync allocations during graph capture.
…atibleAllocation Weight allocation and host-to-device uploads now fail fast with a typed error when invoked while a CUDA graph capture is active on the engine's stream. On GB10 the legacy path silently hangs because cudaMalloc / MallocManaged are not capturable; this guard surfaces that condition to callers via errors.Is(err, ErrCaptureIncompatibleAllocation). The guard queries cuda.StreamCaptureStatus (T1.1). On CPU-only runtimes or nil streams the probe returns nil and the code path is unchanged. Probe failures are propagated rather than swallowed. Refs E1 T1.2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 1 of the GB10 CUDA graph capture hang fix (docs/plan.md E1).
cuda.StreamCaptureStatuspurego binding wrappingcudaStreamGetCaptureInfo. Three-valued enum (None/Active/Invalidated). Safe no-op on CPU-only runtimes.ensureNotCapturing()guard on*GPUEngine[T]that trips when a weight alloc/upload is attempted during active capture. Returns new sentinelcompute.ErrCaptureIncompatibleAllocation. Wired intoallocWeightanduploadBytes. This makes the silent GB10 hang observable from the call site so we can fail fast (or fall back in future waves).This lands the probes only. Follow-on work: T1.3 (reproduction test under
//go:build dgxgb10), T1.4 (Spark manifest + hardware run), then E2 fixes (T2.1aWithCapturehelper, T2.2 capture-awareallocWeightrouting).Verification report
--no-ffontowave-1-integration. Silent-revert check: every non-context line from each branch's M1 patch is reflected ingit diff main...HEAD.go build ./...PASS.go test ./... -race -timeout 180sPASS (all packages, includingcompute2.8s with-race).origin/mainis zero (28 pre-existingpossible misuse of unsafe.Pointerwarnings in internal GPU shim packages, none in files this PR touches).golangci-lint run): zero new issues. Pre-existing findings incompute/gpu_fp8.go,compute/cpu_engine_quant_test.go,compute/gpu_paged_gqa_test.go,compute/ternary_gemv.go(none touched by this PR).TODO/FIXME/Stub/Mock/Fake/Placeholder/NotImplementedin production diff.compute/capture_guard_test.go(nil-stream safe path +errors.Issanity).Files touched
internal/cuda/purego.go(+8 −6) — registercudaStreamGetCaptureInfosymbolinternal/cuda/runtime_purego.go(+35) —StreamCaptureStatus+CaptureStatus*constantsinternal/cuda/runtime_purego_test.go(+58, new) — binding smoke testscompute/errors.go(+11, new) —ErrCaptureIncompatibleAllocationsentinelcompute/gpu_engine.go(+35 −2) —ensureNotCapturingmethod; wire intoallocWeight+uploadBytescompute/capture_guard_test.go(+40, new) — sentinel + nil-stream testsdocs/plan.md— mark T1.1/T1.2 completeTest plan
go build ./...go test ./... -race -timeout 180sgo vet ./...(delta zero vs origin/main)golangci-lint runon touched packages (zero new findings)