Summary
Any *TensorNumeric[float32] passed as the destination to a GPU engine op (MatMul, Add, Sub, Mul, etc.) reads as all-zero via .Data() immediately after the op returns, on NVIDIA GB10 (DGX Spark, aarch64, unified memory). This breaks PatchTST training in github.com/zerfoo/zerfoo — loss freezes at a byte-identical deterministic value (0.268357 on a specific test config) because backward-pass gradients never reach the CPU-visible side for the optimizer to read.
The exact same training loop converges correctly on the CPU engine (86% loss reduction in 10 epochs on 15 samples via TestPatchTST_TrainWindowed_EngineConvergence in timeseries/patchtst_test.go), isolating the fault to ztensor's GPU engine implementation.
Minimal reproducer (zerfoo side)
cd github.com/zerfoo/zerfoo
git checkout main
go test ./timeseries/ -run TestPatchTST_TrainWindowed_EngineConvergence -count=1 -v
# PASSES: loss[0]=5.169868 -> loss[9]=0.702752 (86% reduction)
The same test, pointed at a GPU engine constructed via compute.NewGPUEngine[float32](ops), would fail with static loss. The test currently uses newTestEngine() which returns a CPU engine. Running the same binary on DGX via scripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3 -cleanup shows:
```
epoch 1: loss=0.268357 ok
epoch 2: loss=0.268357 ok
epoch 3: loss=0.268357 ok
convergence: FAILED
```
Empirical diagnostic (seven probes on DGX, branch debug/diag-grad-zero-source-v2)
First-batch gradient/parameter state logged at key points in trainWindowedGPU:
```
P0 post-zero grads.patchEmbW = [0 0 0 0] (expected)
P7 post-encoderBackward dX = [0 0 0 0] BROKEN
P8 post-posEmb CPU accumulation = [0 0 0 0] (consumes dX via .Data())
P1 pre-MatMul fc.dPEW = [0 0 0 0] (already zero before use)
P2 post-Add grads.patchEmbW = [0 0 0 0] (Add of zeros = zero)
P3 pre-gradclip = [0 0 0 0]
P4 adamw-read gradTs[0] = [0 0 0 0]
P5 adamw-post-update paramTs[0] = [0.01112, ...] near-init (AdamW ~= 0 on zero grads)
P6 batch 1 forward entry = same as P5 (AdamW writeback + forward read work fine)
```
Every tensor that was written by a GPU engine op returns zero via `.Data()`. The CPU/GPU disconnect is at the engine-op boundary.
What's NOT the bug (ruled out by multi-day investigation in zerfoo)
- Wrapper aliasing: `grads.X[i]` and `gradTs[i]` ARE the same `*TensorNumeric` wrapper (sentinel verified in zerfoo PR #369).
- Stream synchronization: adding `engine.Sync()` (via anonymous interface type-assert on `*GPUEngine[T]`) before `.Data()` reads did NOT change the frozen loss. Not a D2H race.
- Once-per-training caches: rebuilding `paramTs`/`gradTs` per batch (zerfoo PR #365) was a no-op — the wrappers were already aliased with the live struct fields.
- Storage-identity sentinel: the check passes because wrappers are genuinely the same instance — the disconnect is deeper, inside the engine op's output buffer routing.
- Kernel timing: ztensor stream Sync barriers before the CPU-side posEmb accumulation AND before grad-clip had zero effect.
Storage kind flip (context)
Storage-kind probe across all 37 grad tensors on first-batch backward (zerfoo branch `debug/v3-storage-kind-probe`, scratch `.claude/scratch/v3-storage-kind-result.md`):
- Setup: all 37 tensors are `*tensor.CPUStorage[float32]` (allocated via `tensor.New[float32]` which hardcodes `NewCPUStorage`).
- Post-first-backward: 36 of 37 flip to `*tensor.GPUStorage[float32]`. The flip happens via `makeGPUResult` calling `dst[0].SetStorage(gs)` at `ztensor/compute/gpu_kernels.go:121-132`.
- Lone holdout: `posEmb` grad (idx=2) stays CPUStorage because its update is a pure CPU loop at `patchtst_gpu_train.go:1012-1019` (`dPosData[j] += dXData[i]`) that never hits `makeGPUResult`.
After the flip, `GPUStorage.Slice()` at `ztensor/tensor/gpu_storage.go:215-250` performs `make([]T, s.length)` + D2H memcpy on every `.Data()` call. The D2H returns all zeros.
Hypotheses for the real bug (need ztensor expertise to pick)
- (α) Kernel writes to a different buffer than the one installed by SetStorage. `makeGPUResult` allocates a fresh device buffer, installs it on `dst` via `SetStorage`, but the underlying CUDA kernel launch uses a different pointer (the old storage's pointer, a scratch buffer, or a pre-allocated op output buffer).
- (β) Kernel writes correctly, but SetStorage happens AFTER the kernel, installing a fresh zero-init buffer that clobbers the kernel's output.
- (γ) D2H memcpy reads from a buffer that was never written. `GPUStorage.Slice()`'s memcpy source pointer diverges from the kernel's write target. Possibly a double-allocation where `makeGPUResult` returns one buffer but the CUDA call went to another.
- (δ) In-place aliasing (dst == src0) broken. `engine.Add(ctx, a, b, a)` handled incorrectly — kernel reads src0 after dst has been swapped to a fresh buffer, so reads empty.
Suggested trace points inside ztensor:
- At entry to `makeGPUResult`: log the caller-supplied `dst` pointer and its existing Storage device pointer.
- Just before the CUDA kernel launch: log the device pointer the kernel is writing to.
- Just after the kernel launch but before `SetStorage`: log the same.
- After `SetStorage`: log `dst[0].GetStorage()` device pointer.
- Compare: kernel-write-target vs post-SetStorage device pointer.
Environment
- Host: DGX Spark (GB10, NVIDIA), aarch64, unified memory
- CUDA: current ztensor CUDA runtime binding
- Go: 1.25 (via ztensor module)
- zerfoo branch: main @ current HEAD (all docs in `docs/devlog.md` entry dated 2026-04-08 titled "FINAL")
Cross-references
- zerfoo issue #368 (PR #365 non-functional bisect marker)
- zerfoo issue #364 (paramTs staleness — false concern, resolved)
- zerfoo issue #367 (posEmb idx=2 storage flip anomaly — not an anomaly, it's the clean CPU control case)
- zerfoo PR #369 (Storage-identity sentinel — fixes false-positive panic but doesn't unblock training)
- Full investigation history: https://github.com/zerfoo/zerfoo/blob/main/docs/devlog.md (entries dated 2026-04-08)
What we need from ztensor
- Trace/fix the `makeGPUResult` output-routing bug per the hypotheses above.
- Consider making `.Data()` on a GPU tensor implicitly sync the stream before the D2H memcpy (systemic safety net; relevant even if the primary bug is buffer routing).
- Add a minimal test at the ztensor level: run `engine.Add(a, b, c)` on a GPU engine with known inputs, read `c.Data()`, assert equals expected. If this test FAILS, the bug is reproducible without zerfoo at all.
Priority
Blocks PatchTST training on DGX. CPU training still works on DGX as a workaround (unverified at full 5K x 10ch x 3ep scale but expected to converge given local CPU test passes).
Summary
Any
*TensorNumeric[float32]passed as the destination to a GPU engine op (MatMul, Add, Sub, Mul, etc.) reads as all-zero via.Data()immediately after the op returns, on NVIDIA GB10 (DGX Spark, aarch64, unified memory). This breaks PatchTST training in github.com/zerfoo/zerfoo — loss freezes at a byte-identical deterministic value (0.268357on a specific test config) because backward-pass gradients never reach the CPU-visible side for the optimizer to read.The exact same training loop converges correctly on the CPU engine (86% loss reduction in 10 epochs on 15 samples via
TestPatchTST_TrainWindowed_EngineConvergenceintimeseries/patchtst_test.go), isolating the fault to ztensor's GPU engine implementation.Minimal reproducer (zerfoo side)
The same test, pointed at a GPU engine constructed via
compute.NewGPUEngine[float32](ops), would fail with static loss. The test currently usesnewTestEngine()which returns a CPU engine. Running the same binary on DGX viascripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3 -cleanupshows:```
epoch 1: loss=0.268357 ok
epoch 2: loss=0.268357 ok
epoch 3: loss=0.268357 ok
convergence: FAILED
```
Empirical diagnostic (seven probes on DGX, branch debug/diag-grad-zero-source-v2)
First-batch gradient/parameter state logged at key points in
trainWindowedGPU:```
P0 post-zero grads.patchEmbW = [0 0 0 0] (expected)
P7 post-encoderBackward dX = [0 0 0 0] BROKEN
P8 post-posEmb CPU accumulation = [0 0 0 0] (consumes dX via .Data())
P1 pre-MatMul fc.dPEW = [0 0 0 0] (already zero before use)
P2 post-Add grads.patchEmbW = [0 0 0 0] (Add of zeros = zero)
P3 pre-gradclip = [0 0 0 0]
P4 adamw-read gradTs[0] = [0 0 0 0]
P5 adamw-post-update paramTs[0] = [0.01112, ...] near-init (AdamW ~= 0 on zero grads)
P6 batch 1 forward entry = same as P5 (AdamW writeback + forward read work fine)
```
Every tensor that was written by a GPU engine op returns zero via `.Data()`. The CPU/GPU disconnect is at the engine-op boundary.
What's NOT the bug (ruled out by multi-day investigation in zerfoo)
Storage kind flip (context)
Storage-kind probe across all 37 grad tensors on first-batch backward (zerfoo branch `debug/v3-storage-kind-probe`, scratch `.claude/scratch/v3-storage-kind-result.md`):
After the flip, `GPUStorage.Slice()` at `ztensor/tensor/gpu_storage.go:215-250` performs `make([]T, s.length)` + D2H memcpy on every `.Data()` call. The D2H returns all zeros.
Hypotheses for the real bug (need ztensor expertise to pick)
Suggested trace points inside ztensor:
Environment
Cross-references
What we need from ztensor
Priority
Blocks PatchTST training on DGX. CPU training still works on DGX as a workaround (unverified at full 5K x 10ch x 3ep scale but expected to converge given local CPU test passes).