Skip to content

GPU engine dst-output routing: tensors written by kernels return zero via .Data() — freezes PatchTST training on GB10 #79

@dndungu

Description

@dndungu

Summary

Any *TensorNumeric[float32] passed as the destination to a GPU engine op (MatMul, Add, Sub, Mul, etc.) reads as all-zero via .Data() immediately after the op returns, on NVIDIA GB10 (DGX Spark, aarch64, unified memory). This breaks PatchTST training in github.com/zerfoo/zerfoo — loss freezes at a byte-identical deterministic value (0.268357 on a specific test config) because backward-pass gradients never reach the CPU-visible side for the optimizer to read.

The exact same training loop converges correctly on the CPU engine (86% loss reduction in 10 epochs on 15 samples via TestPatchTST_TrainWindowed_EngineConvergence in timeseries/patchtst_test.go), isolating the fault to ztensor's GPU engine implementation.

Minimal reproducer (zerfoo side)

cd github.com/zerfoo/zerfoo
git checkout main
go test ./timeseries/ -run TestPatchTST_TrainWindowed_EngineConvergence -count=1 -v
# PASSES: loss[0]=5.169868 -> loss[9]=0.702752  (86% reduction)

The same test, pointed at a GPU engine constructed via compute.NewGPUEngine[float32](ops), would fail with static loss. The test currently uses newTestEngine() which returns a CPU engine. Running the same binary on DGX via scripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3 -cleanup shows:
```
epoch 1: loss=0.268357 ok
epoch 2: loss=0.268357 ok
epoch 3: loss=0.268357 ok
convergence: FAILED
```

Empirical diagnostic (seven probes on DGX, branch debug/diag-grad-zero-source-v2)

First-batch gradient/parameter state logged at key points in trainWindowedGPU:

```
P0 post-zero grads.patchEmbW = [0 0 0 0] (expected)
P7 post-encoderBackward dX = [0 0 0 0] BROKEN
P8 post-posEmb CPU accumulation = [0 0 0 0] (consumes dX via .Data())
P1 pre-MatMul fc.dPEW = [0 0 0 0] (already zero before use)
P2 post-Add grads.patchEmbW = [0 0 0 0] (Add of zeros = zero)
P3 pre-gradclip = [0 0 0 0]
P4 adamw-read gradTs[0] = [0 0 0 0]
P5 adamw-post-update paramTs[0] = [0.01112, ...] near-init (AdamW ~= 0 on zero grads)
P6 batch 1 forward entry = same as P5 (AdamW writeback + forward read work fine)
```

Every tensor that was written by a GPU engine op returns zero via `.Data()`. The CPU/GPU disconnect is at the engine-op boundary.

What's NOT the bug (ruled out by multi-day investigation in zerfoo)

  1. Wrapper aliasing: `grads.X[i]` and `gradTs[i]` ARE the same `*TensorNumeric` wrapper (sentinel verified in zerfoo PR #369).
  2. Stream synchronization: adding `engine.Sync()` (via anonymous interface type-assert on `*GPUEngine[T]`) before `.Data()` reads did NOT change the frozen loss. Not a D2H race.
  3. Once-per-training caches: rebuilding `paramTs`/`gradTs` per batch (zerfoo PR #365) was a no-op — the wrappers were already aliased with the live struct fields.
  4. Storage-identity sentinel: the check passes because wrappers are genuinely the same instance — the disconnect is deeper, inside the engine op's output buffer routing.
  5. Kernel timing: ztensor stream Sync barriers before the CPU-side posEmb accumulation AND before grad-clip had zero effect.

Storage kind flip (context)

Storage-kind probe across all 37 grad tensors on first-batch backward (zerfoo branch `debug/v3-storage-kind-probe`, scratch `.claude/scratch/v3-storage-kind-result.md`):

  • Setup: all 37 tensors are `*tensor.CPUStorage[float32]` (allocated via `tensor.New[float32]` which hardcodes `NewCPUStorage`).
  • Post-first-backward: 36 of 37 flip to `*tensor.GPUStorage[float32]`. The flip happens via `makeGPUResult` calling `dst[0].SetStorage(gs)` at `ztensor/compute/gpu_kernels.go:121-132`.
  • Lone holdout: `posEmb` grad (idx=2) stays CPUStorage because its update is a pure CPU loop at `patchtst_gpu_train.go:1012-1019` (`dPosData[j] += dXData[i]`) that never hits `makeGPUResult`.

After the flip, `GPUStorage.Slice()` at `ztensor/tensor/gpu_storage.go:215-250` performs `make([]T, s.length)` + D2H memcpy on every `.Data()` call. The D2H returns all zeros.

Hypotheses for the real bug (need ztensor expertise to pick)

  1. (α) Kernel writes to a different buffer than the one installed by SetStorage. `makeGPUResult` allocates a fresh device buffer, installs it on `dst` via `SetStorage`, but the underlying CUDA kernel launch uses a different pointer (the old storage's pointer, a scratch buffer, or a pre-allocated op output buffer).
  2. (β) Kernel writes correctly, but SetStorage happens AFTER the kernel, installing a fresh zero-init buffer that clobbers the kernel's output.
  3. (γ) D2H memcpy reads from a buffer that was never written. `GPUStorage.Slice()`'s memcpy source pointer diverges from the kernel's write target. Possibly a double-allocation where `makeGPUResult` returns one buffer but the CUDA call went to another.
  4. (δ) In-place aliasing (dst == src0) broken. `engine.Add(ctx, a, b, a)` handled incorrectly — kernel reads src0 after dst has been swapped to a fresh buffer, so reads empty.

Suggested trace points inside ztensor:

  • At entry to `makeGPUResult`: log the caller-supplied `dst` pointer and its existing Storage device pointer.
  • Just before the CUDA kernel launch: log the device pointer the kernel is writing to.
  • Just after the kernel launch but before `SetStorage`: log the same.
  • After `SetStorage`: log `dst[0].GetStorage()` device pointer.
  • Compare: kernel-write-target vs post-SetStorage device pointer.

Environment

  • Host: DGX Spark (GB10, NVIDIA), aarch64, unified memory
  • CUDA: current ztensor CUDA runtime binding
  • Go: 1.25 (via ztensor module)
  • zerfoo branch: main @ current HEAD (all docs in `docs/devlog.md` entry dated 2026-04-08 titled "FINAL")

Cross-references

  • zerfoo issue #368 (PR #365 non-functional bisect marker)
  • zerfoo issue #364 (paramTs staleness — false concern, resolved)
  • zerfoo issue #367 (posEmb idx=2 storage flip anomaly — not an anomaly, it's the clean CPU control case)
  • zerfoo PR #369 (Storage-identity sentinel — fixes false-positive panic but doesn't unblock training)
  • Full investigation history: https://github.com/zerfoo/zerfoo/blob/main/docs/devlog.md (entries dated 2026-04-08)

What we need from ztensor

  1. Trace/fix the `makeGPUResult` output-routing bug per the hypotheses above.
  2. Consider making `.Data()` on a GPU tensor implicitly sync the stream before the D2H memcpy (systemic safety net; relevant even if the primary bug is buffer routing).
  3. Add a minimal test at the ztensor level: run `engine.Add(a, b, c)` on a GPU engine with known inputs, read `c.Data()`, assert equals expected. If this test FAILS, the bug is reproducible without zerfoo at all.

Priority

Blocks PatchTST training on DGX. CPU training still works on DGX as a workaround (unverified at full 5K x 10ch x 3ep scale but expected to converge given local CPU test passes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions