GPU engine dst-output routing: tensors written by kernels return zero via .Data() — freezes PatchTST training on GB10

## Summary
Any `*TensorNumeric[float32]` passed as the destination to a GPU engine op (MatMul, Add, Sub, Mul, etc.) reads as all-zero via `.Data()` immediately after the op returns, on NVIDIA GB10 (DGX Spark, aarch64, unified memory). This breaks PatchTST training in github.com/zerfoo/zerfoo — loss freezes at a byte-identical deterministic value (`0.268357` on a specific test config) because backward-pass gradients never reach the CPU-visible side for the optimizer to read.

**The exact same training loop converges correctly on the CPU engine** (86% loss reduction in 10 epochs on 15 samples via `TestPatchTST_TrainWindowed_EngineConvergence` in `timeseries/patchtst_test.go`), isolating the fault to ztensor's GPU engine implementation.

## Minimal reproducer (zerfoo side)
```bash
cd github.com/zerfoo/zerfoo
git checkout main
go test ./timeseries/ -run TestPatchTST_TrainWindowed_EngineConvergence -count=1 -v
# PASSES: loss[0]=5.169868 -> loss[9]=0.702752  (86% reduction)
```

The same test, pointed at a GPU engine constructed via `compute.NewGPUEngine[float32](ops)`, would fail with static loss. The test currently uses `newTestEngine()` which returns a CPU engine. Running the same binary on DGX via `scripts/bench-spark.sh -samples 1000 -channels 5 -epochs 3 -cleanup` shows:
\`\`\`
epoch 1: loss=0.268357 ok
epoch 2: loss=0.268357 ok
epoch 3: loss=0.268357 ok
convergence: FAILED
\`\`\`

## Empirical diagnostic (seven probes on DGX, branch debug/diag-grad-zero-source-v2)
First-batch gradient/parameter state logged at key points in `trainWindowedGPU`:

\`\`\`
P0 post-zero grads.patchEmbW      = [0 0 0 0]           (expected)
P7 post-encoderBackward dX        = [0 0 0 0]           BROKEN
P8 post-posEmb CPU accumulation   = [0 0 0 0]           (consumes dX via .Data())
P1 pre-MatMul fc.dPEW             = [0 0 0 0]           (already zero before use)
P2 post-Add grads.patchEmbW       = [0 0 0 0]           (Add of zeros = zero)
P3 pre-gradclip                   = [0 0 0 0]
P4 adamw-read gradTs[0]           = [0 0 0 0]
P5 adamw-post-update paramTs[0]   = [0.01112, ...]      near-init (AdamW ~= 0 on zero grads)
P6 batch 1 forward entry          = same as P5          (AdamW writeback + forward read work fine)
\`\`\`

Every tensor that was written by a GPU engine op returns zero via \`.Data()\`. The CPU/GPU disconnect is at the engine-op boundary.

## What's NOT the bug (ruled out by multi-day investigation in zerfoo)
1. **Wrapper aliasing**: \`grads.X[i]\` and \`gradTs[i]\` ARE the same \`*TensorNumeric\` wrapper (sentinel verified in zerfoo PR #369).
2. **Stream synchronization**: adding \`engine.Sync()\` (via anonymous interface type-assert on \`*GPUEngine[T]\`) before \`.Data()\` reads did NOT change the frozen loss. Not a D2H race.
3. **Once-per-training caches**: rebuilding \`paramTs\`/\`gradTs\` per batch (zerfoo PR #365) was a no-op — the wrappers were already aliased with the live struct fields.
4. **Storage-identity sentinel**: the check passes because wrappers are genuinely the same instance — the disconnect is deeper, inside the engine op's output buffer routing.
5. **Kernel timing**: ztensor stream Sync barriers before the CPU-side posEmb accumulation AND before grad-clip had zero effect.

## Storage kind flip (context)
Storage-kind probe across all 37 grad tensors on first-batch backward (zerfoo branch \`debug/v3-storage-kind-probe\`, scratch \`.claude/scratch/v3-storage-kind-result.md\`):

- **Setup:** all 37 tensors are \`*tensor.CPUStorage[float32]\` (allocated via \`tensor.New[float32]\` which hardcodes \`NewCPUStorage\`).
- **Post-first-backward:** 36 of 37 flip to \`*tensor.GPUStorage[float32]\`. The flip happens via \`makeGPUResult\` calling \`dst[0].SetStorage(gs)\` at \`ztensor/compute/gpu_kernels.go:121-132\`.
- **Lone holdout:** \`posEmb\` grad (idx=2) stays CPUStorage because its update is a pure CPU loop at \`patchtst_gpu_train.go:1012-1019\` (\`dPosData[j] += dXData[i]\`) that never hits \`makeGPUResult\`.

After the flip, \`GPUStorage.Slice()\` at \`ztensor/tensor/gpu_storage.go:215-250\` performs \`make([]T, s.length)\` + D2H memcpy on every \`.Data()\` call. The D2H returns all zeros.

## Hypotheses for the real bug (need ztensor expertise to pick)
1. **(α) Kernel writes to a different buffer than the one installed by SetStorage.** \`makeGPUResult\` allocates a fresh device buffer, installs it on \`dst\` via \`SetStorage\`, but the underlying CUDA kernel launch uses a different pointer (the old storage's pointer, a scratch buffer, or a pre-allocated op output buffer).
2. **(β) Kernel writes correctly, but SetStorage happens AFTER the kernel, installing a fresh zero-init buffer that clobbers the kernel's output.**
3. **(γ) D2H memcpy reads from a buffer that was never written.** \`GPUStorage.Slice()\`'s memcpy source pointer diverges from the kernel's write target. Possibly a double-allocation where \`makeGPUResult\` returns one buffer but the CUDA call went to another.
4. **(δ) In-place aliasing (dst == src0) broken.** \`engine.Add(ctx, a, b, a)\` handled incorrectly — kernel reads src0 after dst has been swapped to a fresh buffer, so reads empty.

Suggested trace points inside ztensor:
- At entry to \`makeGPUResult\`: log the caller-supplied \`dst\` pointer and its existing Storage device pointer.
- Just before the CUDA kernel launch: log the device pointer the kernel is writing to.
- Just after the kernel launch but before \`SetStorage\`: log the same.
- After \`SetStorage\`: log \`dst[0].GetStorage()\` device pointer.
- Compare: kernel-write-target vs post-SetStorage device pointer.

## Environment
- Host: DGX Spark (GB10, NVIDIA), aarch64, unified memory
- CUDA: current ztensor CUDA runtime binding
- Go: 1.25 (via ztensor module)
- zerfoo branch: main @ current HEAD (all docs in \`docs/devlog.md\` entry dated 2026-04-08 titled "FINAL")

## Cross-references
- zerfoo issue #368 (PR #365 non-functional bisect marker)
- zerfoo issue #364 (paramTs staleness — false concern, resolved)
- zerfoo issue #367 (posEmb idx=2 storage flip anomaly — not an anomaly, it's the clean CPU control case)
- zerfoo PR #369 (Storage-identity sentinel — fixes false-positive panic but doesn't unblock training)
- Full investigation history: https://github.com/zerfoo/zerfoo/blob/main/docs/devlog.md (entries dated 2026-04-08)

## What we need from ztensor
1. Trace/fix the \`makeGPUResult\` output-routing bug per the hypotheses above.
2. Consider making \`.Data()\` on a GPU tensor implicitly sync the stream before the D2H memcpy (systemic safety net; relevant even if the primary bug is buffer routing).
3. Add a minimal test at the ztensor level: run \`engine.Add(a, b, c)\` on a GPU engine with known inputs, read \`c.Data()\`, assert equals expected. If this test FAILS, the bug is reproducible without zerfoo at all.

## Priority
Blocks PatchTST training on DGX. CPU training still works on DGX as a workaround (unverified at full 5K x 10ch x 3ep scale but expected to converge given local CPU test passes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU engine dst-output routing: tensors written by kernels return zero via .Data() — freezes PatchTST training on GB10 #79

Summary

Minimal reproducer (zerfoo side)

Empirical diagnostic (seven probes on DGX, branch debug/diag-grad-zero-source-v2)

What's NOT the bug (ruled out by multi-day investigation in zerfoo)

Storage kind flip (context)

Hypotheses for the real bug (need ztensor expertise to pick)

Environment

Cross-references

What we need from ztensor

Priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU engine dst-output routing: tensors written by kernels return zero via .Data() — freezes PatchTST training on GB10 #79

Description

Summary

Minimal reproducer (zerfoo side)

Empirical diagnostic (seven probes on DGX, branch debug/diag-grad-zero-source-v2)

What's NOT the bug (ruled out by multi-day investigation in zerfoo)

Storage kind flip (context)

Hypotheses for the real bug (need ztensor expertise to pick)

Environment

Cross-references

What we need from ztensor

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions