Skip to content

fix(compute): reuse dst GPU memory instead of allocating per call (#84)#85

Merged
dndungu merged 2 commits intomainfrom
fix/issue-84-dst-memory-reuse
Apr 10, 2026
Merged

fix(compute): reuse dst GPU memory instead of allocating per call (#84)#85
dndungu merged 2 commits intomainfrom
fix/issue-84-dst-memory-reuse

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 10, 2026

Summary

Fixes #84. GPU ops were allocating fresh device memory via pool.Alloc on every call even when a pre-sized dst tensor was provided, orphaning old GPUStorage objects. At large training shapes, the GC couldn't keep pace with finalization, causing unbounded GPU memory growth → OOM.

Fix: tryReuseDstPtr checks if dst already has a GPUStorage with sufficient capacity. If so, the kernel writes directly into the existing device pointer — zero allocation, zero GC pressure. Applied to the six hot-path op families: gpuBinaryOp, gpuUnaryOp, gpuScalarOp, Transpose, MatMul, Sum.

Validation on DGX GB10

Before fix (post-E85 commit 09a318c6):

  • 20K×20×5: >300s (still running after timeout)
  • 28K×20×10: OOM (cudaMalloc failed)

After fix:

  • 20K×20×5: 15s (3.0s/epoch), convergence 99.2%
  • 28K×20×10: 40.3s (4.0s/epoch), convergence 99.9%
  • vs v1.38.4 baseline: 128.5s → 40.3s (3.2x faster!)

Test plan

  • go build ./... clean
  • go vet ./... clean (pre-existing unsafe.Pointer warnings only)
  • Full go test ./... passes (all packages)
  • DGX 20K×20×5: 15s, convergence OK
  • DGX 28K×20×10: 40.3s, convergence OK, no OOM
  • CI green

Refs

dndungu added 2 commits April 9, 2026 22:01
GPU ops (gpuBinaryOp, gpuUnaryOp, gpuScalarOp, Transpose, MatMul, Sum)
were allocating fresh device memory via pool.Alloc on every call even
when a pre-sized dst tensor was provided, then swapping dst's storage
to the new allocation. The old GPUStorage was orphaned and depended on
Go's GC finalizer to call pool.Free. At large training shapes with
hundreds of batches and ~20 ops per batch, orphaned allocations piled
up faster than the GC could reclaim, causing unbounded GPU memory
growth and OOM.

Fix: add tryReuseDstPtr helper that checks if dst[0] already has a
GPUStorage with sufficient capacity. If so, the kernel writes directly
into the existing device pointer — no pool.Alloc, no orphaned storage,
no GC pressure. When dst is nil or undersized, the existing alloc path
is preserved unchanged.

Applied to the six hot-path op families that cover PatchTST GPU training:
- gpuBinaryOp (Add, Sub, Mul same-shape)
- gpuUnaryOp (Exp, Log, Sin, Cos, Tanh, Sqrt)
- gpuScalarOp (MulScalar, AddScalar, DivScalar)
- Transpose (gpu_engine_memory.go)
- MatMul standard float32 path (gpu_engine.go)
- Sum/ReduceSum (gpu_kernels.go)

Other ops (broadcast, Q4/Q8/BF16 matmul, fused kernels) continue
using the existing alloc path and can be converted incrementally.

Full ztensor test suite passes on CPU host.

Closes #84
Refs zerfoo/zerfoo#373
@dndungu dndungu merged commit 34bfe35 into main Apr 10, 2026
1 check passed
@dndungu dndungu deleted the fix/issue-84-dst-memory-reuse branch April 10, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU ops should reuse dst device memory instead of allocating per call

1 participant