fix(compute): reuse dst GPU memory instead of allocating per call (#84) by dndungu · Pull Request #85 · zerfoo/ztensor

dndungu · 2026-04-10T05:27:41Z

Summary

Fixes #84. GPU ops were allocating fresh device memory via pool.Alloc on every call even when a pre-sized dst tensor was provided, orphaning old GPUStorage objects. At large training shapes, the GC couldn't keep pace with finalization, causing unbounded GPU memory growth → OOM.

Fix: tryReuseDstPtr checks if dst already has a GPUStorage with sufficient capacity. If so, the kernel writes directly into the existing device pointer — zero allocation, zero GC pressure. Applied to the six hot-path op families: gpuBinaryOp, gpuUnaryOp, gpuScalarOp, Transpose, MatMul, Sum.

Validation on DGX GB10

Before fix (post-E85 commit 09a318c6):

20K×20×5: >300s (still running after timeout)
28K×20×10: OOM (cudaMalloc failed)

After fix:

20K×20×5: 15s (3.0s/epoch), convergence 99.2%
28K×20×10: 40.3s (4.0s/epoch), convergence 99.9%
vs v1.38.4 baseline: 128.5s → 40.3s (3.2x faster!)

Test plan

go build ./... clean
go vet ./... clean (pre-existing unsafe.Pointer warnings only)
Full go test ./... passes (all packages)
DGX 20K×20×5: 15s, convergence OK
DGX 28K×20×10: 40.3s, convergence OK, no OOM
CI green

Refs

Closes GPU ops should reuse dst device memory instead of allocating per call #84
Refs PatchTST GPU training perf/memory regression at large shapes since v1.38.4 zerfoo#373 (bisected regression)
Refs fix(compute): GPUEngine.Reshape honors dst argument (#81) #82 (Reshape dst fix, same class of bug)

GPU ops (gpuBinaryOp, gpuUnaryOp, gpuScalarOp, Transpose, MatMul, Sum) were allocating fresh device memory via pool.Alloc on every call even when a pre-sized dst tensor was provided, then swapping dst's storage to the new allocation. The old GPUStorage was orphaned and depended on Go's GC finalizer to call pool.Free. At large training shapes with hundreds of batches and ~20 ops per batch, orphaned allocations piled up faster than the GC could reclaim, causing unbounded GPU memory growth and OOM. Fix: add tryReuseDstPtr helper that checks if dst[0] already has a GPUStorage with sufficient capacity. If so, the kernel writes directly into the existing device pointer — no pool.Alloc, no orphaned storage, no GC pressure. When dst is nil or undersized, the existing alloc path is preserved unchanged. Applied to the six hot-path op families that cover PatchTST GPU training: - gpuBinaryOp (Add, Sub, Mul same-shape) - gpuUnaryOp (Exp, Log, Sin, Cos, Tanh, Sqrt) - gpuScalarOp (MulScalar, AddScalar, DivScalar) - Transpose (gpu_engine_memory.go) - MatMul standard float32 path (gpu_engine.go) - Sum/ReduceSum (gpu_kernels.go) Other ops (broadcast, Q4/Q8/BF16 matmul, fused kernels) continue using the existing alloc path and can be converted incrementally. Full ztensor test suite passes on CPU host. Closes #84 Refs zerfoo/zerfoo#373

dndungu added 2 commits April 9, 2026 22:01

style(compute): gofmt gpu_engine.go

88f7394

dndungu merged commit 34bfe35 into main Apr 10, 2026
1 check passed

dndungu deleted the fix/issue-84-dst-memory-reuse branch April 10, 2026 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compute): reuse dst GPU memory instead of allocating per call (#84)#85

fix(compute): reuse dst GPU memory instead of allocating per call (#84)#85
dndungu merged 2 commits intomainfrom
fix/issue-84-dst-memory-reuse

dndungu commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Apr 10, 2026

Summary

Validation on DGX GB10

Test plan

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant