Skip to content

feat(compute): bulk-upload F32 weights to one device buffer (#103)#104

Merged
dndungu merged 1 commit intomainfrom
fix/issue103-bulk-upload
Apr 29, 2026
Merged

feat(compute): bulk-upload F32 weights to one device buffer (#103)#104
dndungu merged 1 commit intomainfrom
fix/issue103-bulk-upload

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 29, 2026

Summary

  • Adds a bulk path in GPUEngine.UploadWeights that collapses many small per-tensor cudaMalloc + cudaMemcpy round-trips into a single allocation and a single H2D copy when the input slice exceeds 64 eligible F32 tensors.
  • The engine retains the bulk pointer in a new bulkUploadBuffers slice and frees it in Close. Per-tensor views are non-owning (NewGPUStorageViewFromPtr).
  • Capture-time uploads bypass the bulk path; the per-tensor route already records async ops as graph nodes, and capture-time inputs are small.

Fixes the GB10 driver wedge reported in #103 where Wolf's training pre-upload of ~106K sample tensors hangs the CUDA context (no progress, no error, nvidia-smi hangs, DELETE times out).

Test plan

  • go build ./compute/ clean
  • go vet ./compute/ clean
  • go test -count=1 ./... -- all packages green on the dev box (CUDA tests skip)
  • Run on DGX GB10 once Spark scheduler issue (High-priority GPU pod stays pending forever with zero events when another non-GPU pod is running feza-ai/spark#32) is unblocked: confirm Wolf training advances past trainWithResult: T4.2 pre-uploading 106652 sample tensors + 50 graph params to GPU
  • New tests: TestGPUEngine_UploadWeights_BulkPath (N=128, asserts single bulk buffer + GPU storage + data preservation) and TestGPUEngine_UploadWeights_BelowBulkThreshold (N=8, asserts per-tensor path retained)

Closes #103.

UploadWeights previously issued one cudaMalloc + cudaMemcpy per input
tensor. Wolf's CrossAsset training pre-uploads ~106K sample tensors plus
~50 graph parameters in a single call, producing ~213K back-to-back
synchronous driver round-trips. On GB10 Blackwell with unified memory,
the driver's allocation-table lock contends with the default-stream
queue and wedges the CUDA context: nvidia-smi hangs, podman cannot tear
down the container, no error surfaces because cudaMalloc never returns.

Add a bulk path that activates when >= 64 eligible F32 tensors are
queued. It computes the total byte size, issues one allocWeight, copies
all source data into the resulting buffer (one cudaMemcpy H2D, or one
host-side memcpy on managed-memory hosts), and creates non-owning
GPUStorage views per tensor. The engine retains the bulk pointer in
bulkUploadBuffers and frees it in Close.

Capture-time uploads bypass the bulk path: the per-tensor route already
emits async ops as graph nodes, and capture-time tensor counts are
small.

Tests cover both the bulk activation (N=128) and the below-threshold
fallback (N=8) on real CUDA. Both build and vet clean on aarch64 and
amd64; the existing compute test suite is unchanged.

Closes #103.
@dndungu dndungu merged commit 9ca83f6 into main Apr 29, 2026
1 check passed
@dndungu dndungu deleted the fix/issue103-bulk-upload branch April 29, 2026 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GB10/Blackwell: training hangs at sample-tensor pre-upload (sample 0/N stalled, no progress)

1 participant