feat(compute): bulk-upload F32 weights to one device buffer (#103)#104
Merged
feat(compute): bulk-upload F32 weights to one device buffer (#103)#104
Conversation
UploadWeights previously issued one cudaMalloc + cudaMemcpy per input tensor. Wolf's CrossAsset training pre-uploads ~106K sample tensors plus ~50 graph parameters in a single call, producing ~213K back-to-back synchronous driver round-trips. On GB10 Blackwell with unified memory, the driver's allocation-table lock contends with the default-stream queue and wedges the CUDA context: nvidia-smi hangs, podman cannot tear down the container, no error surfaces because cudaMalloc never returns. Add a bulk path that activates when >= 64 eligible F32 tensors are queued. It computes the total byte size, issues one allocWeight, copies all source data into the resulting buffer (one cudaMemcpy H2D, or one host-side memcpy on managed-memory hosts), and creates non-owning GPUStorage views per tensor. The engine retains the bulk pointer in bulkUploadBuffers and frees it in Close. Capture-time uploads bypass the bulk path: the per-tensor route already emits async ops as graph nodes, and capture-time tensor counts are small. Tests cover both the bulk activation (N=128) and the below-threshold fallback (N=8) on real CUDA. Both build and vet clean on aarch64 and amd64; the existing compute test suite is unchanged. Closes #103.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GPUEngine.UploadWeightsthat collapses many small per-tensorcudaMalloc+cudaMemcpyround-trips into a single allocation and a single H2D copy when the input slice exceeds 64 eligible F32 tensors.bulkUploadBuffersslice and frees it inClose. Per-tensor views are non-owning (NewGPUStorageViewFromPtr).Fixes the GB10 driver wedge reported in #103 where Wolf's training pre-upload of ~106K sample tensors hangs the CUDA context (no progress, no error,
nvidia-smihangs,DELETEtimes out).Test plan
go build ./compute/cleango vet ./compute/cleango test -count=1 ./...-- all packages green on the dev box (CUDA tests skip)trainWithResult: T4.2 pre-uploading 106652 sample tensors + 50 graph params to GPUTestGPUEngine_UploadWeights_BulkPath(N=128, asserts single bulk buffer + GPU storage + data preservation) andTestGPUEngine_UploadWeights_BelowBulkThreshold(N=8, asserts per-tensor path retained)Closes #103.