Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/adr/003-bulk-upload-chunking-cap.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# ADR 003: Bound bulkUploadF32 by a byte-sized chunk cap

## Status
Accepted
Accepted as a defensive bound -- but NOT a fix for #106. On 2026-06-06 the
issue was reopened: Wolf train-crossasset rebuilt against this chunking still
wedged the GB10 driver identically at the 213,304-tensor scale. Capping each
alloc/copy at 64 MiB / 4096 tensors does not correlate with the wedge. Keep the
chunking (harmless, bounds driver-call size) but the root cause is still
unpinned. See docs/devlog.md 2026-06-06.

## Date
2026-06-05
Expand Down
34 changes: 33 additions & 1 deletion docs/devlog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,38 @@
# ztensor Development Log

## 2026-06-05: bulkUploadF32 chunking validated on GB10 (#106)
## 2026-06-06: #106 REOPENED -- chunking is NOT the fix

**Type:** finding
**Tags:** cuda, bulk-upload, gb10, sm_121, #106, correction

**Problem:** Correct the record. The 2026-06-05 entry below claimed the chunked
`bulkUploadF32` (v1.8.1) "validated end-to-end" and "unblocked" the CrossAsset
213k-tensor upload. That is WRONG.

**Root cause of the wrong claim:** `TestGPUEngine_UploadWeights_MultiChunk` used
only 256 tensors / 256 MiB. It proved a 64 MiB chunk does not wedge at *small
scale* -- it never reproduced the 213,304-tensor regime the issue is actually
about. Passing that test said nothing about the real wedge.

**What actually happened (per issue #106 reopen, user dndungu):** Wolf
train-crossasset was rebuilt against the merged chunking code (verified in the
binary, no vendoring) and the matched repro (213,304-tensor pre-upload on GB10)
**wedged identically** -- exec/logs/ssh+logind all hang, control plane
responsive: the same uninterruptible D-state CUDA-driver wedge. So capping each
alloc/copy at 64 MiB / 4096 tensors is a defensive bound, NOT the fix. The
wedge does not correlate with single-alloc size.

**Fix:** None yet. The exact wedging CUDA ioctl was never pinned because the
D-state blocks every in-container capture path. Next step (user-proposed):
out-of-band watchdog that samples the D-state thread's
`/proc/<tid>/{wchan,syscall,stack,status}` to a persisted hostPath that survives
the data-plane wedge, to name the exact ioctl, before proposing a real fix.

**Impact:** v1.8.1 chunking stays as a defensive bound (no regression), but
#106 is OPEN. The "fixes #106" framing on PR #107 was premature; treat the
chunking as a partial mitigation only.

## 2026-06-05: bulkUploadF32 chunking validated on GB10 (#106) [SUPERSEDED -- see 2026-06-06 above; chunking did NOT fix the wedge]

**Type:** benchmark
**Tags:** cuda, bulk-upload, gb10, sm_121, #106, verification
Expand Down
26 changes: 24 additions & 2 deletions docs/plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,8 +231,30 @@ churn. A second agent can author the Wave 3 Spark manifest in parallel.)
- [x] T3.2 Release v1.8.1 cut verifies: [infrastructure] (2026 06 05)
- [x] T3.3 Close #106 verifies: [#106] (2026 06 05)

ALL TASKS COMPLETE. Issue #106 resolved: chunked bulkUploadF32 shipped in
v1.8.1, validated on GB10 hardware. No open ztensor issues remain.
CHUNKING SHIPPED in v1.8.1 (E0-E3 done) but #106 REOPENED 2026-06-06: the
chunked path still wedges the GB10 driver at the 213,304-tensor scale (Wolf
train-crossasset, verified against merged code). Chunking is a defensive bound,
not the fix. The wedge does not correlate with single-alloc size. New work is
diagnostic, not a code fix -- see E4 below. (The earlier "validated on GB10"
claim was from a 256-tensor test that never reproduced the 213k-scale wedge.)

### E4 -- Pin the wedging CUDA ioctl (diagnostic)
**Component:** compute
Acceptance: the exact CUDA driver call (kernel stack / syscall / wchan) that
enters uninterruptible D-state during a 213k-tensor GB10 upload is named, so a
real fix can be proposed.

- [ ] T4.1 Out-of-band watchdog: a sidecar/host process that, on a persisted
hostPath (survives the data-plane wedge), samples the target process's
D-state thread `/proc/<tid>/{wchan,syscall,stack,status,comm}` in a loop and
writes frames to disk. Must NOT live inside the wedged container's data plane.
Owner: TBD Est: 2h verifies: [#106]
- [ ] T4.2 Reproduce the wedge under the watchdog and capture the pinned frame
(the exact ioctl). Risk: deliberately wedging the shared GB10 can leave an
unkillable pod / need a host restart -- coordinate before running.
Owner: TBD Est: 2h verifies: [#106] blocked: needs user go-ahead (shared-host wedge risk)
- [ ] T4.3 From the pinned ioctl, propose the real fix and post findings to #106
Owner: TBD Est: TBD verifies: [#106] blocked-by: [T4.2]

## Timeline and Milestones

Expand Down
19 changes: 11 additions & 8 deletions docs/updates.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
# ztensor session updates

## 2026-06-05 -- Resolve open GitHub issues (#106) -- COMPLETE
## 2026-06-06 -- #106 REOPENED: chunking is not the fix

Sole open issue #106 (bulkUploadF32 wedges GB10 driver) is RESOLVED and SHIPPED.
The v1.8.1 chunked bulkUploadF32 (64 MiB + 4096-tensor cap) is a defensive
bound but does NOT resolve #106. Wolf train-crossasset rebuilt against the
merged code still wedged the GB10 driver identically at the 213,304-tensor
scale (uninterruptible D-state). The wedge does not correlate with single-alloc
size; the exact CUDA ioctl is still unpinned.

- E1 fix: chunked bulkUploadF32 (64 MiB + 4096-tensor dual cap). ADR 003.
- E2 validation: TestGPUEngine_UploadWeights_MultiChunk PASSED on GB10 (Spark
pod ...guard-3c04539). 256 MiB -> 4x 64 MiB chunks, no wedge, views round-trip.
- E3 ship: PR #107 rebase-merged; release-please cut v1.8.1 (PR #108);
issue #106 auto-closed.
Correction: the earlier "validated on GB10 / unblocked" claim came from a
256-tensor test that never reproduced the 213k-scale wedge.

No open ztensor GitHub issues remain.
Next: out-of-band watchdog to pin the wedging ioctl (plan E4). Blocked on
go-ahead -- deliberately wedging the shared GB10 risks an unkillable pod / host
restart.
Loading