From 61193165368f5614364a1302ccb34b9a87d45918 Mon Sep 17 00:00:00 2001 From: David Ndungu Date: Fri, 5 Jun 2026 23:20:28 -0700 Subject: [PATCH] docs: correct #106 -- chunking is a defensive bound, not the fix Issue reopened 2026-06-06: Wolf train-crossasset rebuilt against the merged chunking still wedges the GB10 driver at 213k-tensor scale. The prior 'validated on GB10' claim was from a 256-tensor test that never reproduced the 213k regime. Record the correction; add diagnostic epic E4 (pin the ioctl). Refs #106. --- docs/adr/003-bulk-upload-chunking-cap.md | 7 ++++- docs/devlog.md | 34 +++++++++++++++++++++++- docs/plan.md | 26 ++++++++++++++++-- docs/updates.md | 19 +++++++------ 4 files changed, 74 insertions(+), 12 deletions(-) diff --git a/docs/adr/003-bulk-upload-chunking-cap.md b/docs/adr/003-bulk-upload-chunking-cap.md index 781b7b1..765be75 100644 --- a/docs/adr/003-bulk-upload-chunking-cap.md +++ b/docs/adr/003-bulk-upload-chunking-cap.md @@ -1,7 +1,12 @@ # ADR 003: Bound bulkUploadF32 by a byte-sized chunk cap ## Status -Accepted +Accepted as a defensive bound -- but NOT a fix for #106. On 2026-06-06 the +issue was reopened: Wolf train-crossasset rebuilt against this chunking still +wedged the GB10 driver identically at the 213,304-tensor scale. Capping each +alloc/copy at 64 MiB / 4096 tensors does not correlate with the wedge. Keep the +chunking (harmless, bounds driver-call size) but the root cause is still +unpinned. See docs/devlog.md 2026-06-06. ## Date 2026-06-05 diff --git a/docs/devlog.md b/docs/devlog.md index 6fd772e..523970e 100644 --- a/docs/devlog.md +++ b/docs/devlog.md @@ -1,6 +1,38 @@ # ztensor Development Log -## 2026-06-05: bulkUploadF32 chunking validated on GB10 (#106) +## 2026-06-06: #106 REOPENED -- chunking is NOT the fix + +**Type:** finding +**Tags:** cuda, bulk-upload, gb10, sm_121, #106, correction + +**Problem:** Correct the record. The 2026-06-05 entry below claimed the chunked +`bulkUploadF32` (v1.8.1) "validated end-to-end" and "unblocked" the CrossAsset +213k-tensor upload. That is WRONG. + +**Root cause of the wrong claim:** `TestGPUEngine_UploadWeights_MultiChunk` used +only 256 tensors / 256 MiB. It proved a 64 MiB chunk does not wedge at *small +scale* -- it never reproduced the 213,304-tensor regime the issue is actually +about. Passing that test said nothing about the real wedge. + +**What actually happened (per issue #106 reopen, user dndungu):** Wolf +train-crossasset was rebuilt against the merged chunking code (verified in the +binary, no vendoring) and the matched repro (213,304-tensor pre-upload on GB10) +**wedged identically** -- exec/logs/ssh+logind all hang, control plane +responsive: the same uninterruptible D-state CUDA-driver wedge. So capping each +alloc/copy at 64 MiB / 4096 tensors is a defensive bound, NOT the fix. The +wedge does not correlate with single-alloc size. + +**Fix:** None yet. The exact wedging CUDA ioctl was never pinned because the +D-state blocks every in-container capture path. Next step (user-proposed): +out-of-band watchdog that samples the D-state thread's +`/proc//{wchan,syscall,stack,status}` to a persisted hostPath that survives +the data-plane wedge, to name the exact ioctl, before proposing a real fix. + +**Impact:** v1.8.1 chunking stays as a defensive bound (no regression), but +#106 is OPEN. The "fixes #106" framing on PR #107 was premature; treat the +chunking as a partial mitigation only. + +## 2026-06-05: bulkUploadF32 chunking validated on GB10 (#106) [SUPERSEDED -- see 2026-06-06 above; chunking did NOT fix the wedge] **Type:** benchmark **Tags:** cuda, bulk-upload, gb10, sm_121, #106, verification diff --git a/docs/plan.md b/docs/plan.md index de46c2b..9015149 100644 --- a/docs/plan.md +++ b/docs/plan.md @@ -231,8 +231,30 @@ churn. A second agent can author the Wave 3 Spark manifest in parallel.) - [x] T3.2 Release v1.8.1 cut verifies: [infrastructure] (2026 06 05) - [x] T3.3 Close #106 verifies: [#106] (2026 06 05) -ALL TASKS COMPLETE. Issue #106 resolved: chunked bulkUploadF32 shipped in -v1.8.1, validated on GB10 hardware. No open ztensor issues remain. +CHUNKING SHIPPED in v1.8.1 (E0-E3 done) but #106 REOPENED 2026-06-06: the +chunked path still wedges the GB10 driver at the 213,304-tensor scale (Wolf +train-crossasset, verified against merged code). Chunking is a defensive bound, +not the fix. The wedge does not correlate with single-alloc size. New work is +diagnostic, not a code fix -- see E4 below. (The earlier "validated on GB10" +claim was from a 256-tensor test that never reproduced the 213k-scale wedge.) + +### E4 -- Pin the wedging CUDA ioctl (diagnostic) +**Component:** compute +Acceptance: the exact CUDA driver call (kernel stack / syscall / wchan) that +enters uninterruptible D-state during a 213k-tensor GB10 upload is named, so a +real fix can be proposed. + +- [ ] T4.1 Out-of-band watchdog: a sidecar/host process that, on a persisted + hostPath (survives the data-plane wedge), samples the target process's + D-state thread `/proc//{wchan,syscall,stack,status,comm}` in a loop and + writes frames to disk. Must NOT live inside the wedged container's data plane. + Owner: TBD Est: 2h verifies: [#106] +- [ ] T4.2 Reproduce the wedge under the watchdog and capture the pinned frame + (the exact ioctl). Risk: deliberately wedging the shared GB10 can leave an + unkillable pod / need a host restart -- coordinate before running. + Owner: TBD Est: 2h verifies: [#106] blocked: needs user go-ahead (shared-host wedge risk) +- [ ] T4.3 From the pinned ioctl, propose the real fix and post findings to #106 + Owner: TBD Est: TBD verifies: [#106] blocked-by: [T4.2] ## Timeline and Milestones diff --git a/docs/updates.md b/docs/updates.md index d6dec05..5a8cec0 100644 --- a/docs/updates.md +++ b/docs/updates.md @@ -1,13 +1,16 @@ # ztensor session updates -## 2026-06-05 -- Resolve open GitHub issues (#106) -- COMPLETE +## 2026-06-06 -- #106 REOPENED: chunking is not the fix -Sole open issue #106 (bulkUploadF32 wedges GB10 driver) is RESOLVED and SHIPPED. +The v1.8.1 chunked bulkUploadF32 (64 MiB + 4096-tensor cap) is a defensive +bound but does NOT resolve #106. Wolf train-crossasset rebuilt against the +merged code still wedged the GB10 driver identically at the 213,304-tensor +scale (uninterruptible D-state). The wedge does not correlate with single-alloc +size; the exact CUDA ioctl is still unpinned. -- E1 fix: chunked bulkUploadF32 (64 MiB + 4096-tensor dual cap). ADR 003. -- E2 validation: TestGPUEngine_UploadWeights_MultiChunk PASSED on GB10 (Spark - pod ...guard-3c04539). 256 MiB -> 4x 64 MiB chunks, no wedge, views round-trip. -- E3 ship: PR #107 rebase-merged; release-please cut v1.8.1 (PR #108); - issue #106 auto-closed. +Correction: the earlier "validated on GB10 / unblocked" claim came from a +256-tensor test that never reproduced the 213k-scale wedge. -No open ztensor GitHub issues remain. +Next: out-of-band watchdog to pin the wedging ioctl (plan E4). Blocked on +go-ahead -- deliberately wedging the shared GB10 risks an unkillable pod / host +restart.