zerfoo · dndungu · Jun 6, 2026 · Jun 6, 2026
diff --git a/docs/adr/003-bulk-upload-chunking-cap.md b/docs/adr/003-bulk-upload-chunking-cap.md
@@ -1,7 +1,12 @@
 # ADR 003: Bound bulkUploadF32 by a byte-sized chunk cap
 
 ## Status
-Accepted
+Accepted as a defensive bound -- but NOT a fix for #106. On 2026-06-06 the
+issue was reopened: Wolf train-crossasset rebuilt against this chunking still
+wedged the GB10 driver identically at the 213,304-tensor scale. Capping each
+alloc/copy at 64 MiB / 4096 tensors does not correlate with the wedge. Keep the
+chunking (harmless, bounds driver-call size) but the root cause is still
+unpinned. See docs/devlog.md 2026-06-06.
 
 ## Date
 2026-06-05

diff --git a/docs/devlog.md b/docs/devlog.md
@@ -1,6 +1,38 @@
 # ztensor Development Log
 
-## 2026-06-05: bulkUploadF32 chunking validated on GB10 (#106)
+## 2026-06-06: #106 REOPENED -- chunking is NOT the fix
+
+**Type:** finding
+**Tags:** cuda, bulk-upload, gb10, sm_121, #106, correction
+
+**Problem:** Correct the record. The 2026-06-05 entry below claimed the chunked
+`bulkUploadF32` (v1.8.1) "validated end-to-end" and "unblocked" the CrossAsset
+213k-tensor upload. That is WRONG.
+
+**Root cause of the wrong claim:** `TestGPUEngine_UploadWeights_MultiChunk` used
+only 256 tensors / 256 MiB. It proved a 64 MiB chunk does not wedge at *small
+scale* -- it never reproduced the 213,304-tensor regime the issue is actually
+about. Passing that test said nothing about the real wedge.
+
+**What actually happened (per issue #106 reopen, user dndungu):** Wolf
+train-crossasset was rebuilt against the merged chunking code (verified in the
+binary, no vendoring) and the matched repro (213,304-tensor pre-upload on GB10)
+**wedged identically** -- exec/logs/ssh+logind all hang, control plane
+responsive: the same uninterruptible D-state CUDA-driver wedge. So capping each
+alloc/copy at 64 MiB / 4096 tensors is a defensive bound, NOT the fix. The
+wedge does not correlate with single-alloc size.
+
+**Fix:** None yet. The exact wedging CUDA ioctl was never pinned because the
+D-state blocks every in-container capture path. Next step (user-proposed):
+out-of-band watchdog that samples the D-state thread's
+`/proc/<tid>/{wchan,syscall,stack,status}` to a persisted hostPath that survives
+the data-plane wedge, to name the exact ioctl, before proposing a real fix.
+
+**Impact:** v1.8.1 chunking stays as a defensive bound (no regression), but
+#106 is OPEN. The "fixes #106" framing on PR #107 was premature; treat the
+chunking as a partial mitigation only.
+
+## 2026-06-05: bulkUploadF32 chunking validated on GB10 (#106) [SUPERSEDED -- see 2026-06-06 above; chunking did NOT fix the wedge]
 
 **Type:** benchmark
 **Tags:** cuda, bulk-upload, gb10, sm_121, #106, verification

diff --git a/docs/plan.md b/docs/plan.md
@@ -231,8 +231,30 @@ churn. A second agent can author the Wave 3 Spark manifest in parallel.)
 - [x] T3.2 Release v1.8.1 cut  verifies: [infrastructure]  (2026 06 05)
 - [x] T3.3 Close #106  verifies: [#106]  (2026 06 05)
 
-ALL TASKS COMPLETE. Issue #106 resolved: chunked bulkUploadF32 shipped in
-v1.8.1, validated on GB10 hardware. No open ztensor issues remain.
+CHUNKING SHIPPED in v1.8.1 (E0-E3 done) but #106 REOPENED 2026-06-06: the
+chunked path still wedges the GB10 driver at the 213,304-tensor scale (Wolf
+train-crossasset, verified against merged code). Chunking is a defensive bound,
+not the fix. The wedge does not correlate with single-alloc size. New work is
+diagnostic, not a code fix -- see E4 below. (The earlier "validated on GB10"
+claim was from a 256-tensor test that never reproduced the 213k-scale wedge.)
+
+### E4 -- Pin the wedging CUDA ioctl (diagnostic)
+**Component:** compute
+Acceptance: the exact CUDA driver call (kernel stack / syscall / wchan) that
+enters uninterruptible D-state during a 213k-tensor GB10 upload is named, so a
+real fix can be proposed.
+
+- [ ] T4.1 Out-of-band watchdog: a sidecar/host process that, on a persisted
+  hostPath (survives the data-plane wedge), samples the target process's
+  D-state thread `/proc/<tid>/{wchan,syscall,stack,status,comm}` in a loop and
+  writes frames to disk. Must NOT live inside the wedged container's data plane.
+  Owner: TBD  Est: 2h  verifies: [#106]
+- [ ] T4.2 Reproduce the wedge under the watchdog and capture the pinned frame
+  (the exact ioctl). Risk: deliberately wedging the shared GB10 can leave an
+  unkillable pod / need a host restart -- coordinate before running.
+  Owner: TBD  Est: 2h  verifies: [#106]  blocked: needs user go-ahead (shared-host wedge risk)
+- [ ] T4.3 From the pinned ioctl, propose the real fix and post findings to #106
+  Owner: TBD  Est: TBD  verifies: [#106]  blocked-by: [T4.2]
 
 ## Timeline and Milestones
 

diff --git a/docs/updates.md b/docs/updates.md
@@ -1,13 +1,16 @@
 # ztensor session updates
 
-## 2026-06-05 -- Resolve open GitHub issues (#106) -- COMPLETE
+## 2026-06-06 -- #106 REOPENED: chunking is not the fix
 
-Sole open issue #106 (bulkUploadF32 wedges GB10 driver) is RESOLVED and SHIPPED.
+The v1.8.1 chunked bulkUploadF32 (64 MiB + 4096-tensor cap) is a defensive
+bound but does NOT resolve #106. Wolf train-crossasset rebuilt against the
+merged code still wedged the GB10 driver identically at the 213,304-tensor
+scale (uninterruptible D-state). The wedge does not correlate with single-alloc
+size; the exact CUDA ioctl is still unpinned.
 
-- E1 fix: chunked bulkUploadF32 (64 MiB + 4096-tensor dual cap). ADR 003.
-- E2 validation: TestGPUEngine_UploadWeights_MultiChunk PASSED on GB10 (Spark
-  pod ...guard-3c04539). 256 MiB -> 4x 64 MiB chunks, no wedge, views round-trip.
-- E3 ship: PR #107 rebase-merged; release-please cut v1.8.1 (PR #108);
-  issue #106 auto-closed.
+Correction: the earlier "validated on GB10 / unblocked" claim came from a
+256-tensor test that never reproduced the 213k-scale wedge.
 
-No open ztensor GitHub issues remain.
+Next: out-of-band watchdog to pin the wedging ioctl (plan E4). Blocked on
+go-ahead -- deliberately wedging the shared GB10 risks an unkillable pod / host
+restart.