Root Cause Found
Qwen3.5-4B's DeltaNet layers produce garbage on short prompts (< ~50 tokens) but work correctly on long prompts (150+ tokens with document context).
Evidence
# SHORT prompt — FAILS (even first call, fresh state)
"What is 2+2?" → "The answer to **" # wrong
# LONG prompt — WORKS (4/4 correct)
"Document: Acme reported 847M...\nQuestion: What was revenue?" → "847 million" ✅
Hypothesis
DeltaNet's conv1d layer (conv_width=4) needs sufficient input tokens to initialize its recurrent state (conv_state buffer). With short prompts, the conv buffer contains uninitialized/zero values that produce unstable attention patterns.
Impact on RLV
The coherence check prompt ("Is this answered? YES/NO") is short (~40 tokens) and triggers this bug, causing:
- False UNSURE verdicts
- Unnecessary retries (3x slower)
- "empty response from server" when the model generates only template tokens
Proposed Fix
Option A: Pad short prompts with a warm-up prefix:
if (n_prompt < 50) {
// Prepend neutral tokens to warm up conv_state
prepend("The following is a question and answer.\n");
}
Option B: Initialize conv_state with a learned warm-up pass:
// Run a dummy forward pass with padding tokens before real generation
for (int i = 0; i < conv_width; i++)
tq_forward(model, state, PAD_TOKEN, i);
Option C: Zero-initialize conv_state explicitly before each generate call (already done by calloc, but verify the state isn't being reused).
Environment
- Model: unsloth/Qwen3.5-4B-GGUF (Q4_K_M, 2.6GB)
- quant.h: latest main (e12fcbd DeltaNet fix)
- OS: macOS 15 (Apple M3, 16GB)
Root Cause Found
Qwen3.5-4B's DeltaNet layers produce garbage on short prompts (< ~50 tokens) but work correctly on long prompts (150+ tokens with document context).
Evidence
Hypothesis
DeltaNet's
conv1dlayer (conv_width=4) needs sufficient input tokens to initialize its recurrent state (conv_statebuffer). With short prompts, the conv buffer contains uninitialized/zero values that produce unstable attention patterns.Impact on RLV
The coherence check prompt ("Is this answered? YES/NO") is short (~40 tokens) and triggers this bug, causing:
Proposed Fix
Option A: Pad short prompts with a warm-up prefix:
Option B: Initialize conv_state with a learned warm-up pass:
Option C: Zero-initialize conv_state explicitly before each generate call (already done by calloc, but verify the state isn't being reused).
Environment