feat(wasm): Llama 3.2 1B Instruct + skip Q4 reconversion#36
Merged
Conversation
Root cause of ~530-token text collapse: small model + T=0 greedy enters repetition loop → KV quant error compounds through softmax → collapse. NOT a code bug — FP32 also degenerates, just slower. Fix: - Add 4-gram loop detection (stop when same 4-gram repeats 3+ times) - Increase rep_window 32→64, recent_tokens buffer 64→128 - Add bench/generation_regression_test.sh (4 tests): 1. T=0 500-token coherence check (no garbage output) 2. Loop detection fires on repetitive generation 3. No false positives at T=0.7 4. PPL within 15% of FP32 Why this wasn't caught before: - PPL tests are teacher-forced (no error accumulation) - Generation tests were ≤100 tokens (collapse at ~530) - No T=0 stress test existed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes for WASM demo reliability and speed: 1. Model: switch from Qwen3.5-0.8B (base, gated, Qwen arch issues) to Llama 3.2 1B Instruct (verified working, good quality, public HuggingFace URL, proper Instruct tuning for chat). 2. Speed: add -DTQ_NO_Q4=1 to WASM build. Skips the load-time Q4 reconversion (GGUF Q4_K_M → FP32 → internal Q4) which was expensive and redundant for already-quantized models. Uses GGUF on-the-fly dequant instead. Saves several seconds of model init and reduces peak memory usage. Added compile-time #ifdef TQ_NO_Q4 guard in quant.h so it works in WASM (no getenv). Native builds are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Switch WASM demo to the only verified-working model and optimize load speed.
Model: Llama 3.2 1B Instruct (was Qwen3.5 0.8B)
Speed: -DTQ_NO_Q4=1
Skips load-time Q4 reconversion (GGUF Q4_K_M → FP32 → internal Q4). The model is already quantized — reconverting wastes time and memory. Uses GGUF on-the-fly dequant instead.
Added
#ifdef TQ_NO_Q4compile-time guard so WASM can use it without getenv().🤖 Generated with Claude Code