Skip to content

feat(wasm): Llama 3.2 1B Instruct + skip Q4 reconversion#36

Merged
unamedkr merged 2 commits intomainfrom
feat/wasm-llama-default-no-q4
Apr 10, 2026
Merged

feat(wasm): Llama 3.2 1B Instruct + skip Q4 reconversion#36
unamedkr merged 2 commits intomainfrom
feat/wasm-llama-default-no-q4

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Switch WASM demo to the only verified-working model and optimize load speed.

Model: Llama 3.2 1B Instruct (was Qwen3.5 0.8B)

  • Qwen3.5 was a base model (not Instruct) with a gated HuggingFace URL
  • Qwen2.5/Qwen3 models produce garbage due to architecture-specific issues (GQA 7:1 ratio, RMSNorm handling)
  • Llama 3.2 1B Instruct: verified coherent output, public URL, proper chat quality

Speed: -DTQ_NO_Q4=1

Skips load-time Q4 reconversion (GGUF Q4_K_M → FP32 → internal Q4). The model is already quantized — reconverting wastes time and memory. Uses GGUF on-the-fly dequant instead.

Added #ifdef TQ_NO_Q4 compile-time guard so WASM can use it without getenv().

🤖 Generated with Claude Code

unamedkr and others added 2 commits April 10, 2026 20:54
Root cause of ~530-token text collapse: small model + T=0 greedy enters
repetition loop → KV quant error compounds through softmax → collapse.
NOT a code bug — FP32 also degenerates, just slower.

Fix:
- Add 4-gram loop detection (stop when same 4-gram repeats 3+ times)
- Increase rep_window 32→64, recent_tokens buffer 64→128
- Add bench/generation_regression_test.sh (4 tests):
  1. T=0 500-token coherence check (no garbage output)
  2. Loop detection fires on repetitive generation
  3. No false positives at T=0.7
  4. PPL within 15% of FP32

Why this wasn't caught before:
- PPL tests are teacher-forced (no error accumulation)
- Generation tests were ≤100 tokens (collapse at ~530)
- No T=0 stress test existed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes for WASM demo reliability and speed:

1. Model: switch from Qwen3.5-0.8B (base, gated, Qwen arch issues)
   to Llama 3.2 1B Instruct (verified working, good quality, public
   HuggingFace URL, proper Instruct tuning for chat).

2. Speed: add -DTQ_NO_Q4=1 to WASM build. Skips the load-time Q4
   reconversion (GGUF Q4_K_M → FP32 → internal Q4) which was
   expensive and redundant for already-quantized models. Uses GGUF
   on-the-fly dequant instead. Saves several seconds of model init
   and reduces peak memory usage.

   Added compile-time #ifdef TQ_NO_Q4 guard in quant.h so it works
   in WASM (no getenv). Native builds are unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 6f25f34 into main Apr 10, 2026
@unamedkr unamedkr deleted the feat/wasm-llama-default-no-q4 branch April 10, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant