feat(wasm): Llama 3.2 1B Instruct + skip Q4 reconversion by unamedkr · Pull Request #36 · quantumaikr/quant.cpp

unamedkr · 2026-04-10T11:55:06Z

Summary

Switch WASM demo to the only verified-working model and optimize load speed.

Model: Llama 3.2 1B Instruct (was Qwen3.5 0.8B)

Qwen3.5 was a base model (not Instruct) with a gated HuggingFace URL
Qwen2.5/Qwen3 models produce garbage due to architecture-specific issues (GQA 7:1 ratio, RMSNorm handling)
Llama 3.2 1B Instruct: verified coherent output, public URL, proper chat quality

Speed: -DTQ_NO_Q4=1

Skips load-time Q4 reconversion (GGUF Q4_K_M → FP32 → internal Q4). The model is already quantized — reconverting wastes time and memory. Uses GGUF on-the-fly dequant instead.

Added #ifdef TQ_NO_Q4 compile-time guard so WASM can use it without getenv().

🤖 Generated with Claude Code

Root cause of ~530-token text collapse: small model + T=0 greedy enters repetition loop → KV quant error compounds through softmax → collapse. NOT a code bug — FP32 also degenerates, just slower. Fix: - Add 4-gram loop detection (stop when same 4-gram repeats 3+ times) - Increase rep_window 32→64, recent_tokens buffer 64→128 - Add bench/generation_regression_test.sh (4 tests): 1. T=0 500-token coherence check (no garbage output) 2. Loop detection fires on repetitive generation 3. No false positives at T=0.7 4. PPL within 15% of FP32 Why this wasn't caught before: - PPL tests are teacher-forced (no error accumulation) - Generation tests were ≤100 tokens (collapse at ~530) - No T=0 stress test existed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two changes for WASM demo reliability and speed: 1. Model: switch from Qwen3.5-0.8B (base, gated, Qwen arch issues) to Llama 3.2 1B Instruct (verified working, good quality, public HuggingFace URL, proper Instruct tuning for chat). 2. Speed: add -DTQ_NO_Q4=1 to WASM build. Skips the load-time Q4 reconversion (GGUF Q4_K_M → FP32 → internal Q4) which was expensive and redundant for already-quantized models. Uses GGUF on-the-fly dequant instead. Saves several seconds of model init and reduces peak memory usage. Added compile-time #ifdef TQ_NO_Q4 guard in quant.h so it works in WASM (no getenv). Native builds are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr and others added 2 commits April 10, 2026 20:54

unamedkr merged commit 6f25f34 into main Apr 10, 2026

unamedkr deleted the feat/wasm-llama-default-no-q4 branch April 10, 2026 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wasm): Llama 3.2 1B Instruct + skip Q4 reconversion#36

feat(wasm): Llama 3.2 1B Instruct + skip Q4 reconversion#36
unamedkr merged 2 commits intomainfrom
feat/wasm-llama-default-no-q4

unamedkr commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 10, 2026

Summary

Model: Llama 3.2 1B Instruct (was Qwen3.5 0.8B)

Speed: -DTQ_NO_Q4=1

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant