From 661314d9fa78844077491a915f3d96a2468651f9 Mon Sep 17 00:00:00 2001
From: quantumaikr <hi@quantumai.kr>
Date: Fri, 10 Apr 2026 23:15:00 +0900
Subject: [PATCH] docs: address "why not just use llama.cpp?" with concrete
 scenarios

Reddit feedback (Eyelbee, 5 upvotes): feature tables don't convince
users who know llama.cpp. Added scenario-based comparison showing
where the single-header approach matters in practice:

- WASM: 192 KB vs GGML tensor graph too large
- Microcontroller: #include only option (no FS, no linker)
- Game engines: one .h vs 250K LOC build integration
- Teaching: readable in an afternoon

Includes side-by-side build commands (cc app.c -lm vs cmake + link).
Explicitly recommends llama.cpp for GPU speed and model coverage.
Applied to both EN and KO READMEs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 README.ko.md | 43 ++++++++++++++++++++++++++++++-------------
 README.md    | 39 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 64 insertions(+), 18 deletions(-)
diff --git a/README.ko.md b/README.ko.md
index 3615a20..b9b1fa9 100644
--- a/README.ko.md
+++ b/README.ko.md
@@ -156,19 +156,36 @@ Transformer의 attention 메커니즘은 최근 토큰에 자연스럽게 집중
 </details>
 
 <details>
-<summary><b>다른 엔진과의 비교</b></summary>
-
-| | quant.cpp | llama.cpp | vLLM | MLX |
-|---|:---:|:---:|:---:|:---:|
-| KV 압축 | **7가지 방식** | Q8_0/Q5_0 (2x) | — | — |
-| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ |
-| 임베딩 가능 | **단일 헤더** | 라이브러리 | 라이브러리 | 프레임워크 |
-| GPU 속도 | 기본 | 최적화됨 | **최고** | Metal |
-| 하루만에 코드 읽기 | **✅** | ❌ | ❌ | ❌ |
-
-**llama.cpp**를 쓰세요 — 워크스테이션에서 최고 속도가 필요할 때.
-**vLLM**을 쓰세요 — 배치 서빙이 필요할 때.
-**quant.cpp**를 쓰세요 — AI를 앱/게임/웹/디바이스에 **넣어야** 할 때.
+<summary><b>"llama.cpp로도 임베딩 되는데, 왜 quant.cpp?"</b></summary>
+
+맞습니다. llama.cpp는 훌륭하고 임베딩도 가능합니다. 차이는 **통합 방식**입니다:
+
+**llama.cpp = 컴파일된 라이브러리** (250K+ LOC). `libllama`를 링크하면 GGML 텐서 그래프, Metal/CUDA 백엔드, 샘플러, 토크나이저가 따라옵니다. 빌드 시스템이 이를 감당할 수 있다면 훌륭합니다 — 하지만 빌드 단계가 필요한 _라이브러리_입니다.
+
+**quant.cpp = 파일 하나** (16K LOC). `#include "quant.h"`, `cc app.c -lm`으로 컴파일. CMake 없음, 링커 플래그는 libc뿐. 하나의 번역 단위.
+
+```
+# quant.cpp — C 프로젝트에 AI 추가: 2줄
+cc -O2 my_app.c -lm -lpthread -o my_app    # 끝
+
+# llama.cpp — 먼저 라이브러리를 빌드해야 합니다
+cmake -B build && cmake --build build
+cc my_app.c -Ibuild/include -Lbuild -lllama -lm -lstdc++ -o my_app
+```
+
+| 시나리오 | quant.cpp | llama.cpp |
+|:---------|:---------:|:---------:|
+| **WASM 브라우저** | 192 KB 바이너리 | GGML 텐서 그래프가 너무 큼 |
+| **마이크로컨트롤러 / RTOS** | `#include`만 가능 (FS/링커 없음) | 빌드 시스템 필요 |
+| **게임 엔진** (Unity/Unreal/Godot) | `.h` 파일 하나 드롭 | 250K LOC 빌드 통합 |
+| **교육 / 연구** | 하루만에 전체 코드 읽기 가능 | 훌륭하지만 코드가 방대 |
+| **GPU 속도** | 기본 | **Metal/CUDA 최적화** |
+| **모델 지원** | 7개 아키텍처 | **100+** |
+
+> **llama.cpp** — 워크스테이션에서 최고 속도가 필요할 때.
+> **vLLM** — 배치 서빙이 필요할 때.
+> **quant.cpp** — AI를 앱/게임/브라우저/디바이스 _안에_ 넣어야 할 때, 통합 단순성이 GPU 처리량보다 중요할 때.
+
 </details>
 
 <details>
diff --git a/README.md b/README.md
index 7eae1de..0f99c76 100644
--- a/README.md
+++ b/README.md
@@ -310,19 +310,48 @@ Both are per-block methods. The quality gap comes from block size (128 vs 32), m
 | GGUF model loading | **✅ 7 architectures** | ❌ | ❌ | research only |
 | End-to-end inference | **✅** | kernel only | kernel only | kernel only |
 
+### "Why not just use llama.cpp?"
+
+You absolutely can. llama.cpp is excellent. The difference is **integration scope**, not capability:
+
+**llama.cpp = compiled library** (250K+ LOC). You link `libllama`, which pulls in GGML tensor graphs, Metal/CUDA backends, sampling, tokenizer. Great if your build system handles it — but it's a _library_ with a build step.
+
+**quant.cpp = one file** (16K LOC). `#include "quant.h"`, compile with `cc app.c -lm`. No CMake, no linker flags beyond libc. One translation unit.
+
+Where this difference matters in practice:
+
+```
+# quant.cpp — add LLM to any C project in 2 lines
+cc -O2 my_app.c -lm -lpthread -o my_app    # that's it
+
+# llama.cpp — requires building the library first
+cmake -B build && cmake --build build       # build libllama
+cc my_app.c -Ibuild/include -Lbuild -lllama -lm -lstdc++ -o my_app
+```
+
+| Scenario | quant.cpp | llama.cpp |
+|:---------|:---------:|:---------:|
+| **WASM browser demo** | 192 KB binary | GGML tensor graph too large |
+| **Microcontroller / RTOS** | `#include` only option (no FS, no linker) | Needs build system |
+| **Game engine plugin** (Unity/Unreal/Godot) | Drop one `.h` | Integrate 250K LOC build |
+| **Teaching / research** | Read in an afternoon | Excellent but large codebase |
+| **Quick prototype** | `pip install quantcpp` or 2-line C | More setup needed |
+| **GPU speed** | Basic | **Full Metal/CUDA** |
+| **Model coverage** | 7 architectures | **100+** |
+| **Production hardening** | Early stage | **Battle-tested** |
+
+> **Use llama.cpp** for speed on a workstation. **Use vLLM** for batch serving.
+> **Use quant.cpp** when you need to ship LLM inference _inside_ something — an app, a game, a browser tab, an embedded device — and integration simplicity matters more than GPU throughput.
+
 ### vs production inference engines
 
 |  | quant.cpp | llama.cpp | vLLM | MLX |
 |:--|:---------:|:---------:|:----:|:---:|
-| KV quantization | **TurboQuant + 6 schemes** | Q8_0/Q5_0 (2x) | -- | -- |
+| KV quantization | **7 schemes (3-7x)** | Q8_0/Q5_0 (2x) | -- | -- |
 | Code size | **72K LOC** | 250K+ | 100K+ | 50K+ |
 | Embeddable | **single header** | library | library | framework |
-| Read in an afternoon | **✅** | ❌ | ❌ | ❌ |
 | GPU throughput | basic | full | **best** | Metal |
 
-> **Use llama.cpp** for speed on a workstation. **Use vLLM** for batch serving.
-> **Use quant.cpp** when you need to ship LLM inference inside something — an app, a game, a website, a device.
-
 ---
 
 ## Supported Models