Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 70 additions & 24 deletions site/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ <h3 class="reveal" data-i18n="ch5.context.title">Context Length on 8GB Mac</h3>
<div class="section-label" data-i18n="rag.label">Movement</div>
<h2 class="reveal" data-i18n="rag.title">Beyond RAG</h2>

<blockquote class="reveal" style="border-left:3px solid var(--accent);padding:1rem 1.5rem;margin:1.5rem 0;background:rgba(108,92,231,.05);font-size:1.1rem;line-height:1.6;color:var(--text)">
<blockquote class="reveal" style="border-left:3px solid var(--accent);padding:1rem 1.5rem;margin:1.5rem 0;background:rgba(108,92,231,.05);font-size:1.1rem;line-height:1.6;color:var(--text)" data-i18n-html="rag.quote">
<strong>Chunking RAG was a workaround for small context windows.</strong><br>
The workaround became dogma.<br>
Now context windows are big enough that we don't need the workaround.<br>
Expand All @@ -531,7 +531,7 @@ <h2 class="reveal" data-i18n="rag.title">Beyond RAG</h2>

<p class="reveal" data-i18n-html="rag.intro">Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. <strong>Now they have 128K. The compromise should have started disappearing.</strong></p>

<p class="reveal">It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.</p>
<p class="reveal" data-i18n="rag.para2">It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.</p>

<div class="viz reveal">
<div class="viz-title" data-i18n="rag.viz.title">Chunk-Level RAG vs Document-Level RAG</div>
Expand Down Expand Up @@ -600,34 +600,34 @@ <h4 data-i18n="rag.card3.t">Read Once, Query Forever</h4>
<!-- ===== Verification Box ===== -->
<section id="verification">
<div class="container">
<div class="section-label">Measured Result</div>
<h2 class="reveal">7/7 vs 0/7 β€” Verified</h2>
<p class="reveal">We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with <strong>Llama 3.2 3B Q8_0</strong>:</p>
<div class="section-label" data-i18n="verify.label">Measured Result</div>
<h2 class="reveal" data-i18n="verify.title">7/7 vs 0/7 β€” Verified</h2>
<p class="reveal" data-i18n-html="verify.intro">We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with <strong>Llama 3.2 3B Q8_0</strong>:</p>

<div class="viz reveal">
<div class="viz-title">Fact Extraction Accuracy</div>
<div class="viz-title" data-i18n="verify.viz.title">Fact Extraction Accuracy</div>

<div class="mem-bar-container">
<div class="mem-bar-label"><span>Chunk-RAG (wrong section retrieved)</span><span style="color:var(--red)">0/7 β€” all hallucinated</span></div>
<div class="mem-bar-label"><span data-i18n="verify.bar1.label">Chunk-RAG (wrong section retrieved)</span><span style="color:var(--red)" data-i18n="verify.bar1.val">0/7 β€” all hallucinated</span></div>
<div class="mem-bar"><div class="mem-bar-fill bar-fp32" style="--w:0%">0%</div></div>
</div>

<div class="mem-bar-container">
<div class="mem-bar-label"><span>Full Document (FP32 KV)</span><span style="color:var(--green)">7/7</span></div>
<div class="mem-bar-label"><span data-i18n="verify.bar2.label">Full Document (FP32 KV)</span><span style="color:var(--green)">7/7</span></div>
<div class="mem-bar"><div class="mem-bar-fill bar-aggr" style="--w:100%">100%</div></div>
</div>

<div class="mem-bar-container">
<div class="mem-bar-label"><span><strong>Full Document (6.4x KV compression)</strong></span><span style="color:var(--green)"><strong>7/7</strong></span></div>
<div class="mem-bar"><div class="mem-bar-fill bar-aggr" style="--w:100%">100% β€” same as FP32</div></div>
<div class="mem-bar-label"><span data-i18n-html="verify.bar3.label"><strong>Full Document (6.4x KV compression)</strong></span><span style="color:var(--green)"><strong>7/7</strong></span></div>
<div class="mem-bar"><div class="mem-bar-fill bar-aggr" style="--w:100%" data-i18n="verify.bar3.inner">100% β€” same as FP32</div></div>
</div>
</div>

<h3 class="reveal">The Hallucination Problem</h3>
<p class="reveal">When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" β€” it generated <strong>plausible-sounding lies</strong>:</p>
<h3 class="reveal" data-i18n="verify.halluc.title">The Hallucination Problem</h3>
<p class="reveal" data-i18n-html="verify.halluc.desc">When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" β€” it generated <strong>plausible-sounding lies</strong>:</p>

<div class="viz reveal">
<div style="font-family:monospace;font-size:.85rem;line-height:2;color:var(--text2)">
<div style="font-family:monospace;font-size:.85rem;line-height:2;color:var(--text2)" data-i18n-html="verify.halluc.examples">
<div><span style="color:var(--accent2)">Q:</span> Who is the CTO?</div>
<div><span style="color:var(--red)">Chunk-RAG:</span> "John Smith" &emsp; <span style="color:var(--text3)">β†’ truth: Maria Santos</span></div>
<br>
Expand All @@ -639,28 +639,28 @@ <h3 class="reveal">The Hallucination Problem</h3>
</div>
</div>

<p class="reveal" style="color:var(--text);font-weight:500;font-size:1.1rem">This is the fundamental danger of chunk-RAG: <strong>retrieval failure becomes silent hallucination</strong>. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.</p>
<p class="reveal" style="color:var(--text);font-weight:500;font-size:1.1rem" data-i18n-html="verify.halluc.summary">This is the fundamental danger of chunk-RAG: <strong>retrieval failure becomes silent hallucination</strong>. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.</p>

<div class="card-grid stagger" style="margin-top:2rem">
<div class="info-card">
<div class="card-icon">&#x2705;</div>
<h4>KV Compression = Zero Quality Loss</h4>
<p>FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.</p>
<h4 data-i18n="verify.card1.t">KV Compression = Zero Quality Loss</h4>
<p data-i18n="verify.card1.d">FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.</p>
</div>
<div class="info-card">
<div class="card-icon">&#x1F517;</div>
<h4>Multi-Hop Reasoning Works</h4>
<p>"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: βœ“. Chunk-RAG: impossible.</p>
<h4 data-i18n="verify.card2.t">Multi-Hop Reasoning Works</h4>
<p data-i18n="verify.card2.d">"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: βœ“. Chunk-RAG: impossible.</p>
</div>
<div class="info-card">
<div class="card-icon">&#x1F4BB;</div>
<h4>Runs on 16GB Mac</h4>
<p>Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.</p>
<h4 data-i18n="verify.card3.t">Runs on 16GB Mac</h4>
<p data-i18n="verify.card3.d">Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.</p>
</div>
</div>

<div style="text-align:center;margin-top:3rem">
<a href="https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md" class="cta-btn cta-primary" style="font-size:.95rem">Read the Beyond RAG Manifesto &rarr;</a>
<a href="https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md" class="cta-btn cta-primary" style="font-size:.95rem" data-i18n-html="verify.cta">Read the Beyond RAG Manifesto &rarr;</a>
</div>
</div>
</section>
Expand Down Expand Up @@ -743,7 +743,7 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
<!-- ===== Footer ===== -->
<footer>
<div class="container">
<p>quant.cpp &middot; Apache 2.0 &middot; <a href="https://github.com/quantumaikr/quant.cpp">GitHub</a> &middot; Made by <a href="https://github.com/quantumaikr">quantumaikr</a></p>
<p data-i18n-html="footer.text">quant.cpp &middot; Apache 2.0 &middot; <a href="https://github.com/quantumaikr/quant.cpp">GitHub</a> &middot; Made by <a href="https://github.com/quantumaikr">quantumaikr</a></p>
</div>
</footer>

Expand Down Expand Up @@ -913,7 +913,30 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
"rag.card2.d": "Can't fit 100K documents in context. Prefill is slow. RAG narrows the search to 2-3 relevant documents that DO fit.",
"rag.card3.t": "Read Once, Query Forever",
"rag.card3.d": "Pre-process documents into .kv files (GPU, once). Load instantly on any laptop (0.5s). Query offline, unlimited, private.",
"rag.pipeline.title": "Pre-computed KV Library Pattern"
"rag.pipeline.title": "Pre-computed KV Library Pattern",
"rag.quote": "<strong>Chunking RAG was a workaround for small context windows.</strong><br>The workaround became dogma.<br>Now context windows are big enough that we don't need the workaround.<br><em style=\"color:var(--accent2)\">β€” Welcome to Beyond RAG.</em>",
"rag.para2": "It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. \"RAG pipeline\" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.",
"verify.label": "Measured Result",
"verify.title": "7/7 vs 0/7 β€” Verified",
"verify.intro": "We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with <strong>Llama 3.2 3B Q8_0</strong>:",
"verify.viz.title": "Fact Extraction Accuracy",
"verify.bar1.label": "Chunk-RAG (wrong section retrieved)",
"verify.bar1.val": "0/7 β€” all hallucinated",
"verify.bar2.label": "Full Document (FP32 KV)",
"verify.bar3.label": "<strong>Full Document (6.4x KV compression)</strong>",
"verify.bar3.inner": "100% β€” same as FP32",
"verify.halluc.title": "The Hallucination Problem",
"verify.halluc.desc": "When chunk-RAG retrieved the wrong section, the model didn't say \"I don't know\" β€” it generated <strong>plausible-sounding lies</strong>:",
"verify.halluc.examples": "<div><span style=\"color:var(--accent2)\">Q:</span> Who is the CTO?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"John Smith\" &emsp; <span style=\"color:var(--text3)\">β†’ truth: Maria Santos</span></div><br><div><span style=\"color:var(--accent2)\">Q:</span> What is the revenue?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"$1,000,000\" &emsp; <span style=\"color:var(--text3)\">β†’ truth: 847 million</span></div><br><div><span style=\"color:var(--accent2)\">Q:</span> What percent is R&D?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"15% of net income\" &emsp; <span style=\"color:var(--text3)\">β†’ truth: 14% of revenue</span></div>",
"verify.halluc.summary": "This is the fundamental danger of chunk-RAG: <strong>retrieval failure becomes silent hallucination</strong>. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.",
"verify.card1.t": "KV Compression = Zero Quality Loss",
"verify.card1.d": "FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.",
"verify.card2.t": "Multi-Hop Reasoning Works",
"verify.card2.d": "\"What risk affects the growth region?\" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: βœ“. Chunk-RAG: impossible.",
"verify.card3.t": "Runs on 16GB Mac",
"verify.card3.d": "Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.",
"verify.cta": "Read the Beyond RAG Manifesto &rarr;",
"footer.text": "quant.cpp &middot; Apache 2.0 &middot; <a href=\"https://github.com/quantumaikr/quant.cpp\">GitHub</a> &middot; Made by <a href=\"https://github.com/quantumaikr\">quantumaikr</a>"
},
ko: {
"nav.problem": "\uBB38\uC81C\uC810",
Expand Down Expand Up @@ -1077,7 +1100,30 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
"rag.card2.d": "100K λ¬Έμ„œλ₯Ό ν•œ λ²ˆμ— μ»¨ν…μŠ€νŠΈμ— 넣을 수 μ—†μŠ΅λ‹ˆλ‹€. Prefill이 λŠλ¦½λ‹ˆλ‹€. RAGλŠ” 검색을 2-3개 κ΄€λ ¨ λ¬Έμ„œλ‘œ μ’ν˜€μ€λ‹ˆλ‹€.",
"rag.card3.t": "ν•œ 번 읽고, μ˜μ›νžˆ 질문",
"rag.card3.d": "λ¬Έμ„œλ₯Ό .kv 파일둜 사전 처리 (GPU, 1회). μ–΄λ–€ λ…ΈνŠΈλΆμ—μ„œλ“  μ¦‰μ‹œ λ‘œλ“œ (0.5초). μ˜€ν”„λΌμΈ, λ¬΄μ œν•œ, 프라이빗 질문.",
"rag.pipeline.title": "사전 κ³„μ‚°λœ KV 라이브러리 νŒ¨ν„΄"
"rag.pipeline.title": "사전 κ³„μ‚°λœ KV 라이브러리 νŒ¨ν„΄",
"rag.quote": "<strong>μ²­ν‚Ή RAGλŠ” μž‘μ€ μ»¨ν…μŠ€νŠΈ μœˆλ„μš°μ— λŒ€ν•œ μž„μ‹œλ°©νŽΈμ΄μ—ˆμŠ΅λ‹ˆλ‹€.</strong><br>κ·Έ μž„μ‹œλ°©νŽΈμ΄ 정섀이 λμŠ΅λ‹ˆλ‹€.<br>이제 μ»¨ν…μŠ€νŠΈ μœˆλ„μš°κ°€ μΆ©λΆ„νžˆ μ»€μ Έμ„œ μž„μ‹œλ°©νŽΈμ΄ ν•„μš” μ—†μŠ΅λ‹ˆλ‹€.<br><em style=\"color:var(--accent2)\">β€” Beyond RAG에 μ˜€μ‹  것을 ν™˜μ˜ν•©λ‹ˆλ‹€.</em>",
"rag.para2": "사라지지 μ•Šμ•˜μŠ΅λ‹ˆλ‹€. 인프라가 정섀이 λμŠ΅λ‹ˆλ‹€. 벑터 DBλŠ” μˆ˜μ‹­μ–΅ λ‹¬λŸ¬ 기업이 λμŠ΅λ‹ˆλ‹€. \"RAG νŒŒμ΄ν”„λΌμΈ\"은 μ‹€μ œ μš©λ„κ°€ ν•„μš”ν•˜λ“  μ•„λ‹ˆλ“  λͺ¨λ“  AI μ—”μ§€λ‹ˆμ–΄κ°€ ꡬ좕해야 ν•  무언가가 λμŠ΅λ‹ˆλ‹€.",
"verify.label": "μΈ‘μ • κ²°κ³Ό",
"verify.title": "7/7 vs 0/7 β€” 검증됨",
"verify.intro": "5개 μ„Ήμ…˜μ˜ ν•©μ„± λ¬Έμ„œμ™€ 7개 질문(4개 단일-hop, 3개 multi-hop)으둜 μ„Έ κ°€μ§€ 접근법을 λΉ„κ΅ν–ˆμŠ΅λ‹ˆλ‹€. <strong>Llama 3.2 3B Q8_0</strong>으둜 ν…ŒμŠ€νŠΈ:",
"verify.viz.title": "사싀 μΆ”μΆœ 정확도",
"verify.bar1.label": "Chunk-RAG (잘λͺ»λœ μ„Ήμ…˜ 검색)",
"verify.bar1.val": "0/7 β€” μ „λΆ€ ν™˜κ°",
"verify.bar2.label": "전체 λ¬Έμ„œ (FP32 KV)",
"verify.bar3.label": "<strong>전체 λ¬Έμ„œ (6.4λ°° KV μ••μΆ•)</strong>",
"verify.bar3.inner": "100% β€” FP32와 동일",
"verify.halluc.title": "ν™˜κ° 문제",
"verify.halluc.desc": "Chunk-RAGκ°€ 잘λͺ»λœ μ„Ήμ…˜μ„ κ²€μƒ‰ν–ˆμ„ λ•Œ, λͺ¨λΈμ€ \"λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€\"라고 λ§ν•˜μ§€ μ•Šκ³  <strong>κ·ΈλŸ΄λ“―ν•œ 거짓말</strong>을 μƒμ„±ν–ˆμŠ΅λ‹ˆλ‹€:",
"verify.halluc.examples": "<div><span style=\"color:var(--accent2)\">Q:</span> CTOλŠ” λˆ„κ΅¬μΈκ°€μš”?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"John Smith\" &emsp; <span style=\"color:var(--text3)\">β†’ μ •λ‹΅: Maria Santos</span></div><br><div><span style=\"color:var(--accent2)\">Q:</span> λ§€μΆœμ€ μ–Όλ§ˆμΈκ°€μš”?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"$1,000,000\" &emsp; <span style=\"color:var(--text3)\">β†’ μ •λ‹΅: 8μ–΅ 4,700만</span></div><br><div><span style=\"color:var(--accent2)\">Q:</span> R&DλŠ” λͺ‡ νΌμ„ΌνŠΈμΈκ°€μš”?</div><div><span style=\"color:var(--red)\">Chunk-RAG:</span> \"순이읡의 15%\" &emsp; <span style=\"color:var(--text3)\">β†’ μ •λ‹΅: 맀좜의 14%</span></div>",
"verify.halluc.summary": "이것이 chunk-RAG의 근본적 μœ„ν—˜μž…λ‹ˆλ‹€: <strong>검색 μ‹€νŒ¨κ°€ μ‘°μš©ν•œ ν™˜κ°μ΄ λ©λ‹ˆλ‹€</strong>. KV 압좕은 전체 λ¬Έμ„œλ₯Ό μ»¨ν…μŠ€νŠΈμ— λ‘œλ“œν•  수 있게 ν•˜μ—¬, μ†ŒλΉ„μž ν•˜λ“œμ›¨μ–΄μ—μ„œ 이 μ‹€νŒ¨ λͺ¨λ“œλ₯Ό μ œκ±°ν•©λ‹ˆλ‹€.",
"verify.card1.t": "KV μ••μΆ• = ν’ˆμ§ˆ 손싀 0",
"verify.card1.d": "FP32 7/7 = 6.4λ°° μ••μΆ• 7/7. 6.4λ°° λ©”λͺ¨λ¦¬ 절감이 사싀 μΆ”μΆœ ν’ˆμ§ˆμ— μ•„λ¬΄λŸ° λΉ„μš©λ„ 듀이지 μ•ŠμŠ΅λ‹ˆλ‹€.",
"verify.card2.t": "Multi-Hop μΆ”λ‘  μž‘λ™",
"verify.card2.d": "\"μ„±μž₯ 지역에 영ν–₯을 λ―ΈμΉ˜λŠ” μœ„ν—˜μ€?\"은 μ„Ήμ…˜ 3(μ•„μ‹œμ•„ μ„±μž₯)κ³Ό μ„Ήμ…˜ 5(μ•„μ‹œμ•„ 톡화 μœ„ν—˜)λ₯Ό μ—°κ²°ν•΄μ•Ό ν•©λ‹ˆλ‹€. 전체 λ¬Έμ„œ: βœ“. Chunk-RAG: λΆˆκ°€λŠ₯.",
"verify.card3.t": "16GB Macμ—μ„œ μ‹€ν–‰",
"verify.card3.d": "Llama 3.2 3B Q8_0, GPU μ—†μŒ. 6.4λ°° KV μ••μΆ•μœΌλ‘œ μ†ŒλΉ„μž ν•˜λ“œμ›¨μ–΄μ—μ„œ μ‹€μš©μ μ΄ λ©λ‹ˆλ‹€.",
"verify.cta": "Beyond RAG μ„ μ–Έλ¬Έ 읽기 &rarr;",
"footer.text": "quant.cpp &middot; Apache 2.0 &middot; <a href=\"https://github.com/quantumaikr/quant.cpp\">GitHub</a> &middot; μ œμž‘ <a href=\"https://github.com/quantumaikr\">quantumaikr</a>"
}
};

Expand Down
Loading