-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
Single-process Next.js application. Two API routes, one store that holds many documents, one UI page. The retrieval pipeline is hybrid search followed by reranking, and the chat response streams citations and tokens as NDJSON.
graph TD
U[User browser]
U -->|upload PDFs| UP["/api/upload"]
U -->|ask question| CH["/api/chat"]
UP -->|buffer| PP[pdf-parse<br/>page by page]
PP -->|pages| CK[Chunker<br/>page-aware]
CK -->|chunks| EM[OpenAI embeddings]
EM -->|vectors| VS[(Vector store<br/>cosine + BM25)]
CH -->|question| EM2[OpenAI embeddings]
EM2 -->|query vector| HS[Hybrid search<br/>RRF fusion]
VS --> HS
HS -->|candidate pool| RR[Reranker]
RR -->|top-k| LLM[OpenAI gpt-4o-mini]
LLM -->|NDJSON citations + tokens| U
classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
class EM,EM2,LLM ext
| Component | File | Responsibility |
|---|---|---|
| Upload route | app/api/upload/route.ts |
Accept PDF, parse per page, chunk, embed, add as a document; list and delete documents |
| Chat route | app/api/chat/route.ts |
Embed question, hybrid retrieve, rerank, stream citations and tokens |
| PDF parser | lib/pdf.ts |
Page-by-page text extraction via the pagerender hook |
| Chunker | lib/chunker.ts |
Fixed-size chunking with overlap and page tracking |
| BM25 index | lib/bm25.ts |
Dependency-free sparse lexical index |
| Vector store | lib/vector-store.ts |
In-memory cosine, BM25, hybrid search, multi-document |
| Reranker | lib/reranker.ts |
LLM cross-encoder reranker with a lexical fallback |
| Retrieval | lib/retrieval.ts |
Orchestrates hybrid search then rerank |
| Citations | lib/citations.ts |
NDJSON streaming protocol shared by server and client |
| OpenAI client | lib/openai.ts |
Lazy client, embed helper |
| UI | app/page.tsx |
Upload, document list, chat, citation rendering |
sequenceDiagram
participant U as User
participant API as /api/upload
participant PDF as pdf-parse
participant CK as Chunker
participant OA as OpenAI Embeddings
participant VS as Vector Store
U->>API: POST FormData(file)
API->>PDF: parse buffer per page
PDF-->>API: pages string array
API->>CK: chunkPages(pages, 1000, 200)
CK-->>API: chunks with page numbers
API->>OA: embed(chunk contents)
OA-->>API: vectors
API->>VS: add(chunks) and reindex BM25
API-->>U: 200 with docId, pages, document list
sequenceDiagram
participant U as User
participant API as /api/chat
participant OA as OpenAI Embeddings
participant VS as Vector Store
participant RR as Reranker
participant LLM as OpenAI gpt-4o-mini
U->>API: POST { question, docIds? }
API->>OA: embed([question])
OA-->>API: query vector
API->>VS: hybridSearch(vec, text, pool, docIds)
VS-->>API: fused candidate pool
API->>RR: rerank(question, pool, topK)
RR-->>API: top-k reranked chunks
API->>API: build citations
API-->>U: NDJSON: citations event
API->>LLM: stream chat(system, numbered context)
LLM-->>API: token
API-->>U: NDJSON: token events, then done
pdf-parse with the pagerender hook so we get text per page rather than one flat blob. The page each chunk starts on is what makes page-level citations possible.
Hybrid search because dense embeddings are strong on meaning but weak on rare exact terms (error codes, identifiers, surnames), and BM25 is the opposite. Reciprocal Rank Fusion combines the two rankings without needing to normalise their very different score scales.
A reranking stage because first-stage retrieval is tuned for recall (pull a wide pool) while the reranker is tuned for precision (reorder that pool against the question directly). This two-stage shape is how production retrieval is built.
NDJSON streaming so one stream carries both the structured citation list (rendered immediately) and the free-form answer tokens, with a trivial client parser.
In-memory cosine plus BM25 because zero infrastructure means you can clone and run in seconds. The migration to a real DB is contained behind the store interface.
interface Chunk {
id: string // stable, unique within the store
docId: string // which document this chunk belongs to
source: string // original filename, for citation display
page: number // 1-based page the chunk starts on
content: string
embedding: number[] // 1536 dims for text-embedding-3-small
}Indexed in a flat array for cosine search, with a parallel BM25 index over the same content. Cosine is O(N times 1536), acceptable to roughly 10k chunks for sub-100ms search. Beyond that, switch to pgvector with an HNSW index.
- PDF has no extractable text (scanned image PDFs): pages come back empty, the route returns 400. Add OCR if you need scanned-PDF support.
- OpenAI rate limit: embeddings or the reranker can return 429. The reranker falls back to the lexical reranker on any failure; embeddings are not retried in the starter.
-
Token limit exceeded (very long PDFs with a large top-k): the prompt may overflow context. Reduce
TOP_Kor chunk size. - In-memory store lost on restart: by design. Swap the store for persistence.
graph LR
CDN[Vercel Edge CDN] -->|static assets| Browser
Browser -->|API calls| Func[Vercel Serverless Function]
Func -->|Embeddings, rerank, chat| OAI[OpenAI API]
classDef vercel fill:#000,stroke:#fff,color:#fff
class CDN,Func vercel
Serverless function. No persistent state. Each question is one embedding call, one reranking call, and one streaming chat call.