Skip to content

Architecture

sarmakska edited this page May 31, 2026 · 2 revisions

Architecture

Single-process Next.js application. Two API routes, one store that holds many documents, one UI page. The retrieval pipeline is hybrid search followed by reranking, and the chat response streams citations and tokens as NDJSON.

High-level diagram

graph TD
  U[User browser]
  U -->|upload PDFs| UP["/api/upload"]
  U -->|ask question| CH["/api/chat"]

  UP -->|buffer| PP[pdf-parse<br/>page by page]
  PP -->|pages| CK[Chunker<br/>page-aware]
  CK -->|chunks| EM[OpenAI embeddings]
  EM -->|vectors| VS[(Vector store<br/>cosine + BM25)]

  CH -->|question| EM2[OpenAI embeddings]
  EM2 -->|query vector| HS[Hybrid search<br/>RRF fusion]
  VS --> HS
  HS -->|candidate pool| RR[Reranker]
  RR -->|top-k| LLM[OpenAI gpt-4o-mini]
  LLM -->|NDJSON citations + tokens| U

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class EM,EM2,LLM ext
Loading

Components

Component File Responsibility
Upload route app/api/upload/route.ts Accept PDF, parse per page, chunk, embed, add as a document; list and delete documents
Chat route app/api/chat/route.ts Embed question, hybrid retrieve, rerank, stream citations and tokens
PDF parser lib/pdf.ts Page-by-page text extraction via the pagerender hook
Chunker lib/chunker.ts Fixed-size chunking with overlap and page tracking
BM25 index lib/bm25.ts Dependency-free sparse lexical index
Vector store lib/vector-store.ts In-memory cosine, BM25, hybrid search, multi-document
Reranker lib/reranker.ts LLM cross-encoder reranker with a lexical fallback
Retrieval lib/retrieval.ts Orchestrates hybrid search then rerank
Citations lib/citations.ts NDJSON streaming protocol shared by server and client
OpenAI client lib/openai.ts Lazy client, embed helper
UI app/page.tsx Upload, document list, chat, citation rendering

Indexing flow

sequenceDiagram
  participant U as User
  participant API as /api/upload
  participant PDF as pdf-parse
  participant CK as Chunker
  participant OA as OpenAI Embeddings
  participant VS as Vector Store

  U->>API: POST FormData(file)
  API->>PDF: parse buffer per page
  PDF-->>API: pages string array
  API->>CK: chunkPages(pages, 1000, 200)
  CK-->>API: chunks with page numbers
  API->>OA: embed(chunk contents)
  OA-->>API: vectors
  API->>VS: add(chunks) and reindex BM25
  API-->>U: 200 with docId, pages, document list
Loading

Question-answer flow

sequenceDiagram
  participant U as User
  participant API as /api/chat
  participant OA as OpenAI Embeddings
  participant VS as Vector Store
  participant RR as Reranker
  participant LLM as OpenAI gpt-4o-mini

  U->>API: POST { question, docIds? }
  API->>OA: embed([question])
  OA-->>API: query vector
  API->>VS: hybridSearch(vec, text, pool, docIds)
  VS-->>API: fused candidate pool
  API->>RR: rerank(question, pool, topK)
  RR-->>API: top-k reranked chunks
  API->>API: build citations
  API-->>U: NDJSON: citations event
  API->>LLM: stream chat(system, numbered context)
  LLM-->>API: token
  API-->>U: NDJSON: token events, then done
Loading

Why each piece

pdf-parse with the pagerender hook so we get text per page rather than one flat blob. The page each chunk starts on is what makes page-level citations possible.

Hybrid search because dense embeddings are strong on meaning but weak on rare exact terms (error codes, identifiers, surnames), and BM25 is the opposite. Reciprocal Rank Fusion combines the two rankings without needing to normalise their very different score scales.

A reranking stage because first-stage retrieval is tuned for recall (pull a wide pool) while the reranker is tuned for precision (reorder that pool against the question directly). This two-stage shape is how production retrieval is built.

NDJSON streaming so one stream carries both the structured citation list (rendered immediately) and the free-form answer tokens, with a trivial client parser.

In-memory cosine plus BM25 because zero infrastructure means you can clone and run in seconds. The migration to a real DB is contained behind the store interface.

Data shapes

interface Chunk {
  id: string         // stable, unique within the store
  docId: string      // which document this chunk belongs to
  source: string     // original filename, for citation display
  page: number       // 1-based page the chunk starts on
  content: string
  embedding: number[] // 1536 dims for text-embedding-3-small
}

Indexed in a flat array for cosine search, with a parallel BM25 index over the same content. Cosine is O(N times 1536), acceptable to roughly 10k chunks for sub-100ms search. Beyond that, switch to pgvector with an HNSW index.

Failure modes

  • PDF has no extractable text (scanned image PDFs): pages come back empty, the route returns 400. Add OCR if you need scanned-PDF support.
  • OpenAI rate limit: embeddings or the reranker can return 429. The reranker falls back to the lexical reranker on any failure; embeddings are not retried in the starter.
  • Token limit exceeded (very long PDFs with a large top-k): the prompt may overflow context. Reduce TOP_K or chunk size.
  • In-memory store lost on restart: by design. Swap the store for persistence.

Deployment topology

graph LR
  CDN[Vercel Edge CDN] -->|static assets| Browser
  Browser -->|API calls| Func[Vercel Serverless Function]
  Func -->|Embeddings, rerank, chat| OAI[OpenAI API]

  classDef vercel fill:#000,stroke:#fff,color:#fff
  class CDN,Func vercel
Loading

Serverless function. No persistent state. Each question is one embedding call, one reranking call, and one streaming chat call.

Clone this wiki locally