Skip to content

Cost and Performance

sarmakska edited this page May 31, 2026 · 3 revisions

Cost and Performance

Real numbers, not vibes.

Per-question cost

For a typical 50-page PDF (around 100k characters, 100 chunks):

Step What Cost
Embed question text-embedding-3-small, ~10 tokens £0.0000002
Hybrid retrieve in-memory cosine plus BM25, no API call £0
Rerank candidate pool gpt-4o-mini, ~20 short passages £0.0004
Generate answer gpt-4o-mini, ~6k input + 300 output tokens £0.0009
Total ~£0.0015

A thousand questions on a 50-page PDF: roughly £1 to £2. Drop the reranker to remove one call if you are cost-sensitive.

Indexing cost (one-time per PDF)

PDF size Chunks Embedding cost
5 pages 10 £0.0002
50 pages 100 £0.002
500 pages 1000 £0.02

text-embedding-3-small is £0.000016 per 1k tokens. Indexing a 500-page PDF costs about 2p.

Latency breakdown

End-to-end question latency (warm Vercel function, UK):

Step Time
Network round trip to Vercel 30-80ms
Embed question (OpenAI) 100-200ms
Hybrid search (in-memory cosine plus BM25, 100 chunks) 2-5ms
Rerank candidate pool (gpt-4o-mini) 300-600ms
First token from gpt-4o-mini 300-700ms
Stream rest of ~300 token answer 800-1500ms
Total time-to-first-token ~900ms to 1.4s
Total time-to-completion ~2.5 to 4 seconds

The reranking call is the new latency cost over a dense-only pipeline. It buys precision; drop it if time-to-first-token matters more than answer quality for your use case.

Time-to-first-token is what the user perceives. Streaming makes the wait feel responsive even when total time is 3 seconds.

Latency at scale (in-memory store)

Cosine similarity is O(N · D) where N is chunks and D is embedding dimensions (1536).

Chunks Search time
100 2ms
1,000 12ms
10,000 60ms
100,000 600ms

Above ~10k chunks, the in-memory store starts becoming the bottleneck. Switch to pgvector (HNSW index) where search stays under 25ms even at 1M chunks.

See Swap to pgvector.

Memory footprint

In-memory store:

  • Each chunk: 1000 bytes content + 1536 × 4 bytes (Float32 array) = ~7kb total
  • 1,000 chunks: ~7MB
  • 10,000 chunks: ~70MB

Vercel serverless functions get 1024MB by default. You can fit ~150k chunks before memory pressure. But cosine search on 150k will take 1+ seconds, so you'd hit the latency wall first.

Cost at scale

100 daily users, 10 questions each, 50-page reference PDF:

  • 1,000 questions/day × £0.001 = £1/day
  • ≈ £30/month in OpenAI costs

If you embed 10 new PDFs per day, that's about £0.02/day in indexing. Trivial.

Throughput

Vercel free tier: 100k function invocations/month. That's 3,300/day, or roughly 1 question every 25 seconds on average.

Each question is one upload-or-chat invocation, so 100k = 100k questions per month.

Above this: Vercel Pro at $20/month for 10x the limits, or self-host on a VPS.

Self-host benchmark

Reference setup: single Hetzner CCX13 (2 vCPU, 8GB) at €13/month.

  • Cold start to first request: ~800ms
  • Memory headroom: 7GB free after Node, can hold ~1M chunks comfortably
  • Concurrent queries: 50-80 before tail latency degrades
  • Monthly cost: €13 + OpenAI usage

Cheaper than Vercel Pro at any scale, but you handle the ops.

What slows things down

In rough order of impact:

  1. OpenAI cold-start latency variability — sometimes 100ms, sometimes 700ms for the same call. Out of your control.
  2. PDF parsing on huge filespdf-parse is single-threaded. A 500-page PDF takes 2-3 seconds to parse before chunking even starts.
  3. In-memory cosine at 10k+ chunks — fix by swapping to pgvector.
  4. Sequential embedding — the upload route embeds chunks one batch at a time. Could parallelise across multiple OpenAI requests for big PDFs. Not implemented to keep the code simple.
  5. Stream parsing in browser — minor, the SSE reader has a small per-chunk overhead.

What's NOT slow

  • Cosine math itself: it's fast, even in JavaScript
  • Vercel cold starts: serverless region in London makes this irrelevant for UK users
  • React rendering: streaming text is cheap

Clone this wiki locally