Cost and Performance

Real numbers, not vibes.

Per-question cost

For a typical 50-page PDF (around 100k characters, 100 chunks):

Step	What	Cost
Embed question	`text-embedding-3-small`, ~10 tokens	£0.0000002
Hybrid retrieve	in-memory cosine plus BM25, no API call	£0
Rerank candidate pool	`gpt-4o-mini`, ~20 short passages	£0.0004
Generate answer	`gpt-4o-mini`, ~6k input + 300 output tokens	£0.0009
Total		~£0.0015

A thousand questions on a 50-page PDF: roughly £1 to £2. Drop the reranker to remove one call if you are cost-sensitive.

Indexing cost (one-time per PDF)

PDF size	Chunks	Embedding cost
5 pages	10	£0.0002
50 pages	100	£0.002
500 pages	1000	£0.02

text-embedding-3-small is £0.000016 per 1k tokens. Indexing a 500-page PDF costs about 2p.

Latency breakdown

End-to-end question latency (warm Vercel function, UK):

Step	Time
Network round trip to Vercel	30-80ms
Embed question (OpenAI)	100-200ms
Hybrid search (in-memory cosine plus BM25, 100 chunks)	2-5ms
Rerank candidate pool (`gpt-4o-mini`)	300-600ms
First token from `gpt-4o-mini`	300-700ms
Stream rest of ~300 token answer	800-1500ms
Total time-to-first-token	~900ms to 1.4s
Total time-to-completion	~2.5 to 4 seconds

The reranking call is the new latency cost over a dense-only pipeline. It buys precision; drop it if time-to-first-token matters more than answer quality for your use case.

Time-to-first-token is what the user perceives. Streaming makes the wait feel responsive even when total time is 3 seconds.

Latency at scale (in-memory store)

Cosine similarity is O(N · D) where N is chunks and D is embedding dimensions (1536).

Chunks	Search time
100	2ms
1,000	12ms
10,000	60ms
100,000	600ms

Above ~10k chunks, the in-memory store starts becoming the bottleneck. Switch to pgvector (HNSW index) where search stays under 25ms even at 1M chunks.

See Swap to pgvector.

BM25 query latency

The BM25 index keeps an inverted postings list and precomputes per-document term frequencies and lengths when the corpus is set, rather than rescanning every document on every query. A query touches only the documents that actually contain a query term.

Measured on a synthetic 5,000-chunk corpus (80-200 tokens per chunk), 2,000 queries, warm:

BM25 implementation	Average query time
Per-query rescan (rebuild term frequencies for every doc)	~26ms
Inverted postings (current)	~1.4ms

That is over a 10x reduction in lexical search latency, and the gap widens with corpus size because the rescan cost grows with the whole corpus while the postings cost grows only with the matching documents. The dense cosine pass is unchanged and remains the dominant in-memory cost above ~10k chunks.

Memory footprint

In-memory store:

Each chunk: 1000 bytes content + 1536 × 4 bytes (Float32 array) = ~7kb total
1,000 chunks: ~7MB
10,000 chunks: ~70MB

Vercel serverless functions get 1024MB by default. You can fit ~150k chunks before memory pressure. But cosine search on 150k will take 1+ seconds, so you'd hit the latency wall first.

Cost at scale

100 daily users, 10 questions each, 50-page reference PDF:

1,000 questions/day × £0.001 = £1/day
≈ £30/month in OpenAI costs

If you embed 10 new PDFs per day, that's about £0.02/day in indexing. Trivial.

Throughput

Vercel free tier: 100k function invocations/month. That's 3,300/day, or roughly 1 question every 25 seconds on average.

Each question is one upload-or-chat invocation, so 100k = 100k questions per month.

Above this: Vercel Pro at $20/month for 10x the limits, or self-host on a VPS.

Self-host benchmark

Reference setup: single Hetzner CCX13 (2 vCPU, 8GB) at €13/month.

Cold start to first request: ~800ms
Memory headroom: 7GB free after Node, can hold ~1M chunks comfortably
Concurrent queries: 50-80 before tail latency degrades
Monthly cost: €13 + OpenAI usage

Cheaper than Vercel Pro at any scale, but you handle the ops.

What slows things down

In rough order of impact:

OpenAI cold-start latency variability — sometimes 100ms, sometimes 700ms for the same call. Out of your control.
PDF parsing on huge files — pdf-parse is single-threaded. A 500-page PDF takes 2-3 seconds to parse before chunking even starts.
In-memory cosine at 10k+ chunks — fix by swapping to pgvector.
Sequential embedding — the upload route embeds chunks one batch at a time. Could parallelise across multiple OpenAI requests for big PDFs. Not implemented to keep the code simple.
Stream parsing in browser — minor, the SSE reader has a small per-chunk overhead.

What's NOT slow

Cosine math itself: it's fast, even in JavaScript
Vercel cold starts: serverless region in London makes this irrelevant for UK users
React rendering: streaming text is cheap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost and Performance

Cost and Performance

Per-question cost

Indexing cost (one-time per PDF)

Latency breakdown

Latency at scale (in-memory store)

BM25 query latency

Memory footprint

Cost at scale

Throughput

Self-host benchmark

What slows things down

What's NOT slow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally