-
Notifications
You must be signed in to change notification settings - Fork 0
Cost and Performance
Real numbers, not vibes.
For a typical 50-page PDF (around 100k characters, 100 chunks):
| Step | What | Cost |
|---|---|---|
| Embed question |
text-embedding-3-small, ~10 tokens |
£0.0000002 |
| Hybrid retrieve | in-memory cosine plus BM25, no API call | £0 |
| Rerank candidate pool |
gpt-4o-mini, ~20 short passages |
£0.0004 |
| Generate answer |
gpt-4o-mini, ~6k input + 300 output tokens |
£0.0009 |
| Total | ~£0.0015 |
A thousand questions on a 50-page PDF: roughly £1 to £2. Drop the reranker to remove one call if you are cost-sensitive.
| PDF size | Chunks | Embedding cost |
|---|---|---|
| 5 pages | 10 | £0.0002 |
| 50 pages | 100 | £0.002 |
| 500 pages | 1000 | £0.02 |
text-embedding-3-small is £0.000016 per 1k tokens. Indexing a 500-page PDF costs about 2p.
End-to-end question latency (warm Vercel function, UK):
| Step | Time |
|---|---|
| Network round trip to Vercel | 30-80ms |
| Embed question (OpenAI) | 100-200ms |
| Hybrid search (in-memory cosine plus BM25, 100 chunks) | 2-5ms |
Rerank candidate pool (gpt-4o-mini) |
300-600ms |
First token from gpt-4o-mini
|
300-700ms |
| Stream rest of ~300 token answer | 800-1500ms |
| Total time-to-first-token | ~900ms to 1.4s |
| Total time-to-completion | ~2.5 to 4 seconds |
The reranking call is the new latency cost over a dense-only pipeline. It buys precision; drop it if time-to-first-token matters more than answer quality for your use case.
Time-to-first-token is what the user perceives. Streaming makes the wait feel responsive even when total time is 3 seconds.
Cosine similarity is O(N · D) where N is chunks and D is embedding dimensions (1536).
| Chunks | Search time |
|---|---|
| 100 | 2ms |
| 1,000 | 12ms |
| 10,000 | 60ms |
| 100,000 | 600ms |
Above ~10k chunks, the in-memory store starts becoming the bottleneck. Switch to pgvector (HNSW index) where search stays under 25ms even at 1M chunks.
See Swap to pgvector.
The BM25 index keeps an inverted postings list and precomputes per-document term frequencies and lengths when the corpus is set, rather than rescanning every document on every query. A query touches only the documents that actually contain a query term.
Measured on a synthetic 5,000-chunk corpus (80-200 tokens per chunk), 2,000 queries, warm:
| BM25 implementation | Average query time |
|---|---|
| Per-query rescan (rebuild term frequencies for every doc) | ~26ms |
| Inverted postings (current) | ~1.4ms |
That is over a 10x reduction in lexical search latency, and the gap widens with corpus size because the rescan cost grows with the whole corpus while the postings cost grows only with the matching documents. The dense cosine pass is unchanged and remains the dominant in-memory cost above ~10k chunks.
In-memory store:
- Each chunk: 1000 bytes content + 1536 × 4 bytes (Float32 array) = ~7kb total
- 1,000 chunks: ~7MB
- 10,000 chunks: ~70MB
Vercel serverless functions get 1024MB by default. You can fit ~150k chunks before memory pressure. But cosine search on 150k will take 1+ seconds, so you'd hit the latency wall first.
100 daily users, 10 questions each, 50-page reference PDF:
- 1,000 questions/day × £0.001 = £1/day
- ≈ £30/month in OpenAI costs
If you embed 10 new PDFs per day, that's about £0.02/day in indexing. Trivial.
Vercel free tier: 100k function invocations/month. That's 3,300/day, or roughly 1 question every 25 seconds on average.
Each question is one upload-or-chat invocation, so 100k = 100k questions per month.
Above this: Vercel Pro at $20/month for 10x the limits, or self-host on a VPS.
Reference setup: single Hetzner CCX13 (2 vCPU, 8GB) at €13/month.
- Cold start to first request: ~800ms
- Memory headroom: 7GB free after Node, can hold ~1M chunks comfortably
- Concurrent queries: 50-80 before tail latency degrades
- Monthly cost: €13 + OpenAI usage
Cheaper than Vercel Pro at any scale, but you handle the ops.
In rough order of impact:
- OpenAI cold-start latency variability — sometimes 100ms, sometimes 700ms for the same call. Out of your control.
-
PDF parsing on huge files —
pdf-parseis single-threaded. A 500-page PDF takes 2-3 seconds to parse before chunking even starts. - In-memory cosine at 10k+ chunks — fix by swapping to pgvector.
- Sequential embedding — the upload route embeds chunks one batch at a time. Could parallelise across multiple OpenAI requests for big PDFs. Not implemented to keep the code simple.
- Stream parsing in browser — minor, the SSE reader has a small per-chunk overhead.
- Cosine math itself: it's fast, even in JavaScript
- Vercel cold starts: serverless region in London makes this irrelevant for UK users
- React rendering: streaming text is cheap