-
Notifications
You must be signed in to change notification settings - Fork 0
Cost and Performance
Real numbers, not vibes.
For a typical 50-page PDF (around 100k characters, 100 chunks):
| Step | What | Cost |
|---|---|---|
| Embed question |
text-embedding-3-small, ~10 tokens |
£0.0000002 |
| Hybrid retrieve | in-memory cosine plus BM25, no API call | £0 |
| Rerank candidate pool |
gpt-4o-mini, ~20 short passages |
£0.0004 |
| Generate answer |
gpt-4o-mini, ~6k input + 300 output tokens |
£0.0009 |
| Total | ~£0.0015 |
A thousand questions on a 50-page PDF: roughly £1 to £2. Drop the reranker to remove one call if you are cost-sensitive.
| PDF size | Chunks | Embedding cost |
|---|---|---|
| 5 pages | 10 | £0.0002 |
| 50 pages | 100 | £0.002 |
| 500 pages | 1000 | £0.02 |
text-embedding-3-small is £0.000016 per 1k tokens. Indexing a 500-page PDF costs about 2p.
End-to-end question latency (warm Vercel function, UK):
| Step | Time |
|---|---|
| Network round trip to Vercel | 30-80ms |
| Embed question (OpenAI) | 100-200ms |
| Hybrid search (in-memory cosine plus BM25, 100 chunks) | 2-5ms |
Rerank candidate pool (gpt-4o-mini) |
300-600ms |
First token from gpt-4o-mini
|
300-700ms |
| Stream rest of ~300 token answer | 800-1500ms |
| Total time-to-first-token | ~900ms to 1.4s |
| Total time-to-completion | ~2.5 to 4 seconds |
The reranking call is the new latency cost over a dense-only pipeline. It buys precision; drop it if time-to-first-token matters more than answer quality for your use case.
Time-to-first-token is what the user perceives. Streaming makes the wait feel responsive even when total time is 3 seconds.
Cosine similarity is O(N · D) where N is chunks and D is embedding dimensions (1536).
| Chunks | Search time |
|---|---|
| 100 | 2ms |
| 1,000 | 12ms |
| 10,000 | 60ms |
| 100,000 | 600ms |
Above ~10k chunks, the in-memory store starts becoming the bottleneck. Switch to pgvector (HNSW index) where search stays under 25ms even at 1M chunks.
See Swap to pgvector.
In-memory store:
- Each chunk: 1000 bytes content + 1536 × 4 bytes (Float32 array) = ~7kb total
- 1,000 chunks: ~7MB
- 10,000 chunks: ~70MB
Vercel serverless functions get 1024MB by default. You can fit ~150k chunks before memory pressure. But cosine search on 150k will take 1+ seconds, so you'd hit the latency wall first.
100 daily users, 10 questions each, 50-page reference PDF:
- 1,000 questions/day × £0.001 = £1/day
- ≈ £30/month in OpenAI costs
If you embed 10 new PDFs per day, that's about £0.02/day in indexing. Trivial.
Vercel free tier: 100k function invocations/month. That's 3,300/day, or roughly 1 question every 25 seconds on average.
Each question is one upload-or-chat invocation, so 100k = 100k questions per month.
Above this: Vercel Pro at $20/month for 10x the limits, or self-host on a VPS.
Reference setup: single Hetzner CCX13 (2 vCPU, 8GB) at €13/month.
- Cold start to first request: ~800ms
- Memory headroom: 7GB free after Node, can hold ~1M chunks comfortably
- Concurrent queries: 50-80 before tail latency degrades
- Monthly cost: €13 + OpenAI usage
Cheaper than Vercel Pro at any scale, but you handle the ops.
In rough order of impact:
- OpenAI cold-start latency variability — sometimes 100ms, sometimes 700ms for the same call. Out of your control.
-
PDF parsing on huge files —
pdf-parseis single-threaded. A 500-page PDF takes 2-3 seconds to parse before chunking even starts. - In-memory cosine at 10k+ chunks — fix by swapping to pgvector.
- Sequential embedding — the upload route embeds chunks one batch at a time. Could parallelise across multiple OpenAI requests for big PDFs. Not implemented to keep the code simple.
- Stream parsing in browser — minor, the SSE reader has a small per-chunk overhead.
- Cosine math itself: it's fast, even in JavaScript
- Vercel cold starts: serverless region in London makes this irrelevant for UK users
- React rendering: streaming text is cheap