Skip to content
sarmakska edited this page May 31, 2026 · 5 revisions

receipt-scanner

Working vision OCR starter. Drop a photo of a receipt, get structured JSON.

Built by Sarma Linux. MIT licence. Source at github.com/sarmakska/receipt-scanner.


What this is

Upload a photo of a receipt. The app sends it to a vision-capable language model and extracts structured fields: vendor name, address, transaction date and time, itemised line items with quantity and unit price, subtotal, tax, tip, total, currency, and payment method when visible.

Returns clean JSON, validated against a Zod schema. Renders the result as a table. Wire it to Supabase, Xero, QuickBooks, n8n, or whatever your finance stack needs. The hard part is solved.

Who this is for

  • Small business teams replacing manual receipt entry.
  • Builders prototyping an AI expense or bookkeeping product.
  • Engineers who want to understand how vision models work end to end.

Architecture

This is a single-process Next.js 14 application. There is no separate worker, queue, or database in the default build. The whole pipeline runs server-side inside one API route, which keeps the API key off the client and makes the cost surface easy to reason about.

flowchart TD
    A[Browser: app/page.tsx] -->|multipart upload| B[app/api/scan/route.ts]
    B --> C[sharp: resize, re-encode, auto-rotate]
    C --> D[lib/vision.ts: single vision API call]
    D --> E[lib/schema.ts: Zod parse + validate]
    E -->|valid| F[lib/persist.ts: save stub]
    E -->|valid| G[JSON response back to UI table]
    E -->|invalid| H[400 with validation error]
Loading

The request lifecycle, step by step:

  1. Upload. app/page.tsx posts the file as multipart form data to /api/scan. No client-side processing, so the browser never holds an API key.
  2. Pre-process. sharp reads the bytes, corrects EXIF orientation, downscales the longest edge to MAX_IMAGE_PX (default 1568), and re-encodes to JPEG. This is the single biggest cost lever: a 12MP phone photo becomes a fraction of the input tokens with no measurable accuracy loss on receipts.
  3. Vision call. lib/vision.ts sends the base64 image plus a JSON-only system prompt to the model. This is one function and one network call. Everything provider-specific lives here.
  4. Validate. The raw model output is parsed by the Zod schema in lib/schema.ts. Malformed output is rejected at this boundary, so a hallucinated or truncated response never reaches your database or your UI.
  5. Persist and respond. Valid receipts pass through lib/persist.ts (a no-op stub by default) and are returned to the UI, which renders them as a table.

Why a strict schema boundary

The schema is the contract. The model is asked for JSON, but models occasionally return prose, partial objects, or wrong types. Rather than trust the output, every scan must parse() clean before anything downstream sees it. This is why swapping the model (Claude to gpt-4o to a local Llama) requires no changes outside lib/vision.ts: the rest of the app only ever sees a validated Receipt.

Component map

File Responsibility
app/page.tsx Upload UI, renders the parsed receipt table
app/api/scan/route.ts Orchestrates the pipeline, returns validated JSON or a 400
lib/vision.ts The single vision API call. The only provider-specific code
lib/schema.ts The Zod contract. Receipt and lineItem types
lib/persist.ts save() stub. Replace with a Supabase insert or webhook
docs/schema.sql Postgres / Supabase tables that mirror the Zod contract

Real-world examples

Expense capture for a small team. Staff snap a photo on their phone, the scan returns structured fields, and you insert straight into Supabase. Wire lib/persist.ts to a single insert against the tables in docs/schema.sql. See Wire-to-Database.

Feeding an accounting tool. After a valid scan, POST the JSON to the Xero or QuickBooks expense API. The validated Receipt shape maps cleanly onto their expense models. The mapping lives next to your save() call.

Automation fan-out with n8n. Add a webhook target in app/api/scan/route.ts and POST every validated receipt to an n8n workflow. From there you can branch on vendor, route for approval, or push to a spreadsheet without touching this codebase again.

Provider benchmarking. Want to compare Claude against gpt-4o on your own receipts? Replace the body of lib/vision.ts, keep the same JSON contract, and the UI and validation stay identical. See Vision-Models.

Token cost reference (typical UK till receipt, 1568px max)

Cost element Approx cost (Claude 3.5 Sonnet)
Image input + system prompt ~£0.006
Output JSON ~£0.008
Per scan ~£0.013

Resizing in step 2 is what keeps this number small. Disabling the downscale roughly quadruples the input token cost on a full-resolution phone photo.

Troubleshooting

Missing ANTHROPIC_API_KEY or a 401 from the model. Copy .env.example to .env.local and set a key with vision access. The key is read server-side only; it is never exposed to the browser. Restart the dev server after changing env files.

The build fails with a sharp native binary error. sharp ships platform-specific binaries. If your package manager skipped its build script, run the rebuild step for it (pnpm rebuild sharp). On serverless platforms, confirm the platform provides the native libraries. Vercel does. See Deployment.

A scan returns a 400 validation error. The model returned output that did not satisfy the Zod schema. This is the boundary doing its job. Inspect the raw response, and if the failure is systematic, tighten the system prompt in lib/vision.ts or relax the affected field in lib/schema.ts. See Edge-Cases.

HEIC photos from iPhones fail to decode. HEIC support depends on the sharp build on your platform. Locally this usually works; on serverless, verify HEIC is supported or convert to JPEG upstream.

Multi-page PDF receipts only read the first page. By design this handles one image per scan. Rasterise each page upstream and scan them individually. See Edge-Cases.

Blurry or low-light photos return sparse fields. The model returns what it can read. Improve capture conditions, or raise MAX_IMAGE_PX to retain more detail at the cost of more tokens. See Configuration.

Stack

Next.js 14 App Router, TypeScript, Anthropic Claude vision (claude-3-5-sonnet-latest), sharp, Zod, Tailwind CSS.


Wiki pages

  • Architecture: scan flow diagram, component table, failure modes, token cost table
  • Quick-Start: clone, install, env vars, first scan
  • Vision-Models: swapping to OpenAI, local Llama, model comparison
  • Configuration: all env vars, tuning image size
  • Wire-to-Database: Supabase, Xero, QuickBooks, n8n integration paths
  • Edge-Cases: blurry images, multi-page PDFs, hand-written receipts
  • Deployment: Vercel one-click, Node runtime requirement
  • Roadmap: what is shipped and what is next

Clone this wiki locally