AI-powered transcription and translation of medieval Latin manuscripts.
Interpres is a full-stack web application that combines optical character recognition (OCR), post-OCR error correction, retrieval-augmented generation (RAG), and large language models (LLMs) into a configurable pipeline for analyzing historical manuscripts. Users upload manuscript images, select their desired models, and receive transcribed Latin text alongside an English translation enriched with historical context.
This project was developed as part of a BSc thesis at the Eötvös Loránd University (ELTE).
- Architecture Overview
- Technology Stack
- Prerequisites
- External Services Setup
- Local Development
- Production Deployment
- Testing
- Project Structure
- Environment Variables Reference
- License
Interpres follows a three-tier architecture with a clear separation between the frontend, backend API, and GPU-accelerated ML services:
Request flow: A user uploads a manuscript image through the frontend. The backend orchestrates a multi-step pipeline: (1) OCR extracts Latin text from the image, (2) an optional correction model fixes OCR errors, (3) RAG retrieves relevant context from a Latin dictionary and parallel corpora, (4) a reranking model scores retrieval results, and (5) an LLM generates the final translation using all gathered context.
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 16, React 19, TypeScript, Tailwind CSS v4 | User interface and server-side rendering |
| Backend | FastAPI, SQLAlchemy, Pydantic, Uvicorn | REST API, authentication, pipeline orchestration |
| Database | PostgreSQL 15 | User accounts, translations, API keys, prompts |
| Vector Store | LanceDB | RAG embeddings for Latin dictionary and parallel corpora |
| ML Services | Modal.com (serverless GPU) | OCR, correction, embedding, reranking |
| Auth | Google OAuth 2.0, JWT | User authentication |
| File Storage | Local filesystem (dev) / Vercel Blob (prod) | Manuscript image storage |
| CI/CD | GitHub Actions | Automated linting and testing |
| Hosting | Railway (backend), Vercel (frontend) | Production deployment |
Ensure the following tools are installed on your development machine:
- Docker and Docker Compose (v2+)
- Python 3.12+ with uv package manager
- Node.js 20+ with npm
- Git
You will also need accounts for the following external services (detailed setup in the next section):
- Google Cloud Platform (OAuth credentials and optionally GCS for production LanceDB)
- Modal.com (serverless GPU for ML models)
- Vercel (frontend hosting, production only)
- Railway (backend hosting, production only)
This section walks through configuring each external service required by Interpres. For local development, only Google OAuth and Modal.com are mandatory.
Interpres uses Google OAuth for user login. You need to create OAuth 2.0 credentials in the Google Cloud Console.
-
Go to the Google Cloud Console.
-
Create a new project (or select an existing one).
-
Navigate to APIs & Services > Credentials.
-
Click Create Credentials > OAuth client ID.
-
Select Web application as the application type.
-
Set the Authorized redirect URIs to:
http://localhost:8000/auth/google/callback(for local development)https://<your-backend-domain>/auth/google/callback(for production)
-
Click Create and note the Client ID and Client Secret.
-
Set these in your
.envfile:GOOGLE_CLIENT_ID=<your-client-id> GOOGLE_CLIENT_SECRET=<your-client-secret>
The ML pipeline (OCR, correction, embedding, reranking) runs on Modal.com as serverless GPU functions.
-
Create an account at modal.com.
-
Install the Modal CLI and authenticate:
uv add modal modal token new
-
Note the token ID and secret printed by the CLI. Add them to your
.env:MODAL_TOKEN_ID=ak-... MODAL_TOKEN_SECRET=as-... MODAL_APP_NAME=interpres-backend -
Deploy the ML services to Modal:
./deploy_modal.sh
This runs
uv run modal deploy src/modal.com/app.pyand deploys the following services:- PaddleService -- PaddleOCR for text detection
- TrOCRTridisService -- TrOCR fine-tuned on the Tridis dataset
- TrOCRMedievalService -- TrOCR fine-tuned on medieval manuscripts
- QwenService -- Qwen2-VL vision-language model for OCR
- EmbeddingService -- Sentence-transformer for RAG embeddings
- RerankingService -- Cross-encoder for reranking retrieval results
- CorrectionService -- ByT5-based post-OCR error correction
Some ML models require a Hugging Face token for downloading weights.
-
Create an account at huggingface.co.
-
Go to Settings > Access Tokens and create a new token with
readpermissions. -
Add it to your
.env:HF_TOKEN=hf_...
In production, LanceDB vectors are stored on Google Cloud Storage instead of the local filesystem. This step is only required for production deployment.
-
Create a GCS bucket (e.g.,
interpres-lancedb). -
Create a Service Account with Storage Object Admin permissions on the bucket.
-
Download the service account JSON key file.
-
For local development with GCS, place the file at the project root as
google_credentials.json. -
For production (Railway), set the entire JSON as an environment variable:
GOOGLE_APPLICATION_CREDENTIALS_JSON={"type": "service_account", ...}The backend automatically parses this JSON string into a temporary file at startup because the LanceDB Rust core requires a physical file path for GCS authentication.
In production, uploaded manuscript images are stored in Vercel Blob instead of the local filesystem. This is only required for production deployment; local development uses a Docker volume.
-
In your Vercel project dashboard, go to Storage > Blob.
-
Create a new Blob store.
-
Copy the read-write token and add it to your production environment:
BLOB_READ_WRITE_TOKEN=vercel_blob_rw_...
git clone https://github.com/whynotkimhari/Interpres.git
cd InterpresCopy the local development environment template and fill in your credentials:
cp .env.local.example .envEdit .env and set the following values:
# --- Database Configuration ---
DATABASE_URL=postgresql://interpres:interpres_secret@postgres:5432/interpres
LANCEDB_DOWNLOAD_URL=https://drive.google.com/file/d/1vCbxUqTRUVlo0gYJjgZDfWx-P9YF2JZ1/view?usp=share_link
LANCEDB_PATH=/lancedb
UPLOADS_PATH=/uploads
# --- Frontend Configuration ---
NEXT_PUBLIC_API_URL=http://localhost:8000
# --- Backend Configuration ---
FRONTEND_URL=http://localhost:3000
JWT_SECRET_KEY=<generate-a-random-string>
SESSION_SECRET=<generate-a-random-string>
ENCRYPTION_KEY=<generate-a-fernet-key>
# --- Third Party Services ---
MODAL_TOKEN_ID=ak-...
MODAL_TOKEN_SECRET=as-...
MODAL_APP_NAME=interpres-backend
HF_TOKEN=hf_...
GOOGLE_CLIENT_ID=<your-google-client-id>
GOOGLE_CLIENT_SECRET=<your-google-client-secret>To generate the required keys:
# Generate a random secret (for SESSION_SECRET and JWT_SECRET_KEY)
openssl rand -hex 64
# Generate a Fernet encryption key (for ENCRYPTION_KEY)
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"The RAG system requires pre-built vector embeddings stored in LanceDB. You have two options:
Option A: Download Pre-built Vectors (Recommended)
Set the LANCEDB_DOWNLOAD_URL in your .env to point to the pre-built archive. The backend Docker container will automatically download and extract it on first startup via the entrypoint script.
LANCEDB_DOWNLOAD_URL=https://drive.google.com/file/d/1vCbxUqTRUVlo0gYJjgZDfWx-P9YF2JZ1/view?usp=share_linkOption B: Build Vectors from Scratch
If you prefer to generate the embeddings yourself (or need to update the vector data), use the ingestion script. This requires Modal services to be deployed first (see Section 2 of External Services):
uv run python src/scripts/ingest.pyThis script:
- Loads the
wnkh/IRCdataset from Hugging Face (Latin dictionary). - Loads the
grosenthal/latin_english_parallelandmagistermilitum/tridis_translate_la_enparallel corpora. - Generates embeddings via the Modal EmbeddingService.
- Stores everything in a local LanceDB directory.
If you have not already done so, deploy the ML models to Modal:
./deploy_modal.shLaunch all services with Docker Compose:
docker compose up --buildThis starts three containers:
| Service | URL | Description |
|---|---|---|
| Frontend | http://localhost:3000 | Next.js development server |
| Backend | http://localhost:8000 | FastAPI server with auto-reload |
| PostgreSQL | localhost:5432 | Database (data persisted in postgres_data volume) |
The backend entrypoint script handles:
- Downloading LanceDB data if
LANCEDB_DOWNLOAD_URLis set and the directory is empty. - Starting the Uvicorn server on the port specified by
$PORT(defaults to 8000).
On startup, the backend also:
- Initializes the PostgreSQL schema via SQLAlchemy (
create_all). - Seeds global LanceDB collections (Latin dictionary and parallel corpora) if they are not already present.
- Open http://localhost:3000 in your browser.
- Click "Launch App" or navigate to the Scriptorium page.
- Sign in with your Google account.
- Upload a manuscript image and run the translation pipeline.
- Verify the backend health endpoint at http://localhost:8000/health.
For detailed differences between local and production environments (storage backends, URLs, environment variables), see HOSTING.md.
-
Connect your GitHub repository to Railway.
-
Railway will detect the
railway.jsonconfiguration, which points tosrc/backend/Dockerfile. -
Add a PostgreSQL service in the Railway project dashboard.
-
Set the following environment variables in the Railway service settings:
Variable Value DATABASE_URLProvided by Railway PostgreSQL service LANCEDB_DOWNLOAD_URLGoogle Drive link to the pre-built vectors LANCEDB_PATHgs://<your-gcs-bucket>GOOGLE_APPLICATION_CREDENTIALS_JSONFull JSON of the GCS service account GOOGLE_CLIENT_IDYour OAuth Client ID GOOGLE_CLIENT_SECRETYour OAuth Client Secret JWT_SECRET_KEYA securely generated random string SESSION_SECRETA securely generated random string ENCRYPTION_KEYA Fernet key (must be consistent across deploys) FRONTEND_URLYour Vercel frontend URL BLOB_READ_WRITE_TOKENYour Vercel Blob token MODAL_TOKEN_IDYour Modal token ID MODAL_TOKEN_SECRETYour Modal token secret MODAL_APP_NAMEinterpres-backend -
Railway will build and deploy automatically on every push to
main. -
Confirm the deployment by visiting
https://<your-railway-domain>/health.
-
Connect your GitHub repository to Vercel.
-
Set the Root Directory to
src/frontend. -
Set the following environment variables in Vercel project settings:
Variable Value NEXT_PUBLIC_API_URLYour Railway backend URL (e.g., https://interpres-production.up.railway.app) -
Vercel builds the Next.js app using the
standaloneoutput mode defined innext.config.ts. -
Ensure
remotePatternsinnext.config.tsincludes your Vercel Blob hostname for image rendering.
Modal services are deployed independently from Railway and Vercel. Whenever you update the ML code in src/modal.com/app.py, redeploy with:
./deploy_modal.shModal handles auto-scaling, cold starts, and GPU provisioning.
The backend uses pytest with coverage reporting. Tests require a running PostgreSQL instance.
# Run all backend tests
uv run pytest
# Run with verbose output
uv run pytest -v
# Run a specific test file
uv run pytest src/backend/tests/test_translations.pyTests are configured in pyproject.toml with the following settings:
- Test directory:
src/backend/tests - Async mode:
auto(viapytest-asyncio) - Coverage: reports on
src/backendwith term-missing output
The frontend uses Vitest with React Testing Library.
cd src/frontend
# Run all tests
npm run test
# Run in watch mode
npm run test:watch
# Run with coverage
npm run test:coverage# Backend (Ruff)
uv run ruff check src/backend
uv run ruff format --check src/backend
# Frontend (ESLint)
cd src/frontend && npm run lintGitHub Actions runs on every push and pull request to main:
- Backend CI (
backend.yml): Ruff linting, then pytest with a PostgreSQL service container. - Frontend CI (
frontend.yml): ESLint, TypeScript type checking, then Vitest with coverage. - React Doctor (
react-doctor.yml): Agentic code review for React components.
All pipelines are path-filtered to only run when relevant files change.
Interpres/
|-- docker-compose.yml # Local development orchestration
|-- pyproject.toml # Python dependencies and tool config (uv)
|-- uv.lock # Locked Python dependencies
|-- railway.json # Railway deployment config
|-- deploy_modal.sh # Script to deploy ML services to Modal
|-- .env.local.example # Local development env template
|-- .env.production.example # Production env template
|-- HOSTING.md # Local vs production environment guide
|
|-- docs/
| +-- diagrams/
| |-- drawio/ # Editable draw.io source files
| |-- puml/ # PlantUML diagram sources
| |-- render/ # Exported PDFs and SVGs
| +-- use_cases.md # Detailed use case specification
|
|-- scripts/
| +-- cleanup.sh # Remove caches and build artifacts
|
|-- src/
| |-- backend/
| | |-- Dockerfile # Multi-stage Python 3.12 build
| | |-- entrypoint.sh # LanceDB download + server startup
| | |-- main.py # FastAPI application entry point
| | |-- db/ # Database layer (SQLAlchemy models, CRUD)
| | | |-- base.py # Engine, sessions, schema init
| | | |-- models.py # SQLAlchemy ORM models
| | | |-- api_keys.py # Encrypted API key storage
| | | |-- collections.py # Document collections + LanceDB integration
| | | |-- translations.py # Translation history CRUD
| | | |-- users.py # User account CRUD
| | | |-- jobs.py # Background job tracking
| | | +-- prompts.py # Custom prompt templates
| | |-- routes/ # FastAPI route handlers
| | | |-- dependencies.py # Shared auth + pipeline config deps
| | | |-- schemas.py # Pydantic request/response models
| | | |-- auth.py # Google OAuth login/callback
| | | |-- pipeline.py # Main transcription pipeline endpoint
| | | |-- translations.py # Translation CRUD endpoints
| | | |-- collections.py # RAG collection management
| | | |-- jobs.py # Batch job endpoints
| | | |-- prompts.py # Prompt template endpoints
| | | |-- user.py # API key management endpoints
| | | +-- utils.py # Health check + image serving
| | |-- services/ # Business logic and ML interfaces
| | | |-- interfaces.py # Abstract base classes for Modal services
| | | |-- pipeline.py # Pipeline orchestration logic
| | | |-- event_bus.py # SSE event bus for streaming
| | | |-- ocr_services.py # OCR model wrappers
| | | |-- correction_services.py # Post-OCR correction wrapper
| | | |-- rag_services.py # RAG retrieval logic
| | | |-- llm_services.py # LLM integration (Gemini, OpenAI)
| | | |-- skip_service.py # No-op service for skipped stages
| | | +-- models.py # Pipeline configuration models
| | |-- utils/
| | | +-- text_processing.py # Text chunking and normalization
| | +-- tests/ # Pytest test suite (216 tests, 97% coverage)
| |
| |-- frontend/
| | |-- Dockerfile # Multi-stage Node.js 20 build
| | |-- package.json # Node.js dependencies
| | |-- next.config.ts # Next.js configuration
| | |-- app/ # Next.js App Router pages
| | | |-- page.tsx # Landing page
| | | |-- scriptorium/ # Main transcription workspace
| | | |-- library/ # RAG collection management
| | | |-- history/ # Translation history browser
| | | |-- jobs/ # Background job monitoring
| | | |-- prompts/ # Prompt template editor
| | | |-- settings/ # User settings (API keys)
| | | +-- share/ # Public translation sharing
| | |-- components/ # Reusable React components
| | |-- lib/ # Utility functions and API client
| | +-- tests/ # Vitest test suite
| |
| |-- modal.com/
| | +-- app.py # Modal serverless GPU service definitions
| |
| +-- scripts/
| |-- ingest.py # LanceDB vector ingestion script
| |-- push_to_hf.py # Push datasets to Hugging Face
| +-- standardize_data.py # Data normalization utilities
|
|-- .gemini/
| +-- config.yaml # Agent PR Reviewer
|
+-- .github/workflows/
|-- backend.yml # Backend CI (lint + test)
|-- frontend.yml # Frontend CI (lint + type-check + test)
+-- react-doctor.yml # React Doctor (agentic lint + test)
| Variable | Required | Default | Description |
|---|---|---|---|
DATABASE_URL |
Yes | -- | PostgreSQL connection string |
LANCEDB_PATH |
Yes | /lancedb |
Path to LanceDB storage (local path or gs:// URI) |
LANCEDB_DOWNLOAD_URL |
No | -- | URL to download pre-built LanceDB vectors |
UPLOADS_PATH |
Yes | /uploads |
Local directory for uploaded images |
NEXT_PUBLIC_API_URL |
Yes | -- | Backend URL visible to the browser |
FRONTEND_URL |
Yes | http://localhost:3000 |
Frontend URL for CORS and OAuth redirects |
GOOGLE_CLIENT_ID |
Yes | -- | Google OAuth 2.0 Client ID |
GOOGLE_CLIENT_SECRET |
Yes | -- | Google OAuth 2.0 Client Secret |
JWT_SECRET_KEY |
Yes | -- | Secret for signing JWT tokens |
SESSION_SECRET |
Yes | -- | Secret for session middleware |
ENCRYPTION_KEY |
Yes | -- | Fernet key for encrypting stored API keys |
MODAL_TOKEN_ID |
Yes | -- | Modal.com authentication token ID |
MODAL_TOKEN_SECRET |
Yes | -- | Modal.com authentication token secret |
MODAL_APP_NAME |
Yes | interpres-backend |
Name of the Modal app |
HF_TOKEN |
Yes | -- | Hugging Face token for model downloads |
GOOGLE_APPLICATION_CREDENTIALS |
Dev only | -- | Path to GCS service account JSON file |
GOOGLE_APPLICATION_CREDENTIALS_JSON |
Prod only | -- | GCS service account JSON as a string |
BLOB_READ_WRITE_TOKEN |
Prod only | -- | Vercel Blob read-write token |
This project is licensed under the MIT License.
Copyright (c) 2026
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.