Interpres

AI-powered transcription and translation of medieval Latin manuscripts.

Interpres is a full-stack web application that combines optical character recognition (OCR), post-OCR error correction, retrieval-augmented generation (RAG), and large language models (LLMs) into a configurable pipeline for analyzing historical manuscripts. Users upload manuscript images, select their desired models, and receive transcribed Latin text alongside an English translation enriched with historical context.

This project was developed as part of a BSc thesis at the Eötvös Loránd University (ELTE).

Architecture Overview

Interpres follows a three-tier architecture with a clear separation between the frontend, backend API, and GPU-accelerated ML services:

Request flow: A user uploads a manuscript image through the frontend. The backend orchestrates a multi-step pipeline: (1) OCR extracts Latin text from the image, (2) an optional correction model fixes OCR errors, (3) RAG retrieves relevant context from a Latin dictionary and parallel corpora, (4) a reranking model scores retrieval results, and (5) an LLM generates the final translation using all gathered context.

Technology Stack

Layer	Technology	Purpose
Frontend	Next.js 16, React 19, TypeScript, Tailwind CSS v4	User interface and server-side rendering
Backend	FastAPI, SQLAlchemy, Pydantic, Uvicorn	REST API, authentication, pipeline orchestration
Database	PostgreSQL 15	User accounts, translations, API keys, prompts
Vector Store	LanceDB	RAG embeddings for Latin dictionary and parallel corpora
ML Services	Modal.com (serverless GPU)	OCR, correction, embedding, reranking
Auth	Google OAuth 2.0, JWT	User authentication
File Storage	Local filesystem (dev) / Vercel Blob (prod)	Manuscript image storage
CI/CD	GitHub Actions	Automated linting and testing
Hosting	Railway (backend), Vercel (frontend)	Production deployment

Prerequisites

Ensure the following tools are installed on your development machine:

Docker and Docker Compose (v2+)
Python 3.12+ with uv package manager
Node.js 20+ with npm
Git

You will also need accounts for the following external services (detailed setup in the next section):

Google Cloud Platform (OAuth credentials and optionally GCS for production LanceDB)
Modal.com (serverless GPU for ML models)
Vercel (frontend hosting, production only)
Railway (backend hosting, production only)

External Services Setup

This section walks through configuring each external service required by Interpres. For local development, only Google OAuth and Modal.com are mandatory.

1. Google OAuth 2.0 (Authentication)

Interpres uses Google OAuth for user login. You need to create OAuth 2.0 credentials in the Google Cloud Console.

Go to the Google Cloud Console.
Create a new project (or select an existing one).
Navigate to APIs & Services > Credentials.
Click Create Credentials > OAuth client ID.
Select Web application as the application type.
Set the Authorized redirect URIs to:
- http://localhost:8000/auth/google/callback (for local development)
- https://<your-backend-domain>/auth/google/callback (for production)
Click Create and note the Client ID and Client Secret.

Set these in your .env file:

GOOGLE_CLIENT_ID=<your-client-id>
GOOGLE_CLIENT_SECRET=<your-client-secret>

2. Modal.com (GPU ML Services)

The ML pipeline (OCR, correction, embedding, reranking) runs on Modal.com as serverless GPU functions.

Create an account at modal.com.
Install the Modal CLI and authenticate:
```
uv add modal
modal token new
```

Note the token ID and secret printed by the CLI. Add them to your .env:

MODAL_TOKEN_ID=ak-...
MODAL_TOKEN_SECRET=as-...
MODAL_APP_NAME=interpres-backend

Deploy the ML services to Modal:
```
./deploy_modal.sh
```
This runs uv run modal deploy src/modal.com/app.py and deploys the following services:
- PaddleService -- PaddleOCR for text detection
- TrOCRTridisService -- TrOCR fine-tuned on the Tridis dataset
- TrOCRMedievalService -- TrOCR fine-tuned on medieval manuscripts
- QwenService -- Qwen2-VL vision-language model for OCR
- EmbeddingService -- Sentence-transformer for RAG embeddings
- RerankingService -- Cross-encoder for reranking retrieval results
- CorrectionService -- ByT5-based post-OCR error correction

3. Hugging Face Token (Model Access)

Some ML models require a Hugging Face token for downloading weights.

Create an account at huggingface.co.
Go to Settings > Access Tokens and create a new token with read permissions.
Add it to your .env:
```
HF_TOKEN=hf_...
```

4. Google Cloud Storage (Production LanceDB -- Optional)

In production, LanceDB vectors are stored on Google Cloud Storage instead of the local filesystem. This step is only required for production deployment.

Create a GCS bucket (e.g., interpres-lancedb).
Create a Service Account with Storage Object Admin permissions on the bucket.
Download the service account JSON key file.
For local development with GCS, place the file at the project root as google_credentials.json.
For production (Railway), set the entire JSON as an environment variable:
```
GOOGLE_APPLICATION_CREDENTIALS_JSON={"type": "service_account", ...}
```
The backend automatically parses this JSON string into a temporary file at startup because the LanceDB Rust core requires a physical file path for GCS authentication.

5. Vercel Blob (Production Image Storage -- Optional)

In production, uploaded manuscript images are stored in Vercel Blob instead of the local filesystem. This is only required for production deployment; local development uses a Docker volume.

In your Vercel project dashboard, go to Storage > Blob.
Create a new Blob store.
Copy the read-write token and add it to your production environment:
```
BLOB_READ_WRITE_TOKEN=vercel_blob_rw_...
```

Local Development

Step 1: Clone the Repository

git clone https://github.com/whynotkimhari/Interpres.git
cd Interpres

Step 2: Configure Environment Variables

Copy the local development environment template and fill in your credentials:

cp .env.local.example .env

Edit .env and set the following values:

# --- Database Configuration ---
DATABASE_URL=postgresql://interpres:interpres_secret@postgres:5432/interpres
LANCEDB_DOWNLOAD_URL=https://drive.google.com/file/d/1vCbxUqTRUVlo0gYJjgZDfWx-P9YF2JZ1/view?usp=share_link
LANCEDB_PATH=/lancedb
UPLOADS_PATH=/uploads

# --- Frontend Configuration ---
NEXT_PUBLIC_API_URL=http://localhost:8000

# --- Backend Configuration ---
FRONTEND_URL=http://localhost:3000
JWT_SECRET_KEY=<generate-a-random-string>
SESSION_SECRET=<generate-a-random-string>
ENCRYPTION_KEY=<generate-a-fernet-key>

# --- Third Party Services ---
MODAL_TOKEN_ID=ak-...
MODAL_TOKEN_SECRET=as-...
MODAL_APP_NAME=interpres-backend
HF_TOKEN=hf_...
GOOGLE_CLIENT_ID=<your-google-client-id>
GOOGLE_CLIENT_SECRET=<your-google-client-secret>

To generate the required keys:

# Generate a random secret (for SESSION_SECRET and JWT_SECRET_KEY)
openssl rand -hex 64

# Generate a Fernet encryption key (for ENCRYPTION_KEY)
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Step 3: Set Up the LanceDB Vector Database

The RAG system requires pre-built vector embeddings stored in LanceDB. You have two options:

Option A: Download Pre-built Vectors (Recommended)

Set the LANCEDB_DOWNLOAD_URL in your .env to point to the pre-built archive. The backend Docker container will automatically download and extract it on first startup via the entrypoint script.

LANCEDB_DOWNLOAD_URL=https://drive.google.com/file/d/1vCbxUqTRUVlo0gYJjgZDfWx-P9YF2JZ1/view?usp=share_link

Option B: Build Vectors from Scratch

If you prefer to generate the embeddings yourself (or need to update the vector data), use the ingestion script. This requires Modal services to be deployed first (see Section 2 of External Services):

uv run python src/scripts/ingest.py

This script:

Loads the wnkh/IRC dataset from Hugging Face (Latin dictionary).
Loads the grosenthal/latin_english_parallel and magistermilitum/tridis_translate_la_en parallel corpora.
Generates embeddings via the Modal EmbeddingService.
Stores everything in a local LanceDB directory.

Step 4: Deploy ML Services to Modal

If you have not already done so, deploy the ML models to Modal:

./deploy_modal.sh

Step 5: Start the Application

Launch all services with Docker Compose:

docker compose up --build

This starts three containers:

Service	URL	Description
Frontend	http://localhost:3000	Next.js development server
Backend	http://localhost:8000	FastAPI server with auto-reload
PostgreSQL	localhost:5432	Database (data persisted in `postgres_data` volume)

The backend entrypoint script handles:

Downloading LanceDB data if LANCEDB_DOWNLOAD_URL is set and the directory is empty.
Starting the Uvicorn server on the port specified by $PORT (defaults to 8000).

On startup, the backend also:

Initializes the PostgreSQL schema via SQLAlchemy (create_all).
Seeds global LanceDB collections (Latin dictionary and parallel corpora) if they are not already present.

Step 6: Verify the Setup

Open http://localhost:3000 in your browser.
Click "Launch App" or navigate to the Scriptorium page.
Sign in with your Google account.
Upload a manuscript image and run the translation pipeline.
Verify the backend health endpoint at http://localhost:8000/health.

Production Deployment

For detailed differences between local and production environments (storage backends, URLs, environment variables), see HOSTING.md.

Backend on Railway

Connect your GitHub repository to Railway.
Railway will detect the railway.json configuration, which points to src/backend/Dockerfile.
Add a PostgreSQL service in the Railway project dashboard.

Set the following environment variables in the Railway service settings:

Variable	Value
`DATABASE_URL`	Provided by Railway PostgreSQL service
`LANCEDB_DOWNLOAD_URL`	Google Drive link to the pre-built vectors
`LANCEDB_PATH`	`gs://<your-gcs-bucket>`
`GOOGLE_APPLICATION_CREDENTIALS_JSON`	Full JSON of the GCS service account
`GOOGLE_CLIENT_ID`	Your OAuth Client ID
`GOOGLE_CLIENT_SECRET`	Your OAuth Client Secret
`JWT_SECRET_KEY`	A securely generated random string
`SESSION_SECRET`	A securely generated random string
`ENCRYPTION_KEY`	A Fernet key (must be consistent across deploys)
`FRONTEND_URL`	Your Vercel frontend URL
`BLOB_READ_WRITE_TOKEN`	Your Vercel Blob token
`MODAL_TOKEN_ID`	Your Modal token ID
`MODAL_TOKEN_SECRET`	Your Modal token secret
`MODAL_APP_NAME`	`interpres-backend`

Railway will build and deploy automatically on every push to main.
Confirm the deployment by visiting https://<your-railway-domain>/health.

Frontend on Vercel

Connect your GitHub repository to Vercel.
Set the Root Directory to src/frontend.
Set the following environment variables in Vercel project settings:

Variable Value

NEXT_PUBLIC_API_URL Your Railway backend URL (e.g., https://interpres-production.up.railway.app)
Vercel builds the Next.js app using the standalone output mode defined in next.config.ts.
Ensure remotePatterns in next.config.ts includes your Vercel Blob hostname for image rendering.

Modal Services

Modal services are deployed independently from Railway and Vercel. Whenever you update the ML code in src/modal.com/app.py, redeploy with:

./deploy_modal.sh

Modal handles auto-scaling, cold starts, and GPU provisioning.

Testing

Backend Tests

The backend uses pytest with coverage reporting. Tests require a running PostgreSQL instance.

# Run all backend tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run a specific test file
uv run pytest src/backend/tests/test_translations.py

Tests are configured in pyproject.toml with the following settings:

Test directory: src/backend/tests
Async mode: auto (via pytest-asyncio)
Coverage: reports on src/backend with term-missing output

Frontend Tests

The frontend uses Vitest with React Testing Library.

cd src/frontend

# Run all tests
npm run test

# Run in watch mode
npm run test:watch

# Run with coverage
npm run test:coverage

Linting

# Backend (Ruff)
uv run ruff check src/backend
uv run ruff format --check src/backend

# Frontend (ESLint)
cd src/frontend && npm run lint

CI Pipeline

GitHub Actions runs on every push and pull request to main:

Backend CI (backend.yml): Ruff linting, then pytest with a PostgreSQL service container.
Frontend CI (frontend.yml): ESLint, TypeScript type checking, then Vitest with coverage.
React Doctor (react-doctor.yml): Agentic code review for React components.

All pipelines are path-filtered to only run when relevant files change.

Project Structure

Interpres/
|-- docker-compose.yml                       # Local development orchestration
|-- pyproject.toml                           # Python dependencies and tool config (uv)
|-- uv.lock                                  # Locked Python dependencies
|-- railway.json                             # Railway deployment config
|-- deploy_modal.sh                          # Script to deploy ML services to Modal
|-- .env.local.example                       # Local development env template
|-- .env.production.example                  # Production env template
|-- HOSTING.md                               # Local vs production environment guide
|
|-- docs/
|   +-- diagrams/
|       |-- drawio/                          # Editable draw.io source files
|       |-- puml/                            # PlantUML diagram sources
|       |-- render/                          # Exported PDFs and SVGs
|       +-- use_cases.md                     # Detailed use case specification
|
|-- scripts/
|   +-- cleanup.sh                           # Remove caches and build artifacts
|
|-- src/
|   |-- backend/
|   |   |-- Dockerfile                       # Multi-stage Python 3.12 build
|   |   |-- entrypoint.sh                    # LanceDB download + server startup
|   |   |-- main.py                          # FastAPI application entry point
|   |   |-- db/                              # Database layer (SQLAlchemy models, CRUD)
|   |   |   |-- base.py                      # Engine, sessions, schema init
|   |   |   |-- models.py                    # SQLAlchemy ORM models
|   |   |   |-- api_keys.py                  # Encrypted API key storage
|   |   |   |-- collections.py               # Document collections + LanceDB integration
|   |   |   |-- translations.py              # Translation history CRUD
|   |   |   |-- users.py                     # User account CRUD
|   |   |   |-- jobs.py                      # Background job tracking
|   |   |   +-- prompts.py                   # Custom prompt templates
|   |   |-- routes/                          # FastAPI route handlers
|   |   |   |-- dependencies.py              # Shared auth + pipeline config deps
|   |   |   |-- schemas.py                   # Pydantic request/response models
|   |   |   |-- auth.py                      # Google OAuth login/callback
|   |   |   |-- pipeline.py                  # Main transcription pipeline endpoint
|   |   |   |-- translations.py              # Translation CRUD endpoints
|   |   |   |-- collections.py               # RAG collection management
|   |   |   |-- jobs.py                      # Batch job endpoints
|   |   |   |-- prompts.py                   # Prompt template endpoints
|   |   |   |-- user.py                      # API key management endpoints
|   |   |   +-- utils.py                     # Health check + image serving
|   |   |-- services/                        # Business logic and ML interfaces
|   |   |   |-- interfaces.py                # Abstract base classes for Modal services
|   |   |   |-- pipeline.py                  # Pipeline orchestration logic
|   |   |   |-- event_bus.py                 # SSE event bus for streaming
|   |   |   |-- ocr_services.py              # OCR model wrappers
|   |   |   |-- correction_services.py       # Post-OCR correction wrapper
|   |   |   |-- rag_services.py              # RAG retrieval logic
|   |   |   |-- llm_services.py              # LLM integration (Gemini, OpenAI)
|   |   |   |-- skip_service.py              # No-op service for skipped stages
|   |   |   +-- models.py                    # Pipeline configuration models
|   |   |-- utils/
|   |   |   +-- text_processing.py           # Text chunking and normalization
|   |   +-- tests/                           # Pytest test suite (216 tests, 97% coverage)
|   |
|   |-- frontend/
|   |   |-- Dockerfile                       # Multi-stage Node.js 20 build
|   |   |-- package.json                     # Node.js dependencies
|   |   |-- next.config.ts                   # Next.js configuration
|   |   |-- app/                             # Next.js App Router pages
|   |   |   |-- page.tsx                     # Landing page
|   |   |   |-- scriptorium/                 # Main transcription workspace
|   |   |   |-- library/                     # RAG collection management
|   |   |   |-- history/                     # Translation history browser
|   |   |   |-- jobs/                        # Background job monitoring
|   |   |   |-- prompts/                     # Prompt template editor
|   |   |   |-- settings/                    # User settings (API keys)
|   |   |   +-- share/                       # Public translation sharing
|   |   |-- components/                      # Reusable React components
|   |   |-- lib/                             # Utility functions and API client
|   |   +-- tests/                           # Vitest test suite
|   |
|   |-- modal.com/
|   |   +-- app.py                           # Modal serverless GPU service definitions
|   |
|   +-- scripts/
|       |-- ingest.py                        # LanceDB vector ingestion script
|       |-- push_to_hf.py                    # Push datasets to Hugging Face
|       +-- standardize_data.py              # Data normalization utilities
|
|-- .gemini/
|   +-- config.yaml                          # Agent PR Reviewer
|
+-- .github/workflows/
    |-- backend.yml                          # Backend CI (lint + test)
    |-- frontend.yml                         # Frontend CI (lint + type-check + test)
    +-- react-doctor.yml                     # React Doctor (agentic lint + test)

Environment Variables Reference

Variable	Required	Default	Description
`DATABASE_URL`	Yes	--	PostgreSQL connection string
`LANCEDB_PATH`	Yes	`/lancedb`	Path to LanceDB storage (local path or `gs://` URI)
`LANCEDB_DOWNLOAD_URL`	No	--	URL to download pre-built LanceDB vectors
`UPLOADS_PATH`	Yes	`/uploads`	Local directory for uploaded images
`NEXT_PUBLIC_API_URL`	Yes	--	Backend URL visible to the browser
`FRONTEND_URL`	Yes	`http://localhost:3000`	Frontend URL for CORS and OAuth redirects
`GOOGLE_CLIENT_ID`	Yes	--	Google OAuth 2.0 Client ID
`GOOGLE_CLIENT_SECRET`	Yes	--	Google OAuth 2.0 Client Secret
`JWT_SECRET_KEY`	Yes	--	Secret for signing JWT tokens
`SESSION_SECRET`	Yes	--	Secret for session middleware
`ENCRYPTION_KEY`	Yes	--	Fernet key for encrypting stored API keys
`MODAL_TOKEN_ID`	Yes	--	Modal.com authentication token ID
`MODAL_TOKEN_SECRET`	Yes	--	Modal.com authentication token secret
`MODAL_APP_NAME`	Yes	`interpres-backend`	Name of the Modal app
`HF_TOKEN`	Yes	--	Hugging Face token for model downloads
`GOOGLE_APPLICATION_CREDENTIALS`	Dev only	--	Path to GCS service account JSON file
`GOOGLE_APPLICATION_CREDENTIALS_JSON`	Prod only	--	GCS service account JSON as a string
`BLOB_READ_WRITE_TOKEN`	Prod only	--	Vercel Blob read-write token

License

This project is licensed under the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpres

Table of Contents

Architecture Overview

Technology Stack

Prerequisites

External Services Setup

1. Google OAuth 2.0 (Authentication)

2. Modal.com (GPU ML Services)

3. Hugging Face Token (Model Access)

4. Google Cloud Storage (Production LanceDB -- Optional)

5. Vercel Blob (Production Image Storage -- Optional)

Local Development

Step 1: Clone the Repository

Step 2: Configure Environment Variables

Step 3: Set Up the LanceDB Vector Database

Step 4: Deploy ML Services to Modal

Step 5: Start the Application

Step 6: Verify the Setup

Production Deployment

Backend on Railway

Frontend on Vercel

Modal Services

Testing

Backend Tests

Frontend Tests

Linting

CI Pipeline

Project Structure

Environment Variables Reference

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
.gemini		.gemini
.github/workflows		.github/workflows
docs/diagrams		docs/diagrams
scripts		scripts
src		src
.dockerignore		.dockerignore
.env.local.example		.env.local.example
.env.production.example		.env.production.example
.gitignore		.gitignore
HOSTING.md		HOSTING.md
README.md		README.md
deploy_modal.sh		deploy_modal.sh
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
railway.json		railway.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Interpres

Table of Contents

Architecture Overview

Technology Stack

Prerequisites

External Services Setup

1. Google OAuth 2.0 (Authentication)

2. Modal.com (GPU ML Services)

3. Hugging Face Token (Model Access)

4. Google Cloud Storage (Production LanceDB -- Optional)

5. Vercel Blob (Production Image Storage -- Optional)

Local Development

Step 1: Clone the Repository

Step 2: Configure Environment Variables

Step 3: Set Up the LanceDB Vector Database

Step 4: Deploy ML Services to Modal

Step 5: Start the Application

Step 6: Verify the Setup

Production Deployment

Backend on Railway

Frontend on Vercel

Modal Services

Testing

Backend Tests

Frontend Tests

Linting

CI Pipeline

Project Structure

Environment Variables Reference

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages