Skip to content

whynotkimhari/Interpres

Repository files navigation

Interpres

AI-powered transcription and translation of medieval Latin manuscripts.

Interpres is a full-stack web application that combines optical character recognition (OCR), post-OCR error correction, retrieval-augmented generation (RAG), and large language models (LLMs) into a configurable pipeline for analyzing historical manuscripts. Users upload manuscript images, select their desired models, and receive transcribed Latin text alongside an English translation enriched with historical context.

This project was developed as part of a BSc thesis at the Eötvös Loránd University (ELTE).

Table of Contents

  1. Architecture Overview
  2. Technology Stack
  3. Prerequisites
  4. External Services Setup
  5. Local Development
  6. Production Deployment
  7. Testing
  8. Project Structure
  9. Environment Variables Reference
  10. License

Architecture Overview

Interpres follows a three-tier architecture with a clear separation between the frontend, backend API, and GPU-accelerated ML services:

C4 Context Diagram

Request flow: A user uploads a manuscript image through the frontend. The backend orchestrates a multi-step pipeline: (1) OCR extracts Latin text from the image, (2) an optional correction model fixes OCR errors, (3) RAG retrieves relevant context from a Latin dictionary and parallel corpora, (4) a reranking model scores retrieval results, and (5) an LLM generates the final translation using all gathered context.

Technology Stack

Layer Technology Purpose
Frontend Next.js 16, React 19, TypeScript, Tailwind CSS v4 User interface and server-side rendering
Backend FastAPI, SQLAlchemy, Pydantic, Uvicorn REST API, authentication, pipeline orchestration
Database PostgreSQL 15 User accounts, translations, API keys, prompts
Vector Store LanceDB RAG embeddings for Latin dictionary and parallel corpora
ML Services Modal.com (serverless GPU) OCR, correction, embedding, reranking
Auth Google OAuth 2.0, JWT User authentication
File Storage Local filesystem (dev) / Vercel Blob (prod) Manuscript image storage
CI/CD GitHub Actions Automated linting and testing
Hosting Railway (backend), Vercel (frontend) Production deployment

Prerequisites

Ensure the following tools are installed on your development machine:

  • Docker and Docker Compose (v2+)
  • Python 3.12+ with uv package manager
  • Node.js 20+ with npm
  • Git

You will also need accounts for the following external services (detailed setup in the next section):

  • Google Cloud Platform (OAuth credentials and optionally GCS for production LanceDB)
  • Modal.com (serverless GPU for ML models)
  • Vercel (frontend hosting, production only)
  • Railway (backend hosting, production only)

External Services Setup

This section walks through configuring each external service required by Interpres. For local development, only Google OAuth and Modal.com are mandatory.

1. Google OAuth 2.0 (Authentication)

Interpres uses Google OAuth for user login. You need to create OAuth 2.0 credentials in the Google Cloud Console.

  1. Go to the Google Cloud Console.

  2. Create a new project (or select an existing one).

  3. Navigate to APIs & Services > Credentials.

  4. Click Create Credentials > OAuth client ID.

  5. Select Web application as the application type.

  6. Set the Authorized redirect URIs to:

    • http://localhost:8000/auth/google/callback (for local development)
    • https://<your-backend-domain>/auth/google/callback (for production)
  7. Click Create and note the Client ID and Client Secret.

  8. Set these in your .env file:

    GOOGLE_CLIENT_ID=<your-client-id>
    GOOGLE_CLIENT_SECRET=<your-client-secret>
    

2. Modal.com (GPU ML Services)

The ML pipeline (OCR, correction, embedding, reranking) runs on Modal.com as serverless GPU functions.

  1. Create an account at modal.com.

  2. Install the Modal CLI and authenticate:

    uv add modal
    modal token new
  3. Note the token ID and secret printed by the CLI. Add them to your .env:

    MODAL_TOKEN_ID=ak-...
    MODAL_TOKEN_SECRET=as-...
    MODAL_APP_NAME=interpres-backend
    
  4. Deploy the ML services to Modal:

    ./deploy_modal.sh

    This runs uv run modal deploy src/modal.com/app.py and deploys the following services:

    • PaddleService -- PaddleOCR for text detection
    • TrOCRTridisService -- TrOCR fine-tuned on the Tridis dataset
    • TrOCRMedievalService -- TrOCR fine-tuned on medieval manuscripts
    • QwenService -- Qwen2-VL vision-language model for OCR
    • EmbeddingService -- Sentence-transformer for RAG embeddings
    • RerankingService -- Cross-encoder for reranking retrieval results
    • CorrectionService -- ByT5-based post-OCR error correction

3. Hugging Face Token (Model Access)

Some ML models require a Hugging Face token for downloading weights.

  1. Create an account at huggingface.co.

  2. Go to Settings > Access Tokens and create a new token with read permissions.

  3. Add it to your .env:

    HF_TOKEN=hf_...
    

4. Google Cloud Storage (Production LanceDB -- Optional)

In production, LanceDB vectors are stored on Google Cloud Storage instead of the local filesystem. This step is only required for production deployment.

  1. Create a GCS bucket (e.g., interpres-lancedb).

  2. Create a Service Account with Storage Object Admin permissions on the bucket.

  3. Download the service account JSON key file.

  4. For local development with GCS, place the file at the project root as google_credentials.json.

  5. For production (Railway), set the entire JSON as an environment variable:

    GOOGLE_APPLICATION_CREDENTIALS_JSON={"type": "service_account", ...}
    

    The backend automatically parses this JSON string into a temporary file at startup because the LanceDB Rust core requires a physical file path for GCS authentication.

5. Vercel Blob (Production Image Storage -- Optional)

In production, uploaded manuscript images are stored in Vercel Blob instead of the local filesystem. This is only required for production deployment; local development uses a Docker volume.

  1. In your Vercel project dashboard, go to Storage > Blob.

  2. Create a new Blob store.

  3. Copy the read-write token and add it to your production environment:

    BLOB_READ_WRITE_TOKEN=vercel_blob_rw_...
    

Local Development

Step 1: Clone the Repository

git clone https://github.com/whynotkimhari/Interpres.git
cd Interpres

Step 2: Configure Environment Variables

Copy the local development environment template and fill in your credentials:

cp .env.local.example .env

Edit .env and set the following values:

# --- Database Configuration ---
DATABASE_URL=postgresql://interpres:interpres_secret@postgres:5432/interpres
LANCEDB_DOWNLOAD_URL=https://drive.google.com/file/d/1vCbxUqTRUVlo0gYJjgZDfWx-P9YF2JZ1/view?usp=share_link
LANCEDB_PATH=/lancedb
UPLOADS_PATH=/uploads

# --- Frontend Configuration ---
NEXT_PUBLIC_API_URL=http://localhost:8000

# --- Backend Configuration ---
FRONTEND_URL=http://localhost:3000
JWT_SECRET_KEY=<generate-a-random-string>
SESSION_SECRET=<generate-a-random-string>
ENCRYPTION_KEY=<generate-a-fernet-key>

# --- Third Party Services ---
MODAL_TOKEN_ID=ak-...
MODAL_TOKEN_SECRET=as-...
MODAL_APP_NAME=interpres-backend
HF_TOKEN=hf_...
GOOGLE_CLIENT_ID=<your-google-client-id>
GOOGLE_CLIENT_SECRET=<your-google-client-secret>

To generate the required keys:

# Generate a random secret (for SESSION_SECRET and JWT_SECRET_KEY)
openssl rand -hex 64

# Generate a Fernet encryption key (for ENCRYPTION_KEY)
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Step 3: Set Up the LanceDB Vector Database

The RAG system requires pre-built vector embeddings stored in LanceDB. You have two options:

Option A: Download Pre-built Vectors (Recommended)

Set the LANCEDB_DOWNLOAD_URL in your .env to point to the pre-built archive. The backend Docker container will automatically download and extract it on first startup via the entrypoint script.

LANCEDB_DOWNLOAD_URL=https://drive.google.com/file/d/1vCbxUqTRUVlo0gYJjgZDfWx-P9YF2JZ1/view?usp=share_link

Option B: Build Vectors from Scratch

If you prefer to generate the embeddings yourself (or need to update the vector data), use the ingestion script. This requires Modal services to be deployed first (see Section 2 of External Services):

uv run python src/scripts/ingest.py

This script:

  1. Loads the wnkh/IRC dataset from Hugging Face (Latin dictionary).
  2. Loads the grosenthal/latin_english_parallel and magistermilitum/tridis_translate_la_en parallel corpora.
  3. Generates embeddings via the Modal EmbeddingService.
  4. Stores everything in a local LanceDB directory.

Step 4: Deploy ML Services to Modal

If you have not already done so, deploy the ML models to Modal:

./deploy_modal.sh

Step 5: Start the Application

Launch all services with Docker Compose:

docker compose up --build

This starts three containers:

Service URL Description
Frontend http://localhost:3000 Next.js development server
Backend http://localhost:8000 FastAPI server with auto-reload
PostgreSQL localhost:5432 Database (data persisted in postgres_data volume)

The backend entrypoint script handles:

  • Downloading LanceDB data if LANCEDB_DOWNLOAD_URL is set and the directory is empty.
  • Starting the Uvicorn server on the port specified by $PORT (defaults to 8000).

On startup, the backend also:

  • Initializes the PostgreSQL schema via SQLAlchemy (create_all).
  • Seeds global LanceDB collections (Latin dictionary and parallel corpora) if they are not already present.

Step 6: Verify the Setup

  1. Open http://localhost:3000 in your browser.
  2. Click "Launch App" or navigate to the Scriptorium page.
  3. Sign in with your Google account.
  4. Upload a manuscript image and run the translation pipeline.
  5. Verify the backend health endpoint at http://localhost:8000/health.

Production Deployment

For detailed differences between local and production environments (storage backends, URLs, environment variables), see HOSTING.md.

Backend on Railway

  1. Connect your GitHub repository to Railway.

  2. Railway will detect the railway.json configuration, which points to src/backend/Dockerfile.

  3. Add a PostgreSQL service in the Railway project dashboard.

  4. Set the following environment variables in the Railway service settings:

    Variable Value
    DATABASE_URL Provided by Railway PostgreSQL service
    LANCEDB_DOWNLOAD_URL Google Drive link to the pre-built vectors
    LANCEDB_PATH gs://<your-gcs-bucket>
    GOOGLE_APPLICATION_CREDENTIALS_JSON Full JSON of the GCS service account
    GOOGLE_CLIENT_ID Your OAuth Client ID
    GOOGLE_CLIENT_SECRET Your OAuth Client Secret
    JWT_SECRET_KEY A securely generated random string
    SESSION_SECRET A securely generated random string
    ENCRYPTION_KEY A Fernet key (must be consistent across deploys)
    FRONTEND_URL Your Vercel frontend URL
    BLOB_READ_WRITE_TOKEN Your Vercel Blob token
    MODAL_TOKEN_ID Your Modal token ID
    MODAL_TOKEN_SECRET Your Modal token secret
    MODAL_APP_NAME interpres-backend
  5. Railway will build and deploy automatically on every push to main.

  6. Confirm the deployment by visiting https://<your-railway-domain>/health.

Frontend on Vercel

  1. Connect your GitHub repository to Vercel.

  2. Set the Root Directory to src/frontend.

  3. Set the following environment variables in Vercel project settings:

    Variable Value
    NEXT_PUBLIC_API_URL Your Railway backend URL (e.g., https://interpres-production.up.railway.app)
  4. Vercel builds the Next.js app using the standalone output mode defined in next.config.ts.

  5. Ensure remotePatterns in next.config.ts includes your Vercel Blob hostname for image rendering.

Modal Services

Modal services are deployed independently from Railway and Vercel. Whenever you update the ML code in src/modal.com/app.py, redeploy with:

./deploy_modal.sh

Modal handles auto-scaling, cold starts, and GPU provisioning.

Testing

Backend Tests

The backend uses pytest with coverage reporting. Tests require a running PostgreSQL instance.

# Run all backend tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run a specific test file
uv run pytest src/backend/tests/test_translations.py

Tests are configured in pyproject.toml with the following settings:

  • Test directory: src/backend/tests
  • Async mode: auto (via pytest-asyncio)
  • Coverage: reports on src/backend with term-missing output

Frontend Tests

The frontend uses Vitest with React Testing Library.

cd src/frontend

# Run all tests
npm run test

# Run in watch mode
npm run test:watch

# Run with coverage
npm run test:coverage

Linting

# Backend (Ruff)
uv run ruff check src/backend
uv run ruff format --check src/backend

# Frontend (ESLint)
cd src/frontend && npm run lint

CI Pipeline

GitHub Actions runs on every push and pull request to main:

  • Backend CI (backend.yml): Ruff linting, then pytest with a PostgreSQL service container.
  • Frontend CI (frontend.yml): ESLint, TypeScript type checking, then Vitest with coverage.
  • React Doctor (react-doctor.yml): Agentic code review for React components.

All pipelines are path-filtered to only run when relevant files change.

Project Structure

Interpres/
|-- docker-compose.yml                       # Local development orchestration
|-- pyproject.toml                           # Python dependencies and tool config (uv)
|-- uv.lock                                  # Locked Python dependencies
|-- railway.json                             # Railway deployment config
|-- deploy_modal.sh                          # Script to deploy ML services to Modal
|-- .env.local.example                       # Local development env template
|-- .env.production.example                  # Production env template
|-- HOSTING.md                               # Local vs production environment guide
|
|-- docs/
|   +-- diagrams/
|       |-- drawio/                          # Editable draw.io source files
|       |-- puml/                            # PlantUML diagram sources
|       |-- render/                          # Exported PDFs and SVGs
|       +-- use_cases.md                     # Detailed use case specification
|
|-- scripts/
|   +-- cleanup.sh                           # Remove caches and build artifacts
|
|-- src/
|   |-- backend/
|   |   |-- Dockerfile                       # Multi-stage Python 3.12 build
|   |   |-- entrypoint.sh                    # LanceDB download + server startup
|   |   |-- main.py                          # FastAPI application entry point
|   |   |-- db/                              # Database layer (SQLAlchemy models, CRUD)
|   |   |   |-- base.py                      # Engine, sessions, schema init
|   |   |   |-- models.py                    # SQLAlchemy ORM models
|   |   |   |-- api_keys.py                  # Encrypted API key storage
|   |   |   |-- collections.py               # Document collections + LanceDB integration
|   |   |   |-- translations.py              # Translation history CRUD
|   |   |   |-- users.py                     # User account CRUD
|   |   |   |-- jobs.py                      # Background job tracking
|   |   |   +-- prompts.py                   # Custom prompt templates
|   |   |-- routes/                          # FastAPI route handlers
|   |   |   |-- dependencies.py              # Shared auth + pipeline config deps
|   |   |   |-- schemas.py                   # Pydantic request/response models
|   |   |   |-- auth.py                      # Google OAuth login/callback
|   |   |   |-- pipeline.py                  # Main transcription pipeline endpoint
|   |   |   |-- translations.py              # Translation CRUD endpoints
|   |   |   |-- collections.py               # RAG collection management
|   |   |   |-- jobs.py                      # Batch job endpoints
|   |   |   |-- prompts.py                   # Prompt template endpoints
|   |   |   |-- user.py                      # API key management endpoints
|   |   |   +-- utils.py                     # Health check + image serving
|   |   |-- services/                        # Business logic and ML interfaces
|   |   |   |-- interfaces.py                # Abstract base classes for Modal services
|   |   |   |-- pipeline.py                  # Pipeline orchestration logic
|   |   |   |-- event_bus.py                 # SSE event bus for streaming
|   |   |   |-- ocr_services.py              # OCR model wrappers
|   |   |   |-- correction_services.py       # Post-OCR correction wrapper
|   |   |   |-- rag_services.py              # RAG retrieval logic
|   |   |   |-- llm_services.py              # LLM integration (Gemini, OpenAI)
|   |   |   |-- skip_service.py              # No-op service for skipped stages
|   |   |   +-- models.py                    # Pipeline configuration models
|   |   |-- utils/
|   |   |   +-- text_processing.py           # Text chunking and normalization
|   |   +-- tests/                           # Pytest test suite (216 tests, 97% coverage)
|   |
|   |-- frontend/
|   |   |-- Dockerfile                       # Multi-stage Node.js 20 build
|   |   |-- package.json                     # Node.js dependencies
|   |   |-- next.config.ts                   # Next.js configuration
|   |   |-- app/                             # Next.js App Router pages
|   |   |   |-- page.tsx                     # Landing page
|   |   |   |-- scriptorium/                 # Main transcription workspace
|   |   |   |-- library/                     # RAG collection management
|   |   |   |-- history/                     # Translation history browser
|   |   |   |-- jobs/                        # Background job monitoring
|   |   |   |-- prompts/                     # Prompt template editor
|   |   |   |-- settings/                    # User settings (API keys)
|   |   |   +-- share/                       # Public translation sharing
|   |   |-- components/                      # Reusable React components
|   |   |-- lib/                             # Utility functions and API client
|   |   +-- tests/                           # Vitest test suite
|   |
|   |-- modal.com/
|   |   +-- app.py                           # Modal serverless GPU service definitions
|   |
|   +-- scripts/
|       |-- ingest.py                        # LanceDB vector ingestion script
|       |-- push_to_hf.py                    # Push datasets to Hugging Face
|       +-- standardize_data.py              # Data normalization utilities
|
|-- .gemini/
|   +-- config.yaml                          # Agent PR Reviewer
|
+-- .github/workflows/
    |-- backend.yml                          # Backend CI (lint + test)
    |-- frontend.yml                         # Frontend CI (lint + type-check + test)
    +-- react-doctor.yml                     # React Doctor (agentic lint + test)

Environment Variables Reference

Variable Required Default Description
DATABASE_URL Yes -- PostgreSQL connection string
LANCEDB_PATH Yes /lancedb Path to LanceDB storage (local path or gs:// URI)
LANCEDB_DOWNLOAD_URL No -- URL to download pre-built LanceDB vectors
UPLOADS_PATH Yes /uploads Local directory for uploaded images
NEXT_PUBLIC_API_URL Yes -- Backend URL visible to the browser
FRONTEND_URL Yes http://localhost:3000 Frontend URL for CORS and OAuth redirects
GOOGLE_CLIENT_ID Yes -- Google OAuth 2.0 Client ID
GOOGLE_CLIENT_SECRET Yes -- Google OAuth 2.0 Client Secret
JWT_SECRET_KEY Yes -- Secret for signing JWT tokens
SESSION_SECRET Yes -- Secret for session middleware
ENCRYPTION_KEY Yes -- Fernet key for encrypting stored API keys
MODAL_TOKEN_ID Yes -- Modal.com authentication token ID
MODAL_TOKEN_SECRET Yes -- Modal.com authentication token secret
MODAL_APP_NAME Yes interpres-backend Name of the Modal app
HF_TOKEN Yes -- Hugging Face token for model downloads
GOOGLE_APPLICATION_CREDENTIALS Dev only -- Path to GCS service account JSON file
GOOGLE_APPLICATION_CREDENTIALS_JSON Prod only -- GCS service account JSON as a string
BLOB_READ_WRITE_TOKEN Prod only -- Vercel Blob read-write token

License

This project is licensed under the MIT License.

Copyright (c) 2026

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

AI-Powered Historical Latin Document Translator

Topics

Resources

Stars

Watchers

Forks

Contributors