From 77ec394c84cbc037125bcc95dfa40136b0948c8c Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Tue, 18 Nov 2025 17:06:13 +0100 Subject: [PATCH 01/12] feat(docs): Improve docs and README --- README.md | 16 +++++++++--- docs/explanation/architecture.md | 2 +- docs/how-to/index.md | 6 +++++ docs/how-to/postgres-state-store.md | 32 ++++++++++++++++++++++++ docs/index.md | 15 +++++++---- docs/tutorials/custom-implementations.md | 2 +- docs/tutorials/first-pipeline.md | 2 +- mkdocs.yaml | 1 + 8 files changed, 64 insertions(+), 12 deletions(-) create mode 100644 docs/how-to/postgres-state-store.md diff --git a/README.md b/README.md index e598b2e3..ef8a63bc 100644 --- a/README.md +++ b/README.md @@ -114,18 +114,26 @@ For tutorials and deeper guidance, see the [Documentation](https://supabase.gith ## Destinations -ETL is designed to be extensible. You can implement your own destinations to send data to any destination you like, however it comes with a few built in destinations: +ETL is designed to be extensible. You can implement your own destinations, and the project currently ships with the following maintained options: -- BigQuery +- **BigQuery** – full CRUD-capable replication for analytics workloads. +- **Apache Iceberg** – append-only log of operations today (no in-place updates yet). -Out-of-the-box destinations are available in the `etl-destinations` crate: +Enable the destinations you need through the `etl-destinations` crate: ```toml [dependencies] etl = { git = "https://github.com/supabase/etl" } -etl-destinations = { git = "https://github.com/supabase/etl", features = ["bigquery"] } +etl-destinations = { git = "https://github.com/supabase/etl", features = ["bigquery", "iceberg"] } ``` +## Contributing + +We welcome pull requests and GitHub issues. That said, we currently cannot accept new custom destinations unless there +is significant community demand. Each destination carries a high long-term maintenance cost, and we are prioritizing core stability, +observability, and ergonomics. If you need a destination that is not yet supported, please start a discussion or issue so we can gauge demand +before proposing an implementation. + ## License Apache‑2.0. See `LICENSE` for details. diff --git a/docs/explanation/architecture.md b/docs/explanation/architecture.md index 22d092cd..dce61e04 100644 --- a/docs/explanation/architecture.md +++ b/docs/explanation/architecture.md @@ -25,7 +25,7 @@ flowchart LR end subgraph Destination[Destination] - Dest["BigQuery
Custom API
Memory"] + Dest["BigQuery
Apache Iceberg
Custom API"] end subgraph Store[Store] diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 67eac066..6d1bf049 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -12,6 +12,12 @@ Set up Postgres with the correct settings, and publications for ETL pipelines. **When to use:** Setting up a new Postgres source for replication. +### [Apply Postgres State Store Migrations](postgres-state-store.md) + +Create the `etl` schema, replication state tables, and related objects required by `PostgresStore`. + +**When to use:** Before running a pipeline that uses the Postgres-backed state or schema stores. + ## Next Steps After solving your immediate problem: diff --git a/docs/how-to/postgres-state-store.md b/docs/how-to/postgres-state-store.md new file mode 100644 index 00000000..aafd76eb --- /dev/null +++ b/docs/how-to/postgres-state-store.md @@ -0,0 +1,32 @@ +# Apply Postgres State Store Migrations + +**Prepare the Postgres-backed state store before running pipelines** + +`PostgresStore` (and the matching schema store) keep replication metadata inside your own Postgres database. The tables live in the `etl` schema and must be created before a pipeline starts, otherwise you will see errors such as `relation "etl.table_mappings" does not exist`. + +Follow these steps whenever you configure a Postgres-backed store. + +## 1. Pick the database and user + +- Choose the Postgres database that should store ETL metadata (often separate from the source database). +- Ensure the user credentials configured in `PgConnectionConfig` have privileges to create schemas, tables, and indexes in that database. + +## 2. Apply the migrations + +All SQL migrations for the Postgres store reside in `etl-replicator/migrations/`. Apply them in order (they are timestamp-prefixed) using your preferred tooling. With `psql`: + +```bash +cd /path/to/etl +psql "postgres://user:password@host:port/database" -f etl-replicator/migrations/20250827000000_base.sql +``` + +If additional migration files appear in that directory, run them sequentially (for example with `ls etl-replicator/migrations/*.sql | sort | xargs -I{} psql -f {}`) before restarting your pipeline. + +## 3. Verify the schema + +After applying the migrations: + +- Confirm the `etl` schema exists. +- Check that tables like `replication_state`, `table_mappings`, and `schema_definitions` are present. + +You can now safely configure `PostgresStore`/`PostgresSchemaStore` in your pipeline. Future migrations can be applied on top. diff --git a/docs/index.md b/docs/index.md index b7fbee02..9f6e483c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -33,11 +33,11 @@ Read our **[Explanations](explanation/index.md)** for deeper insights: **Postgres Logical Replication** streams data changes from Postgres databases in real-time using the Write-Ahead Log (WAL). ETL builds on this foundation to provide: -- 🚀 **Real-time replication** - Stream changes as they happen -- 🔄 **Multiple destinations** - BigQuery and more coming soon -- 🛡️ **Fault tolerance** - Built-in error handling and recovery -- ⚡ **High performance** - Efficient batching and parallel processing -- 🔧 **Extensible** - Plugin architecture for custom destinations +- **Real-time replication** - Stream changes as they happen +- **Multiple destinations** - BigQuery and Apache Iceberg officially supported +- **Fault tolerance** - Built-in error handling and recovery +- **High performance** - Efficient batching and parallel processing +- **Extensible** - Plugin architecture for custom destinations ## Quick Example @@ -91,5 +91,10 @@ async fn main() -> Result<(), Box> { - **First time using ETL?** → Start with [Build your first pipeline](tutorials/first-pipeline.md) - **Need Postgres setup help?** → Check [Configure Postgres for Replication](how-to/configure-postgres.md) +- **Using Postgres for state storage?** → Follow [Apply Postgres state store migrations](how-to/postgres-state-store.md) - **Need technical details?** → Check the [Reference](reference/index.md) - **Want to understand the architecture?** → Read [ETL Architecture](explanation/architecture.md) + +## Contributing + +Contributions and bug reports are welcome in the GitHub repository. At the moment we cannot accept new custom destination implementations unless a large portion of the community requests them, because every destination adds a long-lived maintenance burden and we are focusing engineering time on stability, observability, and ergonomics. Please open an issue or discussion first if you believe a new destination should be prioritized. diff --git a/docs/tutorials/custom-implementations.md b/docs/tutorials/custom-implementations.md index 25de68d2..1e7dfd9b 100644 --- a/docs/tutorials/custom-implementations.md +++ b/docs/tutorials/custom-implementations.md @@ -704,7 +704,7 @@ You now have working custom ETL components: - **Connect to real Postgres** → [Configure Postgres for Replication](../how-to/configure-postgres.md) - **Understand the architecture** → [ETL Architecture](../explanation/architecture.md) -- **Contribute to ETL** → [Open an issue](https://github.com/supabase/etl/issues) with your custom implementations +- **Contribute thoughtfully** → [Open an issue](https://github.com/supabase/etl/issues) before proposing a new destination; we currently accept new destinations only when there is clear, broad demand due to the maintenance cost. ## See Also diff --git a/docs/tutorials/first-pipeline.md b/docs/tutorials/first-pipeline.md index 20ee676b..9ad543fb 100644 --- a/docs/tutorials/first-pipeline.md +++ b/docs/tutorials/first-pipeline.md @@ -213,7 +213,7 @@ DELETE FROM users WHERE email = 'bob@example.com'; ## Step 6: Verify Data Replication -The data is now replicated in your memory destination. While this tutorial uses memory (perfect for testing), the same pattern works with BigQuery, DuckDB, or custom destinations. +The data is now replicated in your memory destination. While this tutorial uses memory (perfect for testing), the same pattern works with any destination. **Checkpoint:** You've successfully built and tested a complete ETL pipeline! diff --git a/mkdocs.yaml b/mkdocs.yaml index 047c82e4..2966dade 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -15,6 +15,7 @@ nav: - How-to Guides: - Overview: how-to/index.md - Configure Postgres: how-to/configure-postgres.md + - Apply Postgres State Store Migrations: how-to/postgres-state-store.md - Reference: - Overview: reference/index.md - Explanation: From fb3968266398fc4f460b2603f979c5fecda27559 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 11:06:46 +0100 Subject: [PATCH 02/12] feat(docs): Improve docs --- AGENTS.md | 2 +- DEVELOPMENT.md | 357 +++++++++++++++++++++++ README.md | 4 + etl-replicator/scripts/run_migrations.sh | 48 +++ 4 files changed, 410 insertions(+), 1 deletion(-) create mode 100644 DEVELOPMENT.md create mode 100755 etl-replicator/scripts/run_migrations.sh diff --git a/AGENTS.md b/AGENTS.md index 5ff8ff8d..4a8e9e59 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -2,7 +2,7 @@ ## Project Structure & Modules - Rust workspace (`Cargo.toml`) with crates: `etl/` (core), `etl-api/` (HTTP API), `etl-postgres/`, `etl-destinations/`, `etl-replicator/`, `etl-config/`, `etl-telemetry/`, `etl-examples/`, `etl-benchmarks/`. -- Docs in `docs/`; ops tooling in `scripts/` (Docker Compose, DB init, migrations). +- Docs in `docs/`; development setup in `DEVELOPMENT.md`; ops tooling in `scripts/` (Docker Compose, DB init, migrations). - Tests live per crate (`src` unit tests, `tests` integration); benches in `etl-benchmarks/benches/`. ## Build and Test diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md new file mode 100644 index 00000000..243f85f5 --- /dev/null +++ b/DEVELOPMENT.md @@ -0,0 +1,357 @@ +# Development Guide + +This guide covers setting up your development environment, running migrations, and common development workflows for the ETL project. + +## Table of Contents + +- [Prerequisites](#prerequisites) +- [Quick Start](#quick-start) +- [Database Setup](#database-setup) + - [Using the Setup Script](#using-the-setup-script) + - [Manual Setup](#manual-setup) +- [Database Migrations](#database-migrations) + - [ETL API Migrations](#etl-api-migrations) + - [ETL Replicator Migrations](#etl-replicator-migrations) +- [Running the Services](#running-the-services) +- [Kubernetes Setup](#kubernetes-setup) +- [Common Development Tasks](#common-development-tasks) + +## Prerequisites + +Before starting, ensure you have the following installed: + +### Required Tools + +- **Rust** (latest stable): [Install Rust](https://rustup.rs/) +- **PostgreSQL client** (`psql`): Required for database operations +- **Docker Compose**: For running PostgreSQL and other services +- **kubectl**: For Kubernetes operations +- **SQLx CLI**: For database migrations + +Install SQLx CLI: + +```bash +cargo install --version='~0.7' sqlx-cli --no-default-features --features rustls,postgres +``` + +### Optional Tools + +- **OrbStack**: Recommended for local Kubernetes development (alternative to Docker Desktop) + - [Install OrbStack](https://orbstack.dev) + - Enable Kubernetes in OrbStack settings + +## Quick Start + +The fastest way to get started is using the setup script: + +```bash +# From the project root +./scripts/init.sh +``` + +This script will: +1. Start PostgreSQL via Docker Compose +2. Run etl-api migrations +3. Seed the default replicator image +4. Configure the Kubernetes environment (OrbStack) + +## Database Setup + +### Using the Setup Script + +The `scripts/init.sh` script provides a complete development environment setup: + +```bash +# Use default settings (Postgres on port 5430) +./scripts/init.sh + +# Customize database settings +POSTGRES_PORT=5432 POSTGRES_DB=mydb ./scripts/init.sh + +# Skip Docker if you already have Postgres running +SKIP_DOCKER=1 ./scripts/init.sh + +# Use persistent storage +POSTGRES_DATA_VOLUME=/path/to/data ./scripts/init.sh +``` + +**Environment Variables:** + +| Variable | Default | Description | +|----------|---------|-------------| +| `POSTGRES_USER` | `postgres` | Database user | +| `POSTGRES_PASSWORD` | `postgres` | Database password | +| `POSTGRES_DB` | `postgres` | Database name | +| `POSTGRES_PORT` | `5430` | Database port | +| `POSTGRES_HOST` | `localhost` | Database host | +| `SKIP_DOCKER` | (empty) | Skip Docker Compose if set | +| `POSTGRES_DATA_VOLUME` | (empty) | Path for persistent storage | +| `REPLICATOR_IMAGE` | `ramsup/etl-replicator:latest` | Default replicator image | + +### Manual Setup + +If you prefer manual setup or have an existing PostgreSQL instance: + +1. **Set the database URL:** + +```bash +export DATABASE_URL=postgres://USER:PASSWORD@HOST:PORT/DB +``` + +2. **Run etl-api migrations:** + +```bash +./etl-api/scripts/run_migrations.sh +``` + +3. **Run etl-replicator migrations:** + +```bash +./etl-replicator/scripts/run_migrations.sh +``` + +## Database Migrations + +The project uses SQLx for database migrations. There are two sets of migrations: + +### ETL API Migrations + +Located in `etl-api/migrations/`, these create the control plane schema (`app` schema) for managing tenants, sources, destinations, and pipelines. + +**Running API migrations:** + +```bash +# From project root +./etl-api/scripts/run_migrations.sh + +# Or manually with SQLx CLI +sqlx migrate run --source etl-api/migrations +``` + +**Creating a new API migration:** + +```bash +cd etl-api +sqlx migrate add +``` + +**Resetting the API database:** + +```bash +cd etl-api +sqlx migrate revert +``` + +**Updating SQLx metadata after schema changes:** + +```bash +cd etl-api +cargo sqlx prepare +``` + +### ETL Replicator Migrations + +Located in `etl-replicator/migrations/`, these create the replicator's state store schema (`etl` schema) for tracking replication state, table schemas, and mappings. + +**Running replicator migrations:** + +```bash +# From project root +./etl-replicator/scripts/run_migrations.sh + +# Or manually with SQLx CLI (requires setting search_path) +psql $DATABASE_URL -c "create schema if not exists etl;" +sqlx migrate run --source etl-replicator/migrations --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" +``` + +**Note:** The replicator migrations are also run automatically when the replicator starts (see `etl-replicator/src/migrations.rs:16`). The script is useful for: +- Pre-creating the state store schema +- Testing migrations independently +- CI/CD pipelines +- Setting up replicator state on new databases + +**Creating a new replicator migration:** + +```bash +cd etl-replicator +sqlx migrate add +``` + +## Running the Services + +### ETL API + +```bash +cd etl-api +cargo run +``` + +The API requires the `DATABASE_URL` environment variable and a valid configuration file. See `etl-api/README.md` for configuration details. + +### ETL Replicator + +The replicator is typically deployed as a Kubernetes pod, but can be run locally for testing: + +```bash +cd etl-replicator +cargo run +``` + +## Kubernetes Setup + +The project uses Kubernetes for deploying replicators. The setup script configures the necessary resources. + +**Prerequisites:** +- OrbStack with Kubernetes enabled (or another local Kubernetes cluster) +- `kubectl` configured with the `orbstack` context + +**Manual Kubernetes setup:** + +```bash +kubectl --context orbstack apply -f scripts/etl-data-plane.yaml +kubectl --context orbstack apply -f scripts/trusted-root-certs-config.yaml +``` + +**Checking deployed resources:** + +```bash +# List replicator pods +kubectl get pods -n etl-control-plane -l app=etl-api + +# View logs +kubectl logs -n etl-control-plane -l app=etl-api --tail=100 + +# Describe a specific pod +kubectl describe pod -n etl-control-plane +``` + +## Common Development Tasks + +### Running Tests + +```bash +# Run all tests +cargo test --workspace + +# Run tests for a specific crate +cargo test -p etl-api +cargo test -p etl-replicator + +# Run with all features enabled +cargo test --workspace --all-features +``` + +### Building the Project + +```bash +# Build all crates +cargo build --workspace + +# Build in release mode +cargo build --workspace --release + +# Build a specific crate +cargo build -p etl-api +``` + +### Checking Code + +```bash +# Run clippy for linting +cargo clippy --workspace --all-targets + +# Format code +cargo fmt --all + +# Check without building +cargo check --workspace +``` + +### Docker Images + +```bash +# Build the API image +docker build -f etl-api/Dockerfile -t etl-api:dev . + +# Build the replicator image +docker build -f etl-replicator/Dockerfile -t etl-replicator:dev . +``` + +### Viewing Logs + +```bash +# Docker Compose logs +docker-compose -f scripts/docker-compose.yaml logs -f + +# PostgreSQL logs specifically +docker-compose -f scripts/docker-compose.yaml logs -f postgres +``` + +### Cleaning Up + +```bash +# Stop Docker Compose services +docker-compose -f scripts/docker-compose.yaml down + +# Remove volumes (WARNING: deletes data) +docker-compose -f scripts/docker-compose.yaml down -v + +# Clean Rust build artifacts +cargo clean +``` + +## Troubleshooting + +### Database Connection Issues + +If you encounter connection issues: + +1. Verify PostgreSQL is running: + ```bash + docker-compose -f scripts/docker-compose.yaml ps + ``` + +2. Check the connection: + ```bash + psql $DATABASE_URL -c "SELECT 1;" + ``` + +3. Ensure the correct port is used (default: 5430) + +### Migration Issues + +If migrations fail: + +1. Check if the database exists: + ```bash + psql $DATABASE_URL -c "\l" + ``` + +2. Verify SQLx CLI is installed: + ```bash + sqlx --version + ``` + +3. Check migration history: + ```bash + psql $DATABASE_URL -c "SELECT * FROM _sqlx_migrations;" + ``` + +### Kubernetes Issues + +If Kubernetes resources aren't deploying: + +1. Verify context: + ```bash + kubectl config current-context + ``` + +2. Check cluster status: + ```bash + kubectl cluster-info + ``` + +3. View events: + ```bash + kubectl get events -n etl-control-plane --sort-by='.lastTimestamp' + ``` diff --git a/README.md b/README.md index ef8a63bc..9acf7bfe 100644 --- a/README.md +++ b/README.md @@ -112,6 +112,10 @@ async fn main() -> Result<(), Box> { For tutorials and deeper guidance, see the [Documentation](https://supabase.github.io/etl) or jump into the [examples](etl-examples/README.md). +## Development + +See [DEVELOPMENT.md](DEVELOPMENT.md) for setup instructions, migration workflows, and development guidelines. + ## Destinations ETL is designed to be extensible. You can implement your own destinations, and the project currently ships with the following maintained options: diff --git a/etl-replicator/scripts/run_migrations.sh b/etl-replicator/scripts/run_migrations.sh new file mode 100755 index 00000000..6786f6b5 --- /dev/null +++ b/etl-replicator/scripts/run_migrations.sh @@ -0,0 +1,48 @@ +#!/usr/bin/env bash +set -eo pipefail + +if [ ! -d "etl-replicator/migrations" ]; then + echo >&2 "❌ Error: 'etl-replicator/migrations' folder not found." + echo >&2 "Please run this script from the 'etl' directory." + exit 1 +fi + +if ! [ -x "$(command -v sqlx)" ]; then + echo >&2 "❌ Error: SQLx CLI is not installed." + echo >&2 "To install it, run:" + echo >&2 " cargo install --version='~0.7' sqlx-cli --no-default-features --features rustls,postgres" + exit 1 +fi + +if ! [ -x "$(command -v psql)" ]; then + echo >&2 "❌ Error: Postgres client (psql) is not installed." + echo >&2 "Please install it using your system's package manager." + exit 1 +fi + +# Database configuration +DB_USER="${POSTGRES_USER:=postgres}" +DB_PASSWORD="${POSTGRES_PASSWORD:=postgres}" +DB_NAME="${POSTGRES_DB:=postgres}" +DB_PORT="${POSTGRES_PORT:=5430}" +DB_HOST="${POSTGRES_HOST:=localhost}" + +# Set up the database URL +export DATABASE_URL=postgres://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME} + +echo "🔄 Running replicator state store migrations..." + +# Create the etl schema if it doesn't exist +# This matches the behavior in etl-replicator/src/migrations.rs +psql "${DATABASE_URL}" -v ON_ERROR_STOP=1 -c "create schema if not exists etl;" > /dev/null + +# Create a temporary sqlx-cli compatible database URL that sets the search_path +# This ensures the _sqlx_migrations table is created in the etl schema +SQLX_MIGRATIONS_OPTS="options=-csearch_path%3Detl" +MIGRATION_URL="${DATABASE_URL}?${SQLX_MIGRATIONS_OPTS}" + +# Run migrations with the modified URL +sqlx database create --database-url "${DATABASE_URL}" +sqlx migrate run --source etl-replicator/migrations --database-url "${MIGRATION_URL}" + +echo "✨ Replicator state store migrations complete! Ready to go!" From f6e0fd18e2c18062176f89945012b71c718ebe6a Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 11:13:13 +0100 Subject: [PATCH 03/12] Improve --- DEVELOPMENT.md | 75 ----------------------------- docs/how-to/index.md | 6 --- docs/how-to/postgres-state-store.md | 32 ------------ docs/index.md | 1 - 4 files changed, 114 deletions(-) delete mode 100644 docs/how-to/postgres-state-store.md diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 243f85f5..7cc4349c 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -225,81 +225,6 @@ kubectl logs -n etl-control-plane -l app=etl-api --tail=100 kubectl describe pod -n etl-control-plane ``` -## Common Development Tasks - -### Running Tests - -```bash -# Run all tests -cargo test --workspace - -# Run tests for a specific crate -cargo test -p etl-api -cargo test -p etl-replicator - -# Run with all features enabled -cargo test --workspace --all-features -``` - -### Building the Project - -```bash -# Build all crates -cargo build --workspace - -# Build in release mode -cargo build --workspace --release - -# Build a specific crate -cargo build -p etl-api -``` - -### Checking Code - -```bash -# Run clippy for linting -cargo clippy --workspace --all-targets - -# Format code -cargo fmt --all - -# Check without building -cargo check --workspace -``` - -### Docker Images - -```bash -# Build the API image -docker build -f etl-api/Dockerfile -t etl-api:dev . - -# Build the replicator image -docker build -f etl-replicator/Dockerfile -t etl-replicator:dev . -``` - -### Viewing Logs - -```bash -# Docker Compose logs -docker-compose -f scripts/docker-compose.yaml logs -f - -# PostgreSQL logs specifically -docker-compose -f scripts/docker-compose.yaml logs -f postgres -``` - -### Cleaning Up - -```bash -# Stop Docker Compose services -docker-compose -f scripts/docker-compose.yaml down - -# Remove volumes (WARNING: deletes data) -docker-compose -f scripts/docker-compose.yaml down -v - -# Clean Rust build artifacts -cargo clean -``` - ## Troubleshooting ### Database Connection Issues diff --git a/docs/how-to/index.md b/docs/how-to/index.md index 6d1bf049..67eac066 100644 --- a/docs/how-to/index.md +++ b/docs/how-to/index.md @@ -12,12 +12,6 @@ Set up Postgres with the correct settings, and publications for ETL pipelines. **When to use:** Setting up a new Postgres source for replication. -### [Apply Postgres State Store Migrations](postgres-state-store.md) - -Create the `etl` schema, replication state tables, and related objects required by `PostgresStore`. - -**When to use:** Before running a pipeline that uses the Postgres-backed state or schema stores. - ## Next Steps After solving your immediate problem: diff --git a/docs/how-to/postgres-state-store.md b/docs/how-to/postgres-state-store.md deleted file mode 100644 index aafd76eb..00000000 --- a/docs/how-to/postgres-state-store.md +++ /dev/null @@ -1,32 +0,0 @@ -# Apply Postgres State Store Migrations - -**Prepare the Postgres-backed state store before running pipelines** - -`PostgresStore` (and the matching schema store) keep replication metadata inside your own Postgres database. The tables live in the `etl` schema and must be created before a pipeline starts, otherwise you will see errors such as `relation "etl.table_mappings" does not exist`. - -Follow these steps whenever you configure a Postgres-backed store. - -## 1. Pick the database and user - -- Choose the Postgres database that should store ETL metadata (often separate from the source database). -- Ensure the user credentials configured in `PgConnectionConfig` have privileges to create schemas, tables, and indexes in that database. - -## 2. Apply the migrations - -All SQL migrations for the Postgres store reside in `etl-replicator/migrations/`. Apply them in order (they are timestamp-prefixed) using your preferred tooling. With `psql`: - -```bash -cd /path/to/etl -psql "postgres://user:password@host:port/database" -f etl-replicator/migrations/20250827000000_base.sql -``` - -If additional migration files appear in that directory, run them sequentially (for example with `ls etl-replicator/migrations/*.sql | sort | xargs -I{} psql -f {}`) before restarting your pipeline. - -## 3. Verify the schema - -After applying the migrations: - -- Confirm the `etl` schema exists. -- Check that tables like `replication_state`, `table_mappings`, and `schema_definitions` are present. - -You can now safely configure `PostgresStore`/`PostgresSchemaStore` in your pipeline. Future migrations can be applied on top. diff --git a/docs/index.md b/docs/index.md index 9f6e483c..c33df729 100644 --- a/docs/index.md +++ b/docs/index.md @@ -91,7 +91,6 @@ async fn main() -> Result<(), Box> { - **First time using ETL?** → Start with [Build your first pipeline](tutorials/first-pipeline.md) - **Need Postgres setup help?** → Check [Configure Postgres for Replication](how-to/configure-postgres.md) -- **Using Postgres for state storage?** → Follow [Apply Postgres state store migrations](how-to/postgres-state-store.md) - **Need technical details?** → Check the [Reference](reference/index.md) - **Want to understand the architecture?** → Read [ETL Architecture](explanation/architecture.md) From 415f143d91ae528de20db8d6776dfb25796eee24 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 11:18:19 +0100 Subject: [PATCH 04/12] Improve --- README.md | 37 ++++++++++++++++---------------- docs/explanation/architecture.md | 2 +- 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 9acf7bfe..703ef7ae 100644 --- a/README.md +++ b/README.md @@ -40,13 +40,14 @@ ETL is a Rust framework by [Supabase](https://supabase.com) for building high‑performance, real‑time data replication apps on Postgres. It sits on top of Postgres [logical replication](https://www.postgresql.org/docs/current/protocol-logical-replication.html) and gives you a clean, Rust‑native API for streaming changes to your own destinations. -## Highlights +## Features -- **Real‑time replication**: stream changes in real time to your own destinations. -- **High performance**: configurable batching and parallelism to maximize throughput. -- **Fault-tolerant**: robust error handling and retry logic built-in. -- **Extensible**: implement your own custom destinations and state/schema stores. -- **Rust native**: typed and ergonomic Rust API. +- **Real‑time replication**: stream changes in real time to your own destinations +- **High performance**: configurable batching and parallelism to maximize throughput +- **Fault-tolerant**: robust error handling and retry logic built-in +- **Extensible**: implement your own custom destinations and state/schema stores +- **Production destinations**: BigQuery and Apache Iceberg officially supported +- **Type-safe**: fully typed Rust API with compile-time guarantees ## Requirements @@ -102,9 +103,12 @@ async fn main() -> Result<(), Box> { max_table_sync_workers: 4, }; + // Start the pipeline. let mut pipeline = Pipeline::new(config, store, destination); pipeline.start().await?; - // pipeline.wait().await?; // Optional: block until completion + + // Wait for the pipeline indefinitely. + pipeline.wait().await?; Ok(()) } @@ -112,31 +116,28 @@ async fn main() -> Result<(), Box> { For tutorials and deeper guidance, see the [Documentation](https://supabase.github.io/etl) or jump into the [examples](etl-examples/README.md). -## Development - -See [DEVELOPMENT.md](DEVELOPMENT.md) for setup instructions, migration workflows, and development guidelines. - ## Destinations ETL is designed to be extensible. You can implement your own destinations, and the project currently ships with the following maintained options: -- **BigQuery** – full CRUD-capable replication for analytics workloads. -- **Apache Iceberg** – append-only log of operations today (no in-place updates yet). +- **BigQuery** – full CRUD-capable replication for analytics workloads +- **Apache Iceberg** – append-only log of operations (updates coming soon) Enable the destinations you need through the `etl-destinations` crate: ```toml [dependencies] etl = { git = "https://github.com/supabase/etl" } -etl-destinations = { git = "https://github.com/supabase/etl", features = ["bigquery", "iceberg"] } +etl-destinations = { git = "https://github.com/supabase/etl", features = ["bigquery"] } ``` +## Development + +See [DEVELOPMENT.md](DEVELOPMENT.md) for setup instructions, migration workflows, and development guidelines. + ## Contributing -We welcome pull requests and GitHub issues. That said, we currently cannot accept new custom destinations unless there -is significant community demand. Each destination carries a high long-term maintenance cost, and we are prioritizing core stability, -observability, and ergonomics. If you need a destination that is not yet supported, please start a discussion or issue so we can gauge demand -before proposing an implementation. +We welcome pull requests and GitHub issues. We currently cannot accept new custom destinations unless there is significant community demand, as each destination carries a long-term maintenance cost. We are prioritizing core stability, observability, and ergonomics. If you need a destination that is not yet supported, please start a discussion or issue so we can gauge demand before proposing an implementation. ## License diff --git a/docs/explanation/architecture.md b/docs/explanation/architecture.md index dce61e04..6d090960 100644 --- a/docs/explanation/architecture.md +++ b/docs/explanation/architecture.md @@ -25,7 +25,7 @@ flowchart LR end subgraph Destination[Destination] - Dest["BigQuery
Apache Iceberg
Custom API"] + Dest["BigQuery
Apache Iceberg
Custom"] end subgraph Store[Store] From 8d43c1df09b02991ff5e67025506f64f8377fd49 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 11:20:09 +0100 Subject: [PATCH 05/12] Improve --- docs/index.md | 42 +++++++++++++++++++----------------------- 1 file changed, 19 insertions(+), 23 deletions(-) diff --git a/docs/index.md b/docs/index.md index c33df729..54fd6304 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,7 +2,7 @@ **Build real-time Postgres replication applications in Rust** -ETL is a Rust framework by [Supabase](https://supabase.com) that enables you to build high-performance, real-time data replication applications for Postgres. Whether you're creating ETL pipelines, implementing CDC (Change Data Capture), or building custom data synchronization solutions, ETL provides the building blocks you need. +ETL is a Rust framework by [Supabase](https://supabase.com) for building high‑performance, real‑time data replication apps on Postgres. It sits on top of Postgres logical replication and gives you a clean, Rust‑native API for streaming changes to your own destinations. ## Getting Started @@ -29,15 +29,14 @@ Read our **[Explanations](explanation/index.md)** for deeper insights: - [ETL architecture overview](explanation/architecture.md) - More explanations coming soon -## Core Concepts +## Features -**Postgres Logical Replication** streams data changes from Postgres databases in real-time using the Write-Ahead Log (WAL). ETL builds on this foundation to provide: - -- **Real-time replication** - Stream changes as they happen -- **Multiple destinations** - BigQuery and Apache Iceberg officially supported -- **Fault tolerance** - Built-in error handling and recovery -- **High performance** - Efficient batching and parallel processing -- **Extensible** - Plugin architecture for custom destinations +- **Real‑time replication**: stream changes in real time to your own destinations +- **High performance**: configurable batching and parallelism to maximize throughput +- **Fault-tolerant**: robust error handling and retry logic built-in +- **Extensible**: implement your own custom destinations and state/schema stores +- **Production destinations**: BigQuery and Apache Iceberg officially supported +- **Type-safe**: fully typed Rust API with compile-time guarantees ## Quick Example @@ -51,36 +50,33 @@ use etl::{ #[tokio::main] async fn main() -> Result<(), Box> { - // Configure Postgres connection - let pg_config = PgConnectionConfig { - host: "localhost".to_string(), + let pg = PgConnectionConfig { + host: "localhost".into(), port: 5432, - name: "mydb".to_string(), - username: "postgres".to_string(), - password: Some("password".to_string().into()), + name: "mydb".into(), + username: "postgres".into(), + password: Some("password".into()), tls: TlsConfig { enabled: false, trusted_root_certs: String::new() }, }; - // Create memory-based store and destination for testing let store = MemoryStore::new(); let destination = MemoryDestination::new(); - // Configure the pipeline let config = PipelineConfig { id: 1, - publication_name: "my_publication".to_string(), - pg_connection: pg_config, + publication_name: "my_publication".into(), + pg_connection: pg, batch: BatchConfig { max_size: 1000, max_fill_ms: 5000 }, - table_error_retry_delay_ms: 10000, + table_error_retry_delay_ms: 10_000, table_error_retry_max_attempts: 5, max_table_sync_workers: 4, }; - // Create and start the pipeline + // Start the pipeline. let mut pipeline = Pipeline::new(config, store, destination); pipeline.start().await?; - // Pipeline will run until stopped + // Wait for the pipeline indefinitely. pipeline.wait().await?; Ok(()) @@ -96,4 +92,4 @@ async fn main() -> Result<(), Box> { ## Contributing -Contributions and bug reports are welcome in the GitHub repository. At the moment we cannot accept new custom destination implementations unless a large portion of the community requests them, because every destination adds a long-lived maintenance burden and we are focusing engineering time on stability, observability, and ergonomics. Please open an issue or discussion first if you believe a new destination should be prioritized. +We welcome pull requests and GitHub issues. We currently cannot accept new custom destinations unless there is significant community demand, as each destination carries a long-term maintenance cost. We are prioritizing core stability, observability, and ergonomics. If you need a destination that is not yet supported, please start a discussion or issue so we can gauge demand before proposing an implementation. From 5c4669b1ba378e728abfdb212d33b2f49c989d74 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 11:49:29 +0100 Subject: [PATCH 06/12] Improve --- docs/explanation/index.md | 2 +- docs/reference/index.md | 38 ++++++++++++++++++++++++++++++++------ docs/tutorials/index.md | 4 ++-- 3 files changed, 35 insertions(+), 9 deletions(-) diff --git a/docs/explanation/index.md b/docs/explanation/index.md index bec4c325..3ed9aa95 100644 --- a/docs/explanation/index.md +++ b/docs/explanation/index.md @@ -33,4 +33,4 @@ After building a conceptual understanding: ## Contributing to Explanations Found gaps in these explanations? See something that could be clearer? -[Open an issue](https://github.com/supabase/etl/issues) or contribute improvements to help other users build better mental models of ETL. +[Open an issue](https://github.com/supabase/etl/issues/new) or contribute improvements to help other users build better mental models of ETL. diff --git a/docs/reference/index.md b/docs/reference/index.md index df2c3c1b..7683b89c 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -1,14 +1,40 @@ - # Reference -Complete API documentation is available through Rust's built-in documentation system. We will publish comprehensive rustdoc documentation that covers all public APIs, traits, and configuration structures. -Right now the docs are accessible via the code or by running: -```shell +**Complete API documentation for ETL** + +API documentation is available through Rust's built-in documentation system. Generate and browse the complete API reference locally: + +```bash cargo doc --workspace --all-features --no-deps --open ``` +This opens comprehensive rustdoc documentation covering: +- All public APIs, traits, and structs +- Configuration types and options +- Store and destination trait definitions +- Code examples and method signatures + +## Key Traits + +The core extension points in ETL: + +- **`Destination`** - Implement to send data to custom destinations +- **`StateStore`** - Manage replication state and table mappings +- **`SchemaStore`** - Handle table schema information +- **`CleanupStore`** - Atomic cleanup operations for removed tables + +## Configuration Types + +Main configuration structures: + +- **`PipelineConfig`** - Complete pipeline configuration +- **`PgConnectionConfig`** - Postgres connection settings +- **`BatchConfig`** - Batching and performance tuning +- **`TlsConfig`** - TLS/SSL configuration + ## See Also - [How-to guides](../how-to/index.md) - Task-oriented instructions -- [Tutorials](../tutorials/index.md) - Learning-oriented lessons -- [Explanations](../explanation/index.md) - Understanding-oriented discussions \ No newline at end of file +- [Tutorials](../tutorials/index.md) - Learning-oriented lessons +- [Explanations](../explanation/index.md) - Understanding-oriented discussions +- [GitHub Repository](https://github.com/supabase/etl) - Source code and issues \ No newline at end of file diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index 8790cbb5..5b0e5aef 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -18,7 +18,7 @@ _What you'll build:_ A working pipeline that streams changes from a sample Postg ### [Build Custom Stores and Destinations](custom-implementations.md) -**45 minutes** • **Advanced** +**30 minutes** • **Advanced** Implement production-ready custom stores and destinations. Learn ETL's design patterns, build persistent storage, implement cleanup primitives for safe table removal, and create HTTP-based destinations with retry logic. @@ -52,4 +52,4 @@ If you get stuck: 1. Double-check the prerequisites 2. Ensure your Postgres setup matches the requirements 3. Check the [Postgres configuration guide](../how-to/configure-postgres.md) -4. [Open an issue](https://github.com/supabase/etl/issues) with your specific problem +4. [Open an issue](https://github.com/supabase/etl/issues/new) with your specific problem From 8ea8af29a21d58417157a41dd7f98198d26701ec Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 11:59:31 +0100 Subject: [PATCH 07/12] Improve --- DEVELOPMENT.md | 13 +++++++++---- mkdocs.yaml | 1 - 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 7cc4349c..346c1639 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -164,11 +164,16 @@ psql $DATABASE_URL -c "create schema if not exists etl;" sqlx migrate run --source etl-replicator/migrations --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" ``` -**Note:** The replicator migrations are also run automatically when the replicator starts (see `etl-replicator/src/migrations.rs:16`). The script is useful for: -- Pre-creating the state store schema +**Important:** Migrations are run automatically when using the `etl-replicator` binary (see `etl-replicator/src/migrations.rs:16`). However, if you integrate the `etl` crate directly into your own application as a library, you should run these migrations manually before starting your pipeline. This design decision ensures: +- The standalone replicator binary works out-of-the-box +- Library users have explicit control over when migrations run +- CI/CD pipelines can pre-apply migrations independently + +**When to run migrations manually:** +- Integrating `etl` as a library in your own application +- Pre-creating the state store schema before deployment - Testing migrations independently -- CI/CD pipelines -- Setting up replicator state on new databases +- CI/CD pipelines that separate migration and deployment steps **Creating a new replicator migration:** diff --git a/mkdocs.yaml b/mkdocs.yaml index 2966dade..047c82e4 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -15,7 +15,6 @@ nav: - How-to Guides: - Overview: how-to/index.md - Configure Postgres: how-to/configure-postgres.md - - Apply Postgres State Store Migrations: how-to/postgres-state-store.md - Reference: - Overview: reference/index.md - Explanation: From e83785e851e88faef4b4cb278ef7646cdcd8bba7 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 12:01:16 +0100 Subject: [PATCH 08/12] Improve --- DEVELOPMENT.md | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 346c1639..c3185d1f 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -184,14 +184,37 @@ sqlx migrate add ## Running the Services +Both `etl-api` and `etl-replicator` binaries use hierarchical configuration loading from the `configuration/` directory within each crate. Configuration is loaded in this order: + +1. **Base configuration**: `configuration/base.yaml` (always loaded) +2. **Environment-specific**: `configuration/{environment}.yaml` (e.g., `dev.yaml`, `prod.yaml`) +3. **Environment variable overrides**: Prefixed with `APP_` (e.g., `APP_DATABASE__URL`) + +**Environment Selection:** + +The environment is determined by the `APP_ENVIRONMENT` variable: +- **Default**: `prod` (if `APP_ENVIRONMENT` is not set) +- **Available**: `dev`, `staging`, `prod` + +```bash +# Run with dev environment +APP_ENVIRONMENT=dev cargo run + +# Run with production environment (default) +cargo run + +# Override specific config values +APP_ENVIRONMENT=dev APP_DATABASE__URL=postgres://localhost/mydb cargo run +``` + ### ETL API ```bash cd etl-api -cargo run +APP_ENVIRONMENT=dev cargo run ``` -The API requires the `DATABASE_URL` environment variable and a valid configuration file. See `etl-api/README.md` for configuration details. +The API loads configuration from `etl-api/configuration/{environment}.yaml`. See `etl-api/README.md` for available configuration options. ### ETL Replicator @@ -199,9 +222,11 @@ The replicator is typically deployed as a Kubernetes pod, but can be run locally ```bash cd etl-replicator -cargo run +APP_ENVIRONMENT=dev cargo run ``` +The replicator loads configuration from `etl-replicator/configuration/{environment}.yaml`. + ## Kubernetes Setup The project uses Kubernetes for deploying replicators. The setup script configures the necessary resources. From 547519fe55e86d297e03a8337a38b894c5284156 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 12:34:22 +0100 Subject: [PATCH 09/12] Improve --- DEVELOPMENT.md | 75 +++++++++++++++++++++++++++++++------------------- 1 file changed, 46 insertions(+), 29 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index c3185d1f..6358b3cf 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -31,7 +31,7 @@ Before starting, ensure you have the following installed: Install SQLx CLI: ```bash -cargo install --version='~0.7' sqlx-cli --no-default-features --features rustls,postgres +cargo install --version='~0.8.6' sqlx-cli --no-default-features --features rustls,postgres ``` ### Optional Tools @@ -92,24 +92,42 @@ POSTGRES_DATA_VOLUME=/path/to/data ./scripts/init.sh If you prefer manual setup or have an existing PostgreSQL instance: -1. **Set the database URL:** +**Important:** The etl-api and etl-replicator migrations can run on **separate databases**. You might have: +- The etl-api using its own dedicated Postgres instance for the control plane +- The etl-replicator state store on the same database you're replicating from (source database) +- Or both on the same database (for simpler local development setups) -```bash -export DATABASE_URL=postgres://USER:PASSWORD@HOST:PORT/DB -``` +#### Single Database Setup -2. **Run etl-api migrations:** +If using one database for both the API and replicator state: ```bash +export DATABASE_URL=postgres://USER:PASSWORD@HOST:PORT/DB + +# Run both migrations on the same database ./etl-api/scripts/run_migrations.sh +./etl-replicator/scripts/run_migrations.sh ``` -3. **Run etl-replicator migrations:** +#### Separate Database Setup + +If using separate databases (recommended for production): ```bash +# API migrations on the control plane database +export DATABASE_URL=postgres://USER:PASSWORD@API_HOST:PORT/API_DB +./etl-api/scripts/run_migrations.sh + +# Replicator migrations on the source database +export DATABASE_URL=postgres://USER:PASSWORD@SOURCE_HOST:PORT/SOURCE_DB ./etl-replicator/scripts/run_migrations.sh ``` +This separation allows you to: +- Scale the control plane independently from replication workloads +- Keep the replicator state close to the source data +- Isolate concerns between infrastructure management and data replication + ## Database Migrations The project uses SQLx for database migrations. There are two sets of migrations: @@ -216,45 +234,44 @@ APP_ENVIRONMENT=dev cargo run The API loads configuration from `etl-api/configuration/{environment}.yaml`. See `etl-api/README.md` for available configuration options. -### ETL Replicator - -The replicator is typically deployed as a Kubernetes pod, but can be run locally for testing: +#### Kubernetes Setup (ETL API Only) -```bash -cd etl-replicator -APP_ENVIRONMENT=dev cargo run -``` - -The replicator loads configuration from `etl-replicator/configuration/{environment}.yaml`. - -## Kubernetes Setup - -The project uses Kubernetes for deploying replicators. The setup script configures the necessary resources. +The etl-api manages replicator deployments on Kubernetes by dynamically creating StatefulSets, Secrets, and ConfigMaps. The etl-api requires Kubernetes, but the **etl-replicator binary can run independently without any Kubernetes setup**. **Prerequisites:** - OrbStack with Kubernetes enabled (or another local Kubernetes cluster) - `kubectl` configured with the `orbstack` context +- Pre-defined Kubernetes resources (see below) -**Manual Kubernetes setup:** +**Required Pre-Defined Resources:** + +The etl-api expects these resources to exist before it can deploy replicators: + +1. **Namespace**: `etl-data-plane` - Where all replicator pods and related resources are created +2. **ConfigMap**: `trusted-root-certs-config` - Provides trusted root certificates for TLS connections + +These are defined in `scripts/` and should be applied before running the API: ```bash kubectl --context orbstack apply -f scripts/etl-data-plane.yaml kubectl --context orbstack apply -f scripts/trusted-root-certs-config.yaml ``` -**Checking deployed resources:** +**Note:** For the complete list of expected Kubernetes resources and their specifications, refer to the constants and resource creation logic in `etl-api/src/k8s/http.rs`. -```bash -# List replicator pods -kubectl get pods -n etl-control-plane -l app=etl-api +### ETL Replicator -# View logs -kubectl logs -n etl-control-plane -l app=etl-api --tail=100 +The replicator can run as a standalone binary without Kubernetes: -# Describe a specific pod -kubectl describe pod -n etl-control-plane +```bash +cd etl-replicator +APP_ENVIRONMENT=dev cargo run ``` +The replicator loads configuration from `etl-replicator/configuration/{environment}.yaml`. + +**Note:** While the replicator is typically deployed as a Kubernetes pod managed by the etl-api, it does not require Kubernetes to function. You can run it as a standalone process on any machine with the appropriate configuration. + ## Troubleshooting ### Database Connection Issues From e105441be1bbc98467a2dbc3b17ef61199993d49 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 12:57:10 +0100 Subject: [PATCH 10/12] Improve --- DEVELOPMENT.md | 44 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 6358b3cf..e3f8d88f 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -227,6 +227,8 @@ APP_ENVIRONMENT=dev APP_DATABASE__URL=postgres://localhost/mydb cargo run ### ETL API +#### Running from Source + ```bash cd etl-api APP_ENVIRONMENT=dev cargo run @@ -234,6 +236,25 @@ APP_ENVIRONMENT=dev cargo run The API loads configuration from `etl-api/configuration/{environment}.yaml`. See `etl-api/README.md` for available configuration options. +#### Running with Docker + +Docker images are available for the etl-api. You must mount the configuration files and can override settings via environment variables: + +```bash +docker run \ + -v $(pwd)/etl-api/configuration/base.yaml:/app/configuration/base.yaml \ + -v $(pwd)/etl-api/configuration/dev.yaml:/app/configuration/dev.yaml \ + -e APP_ENVIRONMENT=dev \ + -e APP_DATABASE__URL=postgres://host.docker.internal:5432/mydb \ + -p 8080:8080 \ + ramsup/etl-api:latest +``` + +**Configuration requirements:** +- Mount both `base.yaml` and your environment-specific config file (e.g., `dev.yaml`) +- Set `APP_ENVIRONMENT` to match your mounted environment file +- Override specific values using `APP_` prefixed environment variables + #### Kubernetes Setup (ETL API Only) The etl-api manages replicator deployments on Kubernetes by dynamically creating StatefulSets, Secrets, and ConfigMaps. The etl-api requires Kubernetes, but the **etl-replicator binary can run independently without any Kubernetes setup**. @@ -261,7 +282,9 @@ kubectl --context orbstack apply -f scripts/trusted-root-certs-config.yaml ### ETL Replicator -The replicator can run as a standalone binary without Kubernetes: +The replicator can run as a standalone binary without Kubernetes. + +#### Running from Source ```bash cd etl-replicator @@ -270,6 +293,25 @@ APP_ENVIRONMENT=dev cargo run The replicator loads configuration from `etl-replicator/configuration/{environment}.yaml`. +#### Running with Docker + +Docker images are available for the etl-replicator. You must mount the configuration files and can override settings via environment variables: + +```bash +docker run \ + -v $(pwd)/etl-replicator/configuration/base.yaml:/app/configuration/base.yaml \ + -v $(pwd)/etl-replicator/configuration/dev.yaml:/app/configuration/dev.yaml \ + -e APP_ENVIRONMENT=dev \ + -e APP_SOURCE__HOST=host.docker.internal \ + -e APP_SOURCE__PASSWORD=mysecret \ + etl-replicator:latest +``` + +**Configuration requirements:** +- Mount both `base.yaml` and your environment-specific config file (e.g., `dev.yaml`) +- Set `APP_ENVIRONMENT` to match your mounted environment file +- Override specific values using `APP_` prefixed environment variables + **Note:** While the replicator is typically deployed as a Kubernetes pod managed by the etl-api, it does not require Kubernetes to function. You can run it as a standalone process on any machine with the appropriate configuration. ## Troubleshooting From 80f9caca96a4045a6eb673cf50b39e41ec92f746 Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 13:05:00 +0100 Subject: [PATCH 11/12] Improve --- DEVELOPMENT.md | 3 --- etl-config/src/load.rs | 2 +- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index e3f8d88f..53818284 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -245,7 +245,6 @@ docker run \ -v $(pwd)/etl-api/configuration/base.yaml:/app/configuration/base.yaml \ -v $(pwd)/etl-api/configuration/dev.yaml:/app/configuration/dev.yaml \ -e APP_ENVIRONMENT=dev \ - -e APP_DATABASE__URL=postgres://host.docker.internal:5432/mydb \ -p 8080:8080 \ ramsup/etl-api:latest ``` @@ -302,8 +301,6 @@ docker run \ -v $(pwd)/etl-replicator/configuration/base.yaml:/app/configuration/base.yaml \ -v $(pwd)/etl-replicator/configuration/dev.yaml:/app/configuration/dev.yaml \ -e APP_ENVIRONMENT=dev \ - -e APP_SOURCE__HOST=host.docker.internal \ - -e APP_SOURCE__PASSWORD=mysecret \ etl-replicator:latest ``` diff --git a/etl-config/src/load.rs b/etl-config/src/load.rs index bf358a2c..44f48471 100644 --- a/etl-config/src/load.rs +++ b/etl-config/src/load.rs @@ -78,7 +78,7 @@ where // Add in settings from the base configuration file. .add_source(config::File::from( configuration_directory.join(BASE_CONFIG_FILE), - )) + ).format(config::FileFormat::Yaml)) // Add in settings from the environment-specific file. .add_source(config::File::from( configuration_directory.join(environment_filename), From 8fed2562d83d8b62933c7930cb8d641be8e2624c Mon Sep 17 00:00:00 2001 From: Riccardo Busetti Date: Wed, 19 Nov 2025 13:07:36 +0100 Subject: [PATCH 12/12] Improve --- etl-config/src/load.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/etl-config/src/load.rs b/etl-config/src/load.rs index 44f48471..bf358a2c 100644 --- a/etl-config/src/load.rs +++ b/etl-config/src/load.rs @@ -78,7 +78,7 @@ where // Add in settings from the base configuration file. .add_source(config::File::from( configuration_directory.join(BASE_CONFIG_FILE), - ).format(config::FileFormat::Yaml)) + )) // Add in settings from the environment-specific file. .add_source(config::File::from( configuration_directory.join(environment_filename),