-
-
Notifications
You must be signed in to change notification settings - Fork 128
feat(experimental): Rework schema handling with replication masks #476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| pg_escape = { version = "0.1.1", default-features = false } | ||
| pin-project-lite = { version = "0.2.16", default-features = false } | ||
| postgres-replication = { git = "https://github.com/MaterializeInc/rust-postgres", default-features = false, rev = "c4b473b478b3adfbf8667d2fbe895d8423f1290b" } | ||
| postgres-replication = { git = "https://github.com/iambriccardo/rust-postgres", default-features = false, rev = "31acf55c7e5c2244e5bb3a36e7afa2a01bf52c38" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used my fork which supports Message logical replication messages.
Pull Request Test Coverage Report for Build 20065122958Details
💛 - Coveralls |
| // Run replicator migrations to create the state store tables. | ||
| sqlx::migrate!("../etl-replicator/migrations") | ||
| // Run migrations to create the etl tables. | ||
| sqlx::migrate!("../etl/migrations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to move migrations into etl itself since now they are required for ETL to work, independently of which store implementation it's used.
| /// The 1-based ordinal position of the column in the table. | ||
| pub ordinal_position: i32, | ||
| /// The 1-based ordinal position of this column in the primary key, or None if not a primary key. | ||
| pub primary_key_ordinal_position: Option<i32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to properly create a composite primary key definition on the destination.
Summary
This PR introduces replication masks, a new mechanism for handling table schemas in ETL that decouples column-level filtering from schema loading.
Motivation
The key insight is that we can load the entire table schema independently of column-level filtering in replication, then rely on
Relationmessages to determine which columns to actually replicate.Changes
Replication Masks
A replication mask is a bitmask that determines which columns of a
TableSchemaare actively replicated at any given time. Creating a mask requires:Relationmessage)TableSchemaof the table (we are assuming that the last table schema stored is synced with the incomingRelationmessage, thus matching by column name is sufficient)These are combined in
ReplicatedTableSchema, a wrapper type that exposes only active replicated columns on top of a stableTableSchema. This allows columns to be added or removed from a publication without breaking the pipeline (assuming the destination supports missing column data, BigQuery and Iceberg will currently fail).Destination Schema Handling
Previously, schemas were loaded by passing the
SchemaStoreto the destination. This caused semantic issues, for example,truncate_tablerelied on assumptions about whether the schema was present or not.The new design supplies a
ReplicatedTableSchemawith each event, eliminating schema loading in the destination and enforcing invariants at compile time via the type system. This also enables future support for multiple schema versions within a single batch of events, which will be critical for schema change support.Consistent Schema Loading
To ensure schema consistency between initial table copy and DDL event triggers, we now define a Postgres function
describe_table_schemathat returns schema data in a consistent structure. Schema change messages are emitted in the replication stream within the same transaction that modifies the schema.More Schema Information
With the new shared schema query, we also load ordinal positions of primary keys, that enables us to create composite primary keys in downstream destinations.
DDL Event Trigger
We also have a new DDL event trigger which will be used to dispatch schema change events (
ALTER TABLEstatements) in a transactionally consistent way. This is doable since Postgres runs event triggers within the transaction that triggered them and they are blocking, so when anALTER TABLEis executed, the SQL function is executed, producing the logical replication message in same transaction as the transaction modifying the table. No statements areALTER TABLEare run until the event trigger is executed successfully.This will be the foundational element needed for supporting schema changes.
Future Work
Follow-up PRs will leverage the DDL message for full schema change support. For now, it's included here to validate consistency.