Implement WithStateSync interface with schema derivation fallback by hentzthename · Pull Request #10 · sidequery/dlt-iceberg

hentzthename · 2026-01-27T19:44:02Z

Summary

While using dlt-iceberg, I encountered issues when running pipelines in ephemeral/stateless runtimes (ECS Fargate, Kubernetes Jobs). These environments don't preserve local dlt state between runs, which exposed several gaps in how dlt-iceberg handles schema management.

Sorry for the large PR, lots of test code 😬

Issues encountered

No schema restoration from destination
dlt-iceberg didn't implement the WithStateSync interface, so pipelines running in fresh containers couldn't restore schemas from the _dlt_version table. Each run only knew about columns in the current data batch.
"Dropped columns" errors with varying data
When source data varies between runs (e.g., some API responses have optional fields), columns present in the Iceberg table but missing from the current batch were treated as "dropped", causing SchemaEvolutionError failures.

Features

WithStateSync Interface

get_stored_schema() – Retrieves newest schema from _dlt_version table
get_stored_schema_by_hash() – Retrieves schema by exact version hash
get_stored_state() – Retrieves pipeline state from _dlt_pipeline_state table
Uses predicate pushdown for efficient scanning

Schema Derivation Fallback

When _dlt_version has no stored schema but Iceberg tables exist, the destination now derives the schema from existing table metadata. This handles:

_dlt_version deleted or corrupted
Pipeline runs in different execution contexts with empty state
Historical columns exist in Iceberg but not in current data batch

Test plan

Unit tests for WithStateSync methods
E2E test for schema restoration from destination
E2E test for schema derivation when _dlt_version is empty
Full test suite passes

Test verifies that fresh pipeline instances can restore schemas from the destination when loading data with missing columns. Currently fails because IcebergRestClient does not implement WithStateSync.

Add schema and state restoration from Iceberg catalog, enabling pipelines in different execution contexts to share state. - Add get_stored_schema() to retrieve latest schema by name - Add get_stored_schema_by_hash() for exact version lookup - Add get_stored_state() to retrieve pipeline state - Use PyIceberg row_filter for predicate pushdown optimization - Add unit tests for WithStateSync methods

This test demonstrates the scenario where: 1. Pipeline creates table with columns [a, b, c, d] 2. _dlt_version is deleted (simulating corrupted/empty state) 3. New pipeline run with data missing column 'd' fails The test currently fails with SchemaEvolutionError because the destination doesn't derive schema from existing Iceberg tables when _dlt_version has no stored schema.

When get_stored_schema() finds no stored schema in _dlt_version, it now falls back to deriving the schema from existing Iceberg table metadata. This handles scenarios where: - _dlt_version is deleted or corrupted - Pipeline runs in different execution contexts with empty state - Historical columns exist in Iceberg but not in current data batch The derived schema includes all columns from existing Iceberg tables, ensuring they are not treated as "dropped" during schema evolution. Implementation: - Add _derive_schema_from_iceberg_tables() to scan catalog for tables - Add _iceberg_type_to_dlt_type() for type conversion - Modify get_stored_schema() to use derivation as fallback - Add unit tests for derivation and precedence behavior

nicosuave · 2026-02-01T15:17:22Z

Looks great, thanks for this

nicosuave · 2026-02-02T15:35:14Z

Just released a new version v0.3.0 with these changes as well

hentzthename added 5 commits January 24, 2026 14:24

Add e2e test for WithStateSync schema restoration

6325e4f

Test verifies that fresh pipeline instances can restore schemas from the destination when loading data with missing columns. Currently fails because IcebergRestClient does not implement WithStateSync.

Merge branch 'sidequery:main' into feature/implement-with-state-sync

5346b74

nicosuave merged commit 27f0e8e into sidequery:main Feb 1, 2026
1 check passed

hentzthename mentioned this pull request Apr 7, 2026

Fix/sparse schema and allow column drops #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement WithStateSync interface with schema derivation fallback#10

Implement WithStateSync interface with schema derivation fallback#10
nicosuave merged 5 commits intosidequery:mainfrom
hentzthename:feature/implement-with-state-sync

hentzthename commented Jan 27, 2026

Uh oh!

nicosuave commented Feb 1, 2026

Uh oh!

Uh oh!

nicosuave commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hentzthename commented Jan 27, 2026

Summary

Issues encountered

Features

WithStateSync Interface

Schema Derivation Fallback

Test plan

Uh oh!

nicosuave commented Feb 1, 2026

Uh oh!

Uh oh!

nicosuave commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants