Skip to content

Implement WithStateSync interface with schema derivation fallback#10

Merged
nicosuave merged 5 commits intosidequery:mainfrom
hentzthename:feature/implement-with-state-sync
Feb 1, 2026
Merged

Implement WithStateSync interface with schema derivation fallback#10
nicosuave merged 5 commits intosidequery:mainfrom
hentzthename:feature/implement-with-state-sync

Conversation

@hentzthename
Copy link
Copy Markdown
Contributor

Summary

While using dlt-iceberg, I encountered issues when running pipelines in ephemeral/stateless runtimes (ECS Fargate, Kubernetes Jobs). These environments don't preserve local dlt state between runs, which exposed several gaps in how dlt-iceberg handles schema management.

Sorry for the large PR, lots of test code 😬

Issues encountered

  1. No schema restoration from destination
    dlt-iceberg didn't implement the WithStateSync interface, so pipelines running in fresh containers couldn't restore schemas from the _dlt_version table. Each run only knew about columns in the current data batch.

  2. "Dropped columns" errors with varying data
    When source data varies between runs (e.g., some API responses have optional fields), columns present in the Iceberg table but missing from the current batch were treated as "dropped", causing SchemaEvolutionError failures.

Features

WithStateSync Interface

  • get_stored_schema() – Retrieves newest schema from _dlt_version table
  • get_stored_schema_by_hash() – Retrieves schema by exact version hash
  • get_stored_state() – Retrieves pipeline state from _dlt_pipeline_state table
  • Uses predicate pushdown for efficient scanning

Schema Derivation Fallback

When _dlt_version has no stored schema but Iceberg tables exist, the destination now derives the schema from existing table metadata. This handles:

  • _dlt_version deleted or corrupted
  • Pipeline runs in different execution contexts with empty state
  • Historical columns exist in Iceberg but not in current data batch

Test plan

  • Unit tests for WithStateSync methods
  • E2E test for schema restoration from destination
  • E2E test for schema derivation when _dlt_version is empty
  • Full test suite passes

Test verifies that fresh pipeline instances can restore schemas from
the destination when loading data with missing columns.

Currently fails because IcebergRestClient does not implement WithStateSync.
Add schema and state restoration from Iceberg catalog, enabling
pipelines in different execution contexts to share state.

- Add get_stored_schema() to retrieve latest schema by name
- Add get_stored_schema_by_hash() for exact version lookup
- Add get_stored_state() to retrieve pipeline state
- Use PyIceberg row_filter for predicate pushdown optimization
- Add unit tests for WithStateSync methods
This test demonstrates the scenario where:
1. Pipeline creates table with columns [a, b, c, d]
2. _dlt_version is deleted (simulating corrupted/empty state)
3. New pipeline run with data missing column 'd' fails

The test currently fails with SchemaEvolutionError because the
destination doesn't derive schema from existing Iceberg tables
when _dlt_version has no stored schema.
When get_stored_schema() finds no stored schema in _dlt_version,
it now falls back to deriving the schema from existing Iceberg
table metadata. This handles scenarios where:

- _dlt_version is deleted or corrupted
- Pipeline runs in different execution contexts with empty state
- Historical columns exist in Iceberg but not in current data batch

The derived schema includes all columns from existing Iceberg tables,
ensuring they are not treated as "dropped" during schema evolution.

Implementation:
- Add _derive_schema_from_iceberg_tables() to scan catalog for tables
- Add _iceberg_type_to_dlt_type() for type conversion
- Modify get_stored_schema() to use derivation as fallback
- Add unit tests for derivation and precedence behavior
@nicosuave
Copy link
Copy Markdown
Member

Looks great, thanks for this

@nicosuave nicosuave merged commit 27f0e8e into sidequery:main Feb 1, 2026
1 check passed
@nicosuave
Copy link
Copy Markdown
Member

Just released a new version v0.3.0 with these changes as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants