Add support for incremental loads to FDW plugins #647

mildbyte · 2022-03-10T09:20:32Z

Add a cursor_columns field to the table parameters (used as a list
of columns that form an increasing-only replication bookmark). Store the
ingestion state in a table inside of the image, similarly to Airbyte.

Add a `cursor_columns` field to the table parameters (used as a list of columns that form an increasing-only replication bookmark). Store the ingestion state in a table inside of the image, similarly to Airbyte.

mildbyte · 2022-03-10T09:23:01Z

splitgraph/hooks/data_source/fdw.py

+        with delete_schema_at_end(repository.object_engine, staging_schema):
+            repository.object_engine.delete_schema(staging_schema)
+            repository.object_engine.create_schema(staging_schema)
+            repository.commit_engines()
+
+            self._mount_and_copy(
+                staging_schema,
+                tables,
+                cursor_values=None if not state else state.get("cursor_values"),
+            )
+
+            logging.info("Storing tables as Splitgraph images")
+            for table_name in repository.object_engine.get_all_tables(staging_schema):
+                logging.info("Storing %s", table_name)
+                new_schema = repository.object_engine.get_full_table_schema(
+                    staging_schema, table_name
+                )
+
+                if base_image:
+                    try:
+                        current_schema = base_image.get_table(table_name).table_schema
+                        if current_schema != new_schema:
+                            raise AssertionError(
+                                "Schema for %s changed! Old: %s, new: %s"
+                                % (
+                                    table_name,
+                                    current_schema,
+                                    new_schema,
+                                )
+                            )
+                    except TableNotFoundError:
+                        pass
+
+                repository.objects.record_table_as_base(
+                    repository,
+                    table_name,
+                    new_image_hash,
+                    chunk_size=DEFAULT_CHUNK_SIZE,
+                    source_schema=staging_schema,
+                    source_table=table_name,
+                    table_schema=new_schema,
+                )


There's some overlap between this code and the code in https://github.com/splitgraph/splitgraph/blob/master/splitgraph/ingestion/airbyte/data_source.py#L290-L329 but I don't know if there's a nice way to factor it out + when writeable LQ lands, we should be able to simplify this code anyway by writing into the LQ checkout.

Yeah, makes sense.

gruuya

Looks good!

gruuya · 2022-03-10T10:28:01Z

splitgraph/hooks/data_source/fdw.py

+        with delete_schema_at_end(repository.object_engine, staging_schema):
+            repository.object_engine.delete_schema(staging_schema)
+            repository.object_engine.create_schema(staging_schema)
+            repository.commit_engines()
+
+            self._mount_and_copy(
+                staging_schema,
+                tables,
+                cursor_values=None if not state else state.get("cursor_values"),
+            )
+
+            logging.info("Storing tables as Splitgraph images")
+            for table_name in repository.object_engine.get_all_tables(staging_schema):
+                logging.info("Storing %s", table_name)
+                new_schema = repository.object_engine.get_full_table_schema(
+                    staging_schema, table_name
+                )
+
+                if base_image:
+                    try:
+                        current_schema = base_image.get_table(table_name).table_schema
+                        if current_schema != new_schema:
+                            raise AssertionError(
+                                "Schema for %s changed! Old: %s, new: %s"
+                                % (
+                                    table_name,
+                                    current_schema,
+                                    new_schema,
+                                )
+                            )
+                    except TableNotFoundError:
+                        pass
+
+                repository.objects.record_table_as_base(
+                    repository,
+                    table_name,
+                    new_image_hash,
+                    chunk_size=DEFAULT_CHUNK_SIZE,
+                    source_schema=staging_schema,
+                    source_table=table_name,
+                    table_schema=new_schema,
+                )


Yeah, makes sense.

Also handle the case where it's not specified.

It adds this table in cases like a CSV upload, polluting the repository. Instead, when loading the state during a sync, fall back to getting the cursor values by just querying the max fields in the current image if it doesn't exist.

Gelio · 2022-03-11T14:44:17Z

splitgraph/engine/base.py

@@ -134,22 +135,42 @@ def copy_table(
        target_schema: str,
        target_table: str,
        with_pk_constraints: bool = True,
+        cursor_fields: Optional[Dict[str, str]] = None,


Array fields are not supported by dynamic forms. I suppose I should relax the error message

Add support for incremental loads to FDW plugins

0bd3a89

Add a `cursor_columns` field to the table parameters (used as a list of columns that form an increasing-only replication bookmark). Store the ingestion state in a table inside of the image, similarly to Airbyte.

mildbyte requested a review from gruuya March 10, 2022 09:20

mildbyte commented Mar 10, 2022

View reviewed changes

gruuya approved these changes Mar 10, 2022

View reviewed changes

mildbyte added 2 commits March 10, 2022 13:57

Fix JSONSchema for cursor_fields

9fad5b4

Also handle the case where it's not specified.

Don't store _sg_ingestion_state when doing a load.

757f02a

It adds this table in cases like a CSV upload, polluting the repository. Instead, when loading the state during a sync, fall back to getting the cursor values by just querying the max fields in the current image if it doesn't exist.

mildbyte merged commit 244b2fd into master Mar 10, 2022

Gelio reviewed Mar 11, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for incremental loads to FDW plugins #647

Add support for incremental loads to FDW plugins #647

mildbyte commented Mar 10, 2022

mildbyte Mar 10, 2022

gruuya Mar 10, 2022

gruuya left a comment

gruuya Mar 10, 2022

Gelio Mar 11, 2022

Add support for incremental loads to FDW plugins #647

Add support for incremental loads to FDW plugins #647

Conversation

mildbyte commented Mar 10, 2022

mildbyte Mar 10, 2022

Choose a reason for hiding this comment

gruuya Mar 10, 2022

Choose a reason for hiding this comment

gruuya left a comment

Choose a reason for hiding this comment

gruuya Mar 10, 2022

Choose a reason for hiding this comment

Gelio Mar 11, 2022

Choose a reason for hiding this comment