Handle Postgres TOAST values and missing columns/values from wal2json #146

mvgijssel · 2022-01-11T16:57:19Z

Describe the bug

We ran into the problem that particular columns ended up NULL in the data warehouse after logical replication updates. After diving into the problem we discovered that this happens because of TOAST values in Postgres. A good explanation can be found here https://debezium.io/blog/2019/10/08/handling-unchanged-postgres-toast-values/, but TL;DR Postgres can sometimes decide to not include a column in the logical replication message from wal2json if that particular column was not part of the UPDATE statement.

To Reproduce

Used the steps described here https://github.com/eulerto/wal2json/pull/99/files#diff-dc97f558087716ec3429b35c430504cbd91e93aa6e477188c289927632d01b6f to reproduce the problem.

Expected behavior
The expectation is that if the message from wal2json contains a toast value and misses a column it does not set that column to NULL for the downstream singer target.

Your environment

Version of tap: forked version of 1.8.0 (HackerOne specific changes Hacker0x01/pipelinewise-tap-postgres#1)
Version of python 3.9.7

Additional context
Because HackerOne is using this tap in production we have already pushed a fix to our fork of tap-postgres in Hacker0x01#1. The commit with the working fix for HackerOne is Hacker0x01@1584b8d. This fix requires altering tables with REPLICA IDENTITY FULL.

The text was updated successfully, but these errors were encountered:

mvgijssel · 2022-01-13T13:45:34Z

An alternative (and better) implementation of dealing with missing columns due to TOAST values is Hacker0x01@acd0b0f. Here we are outputting a new (partial) schema from the extractor to the loader for the single record that has missing columns.

If used together with an extractor like https://github.com/transferwise/pipelinewise-target-snowflake then the loader will try an upload a partial record. Partial records work fine in Snowflake with the Parquet file format (or with CSV file format ignoring the difference in columns).

mvgijssel added the bug Something isn't working label Jan 11, 2022

mvgijssel mentioned this issue Feb 3, 2022

Consider flush_all_streams for schema updates transferwise/pipelinewise-target-snowflake#247

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Postgres TOAST values and missing columns/values from wal2json #146

Handle Postgres TOAST values and missing columns/values from wal2json #146

mvgijssel commented Jan 11, 2022

mvgijssel commented Jan 13, 2022

Handle Postgres TOAST values and missing columns/values from wal2json #146

Handle Postgres TOAST values and missing columns/values from wal2json #146

Comments

mvgijssel commented Jan 11, 2022

mvgijssel commented Jan 13, 2022