Replication slot sync feature is unsafe, high data corruption risk #1749

ringerc · 2020-11-09T04:04:11Z

The slots configuration parameter is unsafe when used with logical replication slots and is highly likely to lead to silent data loss. This feature should be disabled in a point release then re-enabled once fixed to be safe.

On a side note, I tried to make this stuff simple to get right with my failover slots patch, but I was not successful in getting it into core, so we're stuck with fragile tooling-driven approaches instead.

It is not safe to create replication slots on streaming replicas only on after promoting the replica. If you're going to try to support failover of logical replication downstreams (subscribers) to a promoted physical replica of a failed upstream (provider) you must do so extremely carefully. Patroni does not do so correctly, and because of the way logical slots work in PostgreSQL there is no warning or error when provider is skipped over on failover. You will just get a silent gap where some changes did not replicate.

Repro steps

You should be able to repro the data loss scenario as follows. Set up your choice of logical decoding/replication schemes; I'm going to assume you'll make a postgres subscriber "S" and use built-in logical replication, but the same issue can be seen with test_decoding and pg_recvlogical, or anything else. Steps:

Create publisher P and streaming physical replica of publisher "Q"
CREATE TABLE test_table(id integer primary key) on P
Configure Patroni to manage Q as a failover candidate for P
Configure Patroni to create logical slot subscriber_s on P
Create your subscriber Q and subscribe to changes on P using logical slot subscriber_s
INSERT INTO test_table(id) VALUES (1) on P
Wait for Q and S to replay the insert
Pause subscription on S, network partition S, or otherwise stop P->S replication (to simulate lag/outage)
Optional steps see what goes wrong better:
- Once you have confirmed S is no longer replaying from P, make a note of the replay position on P by running SELECT remote_lsn FROM pg_catalog.pg_replication_origin_status on P. For the purposes of this discussion this position will be 0/1000.
- Make a note of S's current slot position on P: run SELECT restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name = 'subscriber_s' on P. This will also be 0/1000 since they're caught up.
- Run the same query on Q and verify that no replication slot named subscriber_s exists
INSERT INTO test_table(id) VALUES (2) on P
Wait for Q to replay the the insert
Optional steps see what goes wrong better:
- Re-run queries for confirmed_flush_lsn on P and remote_lsn on Q. Note that P's position has advanced, we'll say to 0/1200, and S's is still at 0/1000.
Terminate P and promote Q to replace it
Wait for Patroni to create slot subscriber_s on Q
Make a note of the confirmed_flush_lsn of the newly created slot on Q: run SELECT restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name = 'subscriber_s' on Q. For the purpose of this discussion we'll say it's at 0/2000.
Re-enable subscriber S.
If you didn't use a virtual IP or failover hostname, change the subscriber connection string on S to point to Q instead of the now-dead P
INSERT INTO test_table(id) VALUES (3) on Q
Wait for S to replay the insert
Compare output of SELECT id FROM test_table ORDER BY id on Q and S. You will see that Q has 1, 2, 3, but S only has 1, 3.

Oops! We misplaced the second insert.

What happened?

When Patroni made the new slot subscriber_s on Q, the slot was created with an automatically determined start position confirmed_flush_lsn set to after the end of the last commit, or if there was no transaction activity during slot creation, then after the end of the last WAL standby status update message. This position is on Q's new timeline, after Q was promoted to replace P, say 0/2000.

But S last saw 0/1000 in the example above. So for consistency it needs to replay the txn that inserted value 2, which happened sometime after 0/1000 and before the after-promotion lsn 0/2000.

When S connects to Q it sends its remote_lsn from pg_replication_origin_status as the start-position, so it says "Q, please send me changes from all txns from 0/1000 and above".

But Q sees that the slot's confirmed_flush_lsn is 0/2000. A logical slot cannot go backwards - if we tried to decode WAL at 0/1000 we wouldn't know where to restart decoding from, the catalogs could have been vacuumed away, and the WAL segments might have been removed anyway.

So Q silently starts streaming changes to S starting at 0/2000 instead.

S doesn't get told that this happened. S has no way to know if anything interesting happened between 0/1000 and 0/2000 or if the first insert sometime after 0/2000 is the first txn of of interest. So S has no way to detect the problem.

Why is it this way?

Pg does this silent-fast-forward of logical slots in order to support efficient crash-safe apply without lots of slow round-trips. It also helps reduce the amount of fsync()ing on provider and subscriber that happens when slots are advancing over WAL that doesn't generate any replicated traffic. So it's done for good reasons.

But it was not designed with primary failover in mind, and it's left a rather interesting hazard for replication systems that try to tack failover support on top using tooling.

Correct way

pglogical 3 demonstrates the correct way to do this on PostgreSQL 10 and above. There's been discussion on -hackers for it too. The gist is that you must use hot standby feedback and a physical slot, then use your own tooling to sync slots from primary to standby while the standby is still a replica.

A related provider/subscriber de-sync issue

BTW, there's another scenario where you can get upstream/downstream inconsistency too, kind of the reverse of this one. In the same setup:

Txn 1 commits a write on P
Txn 1 replicated to S over logical replication and committed
P fails before Q receives WAL for txn 1
Q promoted to replace P

In this case S has the changes from txn 1, but Q does not. To handle this, you must ensure that changes are not sent out over logical replication until they're first confirmed by any physical streaming replicas. That's not simple to do in core postgres right now. We handle it in the pglogical output plugin by having the user configure which slots are for failover candidates.

Docs for feature

The docs for the unsafe feature read:

slots: define permanent replication slots. These slots will be preserved during switchover/failover. Patroni will try to create slots before opening connections to the cluster.
...
type: slot type. Could be physical or logical. If the slot is logical, you have to additionally define database and plugin.
database: the database name where logical slots should be created.
plugin: the plugin name for the logical slot.

The text was updated successfully, but these errors were encountered:

ringerc · 2020-11-09T04:07:10Z

Short-version how to repro is to commit a change on a publisher, and ensure that the physical replica(s) receive it but the logical replica(s) don't yet, then promote a physical replica to replace the publisher. Your logical replica(s) will be missing the change.

ringerc · 2020-11-09T04:08:35Z

AFAICS it's not simple to just replace https://github.com/zalando/patroni/blob/252a1b78ed715b9696fa94a0135d1f45caf089de/patroni/postgresql/slots.py#L81 with an if logical_slots: raise Exception(...). There are legitimate reasons you might want Patroni to create logical slots for management and automation.

Really you want to prevent this feature from being used on failover. I suggest refusing to allow any logical_slots entries to be configured for any primary that has a configured failover-candidate standby.

jconway · 2020-11-09T14:41:15Z

pglogical 3 demonstrates the correct way to do this on PostgreSQL 10 and above.

@ringerc where can we find the source code for pglogical 3?

ringerc · 2020-11-10T01:28:13Z

I am not able to share pglogical3 sources, as I'm sure you're aware. That's outside my control. As it happens, it's also a complex load of interacting background workers and IPC that would take a while to unravel and understand anyway, so it wouldn't be as useful as you might think.

I've tried to constructively raise an issue that puts users at risk in a way that not only shows the problem, but provides detailed steps to verify it and an explanation of why it occurs. I also pointed you at a way to do it safely. It took a long time to write this up and explain all this. I hope you're able to review the matter and find a constructive path forward on it.

It's a shame the failover slots feature didn't land, though I understand some of why. I've done a lot of work to try to make this stuff safe and easy. I've just gone and updated https://wiki.postgresql.org/wiki/Failover_slots to reflect the outcome of that work, the parts that did get in, and point people at how to handle it.

I'll be happy to answer questions and help out in so far as time permits.

ringerc · 2020-11-10T02:17:28Z

I think the main issue with doing this safely in Patroni will be the lack of SQL-visible user interface for direct replication slot management. You can't use the SQL functions or the walsender based interface to sync slots from primary to standby; it's basically an unsupported hack. Andres objected to the whole idea and AFAIK continues to do so, but in the absence of any better built-in way to handle failover it's what we're stuck with.

You will need a C extension to expose SQL interface for slot state sync and to install that in any replica you wish to manage failover-capable logical slots on. There isn't really any way around that right now: pg_replication_slot_advance() works on a standby, but I don't think there's a good way to create the slots in the first place. It might also be possible to instead bodge it with manual copying of the the actual pg_replslot/*/state contents, but you'd have to do that by checkpointing the primary, stopping the replica, copying slot state over and starting the replica again, so it's unlikely to be viable.

A minimal example of an extension that exposes SQL calls for low level slot management can be found in my original failover slots patch. The patch attached to this mail contains it - see src/test/modules/test_slot_timelines.

ringerc · 2020-11-10T02:59:16Z

Here I wrote some detailed guidance on how to do it safely.

If you'd be interested in driving forward some core patches to smooth the rough edges I would be happy to assist you with review and implementation guidance.

ringerc · 2020-11-10T05:13:20Z

Here's a patch against master showing that pg_replication_slot_advance() works on a standby. You can use that to be much safer. Unfortunately I don't think there's a good way to create the slot in the first place without an extension.

0001-Extend-TAP-test-for-pg_replication_slot_advance-to-c_patch.txt

konst583 · 2020-11-11T14:56:47Z

Here https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072 in 2018 we have shared results of our research and proposed some workarounds and wishlist connected with our experience of using logical replication. It is cool that there is activity to fix issues connected with logical replication failover. I suppose our research can help.

CyberDem0n closed this as completed in c7173aa Mar 25, 2021

NikolayS mentioned this issue Nov 8, 2021

Failover logical slots #1820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication slot sync feature is unsafe, high data corruption risk #1749

Replication slot sync feature is unsafe, high data corruption risk #1749

ringerc commented Nov 9, 2020 •

edited

ringerc commented Nov 9, 2020

ringerc commented Nov 9, 2020 •

edited

jconway commented Nov 9, 2020

ringerc commented Nov 10, 2020 •

edited

ringerc commented Nov 10, 2020 •

edited

ringerc commented Nov 10, 2020 •

edited

ringerc commented Nov 10, 2020

konst583 commented Nov 11, 2020

Replication slot sync feature is unsafe, high data corruption risk #1749

Replication slot sync feature is unsafe, high data corruption risk #1749

Comments

ringerc commented Nov 9, 2020 • edited

Repro steps

What happened?

Why is it this way?

Correct way

A related provider/subscriber de-sync issue

Docs for feature

ringerc commented Nov 9, 2020

ringerc commented Nov 9, 2020 • edited

jconway commented Nov 9, 2020

ringerc commented Nov 10, 2020 • edited

ringerc commented Nov 10, 2020 • edited

ringerc commented Nov 10, 2020 • edited

ringerc commented Nov 10, 2020

konst583 commented Nov 11, 2020

ringerc commented Nov 9, 2020 •

edited

ringerc commented Nov 9, 2020 •

edited

ringerc commented Nov 10, 2020 •

edited

ringerc commented Nov 10, 2020 •

edited

ringerc commented Nov 10, 2020 •

edited