New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication slot sync feature is unsafe, high data corruption risk #1749
Comments
Short-version how to repro is to commit a change on a publisher, and ensure that the physical replica(s) receive it but the logical replica(s) don't yet, then promote a physical replica to replace the publisher. Your logical replica(s) will be missing the change. |
AFAICS it's not simple to just replace Really you want to prevent this feature from being used on failover. I suggest refusing to allow any |
@ringerc where can we find the source code for pglogical 3? |
I am not able to share pglogical3 sources, as I'm sure you're aware. That's outside my control. As it happens, it's also a complex load of interacting background workers and IPC that would take a while to unravel and understand anyway, so it wouldn't be as useful as you might think. I've tried to constructively raise an issue that puts users at risk in a way that not only shows the problem, but provides detailed steps to verify it and an explanation of why it occurs. I also pointed you at a way to do it safely. It took a long time to write this up and explain all this. I hope you're able to review the matter and find a constructive path forward on it. It's a shame the failover slots feature didn't land, though I understand some of why. I've done a lot of work to try to make this stuff safe and easy. I've just gone and updated https://wiki.postgresql.org/wiki/Failover_slots to reflect the outcome of that work, the parts that did get in, and point people at how to handle it. I'll be happy to answer questions and help out in so far as time permits. |
I think the main issue with doing this safely in Patroni will be the lack of SQL-visible user interface for direct replication slot management. You can't use the SQL functions or the walsender based interface to sync slots from primary to standby; it's basically an unsupported hack. Andres objected to the whole idea and AFAIK continues to do so, but in the absence of any better built-in way to handle failover it's what we're stuck with. You will need a C extension to expose SQL interface for slot state sync and to install that in any replica you wish to manage failover-capable logical slots on. There isn't really any way around that right now: A minimal example of an extension that exposes SQL calls for low level slot management can be found in my original failover slots patch. The patch attached to this mail contains it - see |
Here I wrote some detailed guidance on how to do it safely. If you'd be interested in driving forward some core patches to smooth the rough edges I would be happy to assist you with review and implementation guidance. |
Here's a patch against master showing that 0001-Extend-TAP-test-for-pg_replication_slot_advance-to-c_patch.txt |
Here https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072 in 2018 we have shared results of our research and proposed some workarounds and wishlist connected with our experience of using logical replication. It is cool that there is activity to fix issues connected with logical replication failover. I suppose our research can help. |
The
slots
configuration parameter is unsafe when used with logical replication slots and is highly likely to lead to silent data loss. This feature should be disabled in a point release then re-enabled once fixed to be safe.On a side note, I tried to make this stuff simple to get right with my failover slots patch, but I was not successful in getting it into core, so we're stuck with fragile tooling-driven approaches instead.
It is not safe to create replication slots on streaming replicas only on after promoting the replica. If you're going to try to support failover of logical replication downstreams (subscribers) to a promoted physical replica of a failed upstream (provider) you must do so extremely carefully. Patroni does not do so correctly, and because of the way logical slots work in PostgreSQL there is no warning or error when provider is skipped over on failover. You will just get a silent gap where some changes did not replicate.
Repro steps
You should be able to repro the data loss scenario as follows. Set up your choice of logical decoding/replication schemes; I'm going to assume you'll make a postgres subscriber "S" and use built-in logical replication, but the same issue can be seen with
test_decoding
andpg_recvlogical
, or anything else. Steps:CREATE TABLE test_table(id integer primary key)
on Psubscriber_s
on Psubscriber_s
INSERT INTO test_table(id) VALUES (1)
on PSELECT remote_lsn FROM pg_catalog.pg_replication_origin_status
on P. For the purposes of this discussion this position will be0/1000
.SELECT restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name = 'subscriber_s'
on P. This will also be0/1000
since they're caught up.subscriber_s
existsINSERT INTO test_table(id) VALUES (2)
on Pconfirmed_flush_lsn
on P andremote_lsn
on Q. Note that P's position has advanced, we'll say to0/1200
, and S's is still at0/1000
.subscriber_s
on Qconfirmed_flush_lsn
of the newly created slot on Q: runSELECT restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name = 'subscriber_s'
on Q. For the purpose of this discussion we'll say it's at0/2000
.INSERT INTO test_table(id) VALUES (3)
on QSELECT id FROM test_table ORDER BY id
on Q and S. You will see that Q has1, 2, 3
, but S only has1, 3
.Oops! We misplaced the second insert.
What happened?
When Patroni made the new slot
subscriber_s
on Q, the slot was created with an automatically determined start positionconfirmed_flush_lsn
set to after the end of the last commit, or if there was no transaction activity during slot creation, then after the end of the last WAL standby status update message. This position is on Q's new timeline, after Q was promoted to replace P, say0/2000
.But S last saw
0/1000
in the example above. So for consistency it needs to replay the txn that inserted value 2, which happened sometime after0/1000
and before the after-promotion lsn0/2000
.When S connects to Q it sends its
remote_lsn
frompg_replication_origin_status
as the start-position, so it says "Q, please send me changes from all txns from 0/1000 and above".But Q sees that the slot's
confirmed_flush_lsn
is0/2000
. A logical slot cannot go backwards - if we tried to decode WAL at0/1000
we wouldn't know where to restart decoding from, the catalogs could have been vacuumed away, and the WAL segments might have been removed anyway.So Q silently starts streaming changes to S starting at 0/2000 instead.
S doesn't get told that this happened. S has no way to know if anything interesting happened between
0/1000
and0/2000
or if the first insert sometime after0/2000
is the first txn of of interest. So S has no way to detect the problem.Why is it this way?
Pg does this silent-fast-forward of logical slots in order to support efficient crash-safe apply without lots of slow round-trips. It also helps reduce the amount of fsync()ing on provider and subscriber that happens when slots are advancing over WAL that doesn't generate any replicated traffic. So it's done for good reasons.
But it was not designed with primary failover in mind, and it's left a rather interesting hazard for replication systems that try to tack failover support on top using tooling.
Correct way
pglogical 3 demonstrates the correct way to do this on PostgreSQL 10 and above. There's been discussion on -hackers for it too. The gist is that you must use hot standby feedback and a physical slot, then use your own tooling to sync slots from primary to standby while the standby is still a replica.
A related provider/subscriber de-sync issue
BTW, there's another scenario where you can get upstream/downstream inconsistency too, kind of the reverse of this one. In the same setup:
In this case S has the changes from txn 1, but Q does not. To handle this, you must ensure that changes are not sent out over logical replication until they're first confirmed by any physical streaming replicas. That's not simple to do in core postgres right now. We handle it in the pglogical output plugin by having the user configure which slots are for failover candidates.
Docs for feature
The docs for the unsafe feature read:
The text was updated successfully, but these errors were encountered: