-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spec.CheckpointBeforePgrewind #644
base: master
Are you sure you want to change the base?
spec.CheckpointBeforePgrewind #644
Conversation
This actually addresses #601, though not via any of the remediations recommended in that issue. |
I've also experienced this issue. |
how to change wal_keep_segments? |
Hey @rearden-steel @frank111, I'm not sure how this is related to the wal keep segments GUC? This change is about recovering from forked timelines rather than having insufficient wal to perform a recovery.
We've done so in our stolon fork as we also needed a higher value than 8. It's safe to do so, but I'd recommend pushing #591 so we can get this solved for everyone. |
487d25b
to
54f0eb3
Compare
tests/integration/ha_test.go
Outdated
// take hours to recover when pg_rewind is <1m. Here we verify that the keeper | ||
// successfully recovered the forked timeline via a rewind rather than basebackup. | ||
if usePgrewind { | ||
output := standbys[1].ReadOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be covered by using Expect (see config_test.go)
tests/integration/utils.go
Outdated
@@ -253,7 +255,9 @@ func (p *Process) start() error { | |||
go func() { | |||
scanner := bufio.NewScanner(pr) | |||
for scanner.Scan() { | |||
p.t.Logf("[%s %s]: %s", p.name, p.uid, scanner.Text()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be covered by using Expect (see config_test.go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sgotti I'm struggling to figure out how to use the expect interface to try matching against what I need here.
I need to verify that this string occurs after a particular place in the code (so flush the buffer, then run the assertion) and I need to perform multiple assertions for several strings that may appear out of order.
Can you help me figure out how to do this with Expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured this out actually, thanks!
Hey @lawrencejones! |
Hey @rearden-steel, I'm.midway through making this work. It needs integration tests which are what I've been working on, and to provide a configurable variable that enables/disables checkpointing. I'm out at a conference today/tomorrow but this is right on my to-do list come Thursday. I'm going to push the WIP changes now so you can see them, and caveat this update by saying I'm explicitly ignoring Simone's advice until I've got the tests working (at which point I'll action them). |
3963d9b
to
5577b6a
Compare
From where I got to last time, it's difficult to simulate a Postgres that takes a while to checkpoint after a promotion. While pg_ctl promote won't force a synchronous checkpoint it will schedule one, and in the test databases that scheduled checkpoint is almost instantaneous leading to the controlfile being updated immediately. Hopefully you can wait a couple of days for me to get back to this? |
I was able to reach such situation when doing manual failover while running pgbench (-c 100 -T 120 -j 20) on a cluster (with pre-generated database around 80 GB, pgbench -i -s 5000).
Sure! Thank you for your work! |
bd0582a
to
bd61324
Compare
@sgotti @rearden-steel this is ready for review now. Instead of adding this logic into an existing test, I've created another test with just two keepers that verifies we're safe from this specific race. The details of that test can be seen here: stolon/tests/integration/ha_test.go Lines 1174 to 1187 in bd61324
I've added some test helpers and I'm going to go through the diff and comment where things may not be totally clear. Any questions, please do ask! |
// wals and we'll force a full resync. | ||
// We have to find a better way to detect if a standby is waiting | ||
// for unavailable wals. | ||
// TODO(sgotti) pg_rewind considers databases on the same timeline as in sync and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited this comment as part of an unrelated change and line-wrapped it to match the rest of the file. Can leave it or remove it as makes most sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -124,18 +124,21 @@ func TestInitWithMultipleKeepers(t *testing.T) { | |||
} | |||
|
|||
func setupServers(t *testing.T, clusterName, dir string, numKeepers, numSentinels uint8, syncRepl bool, usePgrewind bool, primaryKeeper *TestKeeper) (testKeepers, testSentinels, *TestProxy, *TestStore) { | |||
var initialClusterSpec *cluster.ClusterSpec | |||
var initialClusterSpec = &cluster.ClusterSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small refactor here to put the common ClusterSpec up top, then assign the overrides. It might make sense to use something like https://github.com/imdario/mergo in future, so we can merge optional overrides provided in this setupServers
helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok (I assume nothing has been changed)
// If we want to pg_rewind then also checkpoint beforehand, otherwise we may fail due | ||
// to uncheckpoint'ed timeline changes that have us silently fallback to basebackup. | ||
UsePgrewind: cluster.BoolP(usePgrewind), | ||
CheckpointBeforePgrewind: cluster.BoolP(usePgrewind), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is open for questioning: should we assign this by default? It doesn't harm us now but we may want to test when the parameter is not set (though I think this is unlikely).
Perhaps ok for now, until we have an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok (I assume nothing has been changed)
@@ -256,6 +270,32 @@ func waitLines(t *testing.T, q Querier, num int, timeout time.Duration) error { | |||
return fmt.Errorf("timeout waiting for %d lines, got: %d", num, c) | |||
} | |||
|
|||
// waitForLine waits until a row with the expected ID appears in table01 | |||
func waitForLine(t *testing.T, q Querier, expected int, timeout time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was what I expected waitForLines
to do! Tripped myself up by assuming its behaviour. Two small things here:
- I've tried to comment my new test helpers to make it clear why you might use them, and how they behave. I would have found it really useful to have comments like these on the ones that already exist when getting familiar with the test suite, though I totally understand when this is mostly a one man operation why they wouldn't be there
- I'm a bit worried
waitLines
might be too loose an assertion in some tests. It could be that we're passing tests when a specific token value never replicated, while a subsequent one did. Not sure on this but I'd normally avoid this type of assertion in favour of something more specific, like thiswaitForLine
Neither are actionable, just food for thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result is the same just with different logic:
What the tests really want to check is that all the transactions are applied. Waiting for the number of rows in an append only table is a way to do this.
If you only append rows to a table you know what the expected total number should be. So I think waiting for the total number of rows is enough. Waiting for the last row you inserted is another way for doing the same thing. Or am I missing something?
I'll stick to just one of them to not add additional confusion the the integration tests (if you really prefer waiting for the last row feel free to create a PR that replaces it)
P.S. The real issue is the function name that will require a good s/line/row/ :smile:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, we're expecting to lose data. So we do three writes of 1, 2 and 3, but the final database should only ever get 1 and 3.
If I use the count of lines then I'd have to expect there to be 2 lines, but that will pass even if I never received the correct write and only have 1 and 2 in the final standby (while the primary node has 1 and 3).
Does that make any sense?
bd61324
to
904b7cb
Compare
defaultPGParameters = cluster.PGParameters{ | ||
"log_destination": "stderr", | ||
"logging_collector": "false", | ||
"log_checkpoints": "on", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't see why we wouldn't have this on. It doesn't hurt us and can be extremely useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// GetPGControldata calls pg_controldata for the Postgres database directory. This can be | ||
// useful in debugging database state, especially when comparing what was successfully | ||
// checkpointed. | ||
func (tk *TestKeeper) GetPGControldata() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused but was useful while debugging. Can remove if we want, or leave for future use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK the keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lawrencejones Thanks for your PR.
Beside the inline comments, one thing that I'd like to avoid at all costs is issuing "write" commands from one keeper to another instance since it looks quite dirty (against the fact that the postgres should be managed only by its keeper and, in this case, every resync will issue an unneeded checkpoint).
So, shouldn't be possible to just issue a checkpoint after a promote by the primary keeper?
The main issue I see here is that there'll be a window between issuing a promote and finishing the checkpoint where the keeper already reported its state to the sentinel and it has updated the old primary follow config.
One possible solution to fix this and also other realted issues will be to report to the sentinel the current checkpoint status after promote or the checkpoint timeline (new field in cluster.go DBStatus
) so the sentinel will set the old primary followConfig
only when it's sure it's in good state for a pg_rewind.
P.S. the same logic could be expanded to also report the current replication slots so the sentinel can avoid similar issues (and remove the use of the --slot
command in pg_basebackup that is another hacky thing).
// If we want to pg_rewind then also checkpoint beforehand, otherwise we may fail due | ||
// to uncheckpoint'ed timeline changes that have us silently fallback to basebackup. | ||
UsePgrewind: cluster.BoolP(usePgrewind), | ||
CheckpointBeforePgrewind: cluster.BoolP(usePgrewind), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok (I assume nothing has been changed)
@@ -124,18 +124,21 @@ func TestInitWithMultipleKeepers(t *testing.T) { | |||
} | |||
|
|||
func setupServers(t *testing.T, clusterName, dir string, numKeepers, numSentinels uint8, syncRepl bool, usePgrewind bool, primaryKeeper *TestKeeper) (testKeepers, testSentinels, *TestProxy, *TestStore) { | |||
var initialClusterSpec *cluster.ClusterSpec | |||
var initialClusterSpec = &cluster.ClusterSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok (I assume nothing has been changed)
doc/cluster_spec.md
Outdated
@@ -27,6 +27,7 @@ Some options in a running cluster specification can be changed to update the des | |||
| additionalWalSenders | number of additional wal_senders in addition to the ones internally defined by stolon, useful to provide enough wal senders for external standbys (changing this value requires an instance restart) | no | uint16 | 5 | | |||
| additionalMasterReplicationSlots | a list of additional physical replication slots to be created on the master postgres instance. They will be prefixed with `stolon_` (like internal replication slots used for standby replication) to make them "namespaced" from other replication slots. Replication slots starting with `stolon_` and not defined here (and not used for standby replication) will be dropped from the master instance. | no | []string | null | | |||
| usePgrewind | try to use pg_rewind for faster instance resyncronization. | no | bool | false | | |||
| checkpointBeforePgrewind | Force a checkpoint before pg_rewind to prevent the rewind racing the checkpointer process after a standby is newly promoted. This will cause increased IO on whatever Postgres node the currently resync'ing Postgres is following as the checkpoint will not immediate, and not respect spread configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should make clearer than the checkpoint is issued on the current primary. At a first sight it's not clear.
// wals and we'll force a full resync. | ||
// We have to find a better way to detect if a standby is waiting | ||
// for unavailable wals. | ||
// TODO(sgotti) pg_rewind considers databases on the same timeline as in sync and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
// GetPGControldata calls pg_controldata for the Postgres database directory. This can be | ||
// useful in debugging database state, especially when comparing what was successfully | ||
// checkpointed. | ||
func (tk *TestKeeper) GetPGControldata() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK the keep it.
defaultPGParameters = cluster.PGParameters{ | ||
"log_destination": "stderr", | ||
"logging_collector": "false", | ||
"log_checkpoints": "on", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// of downtime. | ||
if forceCheckpoint { | ||
log.Infow("issuing checkpoint on primary") | ||
psqlName := filepath.Join(p.pgBinPath, "psql") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you using psql instead of directly calling it from go sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doing this to ensure the way we connect to the Postgres for pg_rewind, pg_basebackup and for issuing a checkpoint was consistent. psql should behave exactly the same as rewind/basebackup, whereas connecting from within Go could be subtly different in many ways.
Does that make sense, or do you think we should try constructing a Go connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://gist.github.com/viggy28/954ff01cd3c29d317834a5c25951a1cd
@lawrencejones and @sgotti does this look okay? (since sslmode prefer is not applicable for lib/pq I have to replace that based on the SSL settings on the cluster).
Hey @sgotti! I've been rubbish here: PRs were left hanging while I went on holiday 🏖 I'm going to reply to each bit of your review separately. Hopefully we can reason our way to a conclusion!
Postgres will automatically begin a checkpoint whenever a promote happens, but it will do so in a spread fashion (ie, using all the checkpointing configuration parameters to ensure it throttles the checkpointing). If we were to get the keeper to issue a checkpoint itself after promotion (optionally, via a similar configuration parameter as this PR) then we'd force the master to checkpoint faster, which is good but... At the same time as the master checkpoints, our slaves will attempt to resync against the master. Until the checkpoint completes the pg_rewind could fail, and we'll fallback to pg_basebackup, which is what we really want to avoid. If we really want to avoid this, then a possible solution would be to provide this checkpoint-after-promote functionality we've been discussing, but additionally have keepers retry pg_rewind for a specific time threshold. A user could then set the retry timeout to be 5m, for example, and enable checkpoint-after-promote to ensure the rewind will be retried after the checkpoint is expected to be completed.
It's quite sad for me that we have to do this, if you know what I mean? pg_basebackup has an option to wait for a checkpoint to complete before starting, so it's quite sad to me that pg_rewind doesn't also have this option. Perhaps we can simulate it by creating a backup label, as that appears to be what pg_basebackup does internally? Do you have any thoughts on this? |
@sgotti I know you're probably busy on a lot of other things right now, but would really appreciate your feedback for this change. It matters quite a lot to us that rewinds are working, and this change makes that the case. If we want to take another path that's great, but we'll need to know what it is before we can get implementing it! |
@lawrencejones Also if theoretically I'd prefer a solution that doesn't require doing operations from a keeper to another, in the end this is already done when doing a pg_basebackup or a pg_rewind. So I don't want to stop this PR trying to find a cleanliness that's not really possible. Lets go ahead with your solution! (in the end we are just doing what pg_rewind should do: a checkpoint inside pg_rewind like pg_basebackup already does) P.S. We should report this upstream if not already done. |
904b7cb
to
9a49a91
Compare
Hey @sgotti!
Makes sense to me :) I've adjusted the doc comment to make it clearer we'll be hitting the primary. All your other review comments were questions/have been addressed, I think. Can I get another look? Thanks. |
Provide a configuration parameter that enables issuing of a CHECKPOINT prior to starting a pg_rewind, ensuring the control file is up-to-date with the latest timeline information before the rewind process starts. When this setting is disabled, it's possible for a newly promoted master to have a control file with a timeline from before the promotion occured. If a follower tries to resync using rewind against this database it will quit without doing anything as it will see no timeline deviation when it reads the control file, stating "no rewind required". Some users run large databases where the cost of a basebackup is many times more expensive than rewind. These users would probably trade the performance impact of immediate checkpointing for faster recovery of the standby, especially for those using synchronous replication who may be unable to accept writes until the standby is recovered. For now, this change will at least enable a "rewind at all costs" option as opposed to the "maybe rewind if possible" that existed before. Future work might support a retry timeout for rewind operations, for those users who don't want the performance hit but do value rewinding over basebackups. We could also use the backup API to trigger a checkpoint with the target Postgres spread configuration, then retry rewinding until the checkpoint_timeout has elapsed.
9a49a91
to
afb5211
Compare
Hi Team, wondering any updates on this? Anything I can help with to get this merged? |
@viggy28 If no one is going to work in this PR (@lawrencejones) feel free to adopt and update it. |
Hey @viggy28, if you want to take this over then please let me know. I’m not working on stolon things right now so if you need this, I’ll be happy to hand it over! |
Sure @lawrencejones. Looks like you have done most of the work. Thanks much. Trying to understand what is needed.
Anything else I am missing? |
Dear contributors ,
Is there any body still work on this issue ? |
Provide a configuration parameter that enables issuing of a CHECKPOINT
prior to starting a pg_rewind, ensuring the control file is up-to-date
with the latest timeline information before the rewind process starts.
When this setting is disabled, it's possible for a newly promoted master
to have a control file with a timeline from before the promotion
occured. If a follower tries to resync using rewind against this
database it will quit without doing anything as it will see no timeline
deviation when it reads the control file, stating "no rewind required".
Some users run large databases where the cost of a basebackup is many
times more expensive than rewind. These users would probably trade the
performance impact of immediate checkpointing for faster recovery of the
standby, especially for those using synchronous replication who may be
unable to accept writes until the standby is recovered.
For now, this change will at least enable a "rewind at all costs" option
as opposed to the "maybe rewind if possible" that existed before. Future
work might support a retry timeout for rewind operations, for those
users who don't want the performance hit but do value rewinding over
basebackups. We could also use the backup API to trigger a checkpoint
with the target Postgres spread configuration, then retry rewinding
until the checkpoint_timeout has elapsed.
I think this is in a good place now. I've been running these tests like so:
If you build the stolon binaries with master (which will silently ignore the
CheckpointBeforePgrewind
configuration parameter) then you'll see the following failure:Running with binaries built against this branch, the tests pass and the log output looks like this: