Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec.CheckpointBeforePgrewind #644

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lawrencejones
Copy link
Contributor

@lawrencejones lawrencejones commented May 17, 2019

Provide a configuration parameter that enables issuing of a CHECKPOINT
prior to starting a pg_rewind, ensuring the control file is up-to-date
with the latest timeline information before the rewind process starts.

When this setting is disabled, it's possible for a newly promoted master
to have a control file with a timeline from before the promotion
occured. If a follower tries to resync using rewind against this
database it will quit without doing anything as it will see no timeline
deviation when it reads the control file, stating "no rewind required".

Some users run large databases where the cost of a basebackup is many
times more expensive than rewind. These users would probably trade the
performance impact of immediate checkpointing for faster recovery of the
standby, especially for those using synchronous replication who may be
unable to accept writes until the standby is recovered.

For now, this change will at least enable a "rewind at all costs" option
as opposed to the "maybe rewind if possible" that existed before. Future
work might support a retry timeout for rewind operations, for those
users who don't want the performance hit but do value rewinding over
basebackups. We could also use the backup API to trigger a checkpoint
with the target Postgres spread configuration, then retry rewinding
until the checkpoint_timeout has elapsed.


I think this is in a good place now. I've been running these tests like so:

$ STCTL_BIN="$(pwd)/bin/stolonctl" STKEEPER_BIN="$(pwd)/bin/stolon-keeper" STSENTINEL_BIN="$(pwd)/bin/stolon-sentinel" STDCTL_BIN="$(pwd)/bin/stolonctl" STPROXY_BIN="$(pwd)/bin/stolon-proxy" STOLON_TEST_STORE_BACKEND=etcdv3 ETCD_BIN=$(which etcd) go test tests/integration/*.go -v -test.run=TestTimelineForkPgRewindCheckpoint -timeout=3m

If you build the stolon binaries with master (which will silently ignore the CheckpointBeforePgrewind configuration parameter) then you'll see the following failure:

    utils.go:260: [keeper fdc71fc5]: 2019-06-12T17:26:45.669+0100       INFO    cmd/keeper.go:824       syncing using pg_rewind {"followedDB": "d6a67b88", "keeper": "e3414a18"}                                                                                                                                                                                                                                 
    utils.go:260: [keeper fdc71fc5]: 2019-06-12T17:26:45.669+0100       INFO    postgresql/postgresql.go:789    running pg_rewind                                                                                                                                                                                                                                                                                
    ha_test.go:1275: keeper tried rewinding but rewind thought it was not required: fdc71fc5                                                                                                                                                                                                                                                                                                                     
    utils.go:260: [keeper fdc71fc5]: fetched file "global/pg_control", length 8192                                                                                                                                                                                                                                                                                                                               
    utils.go:260: [keeper fdc71fc5]: source and target cluster are on the same timeline                                                                                                                                                                                                                                                                                                                          
    utils.go:260: [keeper fdc71fc5]: no rewind required   

Running with binaries built against this branch, the tests pass and the log output looks like this:

    ha_test.go:1213: Waiting 2 * sleepInterval for checkpoint configuration to be applied
    ha_test.go:1216: Creating lots of un-CHECKPOINT'd data
    ha_test.go:1221: Write token 1 to master: f0333453
    ha_test.go:1234: SIGSTOPping current standby keeper: 17d6df63
    ha_test.go:1238: SIGSTOPping current standby wal receivers: 17d6df63
    ha_test.go:1243: Writing token 2 to master, advancing current timeline: f0333453
    ha_test.go:1248: Stopping master: f0333453
    ha_test.go:1255: SIGKILLing current standby wal receivers: 17d6df63
    ha_test.go:1261: SIGCONTing current standby keeper: 17d6df63
    ha_test.go:1267: Waiting for new master to assume role: 17d6df63
    ha_test.go:1272: Write token 3 to new master, advancing current timeline: 17d6df63
    ha_test.go:1279: Starting old master and verifying it successfully rewinds: f0333453
    utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.228+0100       INFO    cmd/keeper.go:844       syncing using pg_rewind {"followedDB": "51bca7ec", "keeper": "17d6df63", "forcingCheckpoint": true}
    utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.229+0100       INFO    postgresql/postgresql.go:804    issuing checkpoint on primary
    utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.293 BST [23830] LOG:  checkpoint complete: wrote 1671 buffers (10.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.847 s, sync=0.014 s, total=4.863 s; sync files=335, longest=0.000 s, average=0.000 s; distance=34075 kB, estimate=34075 kB
    utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.295 BST [23830] LOG:  checkpoint starting: immediate force wait
    utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.296 BST [23830] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=30668 kB
    utils.go:260: [keeper f0333453]: CHECKPOINT
    utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.351+0100       INFO    postgresql/postgresql.go:815    running pg_rewind
    utils.go:260: [keeper f0333453]: fetched file "global/pg_control", length 8192
    utils.go:260: [keeper f0333453]: fetched file "pg_wal/00000002.history", length 41
    utils.go:260: [keeper f0333453]: Source timeline history:
    utils.go:260: [keeper f0333453]: Target timeline history:
    utils.go:260: [keeper f0333453]: 1: 0/0 - 0/0
    utils.go:260: [keeper f0333453]: servers diverged at WAL location 0/4146FE0 on timeline 1
    utils.go:260: [keeper f0333453]: rewinding from last common checkpoint at 0/2000060 on timeline 1
    ha_test.go:1304: Waiting for the old master to receive token 3 from forked timeline: f0333453

@lawrencejones
Copy link
Contributor Author

This actually addresses #601, though not via any of the remediations recommended in that issue.

@rearden-steel
Copy link

I've also experienced this issue.
@sgotti why wal_keep_segments is hard-coded to such small value as 8 and can I increase it manually in code as a temporary workaround of this problem?

@Widerstehen
Copy link

wal_keep_segments

how to change wal_keep_segments?

@lawrencejones
Copy link
Contributor Author

Hey @rearden-steel @frank111, I'm not sure how this is related to the wal keep segments GUC? This change is about recovering from forked timelines rather than having insufficient wal to perform a recovery.

can I increase it manually in code

We've done so in our stolon fork as we also needed a higher value than 8. It's safe to do so, but I'd recommend pushing #591 so we can get this solved for everyone.

// take hours to recover when pg_rewind is <1m. Here we verify that the keeper
// successfully recovered the forked timeline via a rewind rather than basebackup.
if usePgrewind {
output := standbys[1].ReadOutput()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be covered by using Expect (see config_test.go)

@@ -253,7 +255,9 @@ func (p *Process) start() error {
go func() {
scanner := bufio.NewScanner(pr)
for scanner.Scan() {
p.t.Logf("[%s %s]: %s", p.name, p.uid, scanner.Text())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be covered by using Expect (see config_test.go)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sgotti I'm struggling to figure out how to use the expect interface to try matching against what I need here.

I need to verify that this string occurs after a particular place in the code (so flush the buffer, then run the assertion) and I need to perform multiple assertions for several strings that may appear out of order.

Can you help me figure out how to do this with Expect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured this out actually, thanks!

@rearden-steel
Copy link

Hey @lawrencejones!
Is there any chance to make the requested changes?
We experience issues which should be fixed with this PR.

@lawrencejones
Copy link
Contributor Author

lawrencejones commented Jun 11, 2019

Hey @rearden-steel, I'm.midway through making this work. It needs integration tests which are what I've been working on, and to provide a configurable variable that enables/disables checkpointing.

I'm out at a conference today/tomorrow but this is right on my to-do list come Thursday. I'm going to push the WIP changes now so you can see them, and caveat this update by saying I'm explicitly ignoring Simone's advice until I've got the tests working (at which point I'll action them).

@lawrencejones lawrencejones force-pushed the lawrence-checkpoint-before-rewind branch from 3963d9b to 5577b6a Compare June 11, 2019 13:14
@lawrencejones
Copy link
Contributor Author

From where I got to last time, it's difficult to simulate a Postgres that takes a while to checkpoint after a promotion. While pg_ctl promote won't force a synchronous checkpoint it will schedule one, and in the test databases that scheduled checkpoint is almost instantaneous leading to the controlfile being updated immediately.

Hopefully you can wait a couple of days for me to get back to this?

@rearden-steel
Copy link

rearden-steel commented Jun 11, 2019

it's difficult to simulate a Postgres that takes a while to checkpoint after a promotion

I was able to reach such situation when doing manual failover while running pgbench (-c 100 -T 120 -j 20) on a cluster (with pre-generated database around 80 GB, pgbench -i -s 5000).

Hopefully you can wait a couple of days for me to get back to this?

Sure! Thank you for your work!

@lawrencejones lawrencejones force-pushed the lawrence-checkpoint-before-rewind branch 2 times, most recently from bd0582a to bd61324 Compare June 12, 2019 16:58
@lawrencejones lawrencejones changed the title [WIP] Force checkpoint before rewind spec.CheckpointBeforePgrewind Jun 12, 2019
@lawrencejones
Copy link
Contributor Author

@sgotti @rearden-steel this is ready for review now.

Instead of adding this logic into an existing test, I've created another test with just two keepers that verifies we're safe from this specific race. The details of that test can be seen here:

// This test verifies that pg_rewind isn't subject to a race against a newly promoted
// Postgres promoting into a new timeline and only checkpointing after the fact. We
// simulate this with a two node cluster and:
//
// 1. Write junk data that we assume to be un-CHECKPOINT'ed
// 2. Pause the standby at 1a and advance the master to 1b
// 3. Shutdown master on timeline 1b
// 4. Resume the standby on timeline 1a and wait for it to promote into 2b
// 5. Start old master at 1b which can never rejoin the current 2b timeline, so it will
// resync. This resync should use pg_rewind and we verify the pg_rewind works as
// expected, despite our expectation that the new master would not have independently
// checkpointed after promotion without stolon having requested it
//
// This will only pass when CheckpointBeforePgrewind is set to true in the cluster spec.

I've added some test helpers and I'm going to go through the diff and comment where things may not be totally clear.

Any questions, please do ask!

// wals and we'll force a full resync.
// We have to find a better way to detect if a standby is waiting
// for unavailable wals.
// TODO(sgotti) pg_rewind considers databases on the same timeline as in sync and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited this comment as part of an unrelated change and line-wrapped it to match the rest of the file. Can leave it or remove it as makes most sense!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -124,18 +124,21 @@ func TestInitWithMultipleKeepers(t *testing.T) {
}

func setupServers(t *testing.T, clusterName, dir string, numKeepers, numSentinels uint8, syncRepl bool, usePgrewind bool, primaryKeeper *TestKeeper) (testKeepers, testSentinels, *TestProxy, *TestStore) {
var initialClusterSpec *cluster.ClusterSpec
var initialClusterSpec = &cluster.ClusterSpec{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small refactor here to put the common ClusterSpec up top, then assign the overrides. It might make sense to use something like https://github.com/imdario/mergo in future, so we can merge optional overrides provided in this setupServers helper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok (I assume nothing has been changed)

// If we want to pg_rewind then also checkpoint beforehand, otherwise we may fail due
// to uncheckpoint'ed timeline changes that have us silently fallback to basebackup.
UsePgrewind: cluster.BoolP(usePgrewind),
CheckpointBeforePgrewind: cluster.BoolP(usePgrewind),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is open for questioning: should we assign this by default? It doesn't harm us now but we may want to test when the parameter is not set (though I think this is unlikely).

Perhaps ok for now, until we have an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok (I assume nothing has been changed)

@@ -256,6 +270,32 @@ func waitLines(t *testing.T, q Querier, num int, timeout time.Duration) error {
return fmt.Errorf("timeout waiting for %d lines, got: %d", num, c)
}

// waitForLine waits until a row with the expected ID appears in table01
func waitForLine(t *testing.T, q Querier, expected int, timeout time.Duration) error {
Copy link
Contributor Author

@lawrencejones lawrencejones Jun 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was what I expected waitForLines to do! Tripped myself up by assuming its behaviour. Two small things here:

  • I've tried to comment my new test helpers to make it clear why you might use them, and how they behave. I would have found it really useful to have comments like these on the ones that already exist when getting familiar with the test suite, though I totally understand when this is mostly a one man operation why they wouldn't be there
  • I'm a bit worried waitLines might be too loose an assertion in some tests. It could be that we're passing tests when a specific token value never replicated, while a subsequent one did. Not sure on this but I'd normally avoid this type of assertion in favour of something more specific, like this waitForLine

Neither are actionable, just food for thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result is the same just with different logic:

What the tests really want to check is that all the transactions are applied. Waiting for the number of rows in an append only table is a way to do this.

If you only append rows to a table you know what the expected total number should be. So I think waiting for the total number of rows is enough. Waiting for the last row you inserted is another way for doing the same thing. Or am I missing something?

I'll stick to just one of them to not add additional confusion the the integration tests (if you really prefer waiting for the last row feel free to create a PR that replaces it)

P.S. The real issue is the function name that will require a good s/line/row/ :smile:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, we're expecting to lose data. So we do three writes of 1, 2 and 3, but the final database should only ever get 1 and 3.

If I use the count of lines then I'd have to expect there to be 2 lines, but that will pass even if I never received the correct write and only have 1 and 2 in the final standby (while the primary node has 1 and 3).

Does that make any sense?

@lawrencejones lawrencejones force-pushed the lawrence-checkpoint-before-rewind branch from bd61324 to 904b7cb Compare June 12, 2019 17:18
defaultPGParameters = cluster.PGParameters{
"log_destination": "stderr",
"logging_collector": "false",
"log_checkpoints": "on",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't see why we wouldn't have this on. It doesn't hurt us and can be extremely useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// GetPGControldata calls pg_controldata for the Postgres database directory. This can be
// useful in debugging database state, especially when comparing what was successfully
// checkpointed.
func (tk *TestKeeper) GetPGControldata() string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused but was useful while debugging. Can remove if we want, or leave for future use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK the keep it.

Copy link
Member

@sgotti sgotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lawrencejones Thanks for your PR.

Beside the inline comments, one thing that I'd like to avoid at all costs is issuing "write" commands from one keeper to another instance since it looks quite dirty (against the fact that the postgres should be managed only by its keeper and, in this case, every resync will issue an unneeded checkpoint).

So, shouldn't be possible to just issue a checkpoint after a promote by the primary keeper?

The main issue I see here is that there'll be a window between issuing a promote and finishing the checkpoint where the keeper already reported its state to the sentinel and it has updated the old primary follow config.
One possible solution to fix this and also other realted issues will be to report to the sentinel the current checkpoint status after promote or the checkpoint timeline (new field in cluster.go DBStatus) so the sentinel will set the old primary followConfig only when it's sure it's in good state for a pg_rewind.

P.S. the same logic could be expanded to also report the current replication slots so the sentinel can avoid similar issues (and remove the use of the --slot command in pg_basebackup that is another hacky thing).

// If we want to pg_rewind then also checkpoint beforehand, otherwise we may fail due
// to uncheckpoint'ed timeline changes that have us silently fallback to basebackup.
UsePgrewind: cluster.BoolP(usePgrewind),
CheckpointBeforePgrewind: cluster.BoolP(usePgrewind),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok (I assume nothing has been changed)

@@ -124,18 +124,21 @@ func TestInitWithMultipleKeepers(t *testing.T) {
}

func setupServers(t *testing.T, clusterName, dir string, numKeepers, numSentinels uint8, syncRepl bool, usePgrewind bool, primaryKeeper *TestKeeper) (testKeepers, testSentinels, *TestProxy, *TestStore) {
var initialClusterSpec *cluster.ClusterSpec
var initialClusterSpec = &cluster.ClusterSpec{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok (I assume nothing has been changed)

@@ -27,6 +27,7 @@ Some options in a running cluster specification can be changed to update the des
| additionalWalSenders | number of additional wal_senders in addition to the ones internally defined by stolon, useful to provide enough wal senders for external standbys (changing this value requires an instance restart) | no | uint16 | 5 |
| additionalMasterReplicationSlots | a list of additional physical replication slots to be created on the master postgres instance. They will be prefixed with `stolon_` (like internal replication slots used for standby replication) to make them "namespaced" from other replication slots. Replication slots starting with `stolon_` and not defined here (and not used for standby replication) will be dropped from the master instance. | no | []string | null |
| usePgrewind | try to use pg_rewind for faster instance resyncronization. | no | bool | false |
| checkpointBeforePgrewind | Force a checkpoint before pg_rewind to prevent the rewind racing the checkpointer process after a standby is newly promoted. This will cause increased IO on whatever Postgres node the currently resync'ing Postgres is following as the checkpoint will not immediate, and not respect spread configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should make clearer than the checkpoint is issued on the current primary. At a first sight it's not clear.

// wals and we'll force a full resync.
// We have to find a better way to detect if a standby is waiting
// for unavailable wals.
// TODO(sgotti) pg_rewind considers databases on the same timeline as in sync and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

// GetPGControldata calls pg_controldata for the Postgres database directory. This can be
// useful in debugging database state, especially when comparing what was successfully
// checkpointed.
func (tk *TestKeeper) GetPGControldata() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK the keep it.

defaultPGParameters = cluster.PGParameters{
"log_destination": "stderr",
"logging_collector": "false",
"log_checkpoints": "on",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// of downtime.
if forceCheckpoint {
log.Infow("issuing checkpoint on primary")
psqlName := filepath.Join(p.pgBinPath, "psql")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you using psql instead of directly calling it from go sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing this to ensure the way we connect to the Postgres for pg_rewind, pg_basebackup and for issuing a checkpoint was consistent. psql should behave exactly the same as rewind/basebackup, whereas connecting from within Go could be subtly different in many ways.

Does that make sense, or do you think we should try constructing a Go connection?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gist.github.com/viggy28/954ff01cd3c29d317834a5c25951a1cd

@lawrencejones and @sgotti does this look okay? (since sslmode prefer is not applicable for lib/pq I have to replace that based on the SSL settings on the cluster).

@lawrencejones
Copy link
Contributor Author

Hey @sgotti! I've been rubbish here: PRs were left hanging while I went on holiday 🏖

I'm going to reply to each bit of your review separately. Hopefully we can reason our way to a conclusion!

Beside the inline comments, one thing that I'd like to avoid at all costs is issuing "write" commands from one keeper to another instance since it looks quite dirty (against the fact that the postgres should be managed only by its keeper and, in this case, every resync will issue an unneeded checkpoint).

So, shouldn't be possible to just issue a checkpoint after a promote by the primary keeper?

Postgres will automatically begin a checkpoint whenever a promote happens, but it will do so in a spread fashion (ie, using all the checkpointing configuration parameters to ensure it throttles the checkpointing). If we were to get the keeper to issue a checkpoint itself after promotion (optionally, via a similar configuration parameter as this PR) then we'd force the master to checkpoint faster, which is good but...

At the same time as the master checkpoints, our slaves will attempt to resync against the master. Until the checkpoint completes the pg_rewind could fail, and we'll fallback to pg_basebackup, which is what we really want to avoid.

If we really want to avoid this, then a possible solution would be to provide this checkpoint-after-promote functionality we've been discussing, but additionally have keepers retry pg_rewind for a specific time threshold. A user could then set the retry timeout to be 5m, for example, and enable checkpoint-after-promote to ensure the rewind will be retried after the checkpoint is expected to be completed.

The main issue I see here is that there'll be a window between issuing a promote and finishing the checkpoint where the keeper already reported its state to the sentinel and it has updated the old primary follow config.

One possible solution to fix this and also other realted issues will be to report to the sentinel the current checkpoint status after promote or the checkpoint timeline (new field in cluster.go DBStatus) so the sentinel will set the old primary followConfig only when it's sure it's in good state for a pg_rewind.

It's quite sad for me that we have to do this, if you know what I mean? pg_basebackup has an option to wait for a checkpoint to complete before starting, so it's quite sad to me that pg_rewind doesn't also have this option. Perhaps we can simulate it by creating a backup label, as that appears to be what pg_basebackup does internally? Do you have any thoughts on this?

@lawrencejones
Copy link
Contributor Author

@sgotti I know you're probably busy on a lot of other things right now, but would really appreciate your feedback for this change. It matters quite a lot to us that rewinds are working, and this change makes that the case.

If we want to take another path that's great, but we'll need to know what it is before we can get implementing it!

@sgotti
Copy link
Member

sgotti commented Aug 23, 2019

@lawrencejones Also if theoretically I'd prefer a solution that doesn't require doing operations from a keeper to another, in the end this is already done when doing a pg_basebackup or a pg_rewind.

So I don't want to stop this PR trying to find a cleanliness that's not really possible. Lets go ahead with your solution! (in the end we are just doing what pg_rewind should do: a checkpoint inside pg_rewind like pg_basebackup already does)

P.S. We should report this upstream if not already done.

@lawrencejones lawrencejones force-pushed the lawrence-checkpoint-before-rewind branch from 904b7cb to 9a49a91 Compare August 27, 2019 11:11
@lawrencejones
Copy link
Contributor Author

Hey @sgotti!

So I don't want to stop this PR trying to find a cleanliness that's not really possible. Lets go ahead with your solution! (in the end we are just doing what pg_rewind should do: a checkpoint inside pg_rewind like pg_basebackup already does)

Makes sense to me :)

I've adjusted the doc comment to make it clearer we'll be hitting the primary. All your other review comments were questions/have been addressed, I think.

Can I get another look? Thanks.

Provide a configuration parameter that enables issuing of a CHECKPOINT
prior to starting a pg_rewind, ensuring the control file is up-to-date
with the latest timeline information before the rewind process starts.

When this setting is disabled, it's possible for a newly promoted master
to have a control file with a timeline from before the promotion
occured. If a follower tries to resync using rewind against this
database it will quit without doing anything as it will see no timeline
deviation when it reads the control file, stating "no rewind required".

Some users run large databases where the cost of a basebackup is many
times more expensive than rewind. These users would probably trade the
performance impact of immediate checkpointing for faster recovery of the
standby, especially for those using synchronous replication who may be
unable to accept writes until the standby is recovered.

For now, this change will at least enable a "rewind at all costs" option
as opposed to the "maybe rewind if possible" that existed before. Future
work might support a retry timeout for rewind operations, for those
users who don't want the performance hit but do value rewinding over
basebackups. We could also use the backup API to trigger a checkpoint
with the target Postgres spread configuration, then retry rewinding
until the checkpoint_timeout has elapsed.
@lawrencejones lawrencejones force-pushed the lawrence-checkpoint-before-rewind branch from 9a49a91 to afb5211 Compare September 11, 2019 17:09
@viggy28
Copy link

viggy28 commented Jun 23, 2020

Hi Team, wondering any updates on this? Anything I can help with to get this merged?

@sgotti
Copy link
Member

sgotti commented Jun 27, 2020

@viggy28 If no one is going to work in this PR (@lawrencejones) feel free to adopt and update it.

@lawrencejones
Copy link
Contributor Author

Hey @viggy28, if you want to take this over then please let me know. I’m not working on stolon things right now so if you need this, I’ll be happy to hand it over!

@viggy28
Copy link

viggy28 commented Jul 1, 2020

Sure @lawrencejones. Looks like you have done most of the work. Thanks much.

Trying to understand what is needed.

  1. https://github.com/sorintlab/stolon/pull/644/files#r293416129
  2. spec.CheckpointBeforePgrewind #644 (review) is unresolved but @sgotti also mentioned here spec.CheckpointBeforePgrewind #644 (comment) that shouldn't be a blocker. which means can we resolve that?
  3. Looks like the build is failing.

Anything else I am missing?

@pirateunclejack
Copy link

Dear contributors ,
We have a stolon cluster which is such big to cost hours to recovery with basebackup.
I apply this change to latest master and run on our test environment . the standby keepers keep failed after pg_rewind done :


pg_rewind: Done!
2024-02-27T09:24:12.615Z	INFO	cmd/keeper.go:928	syncing using pg_rewind	{"masterDB": "1f5bbc57", "keeper": "keeper0"}
2024-02-27T09:24:12.636Z	INFO	postgresql/postgresql.go:921	issuing checkpoint on primary
CHECKPOINT
2024-02-27T09:24:12.993Z	INFO	postgresql/postgresql.go:932	running pg_rewind
2024-02-27T09:24:12.994Z	DEBUG	postgresql/postgresql.go:936	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/pg_rewind --debug -D /stolon-data/postgres --source-server=application_name=stolon_7805b990 dbname=zabbix host=172.16.250.153 options=-c\\ synchronous_commit=off password=******************** port=5432 sslmode=prefer user=stolon"}
pg_rewind: executing "/usr/lib/postgresql/14/bin/postgres" for target server to complete crash recovery
2024-02-27 09:24:13.068 GMT [1590] LOG:  skipping missing configuration file "/stolon-data/postgres/postgresql.auto.conf"
2024-02-27T09:24:13.173Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27T09:24:15.675Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27 09:24:16.349 UTC [1590] LOG:  database system was interrupted while in recovery at log time 2024-02-27 09:23:11 UTC
2024-02-27 09:24:16.349 UTC [1590] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2024-02-27 09:24:16.455 UTC [1590] LOG:  redo starts at B/280161F8
2024-02-27 09:24:16.556 UTC [1590] LOG:  invalid record length at B/28182E08: wanted 24, got 0
2024-02-27 09:24:16.556 UTC [1590] LOG:  redo done at B/28182DE0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.10 s

PostgreSQL stand-alone backend 14.0 (Debian 14.0-1.pgdg110+1)
2024-02-27T09:24:17.548Z	DEBUG	postgresql/postgresql.go:625	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:17.561Z	DEBUG	cmd/keeper.go:815	pgstate dump: (*cluster.PostgresState)(0xc0008f84d0)({
 UID: (string) (len=8) "7805b990",
 Generation: (int64) 0,
 ListenAddress: (string) (len=14) "172.16.115.237",
 Port: (string) (len=4) "5432",
 Healthy: (bool) false,
 SystemID: (string) "",
 TimelineID: (uint64) 0,
 XLogPos: (uint64) 0,
 TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
 PGParameters: (common.Parameters) <nil>,
 SynchronousStandbys: ([]string) <nil>,
 OlderWalFile: (string) ""
})

2024-02-27T09:24:18.177Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
pg_rewind: fetched file "global/pg_control", length 8192
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
2024-02-27T09:24:18.222Z	DEBUG	postgresql/postgresql.go:625	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:18.324Z	INFO	postgresql/postgresql.go:844	writing standby signal file
2024-02-27T09:24:18.368Z	INFO	postgresql/postgresql.go:319	starting database
2024-02-27T09:24:18.368Z	DEBUG	postgresql/postgresql.go:323	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -D /stolon-data/postgres -c unix_socket_directories=/tmp"}
2024-02-27T09:24:20.682Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27 09:24:21.540 UTC [1593] LOG:  starting PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2024-02-27 09:24:21.540 UTC [1593] LOG:  listening on IPv4 address "172.16.115.237", port 5432
2024-02-27 09:24:21.589 UTC [1593] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-02-27 09:24:22.064 UTC [1594] LOG:  database system was shut down at 2024-02-27 09:24:17 UTC
2024-02-27 09:24:22.064 UTC [1595] FATAL:  the database system is starting up
2024-02-27 09:24:22.085 UTC [1594] LOG:  entering standby mode
2024-02-27 09:24:22.221 UTC [1594] LOG:  invalid record length at B/28182EF8: wanted 24, got 0
2024-02-27 09:24:22.245 UTC [1598] LOG:  started streaming WAL from primary at B/28000000 on timeline 31
2024-02-27 09:24:22.267 UTC [1599] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.267 UTC [1599] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:22.271 UTC [1594] LOG:  invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.279 UTC [1598] FATAL:  terminating walreceiver process due to administrator command
2024-02-27 09:24:22.279 UTC [1594] LOG:  invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.280 UTC [1594] LOG:  invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.469 UTC [1601] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.469 UTC [1601] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27T09:24:22.583Z	DEBUG	postgresql/postgresql.go:625	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:22.593Z	DEBUG	cmd/keeper.go:815	pgstate dump: (*cluster.PostgresState)(0xc0006ac630)({
 UID: (string) (len=8) "7805b990",
 Generation: (int64) 0,
 ListenAddress: (string) (len=14) "172.16.115.237",
 Port: (string) (len=4) "5432",
 Healthy: (bool) false,
 SystemID: (string) "",
 TimelineID: (uint64) 0,
 XLogPos: (uint64) 0,
 TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
 PGParameters: (common.Parameters) <nil>,
 SynchronousStandbys: ([]string) <nil>,
 OlderWalFile: (string) ""
})

2024-02-27 09:24:22.671 UTC [1603] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.671 UTC [1603] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:22.873 UTC [1604] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.873 UTC [1604] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.094 UTC [1605] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:23.094 UTC [1605] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.185 UTC [1606] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:23.185 UTC [1606] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27T09:24:23.185Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "pq: the database system is not yet accepting connections"}
2024-02-27 09:24:23.297 UTC [1607] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:23.297 UTC [1607] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.499 UTC [1608] FATAL:  the database system is not yet accepting connections

Is there any body still work on this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants