Skip to content

2.25.1.0-b148

@timothy-e timothy-e tagged this 08 Jan 13:39
Summary:
### Background

In vanilla PG, if a backend is killed abruptly, the postmaster restarts.

In Yugabyte, D14099 (b63b791121484490c923a1a96b7666897bdeebf3) avoided restarting all Postgres backends when one crashed, but did not fully clean up the crashed backend.

To follow up, D20454 (f90da744e9c3cb24d4c4622664f9895433a451d2) (and others) introduced new logic to clean up after terminated backends.

There are some situations that are difficult to clean up after. In these cases, we still restart the postmaster (and all PG backends). The general idea behind these cases is: over-zealously restarting the postmaster is much better than ending up in a scenario where the Postgres processes are stuck. If we find a specific case is causing postmaster restarts too often, we can decide to spend the extra effort to clean it up without a restart.

### pg_cron

While investigating pg_cron issues, it was found that a killed pg_cron worker would leave the pg_cron leader in a hung state, waiting for the worker to report its status. Calling `ReportBackgroundWorkerExit` from the postmaster on behalf of the killed process seems like  it should work, but there's more going on in the communication between the workers and leaders than just that. I tried a few more things, but quickly realized that it is unlikely that whatever works for the pg_cron behaviour will work for all background worker cases.

To avoid corrupted / hung state, rather than trying to figure out the perfect solution for all background workers, just restart the postmaster if a **background worker** dies with a not normal signal. This change has no impact on regular backends.

Test Plan:
There was some slight flakiness (~2%) observed in WIP versions of this test. Ran the new test 200
times to be safe.

```
./yb_build.sh --cxx-test pg_cron-test --gtest_filter PgCronTest.KillRunningJob -n 200
```

Just to validate that I didn't break anything here:
```
./yb_build.sh --cxx-test pg_cron-test --gtest_filter PgCronTest.DeactivateRunningJob
./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLWPgBackendKillAfterLWLockAcquire
```

Reviewers: hsunder

Reviewed By: hsunder

Subscribers: yql, smishra

Differential Revision: https://phorge.dev.yugabyte.com/D39781
Assets 2
Loading