[YSQL] Cleaning up after unexpectedly terminated backend's lockGroupLeader is missing locking #18008

timothy-e · 2023-06-29T14:49:40Z

Jira Link: DB-7069

Description

Might be the root cause of #17961

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

…l locking Summary: When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: kramanathan, amartsinchyk Reviewed By: kramanathan, amartsinchyk Subscribers: smishra, ssong, yql Differential Revision: https://phorge.dev.yugabyte.com/D26563

…s with additional locking Summary: Original Commit: D26563 / 2e67c23 When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: kramanathan, amartsinchyk Reviewed By: amartsinchyk Subscribers: yql, ssong, smishra Differential Revision: https://phorge.dev.yugabyte.com/D27282

…s with additional locking Summary: Original commit: D26563 / 2e67c23 When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: amartsinchyk, kramanathan Reviewed By: amartsinchyk Subscribers: yql, cdavid Differential Revision: https://phorge.dev.yugabyte.com/D27287

… with additional locking Summary: Original Commit: D26563 / 2e67c23 When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: kramanathan, amartsinchyk Reviewed By: amartsinchyk Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D27301

timothy-e added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Jun 29, 2023

timothy-e self-assigned this Jun 29, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Jun 29, 2023

yugabyte-ci closed this as completed Jul 24, 2023

yugabyte-ci reopened this Jul 26, 2023

yugabyte-ci closed this as completed Jul 26, 2023

timothy-e mentioned this issue Nov 20, 2023

Can't connect to the DB even though the number of connections is within the max_connections limit #15243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Cleaning up after unexpectedly terminated backend's lockGroupLeader is missing locking #18008

[YSQL] Cleaning up after unexpectedly terminated backend's lockGroupLeader is missing locking #18008

timothy-e commented Jun 29, 2023 •

edited by jira bot

[YSQL] Cleaning up after unexpectedly terminated backend's lockGroupLeader is missing locking #18008

[YSQL] Cleaning up after unexpectedly terminated backend's lockGroupLeader is missing locking #18008

Comments

timothy-e commented Jun 29, 2023 • edited by jira bot

Description

Warning: Please confirm that this issue does not contain any sensitive information

timothy-e commented Jun 29, 2023 •

edited by jira bot