New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Cleaning up after unexpectedly terminated backend's lockGroupLeader is missing locking #18008
Closed
1 task done
Labels
Comments
timothy-e
added
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Jun 29, 2023
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
and removed
status/awaiting-triage
Issue awaiting triage
labels
Jun 29, 2023
timothy-e
added a commit
that referenced
this issue
Jul 24, 2023
…l locking Summary: When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: kramanathan, amartsinchyk Reviewed By: kramanathan, amartsinchyk Subscribers: smishra, ssong, yql Differential Revision: https://phorge.dev.yugabyte.com/D26563
timothy-e
added a commit
that referenced
this issue
Jul 26, 2023
…s with additional locking Summary: Original Commit: D26563 / 2e67c23 When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: kramanathan, amartsinchyk Reviewed By: amartsinchyk Subscribers: yql, ssong, smishra Differential Revision: https://phorge.dev.yugabyte.com/D27282
timothy-e
added a commit
that referenced
this issue
Aug 8, 2023
…s with additional locking Summary: Original commit: D26563 / 2e67c23 When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: amartsinchyk, kramanathan Reviewed By: amartsinchyk Subscribers: yql, cdavid Differential Revision: https://phorge.dev.yugabyte.com/D27287
timothy-e
added a commit
that referenced
this issue
Aug 8, 2023
… with additional locking Summary: Original Commit: D26563 / 2e67c23 When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection. | **Postmaster** | **Connection A** | **Connection B** | -- | -- | -- | forks new backend | running long running query | -- | | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up | starts cleaning up after connection A | -- | acquires lock on `freeProc` | reads `A->procgloballist` (a pointer to a node of `freeProc`) | -- | | | -- | modfies `freeProc` | modifies `freeProc` based on it's stale read from A | -- | | | -- | releases lock on `freeProc` `ProcGlobal->freeProc` now has its state messed up because it was modified concurrently. D20454 introduced cleaning up after terminated connections, but missed including this lock. == Changes: == - By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write. - Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature. Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise. - Added additional logging to help future debugging efforts - Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired` - Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state. Jira: DB-7029, DB-7069 Test Plan: == Max Connections Issue == Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up. - I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ + pg_usleep(1000 * 1000 * 2); // 2s SpinLockAcquire(ProcStructLock); ``` - I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess` ``` lang=diff proc->links.next = (SHM_QUEUE *) *procgloballist; + pg_usleep(1000 * 1000 * 8); // 8s *procgloballist = proc; ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself 3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B 4. Create connection C and - in the original state, it gets rejected with the error "FATAL: latch already owned by 690287" (connection B) (I'm not sure why this occurs) - with the diff, it connects successfully 5. Create connection D and - in the original state, it gets rejected with the error "Sorry, too many clients already" - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state. If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact. == Avoiding stuck spinlock == - I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess` ``` lang=diff * While we are holding the ProcStructLock, also copy the current shared * estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + pg_usleep(1000 * 1000 * 5); // 5s ``` Then run the steps: 1. Create connection A and allow it to connect. 2. Create a new connection B. This acquires the lock and then waits. 3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list. 4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted. Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A. Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts. == Unexpected Signals == Steps: 1. Create connection A and B 2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal. 3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected. Reviewers: kramanathan, amartsinchyk Reviewed By: amartsinchyk Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D27301
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Jira Link: DB-7069
Description
Might be the root cause of #17961
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: