Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Cleaning up after unexpectedly terminated backend's lockGroupLeader is missing locking #18008

Closed
1 task done
timothy-e opened this issue Jun 29, 2023 · 0 comments
Closed
1 task done
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@timothy-e
Copy link
Contributor

timothy-e commented Jun 29, 2023

Jira Link: DB-7069

Description

Might be the root cause of #17961

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@timothy-e timothy-e added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Jun 29, 2023
@timothy-e timothy-e self-assigned this Jun 29, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Jun 29, 2023
timothy-e added a commit that referenced this issue Jul 24, 2023
…l locking

Summary:
When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection.

| **Postmaster**                        | **Connection A**                                 | **Connection B**
| --                                    | --                                               | --
| forks new backend                     | running long running query                       | --
|                                       | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up
| starts cleaning up after connection A | --                                               | acquires lock on `freeProc`
| reads `A->procgloballist` (a pointer to a node of `freeProc`) | --                       |
|                                       | --                                               | modfies `freeProc`
| modifies `freeProc` based on it's stale read from A | --                                 |
|                                       | --                                               | releases lock on `freeProc`

`ProcGlobal->freeProc` now has its state messed up because it was modified concurrently.

D20454 introduced cleaning up after terminated connections, but missed including this lock.

== Changes: ==
- By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write.
- Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature.  Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise.
- Added additional logging to help future debugging efforts
- Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired`
- Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state.
Jira: DB-7029, DB-7069

Test Plan:
== Max Connections Issue ==
Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up.
- I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
+       pg_usleep(1000 * 1000 * 2); // 2s
        SpinLockAcquire(ProcStructLock);
```
- I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess`
```
lang=diff
                        proc->links.next = (SHM_QUEUE *) *procgloballist;
+                       pg_usleep(1000 * 1000 * 8); // 8s
                        *procgloballist = proc;
```

Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself
3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B
4. Create connection C and
 - in the original state, it gets rejected with the error "FATAL:  latch already owned by 690287" (connection B) (I'm not sure why this occurs)
 - with the diff, it connects successfully
5. Create connection D and
 - in the original state, it gets rejected with the error "Sorry, too many clients already"
 - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state.

If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact.

== Avoiding stuck spinlock ==
- I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
        SpinLockAcquire(ProcStructLock);
+       pg_usleep(1000 * 1000 * 5); // 5s
```
Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. This acquires the lock and then waits.
3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list.
4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted.

Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A.

Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts.

== Unexpected Signals ==
Steps:
1. Create connection A and B
2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal.
3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected.

Reviewers: kramanathan, amartsinchyk

Reviewed By: kramanathan, amartsinchyk

Subscribers: smishra, ssong, yql

Differential Revision: https://phorge.dev.yugabyte.com/D26563
@yugabyte-ci yugabyte-ci reopened this Jul 26, 2023
timothy-e added a commit that referenced this issue Jul 26, 2023
…s with additional locking

Summary:
Original Commit: D26563 / 2e67c23

When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection.

| **Postmaster**                        | **Connection A**                                 | **Connection B**
| --                                    | --                                               | --
| forks new backend                     | running long running query                       | --
|                                       | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up
| starts cleaning up after connection A | --                                               | acquires lock on `freeProc`
| reads `A->procgloballist` (a pointer to a node of `freeProc`) | --                       |
|                                       | --                                               | modfies `freeProc`
| modifies `freeProc` based on it's stale read from A | --                                 |
|                                       | --                                               | releases lock on `freeProc`

`ProcGlobal->freeProc` now has its state messed up because it was modified concurrently.

D20454 introduced cleaning up after terminated connections, but missed including this lock.

== Changes: ==
- By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write.
- Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature.  Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise.
- Added additional logging to help future debugging efforts
- Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired`
- Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state.
Jira: DB-7029, DB-7069

Test Plan:
== Max Connections Issue ==
Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up.
- I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
+       pg_usleep(1000 * 1000 * 2); // 2s
        SpinLockAcquire(ProcStructLock);
```
- I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess`
```
lang=diff
                        proc->links.next = (SHM_QUEUE *) *procgloballist;
+                       pg_usleep(1000 * 1000 * 8); // 8s
                        *procgloballist = proc;
```

Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself
3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B
4. Create connection C and
 - in the original state, it gets rejected with the error "FATAL:  latch already owned by 690287" (connection B) (I'm not sure why this occurs)
 - with the diff, it connects successfully
5. Create connection D and
 - in the original state, it gets rejected with the error "Sorry, too many clients already"
 - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state.

If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact.

== Avoiding stuck spinlock ==
- I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
        SpinLockAcquire(ProcStructLock);
+       pg_usleep(1000 * 1000 * 5); // 5s
```
Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. This acquires the lock and then waits.
3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list.
4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted.

Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A.

Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts.

== Unexpected Signals ==
Steps:
1. Create connection A and B
2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal.
3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected.

Reviewers: kramanathan, amartsinchyk

Reviewed By: amartsinchyk

Subscribers: yql, ssong, smishra

Differential Revision: https://phorge.dev.yugabyte.com/D27282
timothy-e added a commit that referenced this issue Aug 8, 2023
…s with additional locking

Summary:
Original commit: D26563 / 2e67c23

When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection.

| **Postmaster**                        | **Connection A**                                 | **Connection B**
| --                                    | --                                               | --
| forks new backend                     | running long running query                       | --
|                                       | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up
| starts cleaning up after connection A | --                                               | acquires lock on `freeProc`
| reads `A->procgloballist` (a pointer to a node of `freeProc`) | --                       |
|                                       | --                                               | modfies `freeProc`
| modifies `freeProc` based on it's stale read from A | --                                 |
|                                       | --                                               | releases lock on `freeProc`

`ProcGlobal->freeProc` now has its state messed up because it was modified concurrently.

D20454 introduced cleaning up after terminated connections, but missed including this lock.

== Changes: ==
- By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write.
- Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature.  Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise.
- Added additional logging to help future debugging efforts
- Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired`
- Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state.
Jira: DB-7029, DB-7069

Test Plan:
== Max Connections Issue ==
Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up.
- I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
+       pg_usleep(1000 * 1000 * 2); // 2s
        SpinLockAcquire(ProcStructLock);
```
- I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess`
```
lang=diff
                        proc->links.next = (SHM_QUEUE *) *procgloballist;
+                       pg_usleep(1000 * 1000 * 8); // 8s
                        *procgloballist = proc;
```

Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself
3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B
4. Create connection C and
 - in the original state, it gets rejected with the error "FATAL:  latch already owned by 690287" (connection B) (I'm not sure why this occurs)
 - with the diff, it connects successfully
5. Create connection D and
 - in the original state, it gets rejected with the error "Sorry, too many clients already"
 - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state.

If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact.

== Avoiding stuck spinlock ==
- I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
        SpinLockAcquire(ProcStructLock);
+       pg_usleep(1000 * 1000 * 5); // 5s
```
Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. This acquires the lock and then waits.
3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list.
4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted.

Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A.

Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts.

== Unexpected Signals ==
Steps:
1. Create connection A and B
2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal.
3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected.

Reviewers: amartsinchyk, kramanathan

Reviewed By: amartsinchyk

Subscribers: yql, cdavid

Differential Revision: https://phorge.dev.yugabyte.com/D27287
timothy-e added a commit that referenced this issue Aug 8, 2023
… with additional locking

Summary:
Original Commit: D26563 / 2e67c23

When many connections are being created and killed at the same time, it's possible for a new backend to write to the same freeProc list that the postmaster is writing to while cleaning up after a terminated connection.

| **Postmaster**                        | **Connection A**                                 | **Connection B**
| --                                    | --                                               | --
| forks new backend                     | running long running query                       | --
|                                       | is suddenly terminated by KILL, OOM, or segfault | begins setting itself up
| starts cleaning up after connection A | --                                               | acquires lock on `freeProc`
| reads `A->procgloballist` (a pointer to a node of `freeProc`) | --                       |
|                                       | --                                               | modfies `freeProc`
| modifies `freeProc` based on it's stale read from A | --                                 |
|                                       | --                                               | releases lock on `freeProc`

`ProcGlobal->freeProc` now has its state messed up because it was modified concurrently.

D20454 introduced cleaning up after terminated connections, but missed including this lock.

== Changes: ==
- By adding a lock around the postmaster's freeProc list usage, we can prevent this concurrent write.
- Checked if the ProcStructLock spinlock is available before acquiring it - this doesn't fix the potential deadlock, but makes it much less likely. Retry logic means that if the lock happens to be taken, we will try again a few more times so we might be able to avoid restarting the postmaster. A more robust solution would involve implementing a spinlock method `bool TryAcquire(int timeoutMs)`, but that would involve potentially dangerous changes to a fundamental feature.  Differing from the original approach in ProcKill, we acquire the spinlock once and use it to cleanup the leader's proc struct (if it exists) and the killed process's proc struct. This has little impact on the execution without lockGroups, but when lockGroups are introduced (maybe as part of parallelism), it will result in one less opportunity for deadlock to arise.
- Added additional logging to help future debugging efforts
- Modified `ybAnyLockAcquired` in `LWLockConditionalAcquire` to be set to true earlier, avoiding a situation where a backend is killed with a LWLock acquired but without `ybAnyLockAcquired`
- Restart the postmaster if a backend receives an unexpected kill signal. We know what to do when we receive a SIGABRT, SIGKILL, or SIGSEGV, but other signals are an indication that something mysterious went wrong, and it's possible that we won't be able to get ourselves back into a good state.
Jira: DB-7029, DB-7069

Test Plan:
== Max Connections Issue ==
Postmaster is single threaded, so for the freeProc write to happen concurrently with another write, the other write must happen in a backend that is still setting itself up.
- I added a 4s delay to new backends before acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
+       pg_usleep(1000 * 1000 * 2); // 2s
        SpinLockAcquire(ProcStructLock);
```
- I added a 8s delay to postmaster between reading freeProc and writing the new freeProc in `postmaster.c::CleanupKilledProcess`
```
lang=diff
                        proc->links.next = (SHM_QUEUE *) *procgloballist;
+                       pg_usleep(1000 * 1000 * 8); // 8s
                        *procgloballist = proc;
```

Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. In 4s this connection will attempt to acquire the `freeProc` lock to modify `freeProc` to claim an entry for itself
3. Immediately kill connection A. The Postmaster will read freeProc, and in 8s, will write a new value to it. In between, `freeProc` will have been changed by connection B
4. Create connection C and
 - in the original state, it gets rejected with the error "FATAL:  latch already owned by 690287" (connection B) (I'm not sure why this occurs)
 - with the diff, it connects successfully
5. Create connection D and
 - in the original state, it gets rejected with the error "Sorry, too many clients already"
 - with the diff, it connects successfully, and (based on logging), the `freeProc` list is in the correct state.

If connection C or D are created within 8s (the duration of the postmaster sleep), they will take 8s to create and connect to the server, but otherwise the increased duration that we hold the lock for has no impact.

== Avoiding stuck spinlock ==
- I added a 5s delay to new backends after acquiring the freeProc lock in `proc.c:InitProcess`
```
lang=diff
         * While we are holding the ProcStructLock, also copy the current shared
         * estimate of spins_per_delay to local storage.
         */
        SpinLockAcquire(ProcStructLock);
+       pg_usleep(1000 * 1000 * 5); // 5s
```
Then run the steps:
1. Create connection A and allow it to connect.
2. Create a new connection B. This acquires the lock and then waits.
3. Immediately kill connection A. The Postmaster will want to acquire the lock before trying to return A to the freeProc list.
4. Since the lock is taken, `CheckSpinLock` fails and so the postmaster is restarted.

Removing the call to `CheckSpinLock` (or increasing the delays between retries to 2s), instead of step 4, we will wait the 5s before being able to continue with the cleanup of connection A.

Although this may result in false positives where we unnecessarily kill the postmaster, the alternative is worse. If a process dies holding the spinlock, the Postmaster becomes unresponsive until the spinlock is determined to be stuck, 2 minutes later, and then the Postmaster restarts.

== Unexpected Signals ==
Steps:
1. Create connection A and B
2. Send `kill -31 <A_pid>` - this is the `SIGSYS` signal.
3. Observe from the logs that the postmaster is restarted (new PID) and both A and B need to be reconnected.

Reviewers: kramanathan, amartsinchyk

Reviewed By: amartsinchyk

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D27301
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants