New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Processes killed while writing to a BEEntry can cause corrupted state #20255
Closed
1 task done
Labels
Comments
timothy-e
added
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Dec 9, 2023
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
and removed
status/awaiting-triage
Issue awaiting triage
labels
Dec 9, 2023
timothy-e
added a commit
that referenced
this issue
Dec 11, 2023
…ections Summary: As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk, kfranz Subscribers: kannan, yql Differential Revision: https://phorge.dev.yugabyte.com/D30906
timothy-e
added a commit
that referenced
this issue
Dec 12, 2023
…s in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: kannan, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30933
timothy-e
added a commit
that referenced
this issue
Dec 12, 2023
…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: yql, kannan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30932
timothy-e
added a commit
that referenced
this issue
Dec 12, 2023
…s in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: kannan, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30931
timothy-e
added a commit
that referenced
this issue
Dec 12, 2023
…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: yql, kannan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30930
timothy-e
added a commit
that referenced
this issue
Dec 12, 2023
…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: kannan, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30935
timothy-e
added a commit
that referenced
this issue
Dec 12, 2023
…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: yql, kannan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30934
timothy-e
added a commit
that referenced
this issue
Dec 13, 2023
Summary: `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz, smishra, mihnea Reviewed By: kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D30954
timothy-e
added a commit
that referenced
this issue
Dec 13, 2023
…eentries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31015
timothy-e
added a commit
that referenced
this issue
Dec 13, 2023
…eentries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz, smishra, mihnea Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31013
timothy-e
added a commit
that referenced
this issue
Dec 13, 2023
…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz, smishra, mihnea Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31012
timothy-e
added a commit
that referenced
this issue
Dec 13, 2023
…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31014
timothy-e
added a commit
that referenced
this issue
Dec 13, 2023
…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31016
timothy-e
added a commit
that referenced
this issue
Dec 13, 2023
…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31017
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Jira Link: DB-9217
Description
st_changecount
is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read.localhost:13000/rpcz
and it will spin indefinitely.Solution: anytime a backend is killed in a critical section, restart the postmaster.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: