[YSQL] Processes killed while writing to a BEEntry can cause corrupted state #20255

timothy-e · 2023-12-09T22:33:52Z

Jira Link: DB-9217

Description

st_changecount is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read.

Add sleep to a beentry write section (after st_changecount has been incremented once)
Kill the backend while it's writing (so st_changecount is odd)
Access localhost:13000/rpcz and it will spin indefinitely.

Solution: anytime a backend is killed in a critical section, restart the postmaster.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

…ections Summary: As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk, kfranz Subscribers: kannan, yql Differential Revision: https://phorge.dev.yugabyte.com/D30906

…s in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: kannan, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30933

…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: yql, kannan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30932

…s in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: kannan, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30931

…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: yql, kannan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30930

…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: kannan, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30935

…in a critical sections Summary: Original commit: 6e7330e / D30906 As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing. These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it. If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds. The only non-assert code where we check this condition is in * [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]] * `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code) * `pgstat_read_current_status` Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes. The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one. Jira: DB-9217 Test Plan: 1. Add a delay to pgstat when a process updates its own beentry: ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str) * Now update the status entry */ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(10 * 1000 * 1000); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c ``` 2. Create a connection A and a connection B 3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a` 4. In another shell, run `kill -9 <pid_a>`. == Results with this change == Observe in the logs: ``` 2023-12-10 02:44:29.630 UTC [396633] WARNING: terminating active server processes due to backend crash of a process while it was in a critical section 2023-12-10 02:44:29.630 UTC [396633] WARNING: server process (PID 397283) was terminated by signal 9: Killed 2023-12-10 02:44:29.630 UTC [396633] DETAIL: Failed process was running: select pg_backend_pid(); ``` and that connection B is forced to restart: ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. ``` and `curl localhost:13000/rpcz` returns with no issues. == Results without this change == Observe a normal kill in the logs: ``` 2023-12-09 22:17:10.003 UTC [244532] WARNING: server process (PID 244858) was terminated by signal 9: Killed 2023-12-09 22:17:10.003 UTC [244532] DETAIL: Failed process was running: select pg_backend_pid(); ``` ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` **High memory usage and unavailable node** Without this change, and after following the repro steps above: 1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry). 2. `curl localhost:13000/rpcz` returns with no issues. 3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount` 4. Run `top` in another shell. 5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine. 6. Observe the results from the `top` command: ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63002 telgers+ 20 0 12.7g 12.3g 16168 S 99.0 85.1 0:12.31 postgres ``` Reviewers: amartsinchyk, kfranz, mihnea, smishra Reviewed By: amartsinchyk Subscribers: yql, kannan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30934

Summary: `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz, smishra, mihnea Reviewed By: kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D30954

…eentries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31015

…eentries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz, smishra, mihnea Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31013

…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz, smishra, mihnea Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31012

…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31014

…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31016

…ntries Summary: Original commit: ab51835 / D30954 `localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing). Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt. If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe. See this comment on pgstat.c: ``` * strcpy is safe even if the string is modified concurrently, * because there's always a \0 at the end of the buffer. ``` Jira: DB-9229 Test Plan: === Forced Repro === 1. Apply the patch ```lang=diff --- a/src/postgres/src/backend/postmaster/pgstat.c +++ b/src/postgres/src/backend/postmaster/pgstat.c @@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str) /* * Now update the status entry */ - PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + YBC_LOG_INFO("waiting!"); + for (int i = 0; i < 10000000; i++) + { + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + pg_usleep(1); // 1 microsec - beentry->st_state = state; - beentry->st_state_start_timestamp = current_timestamp; + beentry->st_state = state; + beentry->st_state_start_timestamp = current_timestamp; - if (cmd_str != NULL) - { - memcpy((char *) beentry->st_activity_raw, cmd_str, len); - beentry->st_activity_raw[len] = '\0'; - beentry->st_activity_start_timestamp = start_timestamp; - } + if (cmd_str != NULL) + { + memcpy((char *) beentry->st_activity_raw, cmd_str, len); + beentry->st_activity_raw[len] = '\0'; + beentry->st_activity_start_timestamp = start_timestamp; + } - PGSTAT_END_WRITE_ACTIVITY(beentry); + PGSTAT_END_WRITE_ACTIVITY(beentry); + } } /*----------- ``` 2. Run any query, e.g. `\d`. 3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully ``` I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting! 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 100 attempts, continuing to retry 2023-12-11 17:30:00.798 EST [43411] WARNING: backend status entry for pid 43433 required 200 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 300 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 400 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 500 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 600 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 700 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 800 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 900 attempts, continuing to retry 2023-12-11 17:30:00.799 EST [43411] WARNING: backend status entry for pid 43433 required 1000 attempts, using inconsistent results ``` === Testing normal impact === This code path should not normally be executed: 1. Create `script.sql` that contains 10000 lines of `\d` 2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done` 3. In another shell, run `bin/ysqlsh -f script.sql` 50 times 4. After a few minutes, it will complete with no logging Reviewers: amartsinchyk, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31017

timothy-e added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Dec 9, 2023

timothy-e self-assigned this Dec 9, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Dec 9, 2023

timothy-e mentioned this issue Dec 11, 2023

[YSQL] Limit the number of iterations while waiting for a BEEntry to enter a consistent state #20274

Closed

1 task

yugabyte-ci closed this as completed Dec 12, 2023

sushantrmishra mentioned this issue Dec 18, 2023

[YSQL] high memory usage observed for YSQL webserver process #19327

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Processes killed while writing to a BEEntry can cause corrupted state #20255

[YSQL] Processes killed while writing to a BEEntry can cause corrupted state #20255

timothy-e commented Dec 9, 2023 •

edited by jira bot

[YSQL] Processes killed while writing to a BEEntry can cause corrupted state #20255

[YSQL] Processes killed while writing to a BEEntry can cause corrupted state #20255

Comments

timothy-e commented Dec 9, 2023 • edited by jira bot

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

timothy-e commented Dec 9, 2023 •

edited by jira bot