Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Processes killed while writing to a BEEntry can cause corrupted state #20255

Closed
1 task done
timothy-e opened this issue Dec 9, 2023 · 0 comments
Closed
1 task done
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@timothy-e
Copy link
Contributor

timothy-e commented Dec 9, 2023

Jira Link: DB-9217

Description

st_changecount is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read.

  1. Add sleep to a beentry write section (after st_changecount has been incremented once)
  2. Kill the backend while it's writing (so st_changecount is odd)
  3. Access localhost:13000/rpcz and it will spin indefinitely.

Solution: anytime a backend is killed in a critical section, restart the postmaster.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@timothy-e timothy-e added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Dec 9, 2023
@timothy-e timothy-e self-assigned this Dec 9, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Dec 9, 2023
timothy-e added a commit that referenced this issue Dec 11, 2023
…ections

Summary:
As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing.

These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it.

If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds.

The only non-assert code where we check this condition is in
* [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]]
* `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code)
* `pgstat_read_current_status`

Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes.

The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one.
Jira: DB-9217

Test Plan:
1. Add a delay to pgstat when a process updates its own beentry:
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
         * Now update the status entry
         */
        PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       pg_usleep(10 * 1000 * 1000);

        beentry->st_state = state;
        beentry->st_state_start_timestamp = current_timestamp;
diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c
```

2. Create a connection A and a connection B
3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a`
4. In another shell, run `kill -9 <pid_a>`.

== Results with this change ==
Observe in the logs:
```
2023-12-10 02:44:29.630 UTC [396633] WARNING:  terminating active server processes due to backend crash of a process while it was in a critical section
2023-12-10 02:44:29.630 UTC [396633] WARNING:  server process (PID 397283) was terminated by signal 9: Killed
2023-12-10 02:44:29.630 UTC [396633] DETAIL:  Failed process was running: select pg_backend_pid();
```
and that connection B is forced to restart:
```
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
```

and `curl localhost:13000/rpcz` returns with no issues.

== Results without this change ==
Observe a normal kill in the logs:
```
2023-12-09 22:17:10.003 UTC [244532] WARNING:  server process (PID 244858) was terminated by signal 9: Killed
2023-12-09 22:17:10.003 UTC [244532] DETAIL:  Failed process was running: select pg_backend_pid();
```
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

**High memory usage and unavailable node**
Without this change, and after following the repro steps above:
1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry).
2. `curl localhost:13000/rpcz` returns with no issues.
3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount`
4. Run `top` in another shell.
5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine.
6. Observe the results from the `top` command:
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

Reviewers: amartsinchyk, kfranz, mihnea, smishra

Reviewed By: amartsinchyk, kfranz

Subscribers: kannan, yql

Differential Revision: https://phorge.dev.yugabyte.com/D30906
timothy-e added a commit that referenced this issue Dec 12, 2023
…s in a critical sections

Summary:
Original commit: 6e7330e / D30906
As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing.

These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it.

If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds.

The only non-assert code where we check this condition is in
* [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]]
* `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code)
* `pgstat_read_current_status`

Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes.

The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one.
Jira: DB-9217

Test Plan:
1. Add a delay to pgstat when a process updates its own beentry:
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
         * Now update the status entry
         */
        PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       pg_usleep(10 * 1000 * 1000);

        beentry->st_state = state;
        beentry->st_state_start_timestamp = current_timestamp;
diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c
```

2. Create a connection A and a connection B
3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a`
4. In another shell, run `kill -9 <pid_a>`.

== Results with this change ==
Observe in the logs:
```
2023-12-10 02:44:29.630 UTC [396633] WARNING:  terminating active server processes due to backend crash of a process while it was in a critical section
2023-12-10 02:44:29.630 UTC [396633] WARNING:  server process (PID 397283) was terminated by signal 9: Killed
2023-12-10 02:44:29.630 UTC [396633] DETAIL:  Failed process was running: select pg_backend_pid();
```
and that connection B is forced to restart:
```
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
```

and `curl localhost:13000/rpcz` returns with no issues.

== Results without this change ==
Observe a normal kill in the logs:
```
2023-12-09 22:17:10.003 UTC [244532] WARNING:  server process (PID 244858) was terminated by signal 9: Killed
2023-12-09 22:17:10.003 UTC [244532] DETAIL:  Failed process was running: select pg_backend_pid();
```
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

**High memory usage and unavailable node**
Without this change, and after following the repro steps above:
1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry).
2. `curl localhost:13000/rpcz` returns with no issues.
3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount`
4. Run `top` in another shell.
5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine.
6. Observe the results from the `top` command:
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

Reviewers: amartsinchyk, kfranz, mihnea, smishra

Reviewed By: amartsinchyk

Subscribers: kannan, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30933
timothy-e added a commit that referenced this issue Dec 12, 2023
…in a critical sections

Summary:
Original commit: 6e7330e / D30906
As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing.

These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it.

If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds.

The only non-assert code where we check this condition is in
* [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]]
* `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code)
* `pgstat_read_current_status`

Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes.

The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one.
Jira: DB-9217

Test Plan:
1. Add a delay to pgstat when a process updates its own beentry:
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
         * Now update the status entry
         */
        PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       pg_usleep(10 * 1000 * 1000);

        beentry->st_state = state;
        beentry->st_state_start_timestamp = current_timestamp;
diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c
```

2. Create a connection A and a connection B
3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a`
4. In another shell, run `kill -9 <pid_a>`.

== Results with this change ==
Observe in the logs:
```
2023-12-10 02:44:29.630 UTC [396633] WARNING:  terminating active server processes due to backend crash of a process while it was in a critical section
2023-12-10 02:44:29.630 UTC [396633] WARNING:  server process (PID 397283) was terminated by signal 9: Killed
2023-12-10 02:44:29.630 UTC [396633] DETAIL:  Failed process was running: select pg_backend_pid();
```
and that connection B is forced to restart:
```
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
```

and `curl localhost:13000/rpcz` returns with no issues.

== Results without this change ==
Observe a normal kill in the logs:
```
2023-12-09 22:17:10.003 UTC [244532] WARNING:  server process (PID 244858) was terminated by signal 9: Killed
2023-12-09 22:17:10.003 UTC [244532] DETAIL:  Failed process was running: select pg_backend_pid();
```
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

**High memory usage and unavailable node**
Without this change, and after following the repro steps above:
1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry).
2. `curl localhost:13000/rpcz` returns with no issues.
3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount`
4. Run `top` in another shell.
5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine.
6. Observe the results from the `top` command:
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

Reviewers: amartsinchyk, kfranz, mihnea, smishra

Reviewed By: amartsinchyk

Subscribers: yql, kannan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30932
timothy-e added a commit that referenced this issue Dec 12, 2023
…s in a critical sections

Summary:
Original commit: 6e7330e / D30906
As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing.

These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it.

If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds.

The only non-assert code where we check this condition is in
* [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]]
* `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code)
* `pgstat_read_current_status`

Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes.

The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one.
Jira: DB-9217

Test Plan:
1. Add a delay to pgstat when a process updates its own beentry:
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
         * Now update the status entry
         */
        PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       pg_usleep(10 * 1000 * 1000);

        beentry->st_state = state;
        beentry->st_state_start_timestamp = current_timestamp;
diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c
```

2. Create a connection A and a connection B
3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a`
4. In another shell, run `kill -9 <pid_a>`.

== Results with this change ==
Observe in the logs:
```
2023-12-10 02:44:29.630 UTC [396633] WARNING:  terminating active server processes due to backend crash of a process while it was in a critical section
2023-12-10 02:44:29.630 UTC [396633] WARNING:  server process (PID 397283) was terminated by signal 9: Killed
2023-12-10 02:44:29.630 UTC [396633] DETAIL:  Failed process was running: select pg_backend_pid();
```
and that connection B is forced to restart:
```
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
```

and `curl localhost:13000/rpcz` returns with no issues.

== Results without this change ==
Observe a normal kill in the logs:
```
2023-12-09 22:17:10.003 UTC [244532] WARNING:  server process (PID 244858) was terminated by signal 9: Killed
2023-12-09 22:17:10.003 UTC [244532] DETAIL:  Failed process was running: select pg_backend_pid();
```
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

**High memory usage and unavailable node**
Without this change, and after following the repro steps above:
1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry).
2. `curl localhost:13000/rpcz` returns with no issues.
3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount`
4. Run `top` in another shell.
5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine.
6. Observe the results from the `top` command:
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

Reviewers: amartsinchyk, kfranz, mihnea, smishra

Reviewed By: amartsinchyk

Subscribers: kannan, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30931
timothy-e added a commit that referenced this issue Dec 12, 2023
…in a critical sections

Summary:
Original commit: 6e7330e / D30906
As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing.

These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it.

If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds.

The only non-assert code where we check this condition is in
* [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]]
* `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code)
* `pgstat_read_current_status`

Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes.

The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one.
Jira: DB-9217

Test Plan:
1. Add a delay to pgstat when a process updates its own beentry:
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
         * Now update the status entry
         */
        PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       pg_usleep(10 * 1000 * 1000);

        beentry->st_state = state;
        beentry->st_state_start_timestamp = current_timestamp;
diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c
```

2. Create a connection A and a connection B
3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a`
4. In another shell, run `kill -9 <pid_a>`.

== Results with this change ==
Observe in the logs:
```
2023-12-10 02:44:29.630 UTC [396633] WARNING:  terminating active server processes due to backend crash of a process while it was in a critical section
2023-12-10 02:44:29.630 UTC [396633] WARNING:  server process (PID 397283) was terminated by signal 9: Killed
2023-12-10 02:44:29.630 UTC [396633] DETAIL:  Failed process was running: select pg_backend_pid();
```
and that connection B is forced to restart:
```
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
```

and `curl localhost:13000/rpcz` returns with no issues.

== Results without this change ==
Observe a normal kill in the logs:
```
2023-12-09 22:17:10.003 UTC [244532] WARNING:  server process (PID 244858) was terminated by signal 9: Killed
2023-12-09 22:17:10.003 UTC [244532] DETAIL:  Failed process was running: select pg_backend_pid();
```
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

**High memory usage and unavailable node**
Without this change, and after following the repro steps above:
1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry).
2. `curl localhost:13000/rpcz` returns with no issues.
3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount`
4. Run `top` in another shell.
5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine.
6. Observe the results from the `top` command:
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

Reviewers: amartsinchyk, kfranz, mihnea, smishra

Reviewed By: amartsinchyk

Subscribers: yql, kannan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30930
timothy-e added a commit that referenced this issue Dec 12, 2023
…in a critical sections

Summary:
Original commit: 6e7330e / D30906
As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing.

These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it.

If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds.

The only non-assert code where we check this condition is in
* [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]]
* `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code)
* `pgstat_read_current_status`

Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes.

The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one.
Jira: DB-9217

Test Plan:
1. Add a delay to pgstat when a process updates its own beentry:
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
         * Now update the status entry
         */
        PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       pg_usleep(10 * 1000 * 1000);

        beentry->st_state = state;
        beentry->st_state_start_timestamp = current_timestamp;
diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c
```

2. Create a connection A and a connection B
3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a`
4. In another shell, run `kill -9 <pid_a>`.

== Results with this change ==
Observe in the logs:
```
2023-12-10 02:44:29.630 UTC [396633] WARNING:  terminating active server processes due to backend crash of a process while it was in a critical section
2023-12-10 02:44:29.630 UTC [396633] WARNING:  server process (PID 397283) was terminated by signal 9: Killed
2023-12-10 02:44:29.630 UTC [396633] DETAIL:  Failed process was running: select pg_backend_pid();
```
and that connection B is forced to restart:
```
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
```

and `curl localhost:13000/rpcz` returns with no issues.

== Results without this change ==
Observe a normal kill in the logs:
```
2023-12-09 22:17:10.003 UTC [244532] WARNING:  server process (PID 244858) was terminated by signal 9: Killed
2023-12-09 22:17:10.003 UTC [244532] DETAIL:  Failed process was running: select pg_backend_pid();
```
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

**High memory usage and unavailable node**
Without this change, and after following the repro steps above:
1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry).
2. `curl localhost:13000/rpcz` returns with no issues.
3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount`
4. Run `top` in another shell.
5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine.
6. Observe the results from the `top` command:
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

Reviewers: amartsinchyk, kfranz, mihnea, smishra

Reviewed By: amartsinchyk

Subscribers: kannan, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30935
timothy-e added a commit that referenced this issue Dec 12, 2023
…in a critical sections

Summary:
Original commit: 6e7330e / D30906
As a lightweight alternative to locks, beentrys are protected by version numbers (`st_changecount`). `st_changecount` is incremented twice each write. Once for beginning the write, once for completing. This means that if the value is even, it is a clean value and safe to read. If the value is odd, it is currently being written to. If the value changes while we are reading it, we know that another process started writing.

These write sections are guarded by `PGSTAT_BEGIN_WRITE_ACTIVITY` and `PGSTAT_END_WRITE_ACTIVITY`. They start a critical section (`START_CRIT_SECTION` and `END_CRIT_SECTION`) and increment the value after starting the critical section, and before finishing it.

If a process encounters an error for any reason in a critical section, it escalates it to `PANIC`, causing a postmaster restart. However, if the process is killed with `kill -9` (e.g. oom killer), we did not detect that. (Postgres restarts upon any backend kill, so it is a Yugabyte-specific issue). In this case, the BEEntry's st_changecount is odd when it should be even. This doesn't affect normal backend operation too much, because checking that the value is even is done by Asserts, which do not run in release builds.

The only non-assert code where we check this condition is in
* [[ https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c#L229-L324 | yb_pg_metrics.c ]]
* `pgstat_read_activity_complete` which is called by `pgstat_get_backend_current_activity` (called by deadlock code)
* `pgstat_read_current_status`

Curling `localhost:13000/rpcz` after reproducing this issue resulted in an infinite loop, as it waited for the changecount to become even again. This infinite loop has no sleep or wait logic, so it consumes lots of CPU while waiting. This caused my devserver to freeze for several minutes.

The solution is to restart the postmaster when a process dies while in a critical section. This fix can only apply to backend PG code, not frontend PG code, because [[ https://github.com/yugabyte/yugabyte-db/blob/b385f2bd146f1141b2237a88da1ee775dfb64dd5/src/postgres/src/include/storage/lock.h#L18 | lock.h is not allowed to be included in frontend code ]]. Most uses are backend, including this one.
Jira: DB-9217

Test Plan:
1. Add a delay to pgstat when a process updates its own beentry:
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3288,6 +3288,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
         * Now update the status entry
         */
        PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       pg_usleep(10 * 1000 * 1000);

        beentry->st_state = state;
        beentry->st_state_start_timestamp = current_timestamp;
diff --git a/src/postgres/src/backend/postmaster/postmaster.c b/src/postgres/src/backend/postmaster/postmaster.c
```

2. Create a connection A and a connection B
3. In A, run `select pg_backend_pid();`. This will take extra time to execute because of the delays. Call this result `pid_a`
4. In another shell, run `kill -9 <pid_a>`.

== Results with this change ==
Observe in the logs:
```
2023-12-10 02:44:29.630 UTC [396633] WARNING:  terminating active server processes due to backend crash of a process while it was in a critical section
2023-12-10 02:44:29.630 UTC [396633] WARNING:  server process (PID 397283) was terminated by signal 9: Killed
2023-12-10 02:44:29.630 UTC [396633] DETAIL:  Failed process was running: select pg_backend_pid();
```
and that connection B is forced to restart:
```
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
```

and `curl localhost:13000/rpcz` returns with no issues.

== Results without this change ==
Observe a normal kill in the logs:
```
2023-12-09 22:17:10.003 UTC [244532] WARNING:  server process (PID 244858) was terminated by signal 9: Killed
2023-12-09 22:17:10.003 UTC [244532] DETAIL:  Failed process was running: select pg_backend_pid();
```
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

**High memory usage and unavailable node**
Without this change, and after following the repro steps above:
1. Close all ysqlsh connections (this means that rpcz will not report for the given beentry).
2. `curl localhost:13000/rpcz` returns with no issues.
3. Run `bin/ysqlsh` to create a new session. It will reuse the Beentry of the killed backend (the one with the improperly incremented `st_changecount`
4. Run `top` in another shell.
5. `curl --max-time 2 localhost:13000/rpcz` will terminate after 2s, but the webserver is still stuck in the infinite loop waiting for `st_changecount` to become even, allocating memory the whole time. The server runs extremely slowly, so that it takes multiple minutes to run the command `pkill -f -9 "YSQL webserver"` that would save the machine.
6. Observe the results from the `top` command:
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  63002 telgers+  20   0   12.7g  12.3g  16168 S  99.0  85.1   0:12.31 postgres
```

Reviewers: amartsinchyk, kfranz, mihnea, smishra

Reviewed By: amartsinchyk

Subscribers: yql, kannan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30934
timothy-e added a commit that referenced this issue Dec 13, 2023
Summary:
`localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing).

Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt.

If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe.

See this comment on pgstat.c:
```
* strcpy is safe even if the string is modified concurrently,
* because there's always a \0 at the end of the buffer.
```
Jira: DB-9229

Test Plan:
=== Forced Repro ===
1. Apply the patch
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
        /*
         * Now update the status entry
         */
-       PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       YBC_LOG_INFO("waiting!");
+       for (int i = 0; i < 10000000; i++)
+       {
+               PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+               pg_usleep(1); // 1 microsec

-       beentry->st_state = state;
-       beentry->st_state_start_timestamp = current_timestamp;
+               beentry->st_state = state;
+               beentry->st_state_start_timestamp = current_timestamp;

-       if (cmd_str != NULL)
-       {
-               memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-               beentry->st_activity_raw[len] = '\0';
-               beentry->st_activity_start_timestamp = start_timestamp;
-       }
+               if (cmd_str != NULL)
+               {
+                       memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+                       beentry->st_activity_raw[len] = '\0';
+                       beentry->st_activity_start_timestamp = start_timestamp;
+               }

-       PGSTAT_END_WRITE_ACTIVITY(beentry);
+               PGSTAT_END_WRITE_ACTIVITY(beentry);
+       }
 }

 /*-----------
```

2. Run any query, e.g. `\d`.
3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully
```
I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting!
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 100 attempts, continuing to retry
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 200 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 300 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 400 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 500 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 600 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 700 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 800 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 900 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 1000 attempts, using inconsistent results
```

=== Testing normal impact ===
This code path should not normally be executed:
1. Create `script.sql` that contains 10000 lines of `\d`
2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done`
3. In another shell, run `bin/ysqlsh -f script.sql` 50 times
4. After a few minutes, it will complete with no logging

Reviewers: amartsinchyk, kfranz, smishra, mihnea

Reviewed By: kfranz

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D30954
timothy-e added a commit that referenced this issue Dec 13, 2023
…eentries

Summary:
Original commit: ab51835 / D30954
`localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing).

Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt.

If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe.

See this comment on pgstat.c:
```
* strcpy is safe even if the string is modified concurrently,
* because there's always a \0 at the end of the buffer.
```
Jira: DB-9229

Test Plan:
=== Forced Repro ===
1. Apply the patch
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
        /*
         * Now update the status entry
         */
-       PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       YBC_LOG_INFO("waiting!");
+       for (int i = 0; i < 10000000; i++)
+       {
+               PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+               pg_usleep(1); // 1 microsec

-       beentry->st_state = state;
-       beentry->st_state_start_timestamp = current_timestamp;
+               beentry->st_state = state;
+               beentry->st_state_start_timestamp = current_timestamp;

-       if (cmd_str != NULL)
-       {
-               memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-               beentry->st_activity_raw[len] = '\0';
-               beentry->st_activity_start_timestamp = start_timestamp;
-       }
+               if (cmd_str != NULL)
+               {
+                       memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+                       beentry->st_activity_raw[len] = '\0';
+                       beentry->st_activity_start_timestamp = start_timestamp;
+               }

-       PGSTAT_END_WRITE_ACTIVITY(beentry);
+               PGSTAT_END_WRITE_ACTIVITY(beentry);
+       }
 }

 /*-----------
```

2. Run any query, e.g. `\d`.
3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully
```
I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting!
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 100 attempts, continuing to retry
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 200 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 300 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 400 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 500 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 600 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 700 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 800 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 900 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 1000 attempts, using inconsistent results
```

=== Testing normal impact ===
This code path should not normally be executed:
1. Create `script.sql` that contains 10000 lines of `\d`
2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done`
3. In another shell, run `bin/ysqlsh -f script.sql` 50 times
4. After a few minutes, it will complete with no logging

Reviewers: amartsinchyk, kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31015
timothy-e added a commit that referenced this issue Dec 13, 2023
…eentries

Summary:
Original commit: ab51835 / D30954
`localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing).

Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt.

If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe.

See this comment on pgstat.c:
```
* strcpy is safe even if the string is modified concurrently,
* because there's always a \0 at the end of the buffer.
```
Jira: DB-9229

Test Plan:
=== Forced Repro ===
1. Apply the patch
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
        /*
         * Now update the status entry
         */
-       PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       YBC_LOG_INFO("waiting!");
+       for (int i = 0; i < 10000000; i++)
+       {
+               PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+               pg_usleep(1); // 1 microsec

-       beentry->st_state = state;
-       beentry->st_state_start_timestamp = current_timestamp;
+               beentry->st_state = state;
+               beentry->st_state_start_timestamp = current_timestamp;

-       if (cmd_str != NULL)
-       {
-               memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-               beentry->st_activity_raw[len] = '\0';
-               beentry->st_activity_start_timestamp = start_timestamp;
-       }
+               if (cmd_str != NULL)
+               {
+                       memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+                       beentry->st_activity_raw[len] = '\0';
+                       beentry->st_activity_start_timestamp = start_timestamp;
+               }

-       PGSTAT_END_WRITE_ACTIVITY(beentry);
+               PGSTAT_END_WRITE_ACTIVITY(beentry);
+       }
 }

 /*-----------
```

2. Run any query, e.g. `\d`.
3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully
```
I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting!
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 100 attempts, continuing to retry
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 200 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 300 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 400 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 500 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 600 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 700 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 800 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 900 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 1000 attempts, using inconsistent results
```

=== Testing normal impact ===
This code path should not normally be executed:
1. Create `script.sql` that contains 10000 lines of `\d`
2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done`
3. In another shell, run `bin/ysqlsh -f script.sql` 50 times
4. After a few minutes, it will complete with no logging

Reviewers: amartsinchyk, kfranz, smishra, mihnea

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31013
timothy-e added a commit that referenced this issue Dec 13, 2023
…ntries

Summary:
Original commit: ab51835 / D30954
`localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing).

Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt.

If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe.

See this comment on pgstat.c:
```
* strcpy is safe even if the string is modified concurrently,
* because there's always a \0 at the end of the buffer.
```
Jira: DB-9229

Test Plan:
=== Forced Repro ===
1. Apply the patch
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
        /*
         * Now update the status entry
         */
-       PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       YBC_LOG_INFO("waiting!");
+       for (int i = 0; i < 10000000; i++)
+       {
+               PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+               pg_usleep(1); // 1 microsec

-       beentry->st_state = state;
-       beentry->st_state_start_timestamp = current_timestamp;
+               beentry->st_state = state;
+               beentry->st_state_start_timestamp = current_timestamp;

-       if (cmd_str != NULL)
-       {
-               memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-               beentry->st_activity_raw[len] = '\0';
-               beentry->st_activity_start_timestamp = start_timestamp;
-       }
+               if (cmd_str != NULL)
+               {
+                       memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+                       beentry->st_activity_raw[len] = '\0';
+                       beentry->st_activity_start_timestamp = start_timestamp;
+               }

-       PGSTAT_END_WRITE_ACTIVITY(beentry);
+               PGSTAT_END_WRITE_ACTIVITY(beentry);
+       }
 }

 /*-----------
```

2. Run any query, e.g. `\d`.
3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully
```
I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting!
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 100 attempts, continuing to retry
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 200 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 300 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 400 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 500 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 600 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 700 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 800 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 900 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 1000 attempts, using inconsistent results
```

=== Testing normal impact ===
This code path should not normally be executed:
1. Create `script.sql` that contains 10000 lines of `\d`
2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done`
3. In another shell, run `bin/ysqlsh -f script.sql` 50 times
4. After a few minutes, it will complete with no logging

Reviewers: amartsinchyk, kfranz, smishra, mihnea

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31012
timothy-e added a commit that referenced this issue Dec 13, 2023
…ntries

Summary:
Original commit: ab51835 / D30954
`localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing).

Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt.

If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe.

See this comment on pgstat.c:
```
* strcpy is safe even if the string is modified concurrently,
* because there's always a \0 at the end of the buffer.
```
Jira: DB-9229

Test Plan:
=== Forced Repro ===
1. Apply the patch
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
        /*
         * Now update the status entry
         */
-       PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       YBC_LOG_INFO("waiting!");
+       for (int i = 0; i < 10000000; i++)
+       {
+               PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+               pg_usleep(1); // 1 microsec

-       beentry->st_state = state;
-       beentry->st_state_start_timestamp = current_timestamp;
+               beentry->st_state = state;
+               beentry->st_state_start_timestamp = current_timestamp;

-       if (cmd_str != NULL)
-       {
-               memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-               beentry->st_activity_raw[len] = '\0';
-               beentry->st_activity_start_timestamp = start_timestamp;
-       }
+               if (cmd_str != NULL)
+               {
+                       memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+                       beentry->st_activity_raw[len] = '\0';
+                       beentry->st_activity_start_timestamp = start_timestamp;
+               }

-       PGSTAT_END_WRITE_ACTIVITY(beentry);
+               PGSTAT_END_WRITE_ACTIVITY(beentry);
+       }
 }

 /*-----------
```

2. Run any query, e.g. `\d`.
3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully
```
I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting!
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 100 attempts, continuing to retry
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 200 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 300 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 400 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 500 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 600 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 700 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 800 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 900 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 1000 attempts, using inconsistent results
```

=== Testing normal impact ===
This code path should not normally be executed:
1. Create `script.sql` that contains 10000 lines of `\d`
2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done`
3. In another shell, run `bin/ysqlsh -f script.sql` 50 times
4. After a few minutes, it will complete with no logging

Reviewers: amartsinchyk, kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31014
timothy-e added a commit that referenced this issue Dec 13, 2023
…ntries

Summary:
Original commit: ab51835 / D30954
`localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing).

Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt.

If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe.

See this comment on pgstat.c:
```
* strcpy is safe even if the string is modified concurrently,
* because there's always a \0 at the end of the buffer.
```
Jira: DB-9229

Test Plan:
=== Forced Repro ===
1. Apply the patch
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
        /*
         * Now update the status entry
         */
-       PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       YBC_LOG_INFO("waiting!");
+       for (int i = 0; i < 10000000; i++)
+       {
+               PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+               pg_usleep(1); // 1 microsec

-       beentry->st_state = state;
-       beentry->st_state_start_timestamp = current_timestamp;
+               beentry->st_state = state;
+               beentry->st_state_start_timestamp = current_timestamp;

-       if (cmd_str != NULL)
-       {
-               memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-               beentry->st_activity_raw[len] = '\0';
-               beentry->st_activity_start_timestamp = start_timestamp;
-       }
+               if (cmd_str != NULL)
+               {
+                       memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+                       beentry->st_activity_raw[len] = '\0';
+                       beentry->st_activity_start_timestamp = start_timestamp;
+               }

-       PGSTAT_END_WRITE_ACTIVITY(beentry);
+               PGSTAT_END_WRITE_ACTIVITY(beentry);
+       }
 }

 /*-----------
```

2. Run any query, e.g. `\d`.
3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully
```
I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting!
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 100 attempts, continuing to retry
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 200 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 300 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 400 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 500 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 600 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 700 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 800 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 900 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 1000 attempts, using inconsistent results
```

=== Testing normal impact ===
This code path should not normally be executed:
1. Create `script.sql` that contains 10000 lines of `\d`
2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done`
3. In another shell, run `bin/ysqlsh -f script.sql` 50 times
4. After a few minutes, it will complete with no logging

Reviewers: amartsinchyk, kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31016
timothy-e added a commit that referenced this issue Dec 13, 2023
…ntries

Summary:
Original commit: ab51835 / D30954
`localhost:13000/rpcz` calls `pullRpczEntries` which can wait infinitely in a for loop if a `beentry` is in an inconsistent state (`beentry->st_changecount` is odd) or being written to frequently (`beentry->st_changecount` is odd or changing).

Postgres still has the same code flow of a break inside an infinite loop, but Yugabyte may have additional code that could results in an inconsistent state (beyond #20255). To be extra cautious, we set the limit to 1000 attempts and log every hundredth attempt.

If the BEEntry never becomes consistent, we will proceed with the inconsistent results copied. This may be slightly misleading and result in a half-written value, but it is safe.

See this comment on pgstat.c:
```
* strcpy is safe even if the string is modified concurrently,
* because there's always a \0 at the end of the buffer.
```
Jira: DB-9229

Test Plan:
=== Forced Repro ===
1. Apply the patch
```lang=diff
--- a/src/postgres/src/backend/postmaster/pgstat.c
+++ b/src/postgres/src/backend/postmaster/pgstat.c
@@ -3287,19 +3287,24 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
        /*
         * Now update the status entry
         */
-       PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+       YBC_LOG_INFO("waiting!");
+       for (int i = 0; i < 10000000; i++)
+       {
+               PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+               pg_usleep(1); // 1 microsec

-       beentry->st_state = state;
-       beentry->st_state_start_timestamp = current_timestamp;
+               beentry->st_state = state;
+               beentry->st_state_start_timestamp = current_timestamp;

-       if (cmd_str != NULL)
-       {
-               memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-               beentry->st_activity_raw[len] = '\0';
-               beentry->st_activity_start_timestamp = start_timestamp;
-       }
+               if (cmd_str != NULL)
+               {
+                       memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+                       beentry->st_activity_raw[len] = '\0';
+                       beentry->st_activity_start_timestamp = start_timestamp;
+               }

-       PGSTAT_END_WRITE_ACTIVITY(beentry);
+               PGSTAT_END_WRITE_ACTIVITY(beentry);
+       }
 }

 /*-----------
```

2. Run any query, e.g. `\d`.
3. While the query runs (takes ~10s because of the 1M * 1µs sleeps), curl `localhost:13000/rpcz`. It will complete successfully
```
I1211 17:29:59.046701 3847495808 pgstat.c:3290] waiting!
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 100 attempts, continuing to retry
2023-12-11 17:30:00.798 EST [43411] WARNING:  backend status entry for pid 43433 required 200 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 300 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 400 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 500 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 600 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 700 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 800 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 900 attempts, continuing to retry
2023-12-11 17:30:00.799 EST [43411] WARNING:  backend status entry for pid 43433 required 1000 attempts, using inconsistent results
```

=== Testing normal impact ===
This code path should not normally be executed:
1. Create `script.sql` that contains 10000 lines of `\d`
2. In one shell, run `for i in {0..11111120}; do curl -s localhost:13000/rpcz > /dev/null; done`
3. In another shell, run `bin/ysqlsh -f script.sql` 50 times
4. After a few minutes, it will complete with no logging

Reviewers: amartsinchyk, kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants