Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Postgres crash in `SetLatch(latch=0x024) at latch.c #20309

Closed
1 task done
Karvy-yb opened this issue Dec 14, 2023 · 1 comment
Closed
1 task done

[YSQL] Postgres crash in `SetLatch(latch=0x024) at latch.c #20309

Karvy-yb opened this issue Dec 14, 2023 · 1 comment
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/highest Highest priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation

Comments

@Karvy-yb
Copy link

Karvy-yb commented Dec 14, 2023

Jira Link: DB-9276

Description

Test: test_create_alter_delete_tables_vm_restarts
Version: 2.18.5.0-b91

Test steps:
image

Core dump:

* thread #1, name = 'postgres', stop reason = signal SIGSEGV
  * frame #0: 0x0000563277c2738b postgres`SetLatch(latch=0x0000000000000024) at latch.c:444:6
    frame #1: 0x00007fa0fd818c2f yb_pg_metrics.so`ws_sighup_handler(postgres_signal_arg=<unavailable>) at yb_pg_metrics.c:343:2
    frame #2: 0x00007fa102c80ba0 libpthread.so.0`__restore_rt
    frame #3: 0x00007fa104367dab ld.so`do_lookup_x(undef_name="pthread_rwlock_destroy", new_hash=<unavailable>, old_hash=0x00007fff4e6b2e60, ref=0x00007fa103b3abc8, result=0x00007fff4e6b2e70, scope=<unavailable>, i=6, version=0x00007fa0fda14a88, flags=5, skip=0x0000000000000000, type_class=1, undef_map=0x00007fa10456bae8) at dl-lookup.c:377
    frame #4: 0x00007fa104368a7c ld.so`_dl_lookup_symbol_x(undef_name="pthread_rwlock_destroy", undef_map=0x00007fa10456bae8, ref=0x00007fff4e6b2f28, symbol_scope=0x00007fa10456be40, version=0x00007fa0fda14a88, type_class=1, flags=5, skip_map=0x0000000000000000) at dl-lookup.c:829
    frame #5: 0x00007fa10436d18f ld.so`_dl_fixup(l=<unavailable>, reloc_arg=<unavailable>) at dl-runtime.c:111
    frame #6: 0x00007fa104373887 ld.so`_dl_runtime_resolve_avx512 at dl-trampoline.h:112
  thread #2, stop reason = signal 0
    frame #0: 0x00007fa102c7d3b8 libpthread.so.0`pthread_cond_timedwait@@GLIBC_2.3.2 at pthread_cond_timedwait.S:225
    frame #1: 0x00007fa0ffd3233e libserver_process.so`worker_thread + 382
    frame #2: 0x00007fa102c78694 libpthread.so.0`start_thread(arg=0x00007fa0f2e15700) at pthread_create.c:333
    frame #3: 0x00007fa1023b541d libc.so.6`__clone at clone.S:109
  thread #3, stop reason = signal 0
    frame #0: 0x00007fa1023ac5cd libc.so.6`poll at syscall-template.S:84
    frame #1: 0x00007fa0ffd31e4c libserver_process.so`master_thread + 572
    frame #2: 0x00007fa102c78694 libpthread.so.0`start_thread(arg=0x00007fa0f3616700) at pthread_create.c:333
    frame #3: 0x00007fa1023b541d libc.so.6`__clone at clone.S:109

Note: There is another core dump observed which is already reported in #17172
Stress report: http://stress.dev.yugabyte.com/stress_test/75f1961a-10ff-4d36-a378-93251e23de71

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@Karvy-yb Karvy-yb added area/ysql Yugabyte SQL (YSQL) priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation labels Dec 14, 2023
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Dec 14, 2023
@Karvy-yb
Copy link
Author

Karvy-yb commented Dec 14, 2023

Observing the same issue for test_sequence_multizone_ddl in 2.18.6.0-b20 and 2.18.5.0-b91
Observing same issue for test_ysql_bank_operations_pessimistic_lock in 2.18.5.0-b91

@yugabyte-ci yugabyte-ci added priority/highest Highest priority issue and removed status/awaiting-triage Issue awaiting triage priority/medium Medium priority issue labels Dec 14, 2023
@timothy-e timothy-e changed the title [YSQL] [Stress Test ]Postgres crash in `SetLatch(latch=0x0000000000000024) at latch.c [YSQL] Postgres crash in `SetLatch(latch=0x024) at latch.c Dec 14, 2023
timothy-e added a commit that referenced this issue Dec 15, 2023
…cleanup

Summary:
Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet.

= Background =
=== Latch Context ===
```lang=c, name=latch.h
/*
...
 * There are two kinds of latches: local and shared. A local latch is
 * initialized by InitLatch, and can only be set from the same process.
 * A local latch can be used to wait for a signal to arrive, by calling
 * SetLatch in the signal handler. A shared latch resides in shared memory,
 * and must be initialized at postmaster startup by InitSharedLatch. Before
 * a shared latch can be waited on, it must be associated with a process
 * with OwnLatch. Only the process owning the latch can wait on it, but any
 * process can set it.
...
```

=== Process Life Cycle ===
A process is forked by the postmaster. Inside the new process, we set the values in the following order:
1. `InitPostmasterChild` sets `MyLatch` to a local latch object.
2. `InitProcess` sets `MyProc` to a non-null value from the free proc list
3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct.
4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now.
5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again.
6. **The process runs normally, waiting on the latch for new events. **
7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive)
8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch.
9. The process sets its `MyProc` to NULL.
10. The process disowns the shared latch.

= Problem =
If the process receives a SIGHUP signal...
* before (5) above, nothing happens - no signal handlers are registered.
* during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically.
* during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault.

The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`.

Test Plan:
1. Apply the patch
```lang=diff
diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c
index eedb2057ea..c4529af546 100644
--- a/src/postgres/src/backend/storage/lmgr/proc.c
+++ b/src/postgres/src/backend/storage/lmgr/proc.c
@@ -885,6 +885,7 @@ ProcKill(int code, Datum arg)
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);

        ReleaseProcToFreeList(proc);
@@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg)
         * latch.
         */
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);
```
2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")`
3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID`
4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID`
5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully.

Reviewers: amartsinchyk, kfranz, smishra

Reviewed By: amartsinchyk, smishra

Subscribers: smishra, yql

Differential Revision: https://phorge.dev.yugabyte.com/D31071
timothy-e added a commit that referenced this issue Dec 15, 2023
…ighup handler at cleanup

Summary:
Original commit: D31071 / af01dc0

Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet.

= Background =
=== Latch Context ===
```lang=c, name=latch.h
/*
...
 * There are two kinds of latches: local and shared. A local latch is
 * initialized by InitLatch, and can only be set from the same process.
 * A local latch can be used to wait for a signal to arrive, by calling
 * SetLatch in the signal handler. A shared latch resides in shared memory,
 * and must be initialized at postmaster startup by InitSharedLatch. Before
 * a shared latch can be waited on, it must be associated with a process
 * with OwnLatch. Only the process owning the latch can wait on it, but any
 * process can set it.
...
```

=== Process Life Cycle ===
A process is forked by the postmaster. Inside the new process, we set the values in the following order:
1. `InitPostmasterChild` sets `MyLatch` to a local latch object.
2. `InitProcess` sets `MyProc` to a non-null value from the free proc list
3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct.
4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now.
5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again.
6. **The process runs normally, waiting on the latch for new events. **
7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive)
8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch.
9. The process sets its `MyProc` to NULL.
10. The process disowns the shared latch.

= Problem =
If the process receives a SIGHUP signal...
* before (5) above, nothing happens - no signal handlers are registered.
* during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically.
* during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault.

The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`.

Test Plan:
1. Apply the patch
```lang=diff
diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c
index eedb2057ea..c4529af546 100644
--- a/src/postgres/src/backend/storage/lmgr/proc.c
+++ b/src/postgres/src/backend/storage/lmgr/proc.c
@@ -885,6 +885,7 @@ ProcKill(int code, Datum arg)
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);

        ReleaseProcToFreeList(proc);
@@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg)
         * latch.
         */
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);
```
2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")`
3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID`
4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID`
5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully.

Reviewers: amartsinchyk, kfranz, smishra

Reviewed By: smishra

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31073
timothy-e added a commit that referenced this issue Dec 21, 2023
…ighup handler at cleanup

Summary:
Original commit: af01dc0 / D31071
Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet.

= Background =
=== Latch Context ===
```lang=c, name=latch.h
/*
...
 * There are two kinds of latches: local and shared. A local latch is
 * initialized by InitLatch, and can only be set from the same process.
 * A local latch can be used to wait for a signal to arrive, by calling
 * SetLatch in the signal handler. A shared latch resides in shared memory,
 * and must be initialized at postmaster startup by InitSharedLatch. Before
 * a shared latch can be waited on, it must be associated with a process
 * with OwnLatch. Only the process owning the latch can wait on it, but any
 * process can set it.
...
```

=== Process Life Cycle ===
A process is forked by the postmaster. Inside the new process, we set the values in the following order:
1. `InitPostmasterChild` sets `MyLatch` to a local latch object.
2. `InitProcess` sets `MyProc` to a non-null value from the free proc list
3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct.
4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now.
5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again.
6. **The process runs normally, waiting on the latch for new events. **
7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive)
8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch.
9. The process sets its `MyProc` to NULL.
10. The process disowns the shared latch.

= Problem =
If the process receives a SIGHUP signal...
* before (5) above, nothing happens - no signal handlers are registered.
* during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically.
* during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault.

The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`.

Jira: DB-9276

Test Plan:
1. Apply the patch
```lang=diff
diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c
index eedb2057ea..c4529af546 100644
--- a/src/postgres/src/backend/storage/lmgr/proc.c
+++ b/src/postgres/src/backend/storage/lmgr/proc.c
@@ -885,6 +885,7 @@ ProcKill(int code, Datum arg)
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);

        ReleaseProcToFreeList(proc);
@@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg)
         * latch.
         */
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);
```
2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")`
3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID`
4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID`
5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully.

Reviewers: amartsinchyk, kfranz, smishra

Reviewed By: amartsinchyk

Subscribers: smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31167
timothy-e added a commit that referenced this issue Dec 21, 2023
…hup handler at cleanup

Summary:
Original commit: af01dc0 / D31071
Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet.

= Background =
=== Latch Context ===
```lang=c, name=latch.h
/*
...
 * There are two kinds of latches: local and shared. A local latch is
 * initialized by InitLatch, and can only be set from the same process.
 * A local latch can be used to wait for a signal to arrive, by calling
 * SetLatch in the signal handler. A shared latch resides in shared memory,
 * and must be initialized at postmaster startup by InitSharedLatch. Before
 * a shared latch can be waited on, it must be associated with a process
 * with OwnLatch. Only the process owning the latch can wait on it, but any
 * process can set it.
...
```

=== Process Life Cycle ===
A process is forked by the postmaster. Inside the new process, we set the values in the following order:
1. `InitPostmasterChild` sets `MyLatch` to a local latch object.
2. `InitProcess` sets `MyProc` to a non-null value from the free proc list
3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct.
4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now.
5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again.
6. **The process runs normally, waiting on the latch for new events. **
7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive)
8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch.
9. The process sets its `MyProc` to NULL.
10. The process disowns the shared latch.

= Problem =
If the process receives a SIGHUP signal...
* before (5) above, nothing happens - no signal handlers are registered.
* during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically.
* during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault.

The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`.

Jira: DB-9276

Test Plan:
1. Apply the patch
```lang=diff
diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c
index eedb2057ea..c4529af546 100644
--- a/src/postgres/src/backend/storage/lmgr/proc.c
+++ b/src/postgres/src/backend/storage/lmgr/proc.c
@@ -885,6 +885,7 @@ ProcKill(int code, Datum arg)
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);

        ReleaseProcToFreeList(proc);
@@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg)
         * latch.
         */
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);
```
2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")`
3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID`
4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID`
5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully.

Reviewers: amartsinchyk, kfranz, smishra

Reviewed By: amartsinchyk

Subscribers: yql, smishra

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31170
timothy-e added a commit that referenced this issue Dec 21, 2023
…hup handler at cleanup

Summary:
Original commit: af01dc0 / D31071
Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet.

= Background =
=== Latch Context ===
```lang=c, name=latch.h
/*
...
 * There are two kinds of latches: local and shared. A local latch is
 * initialized by InitLatch, and can only be set from the same process.
 * A local latch can be used to wait for a signal to arrive, by calling
 * SetLatch in the signal handler. A shared latch resides in shared memory,
 * and must be initialized at postmaster startup by InitSharedLatch. Before
 * a shared latch can be waited on, it must be associated with a process
 * with OwnLatch. Only the process owning the latch can wait on it, but any
 * process can set it.
...
```

=== Process Life Cycle ===
A process is forked by the postmaster. Inside the new process, we set the values in the following order:
1. `InitPostmasterChild` sets `MyLatch` to a local latch object.
2. `InitProcess` sets `MyProc` to a non-null value from the free proc list
3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct.
4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now.
5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again.
6. **The process runs normally, waiting on the latch for new events. **
7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive)
8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch.
9. The process sets its `MyProc` to NULL.
10. The process disowns the shared latch.

= Problem =
If the process receives a SIGHUP signal...
* before (5) above, nothing happens - no signal handlers are registered.
* during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically.
* during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault.

The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`.

Jira: DB-9276

Test Plan:
1. Apply the patch
```lang=diff
diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c
index eedb2057ea..c4529af546 100644
--- a/src/postgres/src/backend/storage/lmgr/proc.c
+++ b/src/postgres/src/backend/storage/lmgr/proc.c
@@ -885,6 +885,7 @@ ProcKill(int code, Datum arg)
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);

        ReleaseProcToFreeList(proc);
@@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg)
         * latch.
         */
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);
```
2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")`
3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID`
4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID`
5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully.

Reviewers: amartsinchyk, kfranz, smishra

Reviewed By: amartsinchyk

Subscribers: smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31169
timothy-e added a commit that referenced this issue Dec 21, 2023
…hup handler at cleanup

Summary:
Original commit: af01dc0 / D31071
Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet.

= Background =
=== Latch Context ===
```lang=c, name=latch.h
/*
...
 * There are two kinds of latches: local and shared. A local latch is
 * initialized by InitLatch, and can only be set from the same process.
 * A local latch can be used to wait for a signal to arrive, by calling
 * SetLatch in the signal handler. A shared latch resides in shared memory,
 * and must be initialized at postmaster startup by InitSharedLatch. Before
 * a shared latch can be waited on, it must be associated with a process
 * with OwnLatch. Only the process owning the latch can wait on it, but any
 * process can set it.
...
```

=== Process Life Cycle ===
A process is forked by the postmaster. Inside the new process, we set the values in the following order:
1. `InitPostmasterChild` sets `MyLatch` to a local latch object.
2. `InitProcess` sets `MyProc` to a non-null value from the free proc list
3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct.
4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now.
5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again.
6. **The process runs normally, waiting on the latch for new events. **
7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive)
8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch.
9. The process sets its `MyProc` to NULL.
10. The process disowns the shared latch.

= Problem =
If the process receives a SIGHUP signal...
* before (5) above, nothing happens - no signal handlers are registered.
* during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically.
* during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault.

The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`.

Jira: DB-9276

Test Plan:
1. Apply the patch
```lang=diff
diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c
index eedb2057ea..c4529af546 100644
--- a/src/postgres/src/backend/storage/lmgr/proc.c
+++ b/src/postgres/src/backend/storage/lmgr/proc.c
@@ -885,6 +885,7 @@ ProcKill(int code, Datum arg)
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);

        ReleaseProcToFreeList(proc);
@@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg)
         * latch.
         */
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);
```
2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")`
3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID`
4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID`
5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully.

Reviewers: amartsinchyk, kfranz, smishra

Reviewed By: amartsinchyk

Subscribers: yql, smishra

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31168
timothy-e added a commit that referenced this issue Dec 21, 2023
…hup handler at cleanup

Summary:
Original commit: af01dc0 / D31071
Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet.

= Background =
=== Latch Context ===
```lang=c, name=latch.h
/*
...
 * There are two kinds of latches: local and shared. A local latch is
 * initialized by InitLatch, and can only be set from the same process.
 * A local latch can be used to wait for a signal to arrive, by calling
 * SetLatch in the signal handler. A shared latch resides in shared memory,
 * and must be initialized at postmaster startup by InitSharedLatch. Before
 * a shared latch can be waited on, it must be associated with a process
 * with OwnLatch. Only the process owning the latch can wait on it, but any
 * process can set it.
...
```

=== Process Life Cycle ===
A process is forked by the postmaster. Inside the new process, we set the values in the following order:
1. `InitPostmasterChild` sets `MyLatch` to a local latch object.
2. `InitProcess` sets `MyProc` to a non-null value from the free proc list
3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct.
4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now.
5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again.
6. **The process runs normally, waiting on the latch for new events. **
7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive)
8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch.
9. The process sets its `MyProc` to NULL.
10. The process disowns the shared latch.

= Problem =
If the process receives a SIGHUP signal...
* before (5) above, nothing happens - no signal handlers are registered.
* during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically.
* during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault.

The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`.

Jira: DB-9276

Test Plan:
1. Apply the patch
```lang=diff
diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c
index eedb2057ea..c4529af546 100644
--- a/src/postgres/src/backend/storage/lmgr/proc.c
+++ b/src/postgres/src/backend/storage/lmgr/proc.c
@@ -885,6 +885,7 @@ ProcKill(int code, Datum arg)
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);

        ReleaseProcToFreeList(proc);
@@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg)
         * latch.
         */
        SwitchBackToLocalLatch();
        proc = MyProc;
        MyProc = NULL;
+       pg_usleep(10 * 1000 * 1000);
        DisownLatch(&proc->procLatch);
```
2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")`
3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID`
4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID`
5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully.

Reviewers: amartsinchyk, kfranz, smishra

Reviewed By: amartsinchyk

Subscribers: yql, smishra

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/highest Highest priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation
Projects
None yet
Development

No branches or pull requests

4 participants