[Bug]: Shutdown stalled waiting for TimescaleDB Background Worker Scheduler #4200

mkindahl · 2022-03-29T16:25:55Z

What type of bug is this?

Locking issue

What subsystems and features are affected?

Background worker

What happened?

If a shutdown is slow, it can cause the scheduler to block while shutting down.

TimescaleDB version affected

2.5.0

PostgreSQL version used

13.5

What operating system did you use?

Ubuntu 20.04.3 LTS

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

strace

strace: Process 1104 attached
futex(0x7fedaf9afc60, FUTEX_WAIT_PRIVATE, 2, NULL

gdb

redacted bt full

#0  0x00007f0e4d1040eb in __lll_lock_wait_private () from target:/lib/x86_64-linux-gnu/libc.so.6
[...]
#3  malloc (size=53)
[...]
#7  0x000055b9212235b1 in errmsg ()
#8  0x00007f0e27bf27a8 in handle_sigterm (postgres_signal_arg=15) at /build/timescaledb/src/bgw/scheduler.c:841
#9  <signal handler called>
[...]
#13 free (ptr=<optimized out>)
#14 0x00007f0e4db12cb4 in OPENSSL_LH_free () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
[...]

How can we reproduce the bug?

Run the following and be patient until the situation occurs.

while sleep 0.1
    do pg_ctl restart -m fast
done

The text was updated successfully, but these errors were encountered:

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200

feikesteenbergen · 2022-03-30T06:36:36Z

We have observed a few situations where when shutting down or restarting PostgreSQL using the fast mode the shutdown stalls indefinitely.

What we observe is that almost all PostgreSQL processes have been terminated, yet a few remain, for example:

ps xef

    PID TTY      STAT   TIME COMMAND
   1092 ?        S      0:29 /usr/lib/postgresql/14/bin/postgres [...]
   1095 ?        Ss     0:00  \_ postgres: cluster0: checkpointer                                                                                                                        
   1099 ?        Ss     0:00  \_ postgres: cluster0: archiver last was 00000002000000000000004C                                                                                          
   1100 ?        Ss     0:00  \_ postgres: cluster0: stats collector                                                                                                                     
   1104 ?        Ss     0:00  \_ postgres: cluster0: TimescaleDB Background Worker Scheduler

The checkpointer, stats collector and archiver are expected in this situation, as PostgreSQL will shut them down last.
The TimescaleDB Background Worker Scheduler is at this point not responding to signals, so a:

kill 1104

Does not sort any effect.

Using strace shows us what the process is currently doing:

strace -p 1104

strace: Process 1104 attached
futex(0x7fedaf9afc60, FUTEX_WAIT_PRIVATE, 2, NULL

Taking a deeper dive with gdb show us that during a free (#13), a malloc (#3) is called, causing a deadlock:

redacted bt full

#0  0x00007f0e4d1040eb in __lll_lock_wait_private () from target:/lib/x86_64-linux-gnu/libc.so.6
[...]
#3  malloc (size=53)
[...]
#7  0x000055b9212235b1 in errmsg ()
#8  0x00007f0e27bf27a8 in handle_sigterm (postgres_signal_arg=15) at /build/timescaledb/src/bgw/scheduler.c:841
#9  <signal handler called>
[...]
#13 free (ptr=<optimized out>)
#14 0x00007f0e4db12cb4 in OPENSSL_LH_free () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
[...]

We have been able to reproduce this in an environment with atypical memory allocation, which as far as we can tell, increases the chances of hitting this kind of bug.

To reproduce, we executed the following and waited for the situation to occur:

while sleep 0.1
    do pg_ctl restart -m fast
done

In a regular environment we have not been able to reproduce this yet, but we have had reports of stalled shutdowns.

mkindahl · 2022-03-30T06:56:28Z

@feikesteenbergen Is there a good way to monitor for a stalled shutdown other than inspecting the output of ps? Would be good to understand if there is a good way to create a monitoring rule for this situation.

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers and also adds calls to `gettext` to support translation. Fixes timescale#4200

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200

feikesteenbergen · 2022-03-30T09:11:12Z

@mkindahl a stalled shutdown would present itself as a PostgreSQL server that is listening, but not accepting connections.

Therefore, to monitor this kind of situation, running a frequent pg_isready against the server would be able to alert on this.

pg_isready returns 0 to the shell if the server is accepting connections normally, 1 if the server is rejecting connections (for example during startup), 2 if there was no response to the connection attempt, and 3 if no attempt was made (for example due to invalid parameters).

All fine exitcode=0

$ pg_isready
/var/run/postgresql:5432 - accepting connections

Shutting down exitcode=1

$ pg_isready
/var/run/postgresql:5432 - rejecting connections

NOTE: A postgresql instance that is in recovery and isn't yet consistent, or doesn't have hot_standby=on, will also return this result.

Not running exitcode=2

$ pg_isready
/var/run/postgresql:5432 - no response

Alternatives would be to alert on absence of metrics, for example, pg_exporter can be used with prometheus, and can be used to detect absence of useful metrics for a period of time.

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes #4200

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes #4200

mkindahl · 2022-05-12T10:46:13Z

It turns out that write_stderr is not signal-safe after all and uses vfprintf internally, which requires allocating a buffer and is done using malloc. This can, again, lead to a process that is shutting down to become unresponsive because it is holding a lock.

mkindahl · 2022-05-12T11:03:09Z

The function write_stderr is defined in elog.c and calls either write_console or write_eventlog depending on the platform. Unfortunately, both of these functions are static so they are not callable from outside elog.c. This makes it difficult to create a signal-safe version without copying code, or re-implementing the solution for all platforms.

In timescale#4199 existing calls of `ereport` were replaced with calls of `write_stderr` to eliminate the use of signal-unsafe calls, in particular calls to `malloc`. Unfortunately, `write_stderr` contains a call to `vfprintf`, which allocates memory as well, occationally causing servers that are shutting down to become unresponsive. Since the existing signal handlers just called `die` after printing out a useful message, this commit fixes this by using `die` as a signal handler. Fixes timescale#4200

In #4199 existing calls of `ereport` were replaced with calls of `write_stderr` to eliminate the use of signal-unsafe calls, in particular calls to `malloc`. Unfortunately, `write_stderr` contains a call to `vfprintf`, which allocates memory as well, occationally causing servers that are shutting down to become unresponsive. Since the existing signal handlers just called `die` after printing out a useful message, this commit fixes this by using `die` as a signal handler. Fixes #4200

mkindahl added the bug label Mar 29, 2022

mkindahl mentioned this issue Mar 29, 2022

Remove signal-unsafe calls from signal handlers #4199

Merged

mkindahl self-assigned this Mar 29, 2022

feikesteenbergen changed the title ~~[Bug]: Slow shutdown can cause BGW scheduler to block~~ [Bug]: Shutdown stalled waiting for TimescaleDB Background Worker Scheduler Mar 30, 2022

mkindahl closed this as completed in #4199 Mar 30, 2022

mkindahl added a commit that referenced this issue Mar 30, 2022

Remove signal-unsafe calls from signal handlers

81b71b6

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes #4200

mkindahl added this to the TimescaleDB 2.6.1 milestone Apr 5, 2022

mkindahl added the Team: Core Database label Apr 7, 2022

svenklemm pushed a commit that referenced this issue Apr 11, 2022

Remove signal-unsafe calls from signal handlers

906267d

Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes #4200

mkindahl reopened this May 12, 2022

mkindahl mentioned this issue May 12, 2022

Remove printouts from signal handlers #4323

Merged

mkindahl closed this as completed in #4323 May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Shutdown stalled waiting for TimescaleDB Background Worker Scheduler #4200

[Bug]: Shutdown stalled waiting for TimescaleDB Background Worker Scheduler #4200

mkindahl commented Mar 29, 2022 •

edited by feikesteenbergen

feikesteenbergen commented Mar 30, 2022 •

edited

mkindahl commented Mar 30, 2022

feikesteenbergen commented Mar 30, 2022 •

edited

mkindahl commented May 12, 2022 •

edited

mkindahl commented May 12, 2022

[Bug]: Shutdown stalled waiting for TimescaleDB Background Worker Scheduler #4200

[Bug]: Shutdown stalled waiting for TimescaleDB Background Worker Scheduler #4200

Comments

mkindahl commented Mar 29, 2022 • edited by feikesteenbergen

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

strace

gdb

How can we reproduce the bug?

feikesteenbergen commented Mar 30, 2022 • edited

mkindahl commented Mar 30, 2022

feikesteenbergen commented Mar 30, 2022 • edited

All fine exitcode=0

Shutting down exitcode=1

Not running exitcode=2

mkindahl commented May 12, 2022 • edited

mkindahl commented May 12, 2022

mkindahl commented Mar 29, 2022 •

edited by feikesteenbergen

feikesteenbergen commented Mar 30, 2022 •

edited

feikesteenbergen commented Mar 30, 2022 •

edited

mkindahl commented May 12, 2022 •

edited