New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Shutdown stalled waiting for TimescaleDB Background Worker Scheduler #4200
Comments
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200
We have observed a few situations where when shutting down or restarting PostgreSQL using the What we observe is that almost all PostgreSQL processes have been terminated, yet a few remain, for example: ps xef
The kill 1104 Does not sort any effect. Using
Taking a deeper dive with redacted
We have been able to reproduce this in an environment with atypical memory allocation, which as far as we can tell, increases the chances of hitting this kind of bug. To reproduce, we executed the following and waited for the situation to occur: while sleep 0.1
do pg_ctl restart -m fast
done In a regular environment we have not been able to reproduce this yet, but we have had reports of stalled shutdowns. |
@feikesteenbergen Is there a good way to monitor for a stalled shutdown other than inspecting the output of |
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers and also adds calls to `gettext` to support translation. Fixes timescale#4200
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200
@mkindahl a stalled shutdown would present itself as a PostgreSQL server that is listening, but not accepting connections. Therefore, to monitor this kind of situation, running a frequent
All fine exitcode=0
Shutting down exitcode=1
Not running exitcode=2
Alternatives would be to alert on absence of metrics, for example, pg_exporter can be used with prometheus, and can be used to detect absence of useful metrics for a period of time. |
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes #4200
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes timescale#4200
Functions `elog` and `ereport` are unsafe to use in signal handlers since they call `malloc`. This commit removes them from signal handlers. Fixes #4200
It turns out that |
The function |
In timescale#4199 existing calls of `ereport` were replaced with calls of `write_stderr` to eliminate the use of signal-unsafe calls, in particular calls to `malloc`. Unfortunately, `write_stderr` contains a call to `vfprintf`, which allocates memory as well, occationally causing servers that are shutting down to become unresponsive. Since the existing signal handlers just called `die` after printing out a useful message, this commit fixes this by using `die` as a signal handler. Fixes timescale#4200
In timescale#4199 existing calls of `ereport` were replaced with calls of `write_stderr` to eliminate the use of signal-unsafe calls, in particular calls to `malloc`. Unfortunately, `write_stderr` contains a call to `vfprintf`, which allocates memory as well, occationally causing servers that are shutting down to become unresponsive. Since the existing signal handlers just called `die` after printing out a useful message, this commit fixes this by using `die` as a signal handler. Fixes timescale#4200
In #4199 existing calls of `ereport` were replaced with calls of `write_stderr` to eliminate the use of signal-unsafe calls, in particular calls to `malloc`. Unfortunately, `write_stderr` contains a call to `vfprintf`, which allocates memory as well, occationally causing servers that are shutting down to become unresponsive. Since the existing signal handlers just called `die` after printing out a useful message, this commit fixes this by using `die` as a signal handler. Fixes #4200
What type of bug is this?
Locking issue
What subsystems and features are affected?
Background worker
What happened?
If a shutdown is slow, it can cause the scheduler to block while shutting down.
TimescaleDB version affected
2.5.0
PostgreSQL version used
13.5
What operating system did you use?
Ubuntu 20.04.3 LTS
What installation method did you use?
Source
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
strace
gdb
redacted
bt full
How can we reproduce the bug?
Run the following and be patient until the situation occurs.
The text was updated successfully, but these errors were encountered: