New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The reported disk free is stuck to "0 bytes/unknown" if RabbitMQ timeouts reading "disk free" from the system #5721
Comments
RabbitMQ 3.8 has reached end of life. Subsequent versions had several PRs that modify the free disk space monitor: #4140, #4328, #3895. |
Thanks @michaelklishin , I believe that my issue looks a lot like #4140 except that I'm on Linux not on Windows. |
@jperville - I can see how the code can be improved and will submit a PR. It will only be back-ported as far as the 3.9.x series, however. |
* Crash when a sub-command times out * Use atom `NaN` when free space can not be determined Fixes #5721
When RabbitMQ starts up, the disk monitor tries to get the free space 10 times, with a two minute interval between times, before giving up. @jperville - Unless you see 10 timeout messages in a row (2 minutes apart) in your logs, the disk monitor must have started up. Anyway, #5726 will prevent the "function clause" crash. |
we had the same issue, RabbitMQ 3.9.13 on Windows. We'll restart the server next weekend and/or update. Good to see we're not alone ;-) |
* Crash when a sub-command times out * Use atom `NaN` when free space can not be determined Fixes #5721 Use port to run /bin/sh on `unix` systems to then run `df` command Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1
* Crash when a sub-command times out * Use atom `NaN` when free space can not be determined Fixes #5721 Use port to run /bin/sh on `unix` systems to then run `df` command Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1
Later this week #5726 will have shipped in
|
Summary
If for some reason a RabbitMQ node fails to read the "disk free" metric at least once :
/metrics
route of the management console will reportrabbitmq_disk_space_available_bytes: unknown
The problem is that
unknown
is not a valid value for the metric and Prometheus will refuse to parse it withstrconv.ParseFloat: parsing "unknown": invalid syntax
, resulting in the prometheus target being down for Prometheus.Workaround
As a workaround, restarting the RabbitMQ node seem to "fix" the problem.
Investigation
I managed to investigate a bit before restarting the broken node.
Version of rabbitmq : v3.8.34 (but from what I saw in the source code it probably affects v3.10.x also).
Error visible in the prometheus "targets" page (prometheus version 0.37.0):
Raw metric scraped by prometheus :
Interesting part of the broken RabbitMQ node log:
How often did the
df -kP
command time out since the instance booted (18th August, 3 weeks ago) :I checked if the
df
command is still timing out or if it is working right now:I looked up the code where
unknown
anddisk
appear in the v3.8.34 codebase.It seems that
unknown
is a fallback hardcoded value which could easily be replaced withNaN
to at least fix the prometheus metric.My theory
I believe that if the
df -kP
command fails for any reason (for example if the system is a bit loaded), then the thread which watches the disk health crashes and is never restarted by erlang (exit with reason no function clause matching lists:reverse({error,timeout}) line 147 in context start_error
). Then the metric becomes "stuck" to invalid.At the minimum the value for the prometheus metric should be "NaN" instead of "unknown" since "unknown" is not a valid float.
It would be even nicer if the disk health thread restarted and would report the metric when
df -kP
works again instead of staying stuck forever.The text was updated successfully, but these errors were encountered: