The reported disk free is stuck to "0 bytes/unknown" if RabbitMQ timeouts reading "disk free" from the system #5721

jperville · 2022-09-06T08:33:45Z

Summary

If for some reason a RabbitMQ node fails to read the "disk free" metric at least once :

the management console's Overview tab will report that the node uses "Disk Space: 0 bytes"
and the /metrics route of the management console will report rabbitmq_disk_space_available_bytes: unknown

The problem is that unknown is not a valid value for the metric and Prometheus will refuse to parse it with strconv.ParseFloat: parsing "unknown": invalid syntax, resulting in the prometheus target being down for Prometheus.

Workaround

As a workaround, restarting the RabbitMQ node seem to "fix" the problem.

Investigation

I managed to investigate a bit before restarting the broken node.

Version of rabbitmq : v3.8.34 (but from what I saw in the source code it probably affects v3.10.x also).

Error visible in the prometheus "targets" page (prometheus version 0.37.0):

prometheus: strconv.ParseFloat: parsing "unknown": invalid syntax

Raw metric scraped by prometheus :

[root@vm1.xxx ~]# curl -sSL http://10.236.96.113:15692/metrics | grep unknown
rabbitmq_disk_space_available_bytes unknown

Interesting part of the broken RabbitMQ node log:

2022-09-04 22:20:51.935 [error] <0.17002.26> Command timed out: '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'
2022-09-04 22:20:52.189 [error] <0.17002.26> CRASH REPORT Process <0.17002.26> with 0 neighbours crashed with reason: no function clause matching lists:reverse({error,timeout}) line 147
2022-09-04 22:20:52.190 [error] <0.438.0> Supervisor rabbit_disk_monitor_sup had child rabbit_disk_monitor started with rabbit_disk_monitor:start_link(50000000) at <0.439.0> exit with reason no function clause matching lists:reverse({error,timeout}) line 147 in context start_error

How often did the df -kP command time out since the instance booted (18th August, 3 weeks ago) :

[root@vm1.xxx ~]# kubectl logs -f sts/rabbitmq -n backends | grep --color df
2022-09-04 22:20:39.998 [error] <0.439.0> Command timed out: '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'
2022-09-04 22:20:51.935 [error] <0.17002.26> Command timed out: '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'

I checked if the df command is still timing out or if it is working right now:

[root@vm1.xxx ~]# time kubectl exec -ti sts/rabbitmq -n backends -- /bin/bash -ce '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'
Filesystem         1024-blocks    Used Available Capacity Mounted on
/dev/mapper/vg-opt    41153760 6967688  32072536      18% /var/lib/rabbitmq

real    0m0.157s
user    0m0.165s
sys     0m0.017s

I looked up the code where unknown and disk appear in the v3.8.34 codebase.
It seems that unknown is a fallback hardcoded value which could easily be replaced with NaN to at least fix the prometheus metric.

julien:rabbitmq-server(139268m|tags/v3.8.34^0) $ fgrep -r unknown * --color | grep disk
deps/rabbit/src/rabbit_disk_monitor.erl:-spec get_disk_free() -> (integer() | 'unknown').
deps/rabbit/src/rabbit_disk_monitor.erl:    safe_ets_lookup(disk_free, unknown).
deps/rabbit/test/unit_disk_monitor_SUITE.erl:        DiskFree =/= unknown
deps/rabbit/test/unit_disk_monitor_mocks_SUITE.erl:    unknown = rabbit_disk_monitor:get_disk_free(),

My theory

I believe that if the df -kP command fails for any reason (for example if the system is a bit loaded), then the thread which watches the disk health crashes and is never restarted by erlang (exit with reason no function clause matching lists:reverse({error,timeout}) line 147 in context start_error). Then the metric becomes "stuck" to invalid.

At the minimum the value for the prometheus metric should be "NaN" instead of "unknown" since "unknown" is not a valid float.
It would be even nicer if the disk health thread restarted and would report the metric when df -kP works again instead of staying stuck forever.

The text was updated successfully, but these errors were encountered:

michaelklishin · 2022-09-06T08:40:51Z

RabbitMQ 3.8 has reached end of life.

Subsequent versions had several PRs that modify the free disk space monitor: #4140, #4328, #3895.

jperville · 2022-09-06T09:28:49Z

Thanks @michaelklishin , I believe that my issue looks a lot like #4140 except that I'm on Linux not on Windows.
I will try version 3.10.7 or higher, if I am lucky my system will be loaded enough that the bug will reproduce (or not).

lukebakken · 2022-09-07T13:00:03Z

then the thread which watches the disk health crashes and is never restarted by erlang

@jperville - rabbit_disk_monitor_sup supervisor will restart the disk monitor process. We don't have enough logging from your environment to confirm this.

I can see how the code can be improved and will submit a PR. It will only be back-ported as far as the 3.9.x series, however.

* Crash when a sub-command times out * Use atom `NaN` when free space can not be determined Fixes #5721

lukebakken · 2022-09-07T20:22:49Z

When RabbitMQ starts up, the disk monitor tries to get the free space 10 times, with a two minute interval between times, before giving up. @jperville - Unless you see 10 timeout messages in a row (2 minutes apart) in your logs, the disk monitor must have started up.

Anyway, #5726 will prevent the "function clause" crash.

robbert-mint · 2022-09-08T16:10:48Z

we had the same issue, RabbitMQ 3.9.13 on Windows. We'll restart the server next weekend and/or update. Good to see we're not alone ;-)

* Crash when a sub-command times out * Use atom `NaN` when free space can not be determined Fixes #5721 Use port to run /bin/sh on `unix` systems to then run `df` command Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1

* Crash when a sub-command times out * Use atom `NaN` when free space can not be determined Fixes #5721 Use port to run /bin/sh on `unix` systems to then run `df` command Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1 (cherry picked from commit a9e6fca)

michaelklishin · 2022-12-12T22:24:02Z

Later this week #5726 will have shipped in

3.11.0
3.10.8
3.9.27

michaelklishin closed this as completed Sep 6, 2022

lukebakken self-assigned this Sep 7, 2022

lukebakken reopened this Sep 7, 2022

lukebakken added a commit that referenced this issue Sep 7, 2022

Disk monitor strikes back!

10a3b54

* Crash when a sub-command times out * Use atom `NaN` when free space can not be determined Fixes #5721

lukebakken mentioned this issue Sep 7, 2022

Disk monitor strikes back! #5726

Merged

michaelklishin closed this as completed in #5726 Sep 9, 2022

lukebakken mentioned this issue Sep 13, 2022

RabbitMQ 3.10.7 - Windows - disk check crashing #5768

Closed

michaelklishin added this to the 3.10.8 milestone Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The reported disk free is stuck to "0 bytes/unknown" if RabbitMQ timeouts reading "disk free" from the system #5721

The reported disk free is stuck to "0 bytes/unknown" if RabbitMQ timeouts reading "disk free" from the system #5721

jperville commented Sep 6, 2022 •

edited

michaelklishin commented Sep 6, 2022

jperville commented Sep 6, 2022

lukebakken commented Sep 7, 2022 •

edited

lukebakken commented Sep 7, 2022

robbert-mint commented Sep 8, 2022

michaelklishin commented Dec 12, 2022

The reported disk free is stuck to "0 bytes/unknown" if RabbitMQ timeouts reading "disk free" from the system #5721

The reported disk free is stuck to "0 bytes/unknown" if RabbitMQ timeouts reading "disk free" from the system #5721

Comments

jperville commented Sep 6, 2022 • edited

Summary

Workaround

Investigation

My theory

michaelklishin commented Sep 6, 2022

jperville commented Sep 6, 2022

lukebakken commented Sep 7, 2022 • edited

lukebakken commented Sep 7, 2022

robbert-mint commented Sep 8, 2022

michaelklishin commented Dec 12, 2022

jperville commented Sep 6, 2022 •

edited

lukebakken commented Sep 7, 2022 •

edited