Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The reported disk free is stuck to "0 bytes/unknown" if RabbitMQ timeouts reading "disk free" from the system #5721

Closed
jperville opened this issue Sep 6, 2022 · 6 comments · Fixed by #5726
Assignees
Milestone

Comments

@jperville
Copy link

jperville commented Sep 6, 2022

Summary

If for some reason a RabbitMQ node fails to read the "disk free" metric at least once :

  • the management console's Overview tab will report that the node uses "Disk Space: 0 bytes"
  • and the /metrics route of the management console will report rabbitmq_disk_space_available_bytes: unknown

The problem is that unknown is not a valid value for the metric and Prometheus will refuse to parse it with strconv.ParseFloat: parsing "unknown": invalid syntax, resulting in the prometheus target being down for Prometheus.

Workaround

As a workaround, restarting the RabbitMQ node seem to "fix" the problem.

Investigation

I managed to investigate a bit before restarting the broken node.

Version of rabbitmq : v3.8.34 (but from what I saw in the source code it probably affects v3.10.x also).

Error visible in the prometheus "targets" page (prometheus version 0.37.0):

prometheus: strconv.ParseFloat: parsing "unknown": invalid syntax

Raw metric scraped by prometheus :

[root@vm1.xxx ~]# curl -sSL http://10.236.96.113:15692/metrics | grep unknown
rabbitmq_disk_space_available_bytes unknown

Interesting part of the broken RabbitMQ node log:

2022-09-04 22:20:51.935 [error] <0.17002.26> Command timed out: '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'
2022-09-04 22:20:52.189 [error] <0.17002.26> CRASH REPORT Process <0.17002.26> with 0 neighbours crashed with reason: no function clause matching lists:reverse({error,timeout}) line 147
2022-09-04 22:20:52.190 [error] <0.438.0> Supervisor rabbit_disk_monitor_sup had child rabbit_disk_monitor started with rabbit_disk_monitor:start_link(50000000) at <0.439.0> exit with reason no function clause matching lists:reverse({error,timeout}) line 147 in context start_error

How often did the df -kP command time out since the instance booted (18th August, 3 weeks ago) :

[root@vm1.xxx ~]# kubectl logs -f sts/rabbitmq -n backends | grep --color df
2022-09-04 22:20:39.998 [error] <0.439.0> Command timed out: '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'
2022-09-04 22:20:51.935 [error] <0.17002.26> Command timed out: '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'

I checked if the df command is still timing out or if it is working right now:

[root@vm1.xxx ~]# time kubectl exec -ti sts/rabbitmq -n backends -- /bin/bash -ce '/usr/bin/df -kP /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-cluster'
Filesystem         1024-blocks    Used Available Capacity Mounted on
/dev/mapper/vg-opt    41153760 6967688  32072536      18% /var/lib/rabbitmq

real    0m0.157s
user    0m0.165s
sys     0m0.017s

I looked up the code where unknown and disk appear in the v3.8.34 codebase.
It seems that unknown is a fallback hardcoded value which could easily be replaced with NaN to at least fix the prometheus metric.

julien:rabbitmq-server(139268m|tags/v3.8.34^0) $ fgrep -r unknown * --color | grep disk
deps/rabbit/src/rabbit_disk_monitor.erl:-spec get_disk_free() -> (integer() | 'unknown').
deps/rabbit/src/rabbit_disk_monitor.erl:    safe_ets_lookup(disk_free, unknown).
deps/rabbit/test/unit_disk_monitor_SUITE.erl:        DiskFree =/= unknown
deps/rabbit/test/unit_disk_monitor_mocks_SUITE.erl:    unknown = rabbit_disk_monitor:get_disk_free(),

My theory

I believe that if the df -kP command fails for any reason (for example if the system is a bit loaded), then the thread which watches the disk health crashes and is never restarted by erlang (exit with reason no function clause matching lists:reverse({error,timeout}) line 147 in context start_error). Then the metric becomes "stuck" to invalid.

At the minimum the value for the prometheus metric should be "NaN" instead of "unknown" since "unknown" is not a valid float.
It would be even nicer if the disk health thread restarted and would report the metric when df -kP works again instead of staying stuck forever.

@michaelklishin
Copy link
Member

RabbitMQ 3.8 has reached end of life.

Subsequent versions had several PRs that modify the free disk space monitor: #4140, #4328, #3895.

@jperville
Copy link
Author

Thanks @michaelklishin , I believe that my issue looks a lot like #4140 except that I'm on Linux not on Windows.
I will try version 3.10.7 or higher, if I am lucky my system will be loaded enough that the bug will reproduce (or not).

@lukebakken lukebakken self-assigned this Sep 7, 2022
@lukebakken lukebakken reopened this Sep 7, 2022
@lukebakken
Copy link
Collaborator

lukebakken commented Sep 7, 2022

then the thread which watches the disk health crashes and is never restarted by erlang

@jperville - rabbit_disk_monitor_sup supervisor will restart the disk monitor process. We don't have enough logging from your environment to confirm this.

I can see how the code can be improved and will submit a PR. It will only be back-ported as far as the 3.9.x series, however.

lukebakken added a commit that referenced this issue Sep 7, 2022
* Crash when a sub-command times out
* Use atom `NaN` when free space can not be determined

Fixes #5721
@lukebakken
Copy link
Collaborator

When RabbitMQ starts up, the disk monitor tries to get the free space 10 times, with a two minute interval between times, before giving up. @jperville - Unless you see 10 timeout messages in a row (2 minutes apart) in your logs, the disk monitor must have started up.

Anyway, #5726 will prevent the "function clause" crash.

@robbert-mint
Copy link

we had the same issue, RabbitMQ 3.9.13 on Windows. We'll restart the server next weekend and/or update. Good to see we're not alone ;-)

lukebakken added a commit that referenced this issue Sep 8, 2022
* Crash when a sub-command times out
* Use atom `NaN` when free space can not be determined

Fixes #5721

Use port to run /bin/sh on `unix` systems to then run `df` command

Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1
lukebakken added a commit that referenced this issue Sep 8, 2022
* Crash when a sub-command times out
* Use atom `NaN` when free space can not be determined

Fixes #5721

Use port to run /bin/sh on `unix` systems to then run `df` command

Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1
mergify bot pushed a commit that referenced this issue Sep 9, 2022
* Crash when a sub-command times out
* Use atom `NaN` when free space can not be determined

Fixes #5721

Use port to run /bin/sh on `unix` systems to then run `df` command

Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1

(cherry picked from commit a9e6fca)
mergify bot pushed a commit that referenced this issue Sep 9, 2022
* Crash when a sub-command times out
* Use atom `NaN` when free space can not be determined

Fixes #5721

Use port to run /bin/sh on `unix` systems to then run `df` command

Update disk monitor tests to not use mocks because we no longer use rabbit_misc:os_cmd/1

(cherry picked from commit a9e6fca)
@michaelklishin michaelklishin added this to the 3.10.8 milestone Dec 12, 2022
@michaelklishin
Copy link
Member

Later this week #5726 will have shipped in

  • 3.11.0
  • 3.10.8
  • 3.9.27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants