Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filesystem fill up time #29

Closed
anarcat opened this Issue Feb 11, 2019 · 9 comments

Comments

Projects
None yet
2 participants
@anarcat
Copy link

anarcat commented Feb 11, 2019

One reason why I still have the host stats dashboard is because it has this neat little table of "Filesystem Fill Up Time" which (tries to?) compute the time at which the filesystem will fill up.

I don't think it's working very well because the results are just off here. But it got me thinking about how this could be implemented and whether you'd be interested in adding this to the dashboard...

The hosts stats dashboard uses this formula:

(node_filesystem_size_bytes{job='node',instance='$instance'} - node_filesystem_free_bytes{job='node',instance='$instance'}) / deriv(node_filesystem_free_bytes{job='node',instance='$instance',fstype!='rootfs',mountpoint!~'/(run|var).*',mountpoint!=''}[3d]) > 0

This blog post suggests instead just using the derivative as a base:

(deriv(node_filesystem_free{device=~"/dev/sd.*",instance=~"$node:.*"}[4h]) > 0)

I would suggest using node_filesystem_avail_bytes in any case, as that is the user-visible metric that will detect actual failures in userspace...

I'm not very familiar with Prometheus formulas, so I'm not sure how it works. I suspect it just doesn't, because it gives me negative numbers here (they don't show up) or absurd estimates (293481462547366 year for a 99% full disk), etc.

Yet this could be an interesting addition.

rfrail3 pushed a commit that referenced this issue Feb 13, 2019

@rfrail3

This comment has been minimized.

Copy link
Owner

rfrail3 commented Feb 13, 2019

Hi Anarcat,

Thanks for the upgrade proposal, it looks nice.

I've been testing both formulas and the first one seems to work better, but take note that it only reports content if the values are "> 0", if not, the box it will be empty.

The second formula doesn't report good values, in my testing lab, 11ms in a filesystem without changes. In any case, if you want to test it, the corrected formula is:

deriv(node_filesystem_avail_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'}[4h]) > 0

Please, check the last commit on node-exporter-full.json it have the new box under "CPU Memory Net Disk", you can move it to other place without problem.

Regards,

@anarcat

This comment has been minimized.

Copy link
Author

anarcat commented Feb 13, 2019

that looks okay, but I still find some strange things going on. take this graph for example:

image

This gives the following table:

Metric Current
/boot 2.39 day
/boot/efi 142257726.77 year

There are many problems here, the first of which of course is the host isn't continuously available (it's a workstation, and it shuts down once in a while). But then the other filesystems (I'm specifically interested in /, /home and /srv) do not show up, because of the > 0 constraint. When I shift the time range in grafana from the default (5 minutes?) to three days, all of a sudden, the estimates show up for the other partitions:

Metric Current
/boot 2.40 day
/home 10.80 week
/srv 14.05 week
/boot/efi 141341662.68 year

Here's the raw unprocessed output from Prometheus doing the query ((node_filesystem_size_bytes{device!~'rootfs'} - node_filesystem_avail_bytes{device!~'rootfs'}) / deriv(node_filesystem_avail_bytes{device!~'rootfs'}[3d])):

Element Value
{device="/dev/mapper/curie--vg-home",fstype="ext4",instance="curie:9100",job="node",mountpoint="/home"} -1996694.4176633644
{device="/dev/mapper/curie--vg-root",fstype="ext4",instance="curie:9100",job="node",mountpoint="/"} -14338319.371650279
{device="/dev/mapper/fedora_crypt",fstype="btrfs",instance="curie:9100",job="node",mountpoint="/srv"} -38339918.49760082

Notice how Prom thinks those numbers are negative. I would also point out that it's somewhat unlikely that (for example) /srv runs out of space in 14 weeks: it gained only 0.4% of space in the last three days, which, if I do a napkin rule-of-three, means it would gain 13% in 14 weeks (0.4147/3), bringing it to 90% disk usage...

So I'm not sure those derivatives are that useful in predicting the future. There might be something fishy going on here... I find it especially strange that the estimates would vary based on the Grafana time range...

@anarcat

This comment has been minimized.

Copy link
Author

anarcat commented Feb 13, 2019

Another example of the estimate failing, on my home server:

Metric Current
/var 45.23 week
/ 1.33 year
/usr 1.51 year
/tmp 2.69 year
/home 118.48 year
/boot 2491947820794.12 year
/srv 117533249733896.45 year

Here's the absolute numbers:

image

And relative:

image

As you can see, /srv is quiiite full and a specific concern I was trying to address ("how much time do I have left with that poor HDD")... the answer (10^14 years, 10^5 times more the age of the universe) is ... rather unlikely. ;)

In fact, maybe we should use the infinity symbol () instead of anything larger than the age of the universe (10^9)...

@anarcat

This comment has been minimized.

Copy link
Author

anarcat commented Feb 13, 2019

Maybe I'm just proving how useless those metrics are, sorry for thinking out loud. :)

@rfrail3

This comment has been minimized.

Copy link
Owner

rfrail3 commented Feb 13, 2019

Well, it's a fact that the formula doesn't work as expected. As the original was made by Robust Perception, maybe @brian-brazil or @Conorbro can said something about it and help us?

@rfrail3

This comment has been minimized.

Copy link
Owner

rfrail3 commented Feb 13, 2019

Could you check if predict_linear function return results in your case:
https://www.robustperception.io/reduce-noise-from-disk-space-alerts#more-614

predict_linear(node_filesystem_free{instance=~"$node:$port",job=~"$job",device!~'rootfs'}[1h], 4 * 3600) < 0
I'm testing it, but I don't get any result in my setup... because I have and old version...

@anarcat

This comment has been minimized.

Copy link
Author

anarcat commented Feb 13, 2019

from what i understand, predict_linear tries to find the value at a specific time. we're looking for the opposite: the time for a specific value (namely, "zero space left")...

@rfrail3

This comment has been minimized.

Copy link
Owner

rfrail3 commented Feb 19, 2019

Do you finally find any working solultion? If the dashboard isn't reliable, I think that it's better to remove it.

@anarcat

This comment has been minimized.

Copy link
Author

anarcat commented Feb 19, 2019

i haven't, unfortunately, and i agree. :/

@rfrail3 rfrail3 closed this in 59adbc8 Feb 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.