-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert Manager Rule: add a too-many-files alert #2060
Comments
@vladzcloudius - is that the same rule as in https://github.com/scylladb/siren/issues/9366 ? |
Yes. It's a request for the same alert. However the |
There's a question about the mount point, a tiny mount point can create false positive, we should probably limit the alert for a minimal few thousends |
@vladzcloudius I'm going to use the following expr:
Note that it is not relevant to siren as they are using their own alerts |
It's their funeral ;) |
I did not understand the above comment. Not the 'funeral' remark nor the 'siren practice' |
Let me try to elaborate:
|
Too many open files can result with ScyllaDB running out of memory and in general an indication of a problem. This patch adds info, warn and error alerts if there are two many open files per shard. The warnnings will be More than 20k - Info More than 30k - warn More than 40k - error Fixes scylladb#2060
Too many open files can result with ScyllaDB running out of memory and in general an indication of a problem. This patch adds info, warn and error alerts if there are two many open files per shard. The warnnings will be More than 20k - Info More than 30k - warn More than 40k - error Fixes #2060 (cherry picked from commit dee0194)
System information
Describe the feature and the current behavior/state.
Scylla and the Linux kernel don't cope with the fact of having too many files, hence nodes can get in an out-of-memory situation where the kernel out-of-memory killer daemon forcibly kills scylla or other processes in the system when it runs out of memory with no ability to evict fragmented pages occupied by inodes.
TL;DR; in order to prevent incidents where nodes start crashing due to a huge number of files, we should implement an alert of too-many-files.
e.g to retrieve the number of files we have the following formula on monitoring:
sum(node_filesystem_files{mountpoint="$mount_point", instance="$node"}- node_filesystem_files_free{mountpoint="$mount_point", instance="$node"}) by ([[by]])
This is @avikivity 's calculation to determine a reasonable threshold:
Let's add an "info" alert for 20K files/shard, "warn" for 30K, "error" for 40K and "critical" for 50K.
Who will benefit with this feature?
Everybody
The text was updated successfully, but these errors were encountered: