Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert Manager Rule: add a too-many-files alert #2060

Closed
vladzcloudius opened this issue Sep 13, 2023 · 7 comments · Fixed by #2066
Closed

Alert Manager Rule: add a too-many-files alert #2060

vladzcloudius opened this issue Sep 13, 2023 · 7 comments · Fixed by #2066
Labels
enhancement New feature or request

Comments

@vladzcloudius
Copy link
Contributor

System information

  • Are you willing to contribute it (Yes/No): No

Describe the feature and the current behavior/state.
Scylla and the Linux kernel don't cope with the fact of having too many files, hence nodes can get in an out-of-memory situation where the kernel out-of-memory killer daemon forcibly kills scylla or other processes in the system when it runs out of memory with no ability to evict fragmented pages occupied by inodes.

TL;DR; in order to prevent incidents where nodes start crashing due to a huge number of files, we should implement an alert of too-many-files.
e.g to retrieve the number of files we have the following formula on monitoring:
sum(node_filesystem_files{mountpoint="$mount_point", instance="$node"}- node_filesystem_files_free{mountpoint="$mount_point", instance="$node"}) by ([[by]])

This is @avikivity 's calculation to determine a reasonable threshold:

    50,000 * nr_vcpus (to start with)
    ICS generates 8 files per gigabyte. i3en has 500 GB/vcpu, so 4,000 files/vcpu. LCS is 6x higher, so 24,000 files/vcpu, but we rarely have such huge LCS tables. I think 50,000/vcpu is a good start, but we may want to adjust it later.

Let's add an "info" alert for 20K files/shard, "warn" for 30K, "error" for 40K and "critical" for 50K.

Who will benefit with this feature?
Everybody

@mykaul
Copy link
Contributor

mykaul commented Sep 13, 2023

@vladzcloudius - is that the same rule as in https://github.com/scylladb/siren/issues/9366 ?

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Sep 13, 2023

@vladzcloudius - is that the same rule as in scylladb/siren#9366 ?

Yes. It's a request for the same alert. However the siren implementation has a lot of hard coded values and I expect a more generic implementation here.

@amnonh
Copy link
Collaborator

amnonh commented Sep 16, 2023

There's a question about the mount point, a tiny mount point can create false positive, we should probably limit the alert for a minimal few thousends

@amnonh
Copy link
Collaborator

amnonh commented Sep 19, 2023

@vladzcloudius I'm going to use the following expr:

(node_filesystem_files - node_filesystem_files_free) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance)

Note that it is not relevant to siren as they are using their own alerts

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Sep 19, 2023

@vladzcloudius I'm going to use the following expr:

(node_filesystem_files - node_filesystem_files_free) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance)

Note that it is not relevant to siren as they are using their own alerts

It's their funeral ;)
@mykaul FYI ^.
You probably want to reconsider the mentioned above siren practice.

@mykaul
Copy link
Contributor

mykaul commented Sep 19, 2023

I did not understand the above comment. Not the 'funeral' remark nor the 'siren practice'

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Sep 19, 2023

I did not understand the above comment. Not the 'funeral' remark nor the 'siren practice'

Let me try to elaborate:

  • "It's their funeral" == "They are shooting themselves in the foot."
  • The 'siren practice' of not using the Alerts from this repo (that Amnon has mentioned) is very problematic IMO - there is no reason whatsoever to not use these generic rules (as a base) and not contribute new generic rules here if needed. We contribute new things here for a reason.

amnonh added a commit to amnonh/scylla-grafana-monitoring that referenced this issue Sep 19, 2023
Too many open files can result with ScyllaDB running out of memory and
in general an indication of a problem.

This patch adds info, warn and error alerts if there are two many open
files per shard.

The warnnings will be
More than 20k - Info
More than 30k - warn
More than 40k - error

Fixes scylladb#2060
amnonh added a commit that referenced this issue Sep 19, 2023
Too many open files can result with ScyllaDB running out of memory and
in general an indication of a problem.

This patch adds info, warn and error alerts if there are two many open
files per shard.

The warnnings will be
More than 20k - Info
More than 30k - warn
More than 40k - error

Fixes #2060

(cherry picked from commit dee0194)
@amnonh amnonh added this to the Monitoring 4.6 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants