Alert Manager Rule: add a too-many-files alert #2060

vladzcloudius · 2023-09-13T13:42:28Z

System information

Are you willing to contribute it (Yes/No): No

Describe the feature and the current behavior/state.
Scylla and the Linux kernel don't cope with the fact of having too many files, hence nodes can get in an out-of-memory situation where the kernel out-of-memory killer daemon forcibly kills scylla or other processes in the system when it runs out of memory with no ability to evict fragmented pages occupied by inodes.

TL;DR; in order to prevent incidents where nodes start crashing due to a huge number of files, we should implement an alert of too-many-files.
e.g to retrieve the number of files we have the following formula on monitoring:
sum(node_filesystem_files{mountpoint="$mount_point", instance="$node"}- node_filesystem_files_free{mountpoint="$mount_point", instance="$node"}) by ([[by]])

This is @avikivity 's calculation to determine a reasonable threshold:

    50,000 * nr_vcpus (to start with)
    ICS generates 8 files per gigabyte. i3en has 500 GB/vcpu, so 4,000 files/vcpu. LCS is 6x higher, so 24,000 files/vcpu, but we rarely have such huge LCS tables. I think 50,000/vcpu is a good start, but we may want to adjust it later.

Let's add an "info" alert for 20K files/shard, "warn" for 30K, "error" for 40K and "critical" for 50K.

Who will benefit with this feature?
Everybody

The text was updated successfully, but these errors were encountered:

mykaul · 2023-09-13T14:07:37Z

@vladzcloudius - is that the same rule as in https://github.com/scylladb/siren/issues/9366 ?

vladzcloudius · 2023-09-13T14:18:43Z

@vladzcloudius - is that the same rule as in scylladb/siren#9366 ?

Yes. It's a request for the same alert. However the siren implementation has a lot of hard coded values and I expect a more generic implementation here.

amnonh · 2023-09-16T14:57:38Z

There's a question about the mount point, a tiny mount point can create false positive, we should probably limit the alert for a minimal few thousends

amnonh · 2023-09-19T10:47:43Z

@vladzcloudius I'm going to use the following expr:

(node_filesystem_files - node_filesystem_files_free) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance)

Note that it is not relevant to siren as they are using their own alerts

vladzcloudius · 2023-09-19T14:17:52Z

@vladzcloudius I'm going to use the following expr:
(node_filesystem_files - node_filesystem_files_free) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance)
Note that it is not relevant to siren as they are using their own alerts

It's their funeral ;)
@mykaul FYI ^.
You probably want to reconsider the mentioned above siren practice.

mykaul · 2023-09-19T14:20:14Z

I did not understand the above comment. Not the 'funeral' remark nor the 'siren practice'

vladzcloudius · 2023-09-19T14:32:01Z

I did not understand the above comment. Not the 'funeral' remark nor the 'siren practice'

Let me try to elaborate:

"It's their funeral" == "They are shooting themselves in the foot."
The 'siren practice' of not using the Alerts from this repo (that Amnon has mentioned) is very problematic IMO - there is no reason whatsoever to not use these generic rules (as a base) and not contribute new generic rules here if needed. We contribute new things here for a reason.

Too many open files can result with ScyllaDB running out of memory and in general an indication of a problem. This patch adds info, warn and error alerts if there are two many open files per shard. The warnnings will be More than 20k - Info More than 30k - warn More than 40k - error Fixes scylladb#2060

Too many open files can result with ScyllaDB running out of memory and in general an indication of a problem. This patch adds info, warn and error alerts if there are two many open files per shard. The warnnings will be More than 20k - Info More than 30k - warn More than 40k - error Fixes #2060 (cherry picked from commit dee0194)

vladzcloudius added the enhancement New feature or request label Sep 13, 2023

vladzcloudius mentioned this issue Sep 13, 2023

Alert Manager Rule: add a too-many-files alert scylladb/scylla-manager#3503

Closed

amnonh mentioned this issue Sep 19, 2023

prometheus.rules.yml: Add a warning for too many open files per shard #2066

Merged

amnonh closed this as completed in #2066 Sep 19, 2023

amnonh added this to the Monitoring 4.6 milestone Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert Manager Rule: add a too-many-files alert #2060

Alert Manager Rule: add a too-many-files alert #2060

vladzcloudius commented Sep 13, 2023

mykaul commented Sep 13, 2023

vladzcloudius commented Sep 13, 2023 •

edited

amnonh commented Sep 16, 2023

amnonh commented Sep 19, 2023

vladzcloudius commented Sep 19, 2023 •

edited

mykaul commented Sep 19, 2023

vladzcloudius commented Sep 19, 2023 •

edited

Alert Manager Rule: add a too-many-files alert #2060

Alert Manager Rule: add a too-many-files alert #2060

Comments

vladzcloudius commented Sep 13, 2023

mykaul commented Sep 13, 2023

vladzcloudius commented Sep 13, 2023 • edited

amnonh commented Sep 16, 2023

amnonh commented Sep 19, 2023

vladzcloudius commented Sep 19, 2023 • edited

mykaul commented Sep 19, 2023

vladzcloudius commented Sep 19, 2023 • edited

vladzcloudius commented Sep 13, 2023 •

edited

vladzcloudius commented Sep 19, 2023 •

edited

vladzcloudius commented Sep 19, 2023 •

edited