Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node filesystem alerting is not aligned with kubelet eviction thresholds #3474

Closed
gdemonet opened this issue Aug 4, 2021 · 1 comment
Closed
Assignees
Labels
kind:bug Something isn't working topic:monitoring Everything related to monitoring of services in a running cluster

Comments

@gdemonet
Copy link
Contributor

gdemonet commented Aug 4, 2021

Component: kubernetes, alerting

What happened:

When trying to raise NodeFilesystemAlmostOutOfSpace alerts manually (by creating arbitrary large files in the root FS), we realized kubelet triggered the eviction process long before we reached the 5% threshold.

What was expected:

To get an alert much before kubelet starts evicting pods.

Resolution proposal:

Lower the alert thresholds according to kubelet defaults (which we don't want to change for now):

  • NodeFilesystemAlmostOutOfSpace (kubelet uses nodefs.available<10%):

    • warning: less than 5% 20% space left
    • critical: less than 3% 12% space left
  • NodeFilesystemAlmostOutOfFiles (kubelet uses nodefs.inodesFree<5%):

    • warning: less than 5% 15% inodes left
    • critical: less than 3% 8% inodes left
@gdemonet gdemonet added kind:bug Something isn't working topic:monitoring Everything related to monitoring of services in a running cluster labels Aug 4, 2021
alexandre-allard added a commit that referenced this issue Aug 4, 2021
We lower the thresholds for the following alerts

NodeFilesystemAlmostOutOfSpace:
  - warning from 5% to 20%
  - critical from 3% to 12%
NodeFilesystemAlmostOutOfFiles:
  - warning from 5% to 15%
  - critical from 3% to 8%

Otherwise we don't receive alert before kubelet
starts evicting pods when disk is full as its
threshold is set to 10% of available disk space
and 5% of inodes free.

Refs: #3474
@alexandre-allard alexandre-allard self-assigned this Aug 4, 2021
alexandre-allard added a commit that referenced this issue Aug 5, 2021
```
./tools/rule_extractor/rule_extractor.py \
  -i <control-plane-ip> -p 8443 -t rules
```

Refs: #3474
@alexandre-allard
Copy link
Contributor

Fixed by #3479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Something isn't working topic:monitoring Everything related to monitoring of services in a running cluster
Projects
None yet
Development

No branches or pull requests

2 participants