Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix riemann-health memory reporting when using ZFS on Linux #289

Merged
merged 2 commits into from
May 30, 2024

Conversation

smortex
Copy link
Member

@smortex smortex commented May 29, 2024

On Linux, memory used by the ZFS ARC is reported as used while in fact
it mostly consist of cached data that can be reclaimed by the operating
system if necessary.

If the system use ZFS, gather the ARC statistics to compute the size
of the ARC that can be evicted and do not count this as used memory.
This fix the obviously wrong reporting we can see on systems running
ZFS.

Containers running on nodes with ZFS can access the hosts ARC
statistics from the container, resulting in wrong memory reporting. To
avoid this, we detect if we are running from a container and skip
reading the ARC statistics in that case.

@smortex smortex force-pushed the fix-linux-zfs-arc-memory-usage branch from 4565cc6 to 736824e Compare May 29, 2024 08:35
@smortex smortex changed the title Fix Linux memory reporting when using ZFS Fix riemann-health memory reporting when using ZFS on Linux May 29, 2024
@smortex smortex force-pushed the fix-linux-zfs-arc-memory-usage branch from 736824e to bdd1960 Compare May 29, 2024 08:38
@smortex smortex added the bug Something isn't working label May 29, 2024
@smortex
Copy link
Member Author

smortex commented May 29, 2024

Here is a graph of a newly provisioned system showing the issue:

screenshot

Initially, 2.5% of the memory is in use. At 8:50, we initiate the transfer of a large amount of data from another machine. Data written to disk quickly fill the ZFS ARC and the memory usage plateau at 53% of memory used. In fact, the workload has not really changed compared to the initial situation, and memory usage was expected to be the same, so I started to dig in this to find out what was wrong.

I first tried to follow what htop does by substracting the size of the ARC minus the minimum size of the ARC from used memory, but the result was still jumping up and down as files where sent, which did not look great. I then reworked the code and removed the "evictable" memory metrics which seems to be more reasonable (on the left, ZFS was not loaded and there was no ARC at all, so we expect the used memory when ZFS is in used to be slightly higher than when ZFS is not installed):

screenshot

@smortex smortex force-pushed the fix-linux-zfs-arc-memory-usage branch from bdd1960 to 2fe0e27 Compare May 29, 2024 22:13
On Linux, memory used by the ZFS ARC is reported as used while in fact
it mostly consist of cached data that can be reclaimed by the operating
system if necessary.

If the system use ZFS, gather the ARC statistics to compute the size
of the ARC that can be evicted and do not count this as used memory.
This fix the obviously wrong reporting we can see on systems running
ZFS.

Containers running on nodes with ZFS can access the hosts ARC
statistics from the container, resulting in wrong memory reporting.  To
avoid this, we detect if we are running from a container and skip
reading the ARC statistics in that case.
@smortex smortex force-pushed the fix-linux-zfs-arc-memory-usage branch from 2fe0e27 to fda2a2a Compare May 29, 2024 22:22
@smortex smortex marked this pull request as ready for review May 30, 2024 00:19
Copy link
Member

@jamtur01 jamtur01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jamtur01 jamtur01 merged commit a47f0fd into main May 30, 2024
7 checks passed
@smortex smortex deleted the fix-linux-zfs-arc-memory-usage branch May 30, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants