New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix cgroup used memory calculation for Ray memory monitor #43071
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
return kNull; | ||
} | ||
return current_usage_bytes - inactive_file_bytes; | ||
return current_usage_bytes - inactive_file_bytes - active_file_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should update python side code too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refine comments before merging.
https://github.com/ray-project/ray/pull/43071/files#r1483902223
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. How do we plan to test it? Is it possible you run it with high IO workloads or are we just going to provide a wheel to users who have the problem?
…-project#43071) From oom killer's perspective, file page caches are reclaimable and can be used when kernel needs memory (the memory can be reclaimed by writing the data back to the original file) so we should exclude those when calculating cgroup used memory. Before this PR, we only exclude part of the file page caches that's inactive. This PR excludes both active and inactive Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Why are these changes needed?
From oom killer's perspective, file page caches are reclaimable and can be used when kernel needs memory (the memory can be reclaimed by writing the data back to the original file) so we should exclude those when calculating cgroup used memory.
Before this PR, we only exclude part of the file page caches that's inactive. This PR excludes both active and inactive.
Related issue number
Closes #42894
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.