Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix cgroup used memory calculation for Ray memory monitor #43071

Merged
merged 4 commits into from Feb 14, 2024

Conversation

jjyao
Copy link
Contributor

@jjyao jjyao commented Feb 9, 2024

Why are these changes needed?

From oom killer's perspective, file page caches are reclaimable and can be used when kernel needs memory (the memory can be reclaimed by writing the data back to the original file) so we should exclude those when calculating cgroup used memory.

Before this PR, we only exclude part of the file page caches that's inactive. This PR excludes both active and inactive.

Related issue number

Closes #42894

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
return kNull;
}
return current_usage_bytes - inactive_file_bytes;
return current_usage_bytes - inactive_file_bytes - active_file_bytes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should update python side code too.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao jjyao marked this pull request as ready for review February 9, 2024 18:14
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. How do we plan to test it? Is it possible you run it with high IO workloads or are we just going to provide a wheel to users who have the problem?

python/ray/_private/utils.py Outdated Show resolved Hide resolved
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao merged commit db9d606 into ray-project:master Feb 14, 2024
9 checks passed
@jjyao jjyao deleted the jjyao/cgroup2 branch February 14, 2024 05:37
kevin85421 pushed a commit to kevin85421/ray that referenced this pull request Feb 17, 2024
…-project#43071)

From oom killer's perspective, file page caches are reclaimable and can be used when kernel needs memory (the memory can be reclaimed by writing the data back to the original file) so we should exclude those when calculating cgroup used memory.

Before this PR, we only exclude part of the file page caches that's inactive. This PR excludes both active and inactive

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Update the cgroup2 memory accounting logic to exclude buffer/cached memory
3 participants