[Core] Fix cgroup used memory calculation for Ray memory monitor #43071

jjyao · 2024-02-09T06:04:01Z

Why are these changes needed?

From oom killer's perspective, file page caches are reclaimable and can be used when kernel needs memory (the memory can be reclaimed by writing the data back to the original file) so we should exclude those when calculating cgroup used memory.

Before this PR, we only exclude part of the file page caches that's inactive. This PR excludes both active and inactive.

Related issue number

Closes #42894

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao · 2024-02-09T06:04:41Z

src/ray/common/memory_monitor.cc

    return kNull;
  }
-  return current_usage_bytes - inactive_file_bytes;
+  return current_usage_bytes - inactive_file_bytes - active_file_bytes;


This is the key.

src/ray/common/memory_monitor.cc

WeichenXu123

You should update python side code too.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

WeichenXu123

Please refine comments before merging.
https://github.com/ray-project/ray/pull/43071/files#r1483902223

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567

Awesome. How do we plan to test it? Is it possible you run it with high IO workloads or are we just going to provide a wheel to users who have the problem?

python/ray/_private/utils.py

src/ray/common/memory_monitor.cc

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…-project#43071) From oom killer's perspective, file page caches are reclaimable and can be used when kernel needs memory (the memory can be reclaimed by writing the data back to the original file) so we should exclude those when calculating cgroup used memory. Before this PR, we only exclude part of the file page caches that's inactive. This PR excludes both active and inactive Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Fix cgroup used memory calculation for Ray memory monitor

246ed7c

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao commented Feb 9, 2024

View reviewed changes

WeichenXu123 approved these changes Feb 9, 2024

View reviewed changes

WeichenXu123 reviewed Feb 9, 2024

View reviewed changes

src/ray/common/memory_monitor.cc Show resolved Hide resolved

WeichenXu123 suggested changes Feb 9, 2024

View reviewed changes

up

196fbb3

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

WeichenXu123 approved these changes Feb 9, 2024

View reviewed changes

WeichenXu123 reviewed Feb 9, 2024

View reviewed changes

jjyao marked this pull request as ready for review February 9, 2024 18:14

up

5807e82

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567 approved these changes Feb 10, 2024

View reviewed changes

python/ray/_private/utils.py Outdated Show resolved Hide resolved

rkooo567 reviewed Feb 10, 2024

View reviewed changes

src/ray/common/memory_monitor.cc Show resolved Hide resolved

comment

907bb05

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao merged commit db9d606 into ray-project:master Feb 14, 2024
9 checks passed

jjyao deleted the jjyao/cgroup2 branch February 14, 2024 05:37

jjyao mentioned this pull request Apr 15, 2024

Memory monitoring shows incorrect memory usage. #33741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix cgroup used memory calculation for Ray memory monitor #43071

[Core] Fix cgroup used memory calculation for Ray memory monitor #43071

jjyao commented Feb 9, 2024 •

edited

jjyao Feb 9, 2024

WeichenXu123 left a comment

WeichenXu123 left a comment

rkooo567 left a comment

[Core] Fix cgroup used memory calculation for Ray memory monitor #43071

[Core] Fix cgroup used memory calculation for Ray memory monitor #43071

Conversation

jjyao commented Feb 9, 2024 • edited

Why are these changes needed?

Related issue number

Checks

jjyao Feb 9, 2024

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

jjyao commented Feb 9, 2024 •

edited