[core] Correct OOM score adjustment logic for workers#62470
[core] Correct OOM score adjustment logic for workers#62470edoakes merged 9 commits intoray-project:masterfrom
Conversation
Signed-off-by: peterjc123 <peterghost86@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the AdjustWorkerOomScore function in src/ray/raylet/worker_pool.cc to allow for a lower OOM score adjustment. Specifically, the minimum allowed oom_score_adj value has been changed from 0 to -1000, enabling workers to be configured with a higher priority against the OOM killer. There are no review comments to address.
|
Hi @peterjc123 thanks for the PR! Quick question about the use case for this, in what scenario would it be preferable to set the worker score lower than 0? |
|
@Kunchd Thanks for your prompt response. Our use case is a multi-tenant physical machine running multiple jobs with different priority levels. At the job level, those priorities are already expressed via oom_score_adj on the parent process. In that setup, Ray workers are child processes of the job and should generally inherit/follow the OOM preference established for that job. Allowing values below 0 is important because higher-priority jobs may already be assigned negative oom_score_adj values so they are less likely to be killed under memory pressure. With the previous lower bound of 0, Ray could not preserve that policy for workers belonging to those jobs. This change lets worker processes align with the parent/job-level OOM configuration, so the kernel’s OOM selection remains consistent with the job priorities already configured on the machine. |
|
Got it, so the goal here is to essentially provide OOM killing priority on a job level granularity. However, there's an issue with using the One work around is if your oom score can be specified without sudo privileges, you could modify the oom score at the start of your user defined function (for tasks) or at the construction of your user defined class (for actors) based on the job. For a more complete solution, adding task/actor granularity prioritization is something we've been considering, so we are also open to help shepherd this effort if you are interested. |
|
Well actually the jobs are managed by k8s and they just assigned the different oom score adjustments in container setup so we will actually run isolated ray servers in the containers. So in my use cases, the code change is just sufficient. |
|
I'm still not clear on how adjusting the oom scores in different containers will allow you to be able to specify different oom scores on a per job basis. Do you have ray nodes that are dedicated to running specific jobs, and each of these nodes depending on their assigned job will have a different oom score configured? And why can't you configure the priority between jobs with positive oom scores only? Clarifying question aside, I do have one more concern with this change. Changing the score to negative might allow the kernel to OOM kill the raylet before workers. Doing so will take down the ray node with all workers running on it, which will be very destructive. |
|
Let's answer the questions one by one. The K8s setup and why we cannot use positive scoresIn our multi-tenant K8s environment, each job runs in its own isolated K8s pod (container). Therefore, each pod runs its own independent Ray instance (Raylet + workers) dedicated solely to that job. The K8s scheduling team enforces a strict, machine-wide OOM policy across all workloads on the physical machine—both Ray and non-Ray workloads. They dictate the container's oom_score_adj to establish priority: -500 for online services, 0 for system, and 500 for interruptible. Because this is a global infrastructure policy, I do not have the authority to shift these priorities into positive numbers. If my online service is assigned -500 by K8s, the Raylet inherits -500. However, because Ray currently clamps worker processes to a minimum of 0, the workers inside my -500 container are artificially inflated to 0. This makes them highly vulnerable to the host machine's OOM killer compared to other -500 tier processes running on that same physical node. The Raylet safety concernYou make an excellent point about the danger of the kernel OOM killing the Raylet before the workers. Taking down the whole node is definitely something we want to avoid. By allowing negative numbers, my goal is actually to maintain Ray's intended kill hierarchy, just shifted into the negative space. Because the K8s container (and thus the Raylet) is already sitting at -500, I need to be able to set the workers to something like -450. If workers are forced to 0, they are violently out of alignment with the pod's baseline. By removing the 0 floor, power users can configure the workers to be slightly more OOM-prone than the Raylet (e.g., Raylet at -500, workers at -450, preserving the safety mechanism you mentioned), while still respecting the strict negative baselines enforced by the host K8s environment. Because this change only takes effect if a user explicitly overrides the default configuration, it remains strictly opt-in. Standard users will still get the default positive scores, ensuring the default Ray experience remains safe. |
|
Thanks, that clarifies things a lot more. I still have a couple more questions to make sure we're making the right fix here, but I see the use case for this change now.
Modifying the oom score can cause significant cluster stability issues if misconfigured, so I want to make sure the change to support negative worker oom scores clearly warns users about potential issues. I'll leave a couple of comments on nits. |
As a tenant, I don't know the whole picture. But looking into the logs, I think they are actually oversubscribing the physical resources.
Anyway, I'm running the job using the ray server in my pouch/container. I believe we are not using any ray cluster tools to manage the box or jobs. |
Signed-off-by: peterjc123 <peterghost86@gmail.com>
eda6691 to
5b8a9bd
Compare
|
Will approve after all tests passes. |
|
@Kunchd Thanks, I've updated the title and the descriptions of the PR. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 0393e2a. Configure here.
Signed-off-by: peterjc123 <peterghost86@gmail.com>
Signed-off-by: peterjc123 <peterghost86@gmail.com>
Signed-off-by: peterjc123 <peterghost86@gmail.com>
Kunchd
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the contribution!

Description
Looking into the comment of worker_oom_score_adjustment in ray_config_def.h, it says that
/// A value to add to workers' OOM score adjustment, so that the OS prioritizes
/// killing these over the raylet. 0 or positive values only (negative values
/// require sudo permissions).
But it doesn't actually add it to the current value, but just set it as the oom score adjustment. So I updated the logic to correctly reflect its behaviour.
Related issues
When the raylet processes has a oom_score_adj of -999, the oom_score_adj of the worker processes can only be set to 0 at the moment, but it should be able to be set to -998.
Additional information