Skip to content

[core] Correct OOM score adjustment logic for workers#62470

Merged
edoakes merged 9 commits intoray-project:masterfrom
peterjc123:pr/fix_oom_score_adj_range
Apr 16, 2026
Merged

[core] Correct OOM score adjustment logic for workers#62470
edoakes merged 9 commits intoray-project:masterfrom
peterjc123:pr/fix_oom_score_adj_range

Conversation

@peterjc123
Copy link
Copy Markdown
Contributor

@peterjc123 peterjc123 commented Apr 9, 2026

Description

Looking into the comment of worker_oom_score_adjustment in ray_config_def.h, it says that

/// A value to add to workers' OOM score adjustment, so that the OS prioritizes
/// killing these over the raylet. 0 or positive values only (negative values
/// require sudo permissions).

But it doesn't actually add it to the current value, but just set it as the oom score adjustment. So I updated the logic to correctly reflect its behaviour.

Related issues

When the raylet processes has a oom_score_adj of -999, the oom_score_adj of the worker processes can only be set to 0 at the moment, but it should be able to be set to -998.

Additional information

Signed-off-by: peterjc123 <peterghost86@gmail.com>
@peterjc123 peterjc123 requested a review from a team as a code owner April 9, 2026 09:03
Comment thread src/ray/raylet/worker_pool.cc Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the AdjustWorkerOomScore function in src/ray/raylet/worker_pool.cc to allow for a lower OOM score adjustment. Specifically, the minimum allowed oom_score_adj value has been changed from 0 to -1000, enabling workers to be configured with a higher priority against the OOM killer. There are no review comments to address.

@peterjc123 peterjc123 changed the title Adjust OOM score adjustment range for workers [core] Adjust OOM score adjustment range for workers Apr 9, 2026
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Apr 9, 2026
@Kunchd Kunchd self-assigned this Apr 9, 2026
@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented Apr 9, 2026

Hi @peterjc123 thanks for the PR! Quick question about the use case for this, in what scenario would it be preferable to set the worker score lower than 0?

@peterjc123
Copy link
Copy Markdown
Contributor Author

@Kunchd Thanks for your prompt response. Our use case is a multi-tenant physical machine running multiple jobs with different priority levels. At the job level, those priorities are already expressed via oom_score_adj on the parent process.

In that setup, Ray workers are child processes of the job and should generally inherit/follow the OOM preference established for that job. Allowing values below 0 is important because higher-priority jobs may already be assigned negative oom_score_adj values so they are less likely to be killed under memory pressure.

With the previous lower bound of 0, Ray could not preserve that policy for workers belonging to those jobs. This change lets worker processes align with the parent/job-level OOM configuration, so the kernel’s OOM selection remains consistent with the job priorities already configured on the machine.

@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented Apr 10, 2026

Got it, so the goal here is to essentially provide OOM killing priority on a job level granularity. However, there's an issue with using the oom_score_adjustment environment variable to accomplish this. The oom score applies on a per-node basis, meaning all workers on that node regardless of what job they are running will share that specified oom score. Because of this, I don't think this environment variable can allow us to specify killing priority between jobs.

One work around is if your oom score can be specified without sudo privileges, you could modify the oom score at the start of your user defined function (for tasks) or at the construction of your user defined class (for actors) based on the job.

For a more complete solution, adding task/actor granularity prioritization is something we've been considering, so we are also open to help shepherd this effort if you are interested.

@peterjc123
Copy link
Copy Markdown
Contributor Author

peterjc123 commented Apr 11, 2026

Well actually the jobs are managed by k8s and they just assigned the different oom score adjustments in container setup so we will actually run isolated ray servers in the containers. So in my use cases, the code change is just sufficient.

@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented Apr 13, 2026

I'm still not clear on how adjusting the oom scores in different containers will allow you to be able to specify different oom scores on a per job basis. Do you have ray nodes that are dedicated to running specific jobs, and each of these nodes depending on their assigned job will have a different oom score configured? And why can't you configure the priority between jobs with positive oom scores only?

Clarifying question aside, I do have one more concern with this change. Changing the score to negative might allow the kernel to OOM kill the raylet before workers. Doing so will take down the ray node with all workers running on it, which will be very destructive.

@peterjc123
Copy link
Copy Markdown
Contributor Author

peterjc123 commented Apr 14, 2026

@Kunchd

Let's answer the questions one by one.

The K8s setup and why we cannot use positive scores

In our multi-tenant K8s environment, each job runs in its own isolated K8s pod (container). Therefore, each pod runs its own independent Ray instance (Raylet + workers) dedicated solely to that job.

The K8s scheduling team enforces a strict, machine-wide OOM policy across all workloads on the physical machine—both Ray and non-Ray workloads. They dictate the container's oom_score_adj to establish priority: -500 for online services, 0 for system, and 500 for interruptible.

Because this is a global infrastructure policy, I do not have the authority to shift these priorities into positive numbers. If my online service is assigned -500 by K8s, the Raylet inherits -500. However, because Ray currently clamps worker processes to a minimum of 0, the workers inside my -500 container are artificially inflated to 0. This makes them highly vulnerable to the host machine's OOM killer compared to other -500 tier processes running on that same physical node.

The Raylet safety concern

You make an excellent point about the danger of the kernel OOM killing the Raylet before the workers. Taking down the whole node is definitely something we want to avoid.

By allowing negative numbers, my goal is actually to maintain Ray's intended kill hierarchy, just shifted into the negative space. Because the K8s container (and thus the Raylet) is already sitting at -500, I need to be able to set the workers to something like -450.

If workers are forced to 0, they are violently out of alignment with the pod's baseline. By removing the 0 floor, power users can configure the workers to be slightly more OOM-prone than the Raylet (e.g., Raylet at -500, workers at -450, preserving the safety mechanism you mentioned), while still respecting the strict negative baselines enforced by the host K8s environment.

Because this change only takes effect if a user explicitly overrides the default configuration, it remains strictly opt-in. Standard users will still get the default positive scores, ensuring the default Ray experience remains safe.

@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented Apr 14, 2026

Thanks, that clarifies things a lot more. I still have a couple more questions to make sure we're making the right fix here, but I see the use case for this change now.

  • For your multi-tenant environment, why do you want to adjust the oom scores for individual pods on a physcial host? Are you oversubscribing the physical resources of that box?
  • When you mentioned that you are running a multi-tenant cluster, do you mean you are running multiple jobs on a single ray cluster, or multiple ray clusters on the same host, or some other configuration?

Modifying the oom score can cause significant cluster stability issues if misconfigured, so I want to make sure the change to support negative worker oom scores clearly warns users about potential issues.

I'll leave a couple of comments on nits.

Comment thread src/ray/raylet/worker_pool.cc Outdated
Comment thread src/ray/raylet/worker_pool.cc Outdated
@peterjc123
Copy link
Copy Markdown
Contributor Author

  • For your multi-tenant environment, why do you want to adjust the oom scores for individual pods on a physcial host? Are you oversubscribing the physical resources of that box?

As a tenant, I don't know the whole picture. But looking into the logs, I think they are actually oversubscribing the physical resources.

  • When you mentioned that you are running a multi-tenant cluster, do you mean you are running multiple jobs on a single ray cluster, or multiple ray clusters on the same host, or some other configuration?

Anyway, I'm running the job using the ray server in my pouch/container. I believe we are not using any ray cluster tools to manage the box or jobs.

Comment thread src/ray/raylet/worker_pool.cc Outdated
Signed-off-by: peterjc123 <peterghost86@gmail.com>
@peterjc123 peterjc123 force-pushed the pr/fix_oom_score_adj_range branch from eda6691 to 5b8a9bd Compare April 15, 2026 11:29
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label Apr 16, 2026
@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented Apr 16, 2026

Will approve after all tests passes.

@peterjc123 peterjc123 changed the title [core] Adjust OOM score adjustment range for workers [core] Correct OOM score adjustment logic for workers Apr 16, 2026
@peterjc123
Copy link
Copy Markdown
Contributor Author

@Kunchd Thanks, I've updated the title and the descriptions of the PR.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 0393e2a. Configure here.

Comment thread src/ray/raylet/worker_pool.cc
Signed-off-by: peterjc123 <peterghost86@gmail.com>
Signed-off-by: peterjc123 <peterghost86@gmail.com>
Signed-off-by: peterjc123 <peterghost86@gmail.com>
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution!

@edoakes edoakes merged commit e4d0c47 into ray-project:master Apr 16, 2026
6 checks passed
@peterjc123 peterjc123 deleted the pr/fix_oom_score_adj_range branch April 17, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants