[Data] Fix autoscaler to respect user-configured resource limits by marwan116 · Pull Request #60283 · ray-project/ray

marwan116 · 2026-01-19T01:47:45Z

Summary

Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured resource_limits set via ExecutionOptions
Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits

Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like:

ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)

The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources.

This was problematic because:

Users explicitly setting resource limits expect Ray Data to stay within those bounds
Unnecessary cluster scaling wastes cloud resources and money
The ResourceManager.get_global_limits() already respects user limits, but the autoscaler bypassed this by requesting resources directly

Test Plan

Added comprehensive unit tests for both autoscaler implementations

Related issue number

Fixes #60085

Checks

I've signed off every commit
I've run scripts/format.sh to lint the changes in this PR
I've included any doc changes needed
I've added any new tests if needed

gemini-code-assist

Code Review

This pull request updates the autoscaling logic to respect user-configured resource limits for CPU and GPU. It introduces capping for resource requests in both DefaultClusterAutoscaler and DefaultClusterAutoscalerV2. The changes are well-tested. I've identified an opportunity to refactor duplicated code between the two autoscaler implementations. Additionally, this PR includes several version bumps across the codebase, likely for a new release.

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py

python/ray/data/_internal/cluster_autoscaler/__init__.py

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py

bveeramani · 2026-01-19T04:49:20Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py


+    def _get_resource_limits(self) -> ExecutionResources:
+        """Get user-configured resource limits from execution options."""
+        return self._resource_manager._options.resource_limits


Same comment as above -- rather than breaking abstraction barriers to get this information, can we explicitly pass it in as a dependency?

agreed - should now be resolved

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

…f resource_manager Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

…imits Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

…thod Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py

python/ray/data/tests/test_autoscaler.py

…nd refactor Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

…ending requests Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>

…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed Signed-off-by: ryanaoleary <ryanaoleary@google.com> --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

marwan116 requested review from a team, SongGuyang, WangTaoTheTonic, kfstorm and raulchen as code owners January 19, 2026 01:47

marwan116 changed the title ~~Update autoscaling to respect resource limits~~ [Data] Update autoscaling to respect resource limits Jan 19, 2026

marwan116 changed the title ~~[Data] Update autoscaling to respect resource limits~~ [Data] Fix autoscaler to respect user-configured resource limits Jan 19, 2026

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py Outdated Show resolved Hide resolved

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch 5 times, most recently from 3c0ba89 to bd4088a Compare January 19, 2026 02:18

This comment was marked as outdated.

Sign in to view

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from bd4088a to 91b9bc5 Compare January 19, 2026 02:43

[Data] Fix autoscaler to respect user-configured resource limits

48304b6

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 91b9bc5 to 48304b6 Compare January 19, 2026 02:48

This comment was marked as outdated.

Sign in to view

bveeramani reviewed Jan 19, 2026

View reviewed changes

aslonnie removed request for a team January 19, 2026 07:04

ray-gardener bot added the data Ray Data-related issues label Jan 19, 2026

marwan116 added 2 commits January 19, 2026 09:35

[Data] Pass resource_limits directly to cluster autoscalers instead o…

fa40236

…f resource_manager Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

[Data] Enhance cluster autoscaler to respect user-configured memory l…

3147b18

…imits Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from c412bd2 to 72e734b Compare January 19, 2026 18:10

marwan116 added 2 commits January 19, 2026 10:14

[Data] Refactor to use ExecutionResources abstraction to cap limits

e7384d9

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

[Data] Remove unnecessary clamping when getting total resources

25a91ae

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

marwan116 added 2 commits January 19, 2026 10:14

[Data] Refactor 5 tests to use a fake instead of patching internal me…

1f6e74e

…thod Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

[Data] Refactor and reduce tests into one parameterized test

d0d56db

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 72e734b to d0d56db Compare January 19, 2026 18:14

[Data] run lint and format

5d6a2f5

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

This comment was marked as outdated.

Sign in to view

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 4eab5ec to 30fc340 Compare January 19, 2026 18:48

cursor bot reviewed Jan 19, 2026

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py Show resolved Hide resolved

python/ray/data/tests/test_autoscaler.py Outdated Show resolved Hide resolved

[Data] fix cap_resource_request_limits to handle heterogenous nodes a…

e7eafff

…nd refactor Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 30fc340 to e7eafff Compare January 19, 2026 18:59

cursor bot reviewed Jan 19, 2026

View reviewed changes

python/ray/data/tests/test_default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

[Data] fix to use ExecutionResources.for_limits

c32122f

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from b90c981 to c32122f Compare January 19, 2026 20:13

marwan116 requested a review from bveeramani January 19, 2026 21:35

[Data] Enhance cluster autoscaler to prioritize active bundles over p…

6c11a77

…ending requests Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from aa6f2b0 to 6c11a77 Compare January 19, 2026 23:01

bveeramani approved these changes Jan 20, 2026

View reviewed changes

Merge branch 'master' into marwan-autoscaling-respect-resource-limits

b391234

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani enabled auto-merge (squash) January 20, 2026 19:50

github-actions bot disabled auto-merge January 20, 2026 19:50

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 20, 2026

Merge branch 'master' into marwan-autoscaling-respect-resource-limits

1e74c34

bveeramani enabled auto-merge (squash) January 20, 2026 20:22

bveeramani merged commit 55b2d08 into master Jan 20, 2026
7 checks passed

bveeramani deleted the marwan-autoscaling-respect-resource-limits branch January 20, 2026 21:15

Conversation

marwan116 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why are these changes needed?

Test Plan

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

marwan116 Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marwan116 commented Jan 19, 2026 •

edited

Loading

marwan116 Jan 19, 2026 •

edited

Loading