[Data] Fix autoscaler to respect user-configured resource limits#60283
Merged
bveeramani merged 13 commits intomasterfrom Jan 20, 2026
Merged
[Data] Fix autoscaler to respect user-configured resource limits#60283bveeramani merged 13 commits intomasterfrom
bveeramani merged 13 commits intomasterfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the autoscaling logic to respect user-configured resource limits for CPU and GPU. It introduces capping for resource requests in both DefaultClusterAutoscaler and DefaultClusterAutoscalerV2. The changes are well-tested. I've identified an opportunity to refactor duplicated code between the two autoscaler implementations. Additionally, this PR includes several version bumps across the codebase, likely for a new release.
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
3c0ba89 to
bd4088a
Compare
bd4088a to
91b9bc5
Compare
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
91b9bc5 to
48304b6
Compare
bveeramani
reviewed
Jan 19, 2026
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py
Outdated
Show resolved
Hide resolved
|
|
||
| def _get_resource_limits(self) -> ExecutionResources: | ||
| """Get user-configured resource limits from execution options.""" | ||
| return self._resource_manager._options.resource_limits |
Member
There was a problem hiding this comment.
Same comment as above -- rather than breaking abstraction barriers to get this information, can we explicitly pass it in as a dependency?
Contributor
Author
There was a problem hiding this comment.
agreed - should now be resolved
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
…f resource_manager Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
…imits Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
c412bd2 to
72e734b
Compare
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
…thod Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
72e734b to
d0d56db
Compare
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
4eab5ec to
30fc340
Compare
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler.py
Show resolved
Hide resolved
…nd refactor Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
30fc340 to
e7eafff
Compare
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
b90c981 to
c32122f
Compare
…ending requests Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
aa6f2b0 to
6c11a77
Compare
bveeramani
approved these changes
Jan 20, 2026
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
jinbum-kim
pushed a commit
to jinbum-kim/ray
that referenced
this pull request
Jan 29, 2026
…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
zzchun
pushed a commit
to antgroup/ant-ray
that referenced
this pull request
Jan 29, 2026
…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
400Ping
pushed a commit
to 400Ping/ray
that referenced
this pull request
Feb 1, 2026
…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this pull request
Feb 3, 2026
…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed Signed-off-by: ryanaoleary <ryanaoleary@google.com> --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
zzchun
pushed a commit
to antgroup/ant-ray
that referenced
this pull request
Feb 5, 2026
…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
resource_limitsset viaExecutionOptionsWhy are these changes needed?
Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like:
The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources.
This was problematic because:
ResourceManager.get_global_limits()already respects user limits, but the autoscaler bypassed this by requesting resources directlyTest Plan
Added comprehensive unit tests for both autoscaler implementations
Related issue number
Fixes #60085
Checks
scripts/format.shto lint the changes in this PR