Skip to content

[Data] Fix autoscaler to respect user-configured resource limits#60283

Merged
bveeramani merged 13 commits intomasterfrom
marwan-autoscaling-respect-resource-limits
Jan 20, 2026
Merged

[Data] Fix autoscaler to respect user-configured resource limits#60283
bveeramani merged 13 commits intomasterfrom
marwan-autoscaling-respect-resource-limits

Conversation

@marwan116
Copy link
Contributor

@marwan116 marwan116 commented Jan 19, 2026

Summary

  • Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured resource_limits set via ExecutionOptions
  • Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits

Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like:

ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)

The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources.

This was problematic because:

  1. Users explicitly setting resource limits expect Ray Data to stay within those bounds
  2. Unnecessary cluster scaling wastes cloud resources and money
  3. The ResourceManager.get_global_limits() already respects user limits, but the autoscaler bypassed this by requesting resources directly

Test Plan

Added comprehensive unit tests for both autoscaler implementations

Related issue number

Fixes #60085

Checks

  • I've signed off every commit
  • I've run scripts/format.sh to lint the changes in this PR
  • I've included any doc changes needed
  • I've added any new tests if needed

@marwan116 marwan116 changed the title Update autoscaling to respect resource limits [Data] Update autoscaling to respect resource limits Jan 19, 2026
@marwan116 marwan116 changed the title [Data] Update autoscaling to respect resource limits [Data] Fix autoscaler to respect user-configured resource limits Jan 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the autoscaling logic to respect user-configured resource limits for CPU and GPU. It introduces capping for resource requests in both DefaultClusterAutoscaler and DefaultClusterAutoscalerV2. The changes are well-tested. I've identified an opportunity to refactor duplicated code between the two autoscaler implementations. Additionally, this PR includes several version bumps across the codebase, likely for a new release.

@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch 5 times, most recently from 3c0ba89 to bd4088a Compare January 19, 2026 02:18
cursor[bot]

This comment was marked as outdated.

@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from bd4088a to 91b9bc5 Compare January 19, 2026 02:43
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 91b9bc5 to 48304b6 Compare January 19, 2026 02:48
cursor[bot]

This comment was marked as outdated.


def _get_resource_limits(self) -> ExecutionResources:
"""Get user-configured resource limits from execution options."""
return self._resource_manager._options.resource_limits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above -- rather than breaking abstraction barriers to get this information, can we explicitly pass it in as a dependency?

Copy link
Contributor Author

@marwan116 marwan116 Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - should now be resolved

@aslonnie aslonnie removed request for a team January 19, 2026 07:04
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Jan 19, 2026
…f resource_manager

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
…imits

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from c412bd2 to 72e734b Compare January 19, 2026 18:10
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
…thod

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 72e734b to d0d56db Compare January 19, 2026 18:14
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
cursor[bot]

This comment was marked as outdated.

@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 4eab5ec to 30fc340 Compare January 19, 2026 18:48
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

…nd refactor

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from 30fc340 to e7eafff Compare January 19, 2026 18:59
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from b90c981 to c32122f Compare January 19, 2026 20:13
@marwan116 marwan116 requested a review from bveeramani January 19, 2026 21:35
…ending requests

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
@marwan116 marwan116 force-pushed the marwan-autoscaling-respect-resource-limits branch from aa6f2b0 to 6c11a77 Compare January 19, 2026 23:01
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
@bveeramani bveeramani enabled auto-merge (squash) January 20, 2026 19:50
@github-actions github-actions bot disabled auto-merge January 20, 2026 19:50
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 20, 2026
@bveeramani bveeramani enabled auto-merge (squash) January 20, 2026 20:22
@bveeramani bveeramani merged commit 55b2d08 into master Jan 20, 2026
7 checks passed
@bveeramani bveeramani deleted the marwan-autoscaling-respect-resource-limits branch January 20, 2026 21:15
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…-project#60283)

## Summary
- Fix Ray Data's cluster autoscalers (V1 and V2) to respect
user-configured `resource_limits` set via `ExecutionOptions`
- Cap autoscaling resource requests to not exceed user-specified CPU and
GPU limits
- Update `get_total_resources()` to return the minimum of cluster
resources and user limits

## Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect
user-configured resource limits. When a user set explicit limits like:

```python
ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)
```

The autoscaler would ignore these limits and continue to request more
cluster resources from Ray's autoscaler, causing unnecessary node
upscaling even when the executor couldn't use the additional resources.

This was problematic because:
1. Users explicitly setting resource limits expect Ray Data to stay
within those bounds
2. Unnecessary cluster scaling wastes cloud resources and money
3. The `ResourceManager.get_global_limits()` already respects user
limits, but the autoscaler bypassed this by requesting resources
directly

## Test Plan

Added comprehensive unit tests for both autoscaler implementations

## Related issue number

Fixes ray-project#60085

## Checks
- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [x] I've included any doc changes needed
- [x] I've added any new tests if needed

---

Would you like me to adjust anything in the PR description?

---------

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
zzchun pushed a commit to antgroup/ant-ray that referenced this pull request Jan 29, 2026
…-project#60283)

## Summary
- Fix Ray Data's cluster autoscalers (V1 and V2) to respect
user-configured `resource_limits` set via `ExecutionOptions`
- Cap autoscaling resource requests to not exceed user-specified CPU and
GPU limits
- Update `get_total_resources()` to return the minimum of cluster
resources and user limits

## Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect
user-configured resource limits. When a user set explicit limits like:

```python
ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)
```

The autoscaler would ignore these limits and continue to request more
cluster resources from Ray's autoscaler, causing unnecessary node
upscaling even when the executor couldn't use the additional resources.

This was problematic because:
1. Users explicitly setting resource limits expect Ray Data to stay
within those bounds
2. Unnecessary cluster scaling wastes cloud resources and money
3. The `ResourceManager.get_global_limits()` already respects user
limits, but the autoscaler bypassed this by requesting resources
directly

## Test Plan

Added comprehensive unit tests for both autoscaler implementations

## Related issue number

Fixes ray-project#60085

## Checks
- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [x] I've included any doc changes needed
- [x] I've added any new tests if needed

---

Would you like me to adjust anything in the PR description?

---------

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…-project#60283)

## Summary
- Fix Ray Data's cluster autoscalers (V1 and V2) to respect
user-configured `resource_limits` set via `ExecutionOptions`
- Cap autoscaling resource requests to not exceed user-specified CPU and
GPU limits
- Update `get_total_resources()` to return the minimum of cluster
resources and user limits

## Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect
user-configured resource limits. When a user set explicit limits like:

```python
ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)
```

The autoscaler would ignore these limits and continue to request more
cluster resources from Ray's autoscaler, causing unnecessary node
upscaling even when the executor couldn't use the additional resources.

This was problematic because:
1. Users explicitly setting resource limits expect Ray Data to stay
within those bounds
2. Unnecessary cluster scaling wastes cloud resources and money
3. The `ResourceManager.get_global_limits()` already respects user
limits, but the autoscaler bypassed this by requesting resources
directly

## Test Plan

Added comprehensive unit tests for both autoscaler implementations

## Related issue number

Fixes ray-project#60085

## Checks
- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [x] I've included any doc changes needed
- [x] I've added any new tests if needed

---

Would you like me to adjust anything in the PR description?

---------

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…-project#60283)

## Summary
- Fix Ray Data's cluster autoscalers (V1 and V2) to respect
user-configured `resource_limits` set via `ExecutionOptions`
- Cap autoscaling resource requests to not exceed user-specified CPU and
GPU limits
- Update `get_total_resources()` to return the minimum of cluster
resources and user limits

## Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect
user-configured resource limits. When a user set explicit limits like:

```python
ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)
```

The autoscaler would ignore these limits and continue to request more
cluster resources from Ray's autoscaler, causing unnecessary node
upscaling even when the executor couldn't use the additional resources.

This was problematic because:
1. Users explicitly setting resource limits expect Ray Data to stay
within those bounds
2. Unnecessary cluster scaling wastes cloud resources and money
3. The `ResourceManager.get_global_limits()` already respects user
limits, but the autoscaler bypassed this by requesting resources
directly

## Test Plan

Added comprehensive unit tests for both autoscaler implementations

## Related issue number

Fixes ray-project#60085

## Checks
- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [x] I've included any doc changes needed
- [x] I've added any new tests if needed

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

---

Would you like me to adjust anything in the PR description?

---------

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
zzchun pushed a commit to antgroup/ant-ray that referenced this pull request Feb 5, 2026
…-project#60283)

## Summary
- Fix Ray Data's cluster autoscalers (V1 and V2) to respect
user-configured `resource_limits` set via `ExecutionOptions`
- Cap autoscaling resource requests to not exceed user-specified CPU and
GPU limits
- Update `get_total_resources()` to return the minimum of cluster
resources and user limits

## Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect
user-configured resource limits. When a user set explicit limits like:

```python
ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)
```

The autoscaler would ignore these limits and continue to request more
cluster resources from Ray's autoscaler, causing unnecessary node
upscaling even when the executor couldn't use the additional resources.

This was problematic because:
1. Users explicitly setting resource limits expect Ray Data to stay
within those bounds
2. Unnecessary cluster scaling wastes cloud resources and money
3. The `ResourceManager.get_global_limits()` already respects user
limits, but the autoscaler bypassed this by requesting resources
directly

## Test Plan

Added comprehensive unit tests for both autoscaler implementations

## Related issue number

Fixes ray-project#60085

## Checks
- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [x] I've included any doc changes needed
- [x] I've added any new tests if needed

---

Would you like me to adjust anything in the PR description?

---------

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Fix cluster autoscaler v2 utilization calculation when resource_limits is set

3 participants