[Serve] Fix app-level autoscaling policy state: cross deployment contamination and state loss for skipped deployments by vaishdho1 · Pull Request #62484 · ray-project/ray

vaishdho1 · 2026-04-09T22:02:56Z

Description

Changes to fix #62482

autoscaling_policy.py:
a) Copies the user state before merging
b) Call _merge_user_state_with_internal_state for skipped deployments instead of returning raw internal state.
test_autoscaling_policy.py : Unit test which verifies two deployments with different internal state don't cross contaminate when the user policy returns a shared dict.
test_application_state.py : Unit test which verifies user state persists across control loop iterations for a deployment the policy skips.

Related issues

Fixes:#62482
Related to: #58857

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

gemini-code-assist

Code Review

This pull request updates the autoscaling policy logic to ensure that custom policy states are correctly copied and merged, preventing cross-deployment contamination. The changes include updates to the autoscaling policy implementation and the addition of unit tests to verify state isolation and persistence for skipped deployments. Feedback highlights a potential regression regarding state loss for skipped deployments and suggests defensive handling for cases where the user policy returns None. Additionally, it is recommended to expand the unit tests to explicitly verify that user-provided state keys are preserved during the merge process.

gemini-code-assist · 2026-04-09T22:07:04Z

            custom_policy_state_per_deployment = updated_custom_policy_state.get(
                dep_id, {}
-            )
+            ).copy()
+            if dep_id not in desired_num_replicas_dict:
+                final_state[dep_id] = _merge_user_state_with_internal_state(
+                    state_per_deployment[dep_id],
+                    custom_policy_state_per_deployment,
+                )
+                continue


The current implementation introduces a regression where user-defined state is lost for skipped deployments if the policy does not explicitly return it in the state dictionary. In the previous version, the entire state (including user keys) was preserved for skipped deployments.

Additionally, calling .copy() on the result of .get(dep_id, {}) will raise an AttributeError if the user policy explicitly returns None as the state value for a deployment.

It is recommended to fallback to the previous state when a deployment is skipped and no new state is provided, while also handling potential None values defensively.

user_state = updated_custom_policy_state.get(dep_id) if dep_id not in desired_num_replicas_dict and user_state is None: # If the policy skipped this deployment and didn't provide state, # fallback to the previous state to avoid losing user data. final_state[dep_id] = state_per_deployment[dep_id] continue # Ensure we have a dictionary and work on a copy to avoid cross-deployment contamination. custom_policy_state_per_deployment = (user_state if isinstance(user_state, dict) else {}).copy() if dep_id not in desired_num_replicas_dict: final_state[dep_id] = _merge_user_state_with_internal_state( state_per_deployment[dep_id], custom_policy_state_per_deployment, ) continue

The current implementation introduces a regression where user-defined state is lost for skipped deployments if the policy does not explicitly return it in the state dictionary. In the previous version, the entire state (including user keys) was preserved for skipped deployments.

This is the expected behavior. We only want to pass back whatever the user passes as part of the policy in the current control loop.

Additionally, calling .copy() on the result of .get(dep_id, {}) will raise an AttributeError if the user policy explicitly returns None as the state value for a deployment.

Fixed in : 9c347f5

gemini-code-assist · 2026-04-09T22:07:04Z

+        assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4
+        assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now
+
+        # d2 had counter=0, timestamp=None. Delay logic sees scale-up,
+        # increments counter to 1, sets timestamp to fake_now.
+        assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1
+        assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now


The test verifies that internal state keys (counters and timestamps) are isolated, but it should also verify that the user-provided state keys are correctly preserved after the merge operation. This ensures that _merge_user_state_with_internal_state doesn't accidentally drop user data.

Suggested change

assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4

assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now

# d2 had counter=0, timestamp=None. Delay logic sees scale-up,

# increments counter to 1, sets timestamp to fake_now.

assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1

assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now

assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4

assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now

assert final_state[d1]["counter"] == 5

# d2 had counter=0, timestamp=None. Delay logic sees scale-up,

# increments counter to 1, sets timestamp to fake_now.

assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1

assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now

assert final_state[d2]["counter"] == 5

Added in 9c347f5

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 536db88. Configure here.}

cursor · 2026-04-09T23:10:03Z

+            app_state_manager.update()
+            # The scaling decisions will not contain d1
+            assert d1_id not in deployment_state_manager._scaling_decisions
+            assert deployment_state_manager._scaling_decisions[d2_id] == 3


Test asserts wrong scaling decision due to delay logic

Low Severity

The assertion deployment_state_manager._scaling_decisions[d2_id] == 3 is likely incorrect. The policy is wrapped by _apply_app_level_autoscaling_config, which applies delay logic with default upscale_delay_s=30.0. Since the test does not patch time (unlike test_app_level_autoscaling_with_decorator_applies_delays), the delay never elapses, and _apply_delay_logic returns curr_target_num_replicas (1) rather than the desired value (3). The state persistence assertions are correct, but this scaling decision check appears wrong.

^{Reviewed by Cursor Bugbot for commit 536db88. Configure here.}

This is taken care of in _create_app_config helper which resets all delay values to 0.0 so this asserts the write scaling decision

…l-autoscaling-policy-state

[Serve] Fix app-level user state handling

e57a141

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

vaishdho1 requested a review from a team as a code owner April 9, 2026 22:02

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

[Serve] Added fix for AttributeError around .copy()

9c347f5

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

cursor bot reviewed Apr 9, 2026

View reviewed changes

Comment thread python/ray/serve/tests/unit/test_autoscaling_policy.py Outdated

[Serve] Edited to pick the right key

536db88

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>

cursor bot reviewed Apr 9, 2026

View reviewed changes

ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Apr 10, 2026

abrarsheikh approved these changes Apr 11, 2026

View reviewed changes

Merge branch 'master' of github.com:ray-project/ray into fix-app-leve…

6b9d9dc

…l-autoscaling-policy-state

abrarsheikh added the go add ONLY when ready to merge, run all tests label Apr 11, 2026

abrarsheikh merged commit db51527 into ray-project:master Apr 17, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Fix app-level autoscaling policy state: cross deployment contamination and state loss for skipped deployments#62484

[Serve] Fix app-level autoscaling policy state: cross deployment contamination and state loss for skipped deployments#62484
abrarsheikh merged 4 commits intoray-project:masterfrom
vaishdho1:fix-app-level-autoscaling-policy-state

vaishdho1 commented Apr 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

vaishdho1 Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

vaishdho1 Apr 9, 2026

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 9, 2026

Uh oh!

vaishdho1 Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vaishdho1 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

vaishdho1 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

vaishdho1 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 9, 2026

Choose a reason for hiding this comment

Test asserts wrong scaling decision due to delay logic

Uh oh!

vaishdho1 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vaishdho1 commented Apr 9, 2026 •

edited

Loading