Skip to content

[Serve] Fix app-level autoscaling policy state: cross deployment contamination and state loss for skipped deployments#62484

Merged
abrarsheikh merged 4 commits intoray-project:masterfrom
vaishdho1:fix-app-level-autoscaling-policy-state
Apr 17, 2026
Merged

[Serve] Fix app-level autoscaling policy state: cross deployment contamination and state loss for skipped deployments#62484
abrarsheikh merged 4 commits intoray-project:masterfrom
vaishdho1:fix-app-level-autoscaling-policy-state

Conversation

@vaishdho1
Copy link
Copy Markdown
Contributor

@vaishdho1 vaishdho1 commented Apr 9, 2026

Description

Changes to fix #62482

  • autoscaling_policy.py:
    a) Copies the user state before merging
    b) Call _merge_user_state_with_internal_state for skipped deployments instead of returning raw internal state.
  • test_autoscaling_policy.py : Unit test which verifies two deployments with different internal state don't cross contaminate when the user policy returns a shared dict.
  • test_application_state.py : Unit test which verifies user state persists across control loop iterations for a deployment the policy skips.

Related issues

Fixes:#62482
Related to: #58857

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
@vaishdho1 vaishdho1 requested a review from a team as a code owner April 9, 2026 22:02
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the autoscaling policy logic to ensure that custom policy states are correctly copied and merged, preventing cross-deployment contamination. The changes include updates to the autoscaling policy implementation and the addition of unit tests to verify state isolation and persistence for skipped deployments. Feedback highlights a potential regression regarding state loss for skipped deployments and suggests defensive handling for cases where the user policy returns None. Additionally, it is recommended to expand the unit tests to explicitly verify that user-provided state keys are preserved during the merge process.

Comment thread python/ray/serve/autoscaling_policy.py Outdated
Comment on lines +298 to +306
custom_policy_state_per_deployment = updated_custom_policy_state.get(
dep_id, {}
)
).copy()
if dep_id not in desired_num_replicas_dict:
final_state[dep_id] = _merge_user_state_with_internal_state(
state_per_deployment[dep_id],
custom_policy_state_per_deployment,
)
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation introduces a regression where user-defined state is lost for skipped deployments if the policy does not explicitly return it in the state dictionary. In the previous version, the entire state (including user keys) was preserved for skipped deployments.

Additionally, calling .copy() on the result of .get(dep_id, {}) will raise an AttributeError if the user policy explicitly returns None as the state value for a deployment.

It is recommended to fallback to the previous state when a deployment is skipped and no new state is provided, while also handling potential None values defensively.

            user_state = updated_custom_policy_state.get(dep_id)
            if dep_id not in desired_num_replicas_dict and user_state is None:
                # If the policy skipped this deployment and didn't provide state,
                # fallback to the previous state to avoid losing user data.
                final_state[dep_id] = state_per_deployment[dep_id]
                continue

            # Ensure we have a dictionary and work on a copy to avoid cross-deployment contamination.
            custom_policy_state_per_deployment = (user_state if isinstance(user_state, dict) else {}).copy()
            if dep_id not in desired_num_replicas_dict:
                final_state[dep_id] = _merge_user_state_with_internal_state(
                    state_per_deployment[dep_id],
                    custom_policy_state_per_deployment,
                )
                continue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation introduces a regression where user-defined state is lost for skipped deployments if the policy does not explicitly return it in the state dictionary. In the previous version, the entire state (including user keys) was preserved for skipped deployments.

This is the expected behavior. We only want to pass back whatever the user passes as part of the policy in the current control loop.

Additionally, calling .copy() on the result of .get(dep_id, {}) will raise an AttributeError if the user policy explicitly returns None as the state value for a deployment.

Fixed in : 9c347f5

Comment on lines +1561 to +1567
assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4
assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now

# d2 had counter=0, timestamp=None. Delay logic sees scale-up,
# increments counter to 1, sets timestamp to fake_now.
assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1
assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test verifies that internal state keys (counters and timestamps) are isolated, but it should also verify that the user-provided state keys are correctly preserved after the merge operation. This ensures that _merge_user_state_with_internal_state doesn't accidentally drop user data.

Suggested change
assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4
assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now
# d2 had counter=0, timestamp=None. Delay logic sees scale-up,
# increments counter to 1, sets timestamp to fake_now.
assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1
assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now
assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4
assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now
assert final_state[d1]["counter"] == 5
# d2 had counter=0, timestamp=None. Delay logic sees scale-up,
# increments counter to 1, sets timestamp to fake_now.
assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1
assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now
assert final_state[d2]["counter"] == 5

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9c347f5

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Comment thread python/ray/serve/tests/unit/test_autoscaling_policy.py Outdated
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 536db88. Configure here.

app_state_manager.update()
# The scaling decisions will not contain d1
assert d1_id not in deployment_state_manager._scaling_decisions
assert deployment_state_manager._scaling_decisions[d2_id] == 3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test asserts wrong scaling decision due to delay logic

Low Severity

The assertion deployment_state_manager._scaling_decisions[d2_id] == 3 is likely incorrect. The policy is wrapped by _apply_app_level_autoscaling_config, which applies delay logic with default upscale_delay_s=30.0. Since the test does not patch time (unlike test_app_level_autoscaling_with_decorator_applies_delays), the delay never elapses, and _apply_delay_logic returns curr_target_num_replicas (1) rather than the desired value (3). The state persistence assertions are correct, but this scaling decision check appears wrong.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 536db88. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken care of in _create_app_config helper which resets all delay values to 0.0 so this asserts the write scaling decision

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Apr 10, 2026
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Apr 11, 2026
@abrarsheikh abrarsheikh merged commit db51527 into ray-project:master Apr 17, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve]App-level policy state loss and shared dict corruption (followup to #58857)

2 participants