[Serve] Fix app-level autoscaling policy state: cross deployment contamination and state loss for skipped deployments#62484
Conversation
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request updates the autoscaling policy logic to ensure that custom policy states are correctly copied and merged, preventing cross-deployment contamination. The changes include updates to the autoscaling policy implementation and the addition of unit tests to verify state isolation and persistence for skipped deployments. Feedback highlights a potential regression regarding state loss for skipped deployments and suggests defensive handling for cases where the user policy returns None. Additionally, it is recommended to expand the unit tests to explicitly verify that user-provided state keys are preserved during the merge process.
| custom_policy_state_per_deployment = updated_custom_policy_state.get( | ||
| dep_id, {} | ||
| ) | ||
| ).copy() | ||
| if dep_id not in desired_num_replicas_dict: | ||
| final_state[dep_id] = _merge_user_state_with_internal_state( | ||
| state_per_deployment[dep_id], | ||
| custom_policy_state_per_deployment, | ||
| ) | ||
| continue |
There was a problem hiding this comment.
The current implementation introduces a regression where user-defined state is lost for skipped deployments if the policy does not explicitly return it in the state dictionary. In the previous version, the entire state (including user keys) was preserved for skipped deployments.
Additionally, calling .copy() on the result of .get(dep_id, {}) will raise an AttributeError if the user policy explicitly returns None as the state value for a deployment.
It is recommended to fallback to the previous state when a deployment is skipped and no new state is provided, while also handling potential None values defensively.
user_state = updated_custom_policy_state.get(dep_id)
if dep_id not in desired_num_replicas_dict and user_state is None:
# If the policy skipped this deployment and didn't provide state,
# fallback to the previous state to avoid losing user data.
final_state[dep_id] = state_per_deployment[dep_id]
continue
# Ensure we have a dictionary and work on a copy to avoid cross-deployment contamination.
custom_policy_state_per_deployment = (user_state if isinstance(user_state, dict) else {}).copy()
if dep_id not in desired_num_replicas_dict:
final_state[dep_id] = _merge_user_state_with_internal_state(
state_per_deployment[dep_id],
custom_policy_state_per_deployment,
)
continueThere was a problem hiding this comment.
The current implementation introduces a regression where user-defined state is lost for skipped deployments if the policy does not explicitly return it in the state dictionary. In the previous version, the entire state (including user keys) was preserved for skipped deployments.
This is the expected behavior. We only want to pass back whatever the user passes as part of the policy in the current control loop.
Additionally, calling
.copy()on the result of.get(dep_id, {})will raise anAttributeErrorif the user policy explicitly returnsNoneas the state value for a deployment.
Fixed in : 9c347f5
| assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4 | ||
| assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now | ||
|
|
||
| # d2 had counter=0, timestamp=None. Delay logic sees scale-up, | ||
| # increments counter to 1, sets timestamp to fake_now. | ||
| assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1 | ||
| assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now |
There was a problem hiding this comment.
The test verifies that internal state keys (counters and timestamps) are isolated, but it should also verify that the user-provided state keys are correctly preserved after the merge operation. This ensures that _merge_user_state_with_internal_state doesn't accidentally drop user data.
| assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4 | |
| assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now | |
| # d2 had counter=0, timestamp=None. Delay logic sees scale-up, | |
| # increments counter to 1, sets timestamp to fake_now. | |
| assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1 | |
| assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now | |
| assert final_state[d1][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 4 | |
| assert final_state[d1][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now | |
| assert final_state[d1]["counter"] == 5 | |
| # d2 had counter=0, timestamp=None. Delay logic sees scale-up, | |
| # increments counter to 1, sets timestamp to fake_now. | |
| assert final_state[d2][SERVE_AUTOSCALING_DECISION_COUNTERS_KEY] == 1 | |
| assert final_state[d2][SERVE_AUTOSCALING_DECISION_TIMESTAMP_KEY] == fake_now | |
| assert final_state[d2]["counter"] == 5 |
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 536db88. Configure here.
| app_state_manager.update() | ||
| # The scaling decisions will not contain d1 | ||
| assert d1_id not in deployment_state_manager._scaling_decisions | ||
| assert deployment_state_manager._scaling_decisions[d2_id] == 3 |
There was a problem hiding this comment.
Test asserts wrong scaling decision due to delay logic
Low Severity
The assertion deployment_state_manager._scaling_decisions[d2_id] == 3 is likely incorrect. The policy is wrapped by _apply_app_level_autoscaling_config, which applies delay logic with default upscale_delay_s=30.0. Since the test does not patch time (unlike test_app_level_autoscaling_with_decorator_applies_delays), the delay never elapses, and _apply_delay_logic returns curr_target_num_replicas (1) rather than the desired value (3). The state persistence assertions are correct, but this scaling decision check appears wrong.
Reviewed by Cursor Bugbot for commit 536db88. Configure here.
There was a problem hiding this comment.
This is taken care of in _create_app_config helper which resets all delay values to 0.0 so this asserts the write scaling decision
…l-autoscaling-policy-state


Description
Changes to fix #62482
a) Copies the user state before merging
b) Call
_merge_user_state_with_internal_statefor skipped deployments instead of returning raw internal state.Related issues