Skip to content

[Core] Expose fallback_strategy in TaskInfoEntry and ActorTableData#60659

Merged
edoakes merged 13 commits intoray-project:masterfrom
nadongjun:fallback
Mar 12, 2026
Merged

[Core] Expose fallback_strategy in TaskInfoEntry and ActorTableData#60659
edoakes merged 13 commits intoray-project:masterfrom
nadongjun:fallback

Conversation

@nadongjun
Copy link
Contributor

Description

Add observability for fallback_strategy in State API and GCS.

While Ray currently provides visibility for label_selector (#53423), there is no mechanism to observe the fallback_strategy from outside the system.

This PR exposes fallback_strategy in TaskInfoEntry and ActorTableData. The ability to read and record fallback_strategy is essential for our custom autoscaler development. When primary label_selector constraints cannot be met, the autoscaler must access these recorded fallback strategies to prioritize and allocate alternative devices.

Beyond autoscaling, adding this feature will provide a better debugging experience by allowing users to transparently track the entire scheduling intent, including the fallback_strategy for both tasks and actors.

Related issues

Related to #51564

Additional information

from ray import serve  
import ray  
from ray.util.scheduling_strategies import NodeLabelSchedulingStrategy, In, Exists  

@serve.deployment(  
    name="soft_docker_deployment",  
    ray_actor_options={  
        "label_selector": {"docker-image": "in(test-image)"},
        "fallback_strategy": [
            {"label_selector": {"docker-image": "in(test-image2)"}},
        ]  
    }
)
class SoftDockerDeployment:  
    def __call__(self, request):  
        node_labels = ray.get_runtime_context().get_node_labels()  
        return {  
            "message": "Hello from soft-docker deployment!",  
            "node_labels": node_labels  
        }  
  
if __name__ == "__main__":  
    serve.start(http_options={"host": "0.0.0.0", "port": 8000})  
    serve.run(SoftDockerDeployment.bind())

GlobalStateAccessor.get_actor_table

image

ray list actors --detail

image

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun nadongjun requested review from a team, edoakes and jjyao as code owners February 2, 2026 08:23
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully exposes the fallback_strategy for tasks and actors across Ray's state API and GCS, involving changes to Protobuf definitions, the C++ backend, and the Python state API. The implementation is largely correct and well-aligned with the stated goals. However, I've identified two issues: one in the C++ backend concerning an incorrect check for the fallback_strategy, and another in the Python state API where a field is missing from a dataclass, which would likely cause a runtime error. My review provides specific suggestions to address these points.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Feb 2, 2026
@nadongjun
Copy link
Contributor Author

@edoakes @jjyao Gentle ping. Any thoughts on this?

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution

Could you add an integration test for the state API to assert that the information is populated correctly?

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Feb 10, 2026
@edoakes edoakes requested a review from a team February 10, 2026 13:55
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@nadongjun
Copy link
Contributor Author

@edoakes I've fixed the issue where fallback_strategy was missing during TaskInfo conversion and added both integration and GCS unit tests to verify the fix. Please take another look!

@nadongjun
Copy link
Contributor Author

@edoakes @MengjinYan @israbbani Gentle ping.

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍

@edoakes
Copy link
Collaborator

edoakes commented Mar 10, 2026

@nadongjun there is a merge conflict

@edoakes edoakes enabled auto-merge (squash) March 10, 2026 20:59
auto-merge was automatically disabled March 10, 2026 23:34

Head branch was pushed to by a user without write access

@nadongjun nadongjun requested a review from edoakes March 12, 2026 04:24
@nadongjun
Copy link
Contributor Author

@edoakes Resolved the merge conflict.

@edoakes edoakes merged commit 1a6c6f0 into ray-project:master Mar 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants