New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Serve python API to support multi application #31589
Conversation
17a23fc
to
6ec4d35
Compare
6ec4d35
to
3976d9b
Compare
Looks good at a high level. As discussed offline please handle the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far! I did a brief pass (will do a deeper dive later). My main concern is whether we need the ApplicationStateManager
. I left a comment with more details.
self.status: ApplicationStatus = ApplicationStatus.NOT_STARTED | ||
self.app_name = app_name | ||
self.deployment_params = deployment_params | ||
self.to_be_deleted = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.to_be_deleted = False | |
self.all_deployments_deleted = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this attribute is under ApplicationState to indicate whether this application should be deleted or not. In the ApplicationStateManager update(), I think the attribute name is clearer to indicate "the application can be deleted" instead of "all deployments have been deleted". WDYT?
@@ -33,6 +33,8 @@ class ApplicationStatus(str, Enum): | |||
DEPLOYING = "DEPLOYING" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between NOT_STARTED
and NOT_EXISTED
? Do we need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOT_EXISTED is used when client is getting a non-existed application status.
@@ -255,6 +251,10 @@ async def run_control_loop(self) -> None: | |||
self.deployment_state_manager.update() | |||
except Exception: | |||
logger.exception("Exception updating deployment state.") | |||
try: | |||
self.application_state_manager.update() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ApplicationStateManager
feels like an awkward abstraction imo. On every control loop, the ServeController
will update the deployments by calling the deployment_state_manager
directly, and then it'll again update the deployments by calling the deployment_state_manager
indirectly via the application_state_manager
.
This makes it harder to understand how deployments are changed in a single control loop because there's two places where they're being updated. This also makes it harder to reason about concurrency (e.g. what happens if two apps have deployments w/ the same name? How would the deployment_state_manager
/application_state_manager
handle that?).
Since an application is essentially a collection of deployments and an application name, could we instead make it as a dictionary in the DeploymentStateManager
? Status calls could then be made directly to the DeploymentStateManager
, which would produce a result based on the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Currently there is no update call from
application_state_manager
todeployment_state_manager
, application manager only fetches information from the deployment state manager. Since we are going to deprecate the v1 API. It make sense to me we directly hidedeployment_state_manager
intoapplication_state_manager
, so which means we only callapplication_state_manager.update()
in the future. (I can't do it for now because we still support v1 API) - For
what happens if two apps have deployments w/ the same name
, it will deploy the one that comes later. make it as a dictionary in the DeploymentStateManager
. One reason I don't do that is that we still need to have the abstraction for application level even in the DeploymentStateManager. DeploymentStateManager class itself is used for deploying single Deployment, it doesn't have group of Deployments concept, and no application concept. we have to introduce it somehow. Given my first point, So I decide to have another class to reflect the application level information. And again after v1 deprecation, we can hidedeployment_state_manager
into theapplication_state_manager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so eventually the deployment_state_manager
will only get called via the application_state_manager
. I think that makes sense.
For
what happens if two apps have deployments w/ the same name
, it will deploy the one that comes later.
Does that mean two apps on a single cluster can't have deployments with the same name at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will have application_state_manager.update() only in the controller level.
I think you are referring to the problem here #31589 (comment) :)
I am working on this so basically each deployment name will be {app_name}_{deployment_name}
. It will reflect in the following up update in this pr soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sihanwang41 I thought that currently, DeploymentStateManager
is used for deploying multiple deployments? It holds a list of DeploymentStates
. Essentially from my understanding it currently manages the state of a "single application".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcin yes, it is good for single applications, if for multiple applications, we have to introducing group deployments
and app concept
somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like this PR is the lowest friction path to add this concept, but it would probably make more sense to refactor and have "app manager" wrap "deployment manager" in the future.
apps_to_be_deleted = [] | ||
for name, app in self._application_states.items(): | ||
app.update() | ||
if app.to_be_deleted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just call del
here? (Why do we need apps_to_be_deleted
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think this is similar to the existing deployment_state_manager
, I don't fully remember why we needed it there but it does seem that here we could just immediately del
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest del entry from _application_states
while it is being iterated? i think it is not allowed?
8fd7c6e
to
9c94500
Compare
Looks good overall! Before merging I think we should add more to the PR description (either link to a doc or add the details directly) Will doc updates happen in a followup PR, or do more code updates need to happen first? |
9c94500
to
1200b32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't review the tests in depth yet. But the code changes look good to me. Ping when tests are passing and I'll do a final pass.
apps_to_be_deleted = [] | ||
for name, app in self._application_states.items(): | ||
app.update() | ||
if app.to_be_deleted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think this is similar to the existing deployment_state_manager
, I don't fully remember why we needed it there but it does seem that here we could just immediately del
@@ -255,6 +251,10 @@ async def run_control_loop(self) -> None: | |||
self.deployment_state_manager.update() | |||
except Exception: | |||
logger.exception("Exception updating deployment state.") | |||
try: | |||
self.application_state_manager.update() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like this PR is the lowest friction path to add this concept, but it would probably make more sense to refactor and have "app manager" wrap "deployment manager" in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
num_health_deployments = 0 | ||
for deployment_status in deployments_statuses: | ||
if deployment_status.status == DeploymentStatus.UNHEALTHY: | ||
self.status = ApplicationStatus.DEPLOY_FAILED | ||
return | ||
if deployment_status.status == DeploymentStatus.HEALTHY: | ||
num_health_deployments += 1 | ||
if num_health_deployments == len(deployments_statuses): | ||
self.status = ApplicationStatus.RUNNING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be simplified with list.count
or Counter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find there is a clean way to use these function with Object.
deployments_to_delete.extend( | ||
self.application_state_manager.get_deployments(name) | ||
) | ||
self.application_state_manager.delete_application(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing that application_state_manager.delete_application()
doesn't actually delete the application; is this to support 1.x API for now? Will application_state_manager.delete_application()
eventually call deployment_state_manager.delete_deployment()
? (Same for deploy_application
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, currently it is only set the DELETING state.
51188d1
to
fe00aec
Compare
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
fe00aec
to
3addb93
Compare
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
059d822
to
d15b780
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, basically all nits.
python/ray/serve/api.py
Outdated
route_prefix: Route prefix for HTTP requests. If not provided, it will use | ||
ingrerss deployment route prefix. By default, the ingress route | ||
prefix is '/'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
route_prefix: Route prefix for HTTP requests. If not provided, it will use | |
ingrerss deployment route prefix. By default, the ingress route | |
prefix is '/'. | |
route_prefix: Route prefix for HTTP requests. If not provided, it will use | |
route_prefix of the ingress deployment. By default, the ingress route | |
prefix is '/'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should update this to not be specifiable in the ingress deployment in the future... not good to have two ways to accomplish the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#31932 for tracking
python/ray/serve/controller.py
Outdated
|
||
Args: | ||
config: Contains the following: | ||
app_name: Application name. If not provided, it is empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will empty string also delete all other apps on the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's empty string we should probably not prefix the deployment names, not sure if that's handled
(or it'll be weird: _{DeploymentName}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will empty string also delete all other apps on the cluster?
Nope. If it is empty string, it will only delete the empty string app.
if it's empty string we should probably not prefix the deployment names, not sure if that's handled
Yes, the deployment name is handled in the sever.run() API level inside build() function.
python/ray/serve/controller.py
Outdated
config_checkpoints_dict[config.app_name] = ( | ||
self.application_state_manager.get_deployment_timestamp(config.app_name), | ||
config_dict, | ||
updated_version_dict, | ||
) | ||
|
||
self.kv_store.put( | ||
CONFIG_CHECKPOINT_KEY, | ||
pickle.dumps(config_checkpoints_dict), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm is the ordering important & correct here? what happens if this call fails in the middle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't intentionally change the order, change it back :)
is the ordering important & correct here
Yes or No here? With current logic (left hand), if the controller die beforeself.kv_store.put
, we will still lose the deploy information.
what happens if this call fails in the middle
If deploy failed for some reasons (e.g. runtime env), application state will help to catch the exception and update the status toDEPLOYED_FAILED
.
python/ray/serve/tests/test_api.py
Outdated
f_handle = serve.run(f.bind(), name="app_f") | ||
assert ray.get(f_handle.remote()) == "got f" | ||
assert requests.get("http://127.0.0.1:8000/").text == "got f" | ||
|
||
g_handle = serve.run(g.bind(), name="app_g", route_prefix="/app_g") | ||
assert ray.get(g_handle.remote()) == "got g" | ||
assert requests.get("http://127.0.0.1:8000/app_g").text == "got g" | ||
|
||
h_handle = serve.run(h.bind(), name="app_h") | ||
assert ray.get(h_handle.remote()) == "got h" | ||
assert requests.get("http://127.0.0.1:8000/my_prefix").text == "got h" | ||
|
||
graph_handle = serve.run( | ||
DAGDriver.bind(Model1.bind()), name="graph", route_prefix="/my_graph" | ||
) | ||
assert ray.get(graph_handle.predict.remote()) == "got model1" | ||
assert requests.get("http://127.0.0.1:8000/my_graph").text == '"got model1"' | ||
|
||
serve.run(MyFastAPIDeployment.bind(), name="FastAPI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a few short comments describing what specific behavior is being tested for each of these conditions. helps a lot with readability!
60e8b84
to
e0cb138
Compare
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
e0cb138
to
34fe1cc
Compare
@@ -726,7 +754,7 @@ def run_graph( | |||
|
|||
# Override options for each deployment | |||
for options in deployment_override_options: | |||
name = options["name"] | |||
deployment_name = options["name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No special meaning. Only change variable name from name to deployment_name
All failed tests appear to be flaky |
Signed-off-by: Sihan Wang sihanwang41@gmail.com
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.