Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Serve python API to support multi application #31589

Merged
merged 6 commits into from Jan 26, 2023

Conversation

sihanwang41
Copy link
Contributor

@sihanwang41 sihanwang41 commented Jan 11, 2023

Signed-off-by: Sihan Wang sihanwang41@gmail.com

Why are these changes needed?

  • Python API for multi applications. serve.run() & serve.delete()

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sihanwang41 sihanwang41 marked this pull request as draft January 11, 2023 14:19
@sihanwang41 sihanwang41 force-pushed the multi_app_p1 branch 7 times, most recently from 17a23fc to 6ec4d35 Compare January 18, 2023 16:34
@sihanwang41 sihanwang41 changed the title [WIP][Serve][Part1] Serve API to support multi application [Serve][Part1] Serve API to support multi application Jan 18, 2023
@sihanwang41 sihanwang41 changed the title [Serve][Part1] Serve API to support multi application [Serve] Serve python API to support multi application Jan 18, 2023
@sihanwang41 sihanwang41 marked this pull request as ready for review January 18, 2023 16:58
@edoakes
Copy link
Contributor

edoakes commented Jan 18, 2023

Looks good at a high level. As discussed offline please handle the name collision behavior.

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far! I did a brief pass (will do a deeper dive later). My main concern is whether we need the ApplicationStateManager. I left a comment with more details.

python/ray/serve/_private/application_state.py Outdated Show resolved Hide resolved
self.status: ApplicationStatus = ApplicationStatus.NOT_STARTED
self.app_name = app_name
self.deployment_params = deployment_params
self.to_be_deleted = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.to_be_deleted = False
self.all_deployments_deleted = False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this attribute is under ApplicationState to indicate whether this application should be deleted or not. In the ApplicationStateManager update(), I think the attribute name is clearer to indicate "the application can be deleted" instead of "all deployments have been deleted". WDYT?

python/ray/serve/_private/common.py Outdated Show resolved Hide resolved
@@ -33,6 +33,8 @@ class ApplicationStatus(str, Enum):
DEPLOYING = "DEPLOYING"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between NOT_STARTED and NOT_EXISTED? Do we need both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOT_EXISTED is used when client is getting a non-existed application status.

@@ -255,6 +251,10 @@ async def run_control_loop(self) -> None:
self.deployment_state_manager.update()
except Exception:
logger.exception("Exception updating deployment state.")
try:
self.application_state_manager.update()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ApplicationStateManager feels like an awkward abstraction imo. On every control loop, the ServeController will update the deployments by calling the deployment_state_manager directly, and then it'll again update the deployments by calling the deployment_state_manager indirectly via the application_state_manager.

This makes it harder to understand how deployments are changed in a single control loop because there's two places where they're being updated. This also makes it harder to reason about concurrency (e.g. what happens if two apps have deployments w/ the same name? How would the deployment_state_manager/application_state_manager handle that?).

Since an application is essentially a collection of deployments and an application name, could we instead make it as a dictionary in the DeploymentStateManager? Status calls could then be made directly to the DeploymentStateManager, which would produce a result based on the dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Currently there is no update call from application_state_manager to deployment_state_manager, application manager only fetches information from the deployment state manager. Since we are going to deprecate the v1 API. It make sense to me we directly hide deployment_state_manager into application_state_manager, so which means we only call application_state_manager.update() in the future. (I can't do it for now because we still support v1 API)
  2. For what happens if two apps have deployments w/ the same name, it will deploy the one that comes later.
  3. make it as a dictionary in the DeploymentStateManager. One reason I don't do that is that we still need to have the abstraction for application level even in the DeploymentStateManager. DeploymentStateManager class itself is used for deploying single Deployment, it doesn't have group of Deployments concept, and no application concept. we have to introduce it somehow. Given my first point, So I decide to have another class to reflect the application level information. And again after v1 deprecation, we can hide deployment_state_manager into the application_state_manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so eventually the deployment_state_manager will only get called via the application_state_manager. I think that makes sense.

For what happens if two apps have deployments w/ the same name, it will deploy the one that comes later.

Does that mean two apps on a single cluster can't have deployments with the same name at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will have application_state_manager.update() only in the controller level.

I think you are referring to the problem here #31589 (comment) :)
I am working on this so basically each deployment name will be {app_name}_{deployment_name}. It will reflect in the following up update in this pr soon!

Copy link
Contributor

@zcin zcin Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sihanwang41 I thought that currently, DeploymentStateManager is used for deploying multiple deployments? It holds a list of DeploymentStates. Essentially from my understanding it currently manages the state of a "single application".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcin yes, it is good for single applications, if for multiple applications, we have to introducing group deployments and app concept somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like this PR is the lowest friction path to add this concept, but it would probably make more sense to refactor and have "app manager" wrap "deployment manager" in the future.

apps_to_be_deleted = []
for name, app in self._application_states.items():
app.update()
if app.to_be_deleted:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call del here? (Why do we need apps_to_be_deleted?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think this is similar to the existing deployment_state_manager, I don't fully remember why we needed it there but it does seem that here we could just immediately del

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest del entry from _application_states while it is being iterated? i think it is not allowed?

@sihanwang41 sihanwang41 force-pushed the multi_app_p1 branch 2 times, most recently from 8fd7c6e to 9c94500 Compare January 19, 2023 18:38
@architkulkarni
Copy link
Contributor

Looks good overall! Before merging I think we should add more to the PR description (either link to a doc or add the details directly)

Will doc updates happen in a followup PR, or do more code updates need to happen first?

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't review the tests in depth yet. But the code changes look good to me. Ping when tests are passing and I'll do a final pass.

apps_to_be_deleted = []
for name, app in self._application_states.items():
app.update()
if app.to_be_deleted:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think this is similar to the existing deployment_state_manager, I don't fully remember why we needed it there but it does seem that here we could just immediately del

python/ray/serve/_private/deployment_graph_build.py Outdated Show resolved Hide resolved
@@ -255,6 +251,10 @@ async def run_control_loop(self) -> None:
self.deployment_state_manager.update()
except Exception:
logger.exception("Exception updating deployment state.")
try:
self.application_state_manager.update()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like this PR is the lowest friction path to add this concept, but it would probably make more sense to refactor and have "app manager" wrap "deployment manager" in the future.

Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

python/ray/serve/controller.py Show resolved Hide resolved
Comment on lines +109 to +117
num_health_deployments = 0
for deployment_status in deployments_statuses:
if deployment_status.status == DeploymentStatus.UNHEALTHY:
self.status = ApplicationStatus.DEPLOY_FAILED
return
if deployment_status.status == DeploymentStatus.HEALTHY:
num_health_deployments += 1
if num_health_deployments == len(deployments_statuses):
self.status = ApplicationStatus.RUNNING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified with list.count or Counter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find there is a clean way to use these function with Object.

deployments_to_delete.extend(
self.application_state_manager.get_deployments(name)
)
self.application_state_manager.delete_application(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing that application_state_manager.delete_application() doesn't actually delete the application; is this to support 1.x API for now? Will application_state_manager.delete_application() eventually call deployment_state_manager.delete_deployment()? (Same for deploy_application)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, currently it is only set the DELETING state.

python/ray/serve/controller.py Outdated Show resolved Hide resolved
@sihanwang41 sihanwang41 force-pushed the multi_app_p1 branch 2 times, most recently from 51188d1 to fe00aec Compare January 24, 2023 03:57
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, basically all nits.

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/api.py Outdated Show resolved Hide resolved
Comment on lines 479 to 481
route_prefix: Route prefix for HTTP requests. If not provided, it will use
ingrerss deployment route prefix. By default, the ingress route
prefix is '/'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
route_prefix: Route prefix for HTTP requests. If not provided, it will use
ingrerss deployment route prefix. By default, the ingress route
prefix is '/'.
route_prefix: Route prefix for HTTP requests. If not provided, it will use
route_prefix of the ingress deployment. By default, the ingress route
prefix is '/'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should update this to not be specifiable in the ingress deployment in the future... not good to have two ways to accomplish the same thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#31932 for tracking

python/ray/serve/api.py Show resolved Hide resolved
python/ray/serve/api.py Outdated Show resolved Hide resolved

Args:
config: Contains the following:
app_name: Application name. If not provided, it is empty string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will empty string also delete all other apps on the cluster?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's empty string we should probably not prefix the deployment names, not sure if that's handled

(or it'll be weird: _{DeploymentName})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will empty string also delete all other apps on the cluster?

Nope. If it is empty string, it will only delete the empty string app.

if it's empty string we should probably not prefix the deployment names, not sure if that's handled

Yes, the deployment name is handled in the sever.run() API level inside build() function.

Comment on lines 486 to 495
config_checkpoints_dict[config.app_name] = (
self.application_state_manager.get_deployment_timestamp(config.app_name),
config_dict,
updated_version_dict,
)

self.kv_store.put(
CONFIG_CHECKPOINT_KEY,
pickle.dumps(config_checkpoints_dict),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is the ordering important & correct here? what happens if this call fails in the middle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't intentionally change the order, change it back :)

is the ordering important & correct here
Yes or No here? With current logic (left hand), if the controller die before self.kv_store.put, we will still lose the deploy information.

what happens if this call fails in the middle
If deploy failed for some reasons (e.g. runtime env), application state will help to catch the exception and update the status to DEPLOYED_FAILED.

python/ray/serve/schema.py Outdated Show resolved Hide resolved
python/ray/serve/schema.py Outdated Show resolved Hide resolved
Comment on lines 510 to 528
f_handle = serve.run(f.bind(), name="app_f")
assert ray.get(f_handle.remote()) == "got f"
assert requests.get("http://127.0.0.1:8000/").text == "got f"

g_handle = serve.run(g.bind(), name="app_g", route_prefix="/app_g")
assert ray.get(g_handle.remote()) == "got g"
assert requests.get("http://127.0.0.1:8000/app_g").text == "got g"

h_handle = serve.run(h.bind(), name="app_h")
assert ray.get(h_handle.remote()) == "got h"
assert requests.get("http://127.0.0.1:8000/my_prefix").text == "got h"

graph_handle = serve.run(
DAGDriver.bind(Model1.bind()), name="graph", route_prefix="/my_graph"
)
assert ray.get(graph_handle.predict.remote()) == "got model1"
assert requests.get("http://127.0.0.1:8000/my_graph").text == '"got model1"'

serve.run(MyFastAPIDeployment.bind(), name="FastAPI")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a few short comments describing what specific behavior is being tested for each of these conditions. helps a lot with readability!

@edoakes edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 25, 2023
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@@ -726,7 +754,7 @@ def run_graph(

# Override options for each deployment
for options in deployment_override_options:
name = options["name"]
deployment_name = options["name"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No special meaning. Only change variable name from name to deployment_name

@edoakes
Copy link
Contributor

edoakes commented Jan 26, 2023

All failed tests appear to be flaky

@edoakes edoakes merged commit 0c68ba4 into ray-project:master Jan 26, 2023
cadedaniel pushed a commit to cadedaniel/ray that referenced this pull request Mar 22, 2023
)

- Python API for multi applications. serve.run() & serve.delete()
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this pull request Mar 28, 2023
)

- Python API for multi applications. serve.run() & serve.delete()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants