Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] [Serve] Threading for Ray Serve #20169

Open
1 of 2 tasks
cin-duke opened this issue Nov 9, 2021 · 7 comments
Open
1 of 2 tasks

[Feature] [Serve] Threading for Ray Serve #20169

cin-duke opened this issue Nov 9, 2021 · 7 comments
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical serve Ray Serve Related Issue
Milestone

Comments

@cin-duke
Copy link

cin-duke commented Nov 9, 2021

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Currently, Ray Serve only supports asyncio which will block the event loop when there are some computation heavy tasks. Which is not ideal for handling concurrent requests.
It'd be great if we can support threading for Ray Serve deployment. It will be easier to process concurrent requests.

Use case

Allow Ray Serve deployment to process some computation heavy tasks while waiting for results on other threads.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@cin-duke cin-duke added the enhancement Request for new feature and/or capability label Nov 9, 2021
@cin-duke
Copy link
Author

cin-duke commented Nov 9, 2021

Hi @simon-mo, could you take a look at this feature. Thank you 😃

@cin-duke cin-duke changed the title [Feature] Threading for Ray Serve [Feature][Serve] Threading for Ray Serve Nov 9, 2021
@cin-duke cin-duke changed the title [Feature][Serve] Threading for Ray Serve [Feature] [Serve] Threading for Ray Serve Nov 9, 2021
@simon-mo
Copy link
Contributor

simon-mo commented Nov 9, 2021

Thanks for posting this @cin-duke. cc @jiaodong @edoakes from the Serve team.

Actually after some thoughts, this is more nuanced than I thought.

When you want concurrent requests but it is CPU bounded, for example a single call using 100% CPU, you should use replicas instead of threading to ensure that each process. Replicas are easier to manage and performance is more predictable.

When you have a call that only uses say 20% CPU but it is blocking call (no async option), threading might still make sense. However in this case replicas are still preferred because you can do YourDeployment.options(ray_actor_options={"num_cpus": 0.2}, num_replicas=10) to scale out.

The only case where threading would be useful is to use lower number of replicas due to process overhead. In this case you can use a Python threadpool and orchestrate it with asyncio. https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor

In summary, I think supporting threaded actor in max_concurrent_quries might be challenging. Let me know whether this make sense!

@cin-duke
Copy link
Author

cin-duke commented Nov 11, 2021

Thanks for your detailed explanation @simon-mo

Here is some of my comments:

When you have a call that only uses say 20% CPU but it is blocking call (no async option), threading might still make sense. However in this case replicas are still preferred because you can do YourDeployment.options(ray_actor_options={"num_cpus": 0.2}, num_replicas=10) to scale out.

In this usecase, thread would be useful because spawning multiple workers will take lots of system RAM, and if the deployment uses GPU RA, it will be very hard to scale out.

The only case where threading would be useful is to use lower number of replicas due to process overhead. In this case you can use a Python threadpool and orchestrate it with asyncio. https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor

I will look into this approach. However, it'd be easier for users if Ray implement the threadpool internally. User only needs to specify the argument to turn on thread, and set max concurrency.

@simon-mo
Copy link
Contributor

In this use case, thread would be useful because spawning multiple workers will take lots of system RAM, and if the deployment uses GPU RA, it will be very hard to scale out.

This is a great point! For the thread pool approach, if you can try to prototype it in your application and see it working, we can integrate it into Ray Serve :D.

@simon-mo simon-mo added the serve Ray Serve Related Issue label Nov 11, 2021
@simon-mo simon-mo added this to the Serve backlog milestone Nov 11, 2021
@cin-duke
Copy link
Author

Okay, we will experiment with it and let you know the result.

@simon-mo simon-mo added the P2 Important issue, but not time-critical label Jan 26, 2022
@jiaodong
Copy link
Member

Hi @cin-duke just revisiting this issue as this topic was brought up by other community users as well. Any updates on your experiment or help needed ?

@cin-duke
Copy link
Author

Hi @jiaodong, @simon-mo
Sorry for late response, I don't have time to investigate it. My simple solution is: creating a function to call the deployment, then create a thread pool to run that function on separated thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

5 participants