Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS #44361

Open
vara-bonthu opened this issue Mar 29, 2024 · 4 comments
Assignees
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. bug Something that is supposed to be working; but isn't core-hardware Hardware support in Ray core: e.g. accelerators P2 Important issue, but not time-critical serve Ray Serve Related Issue

Comments

@vara-bonthu
Copy link

vara-bonthu commented Mar 29, 2024

What happened + What you expected to happen

What Happened:

When deploying models using RayServe with autoscaling enabled on Amazon EKS, specifically across multiple inf2 nodes, the system scales correctly within a single inf2.24xlarge instance up to 6 replicas. However, beyond 6 replicas, RayServe fails to request new worker nodes, which should trigger Karpenter to provision additional nodes for placing new pods. This issue occurs despite auto-scaling being seemingly configured correctly to handle such scaling.

Logs Indicating the Issue:

The logs suggest that Ray's autoscaler could not find a suitable node type to satisfy the resource demands, despite the available Neuron device resources.

{'CPU': 1.0, 'neuron_cores': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333}: 6+ pending tasks/actors

2024-03-22 08:39:58,179	WARNING resource_demand_scheduler.py:782 -- The autoscaler could not find a node type to satisfy the request: [
{'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
{'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0}]. 

Please specify a node type with the necessary resources.

Possible Contributing Factors: The RayStartParam specifies resources in a manner (resources: {"neuron_cores": 2}) that might not align perfectly with the resource tags added by the Neuron device plugin to the nodes managed by Karpenter.

What You Expected to Happen:

Expected Behavior: RayServe's autoscaling should seamlessly request new worker nodes when the demand exceeds the capacity of the current nodes, especially in scenarios where more than 6 replicas are needed. Karpenter should then be able to provision new nodes based on the resource requests from RayServe, allowing for the continuous scaling of model deployments without manual intervention.

Seamless Integration and Scaling: Given the configuration and resources available, especially with Neuron devices on Inf2 instances, I expected a smooth scaling experience that leverages the Neuron core resources effectively across multiple nodes, allowing for a greater number of model replicas to be deployed and managed dynamically based on load.

Additional Information:

Deployment Configuration: The issue arises with a specific RayServe configuration designed for deploying models on Amazon EKS with Inf2 instances. The configuration details can be found at RayServe Configuration for Stable Diffusion on EKS.

Potential Misalignment with Neuron Device Plugin and Karpenter: The issue might stem from how Neuron device resources are tagged and utilized by Karpenter in response to RayServe's resource requests, suggesting a potential area for troubleshooting and adjustment.

Versions / Dependencies

Reproduction script

Steps to Reproduce:

Deploy Infrastructure and RayServe Model Inference:

Follow the instructions provided in the blueprint for deploying the infrastructure and RayServe model inference for Stable Diffusion. This comprehensive guide is available at the following URL: Deploying StableDiffusion Model Inference on EKS. This guide outlines the necessary steps to set up Amazon EKS, configure Karpenter, deploy RayServe, and prepare the model for inference.
Generate concurrent requests:

Utilize Postman to simulate multiple concurrent requests to the deployed RayServe model endpoint. The objective is to create a workload that triggers the auto-scaling behavior of RayServe, necessitating the scaling of replicas and, consequently, the provisioning of additional nodes by Karpenter.

Monitor Logs for Scaling Activity:

Keep an eye on the Ray dashboard and Karpenter logs to observe the scaling behavior. The expectation is for the number of replicas to increase in response to the simulated demand, leading to Karpenter being prompted to provision new nodes to accommodate the additional replicas.
Identify autoscaling limitations:

The critical point of observation is when the number of replicas reaches 6. Beyond this point, note whether RayServe attempts to scale beyond the existing node capacity and if Karpenter responds by provisioning additional nodes. The failure to do so underlines the issue being reported.

Expected Outcome:
The infrastructure and RayServe deployment should scale seamlessly in response to increased demand, with Karpenter provisioning new nodes as required to host the additional replicas.

Issue Severity

High: It blocks me from completing my task.

@vara-bonthu vara-bonthu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 29, 2024
@GeneDer
Copy link
Contributor

GeneDer commented Apr 2, 2024

@vara-bonthu Can you scope this down to just what's missing/ wrong in Ray Serve? Is this really an issue that's required to change Ray code? Or if you already know what's the fix, also feel free to contribute to the codebase

@GeneDer GeneDer self-assigned this Apr 3, 2024
@GeneDer GeneDer added P2 Important issue, but not time-critical serve Ray Serve Related Issue @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 3, 2024
@geetasg
Copy link

geetasg commented Apr 8, 2024

Does Ray for neuron support autoscaling based on neuron devices - represented by the device plugin as aws.amazon.com/neuron ?

@vara-bonthu
Copy link
Author

@GeneDer The issue is related solely to scaling new nodes for inf2 instances. It appears node_types is not set, causing the code to break at this line in here .

What is the easiest way to debug the code to print the values passed to this method?

I will try to dig further and keep you posted

@GeneDer
Copy link
Contributor

GeneDer commented Apr 8, 2024

@vara-bonthu those are great findings! If you are develop on a mac, you can try those instructions to setup it up locally https://docs.ray.io/en/master/ray-contribute/development.html#building-ray-on-linux-macos-full

If this has to go onto a cluster, I think you can raise a draft PR and one of the CI step will generate a wheel that you can use to build the docker image for testing. This is an example of such build that generates the wheel.

@anyscalesam anyscalesam added the core-hardware Hardware support in Ray core: e.g. accelerators label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. bug Something that is supposed to be working; but isn't core-hardware Hardware support in Ray core: e.g. accelerators P2 Important issue, but not time-critical serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

4 participants