[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS #44361

vara-bonthu · 2024-03-29T19:02:57Z

What happened + What you expected to happen

What Happened:

When deploying models using RayServe with autoscaling enabled on Amazon EKS, specifically across multiple inf2 nodes, the system scales correctly within a single inf2.24xlarge instance up to 6 replicas. However, beyond 6 replicas, RayServe fails to request new worker nodes, which should trigger Karpenter to provision additional nodes for placing new pods. This issue occurs despite auto-scaling being seemingly configured correctly to handle such scaling.

Logs Indicating the Issue:

The logs suggest that Ray's autoscaler could not find a suitable node type to satisfy the resource demands, despite the available Neuron device resources.

{'CPU': 1.0, 'neuron_cores': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333}: 6+ pending tasks/actors

2024-03-22 08:39:58,179	WARNING resource_demand_scheduler.py:782 -- The autoscaler could not find a node type to satisfy the request: [
{'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
{'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0}]. 

Please specify a node type with the necessary resources.

Possible Contributing Factors: The RayStartParam specifies resources in a manner (resources: {"neuron_cores": 2}) that might not align perfectly with the resource tags added by the Neuron device plugin to the nodes managed by Karpenter.

What You Expected to Happen:

Expected Behavior: RayServe's autoscaling should seamlessly request new worker nodes when the demand exceeds the capacity of the current nodes, especially in scenarios where more than 6 replicas are needed. Karpenter should then be able to provision new nodes based on the resource requests from RayServe, allowing for the continuous scaling of model deployments without manual intervention.

Seamless Integration and Scaling: Given the configuration and resources available, especially with Neuron devices on Inf2 instances, I expected a smooth scaling experience that leverages the Neuron core resources effectively across multiple nodes, allowing for a greater number of model replicas to be deployed and managed dynamically based on load.

Additional Information:

Deployment Configuration: The issue arises with a specific RayServe configuration designed for deploying models on Amazon EKS with Inf2 instances. The configuration details can be found at RayServe Configuration for Stable Diffusion on EKS.

Potential Misalignment with Neuron Device Plugin and Karpenter: The issue might stem from how Neuron device resources are tagged and utilized by Karpenter in response to RayServe's resource requests, suggesting a potential area for troubleshooting and adjustment.

Versions / Dependencies

KubeRay Operator Helm chat verison : 1.0.0
NeuronrDevicePlugin Image version: 2.14.4.0
Karpenter Version: v0.34.0
Amazon EKS Version: 1.29
RayServe Config yaml
Ray Base image used in the application image : rayproject/ray:2.9.0-py310. Check Dockerfile here https://github.com/awslabs/data-on-eks/blob/main/ai-ml/trainium-inferentia/examples/inference/ray-serve/stable-diffusion-inf2/Dockerfile

Reproduction script

Steps to Reproduce:

Deploy Infrastructure and RayServe Model Inference:

Follow the instructions provided in the blueprint for deploying the infrastructure and RayServe model inference for Stable Diffusion. This comprehensive guide is available at the following URL: Deploying StableDiffusion Model Inference on EKS. This guide outlines the necessary steps to set up Amazon EKS, configure Karpenter, deploy RayServe, and prepare the model for inference.
Generate concurrent requests:

Utilize Postman to simulate multiple concurrent requests to the deployed RayServe model endpoint. The objective is to create a workload that triggers the auto-scaling behavior of RayServe, necessitating the scaling of replicas and, consequently, the provisioning of additional nodes by Karpenter.

Monitor Logs for Scaling Activity:

Keep an eye on the Ray dashboard and Karpenter logs to observe the scaling behavior. The expectation is for the number of replicas to increase in response to the simulated demand, leading to Karpenter being prompted to provision new nodes to accommodate the additional replicas.
Identify autoscaling limitations:

The critical point of observation is when the number of replicas reaches 6. Beyond this point, note whether RayServe attempts to scale beyond the existing node capacity and if Karpenter responds by provisioning additional nodes. The failure to do so underlines the issue being reported.

Expected Outcome:
The infrastructure and RayServe deployment should scale seamlessly in response to increased demand, with Karpenter provisioning new nodes as required to host the additional replicas.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

GeneDer · 2024-04-02T20:06:13Z

@vara-bonthu Can you scope this down to just what's missing/ wrong in Ray Serve? Is this really an issue that's required to change Ray code? Or if you already know what's the fix, also feel free to contribute to the codebase

geetasg · 2024-04-08T18:28:38Z

Does Ray for neuron support autoscaling based on neuron devices - represented by the device plugin as aws.amazon.com/neuron ?

vara-bonthu · 2024-04-08T23:01:50Z

@GeneDer The issue is related solely to scaling new nodes for inf2 instances. It appears node_types is not set, causing the code to break at this line in here .

What is the easiest way to debug the code to print the values passed to this method?

I will try to dig further and keep you posted

GeneDer · 2024-04-08T23:10:58Z

@vara-bonthu those are great findings! If you are develop on a mac, you can try those instructions to setup it up locally https://docs.ray.io/en/master/ray-contribute/development.html#building-ray-on-linux-macos-full

If this has to go onto a cluster, I think you can raise a draft PR and one of the CI step will generate a wheel that you can use to build the docker image for testing. This is an example of such build that generates the wheel.

vara-bonthu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 29, 2024

ratnopamc mentioned this issue Mar 29, 2024

Test horizontal scaling of Ray Worker Pods awslabs/data-on-eks#449

Open

GeneDer self-assigned this Apr 3, 2024

GeneDer added P2 Important issue, but not time-critical serve Ray Serve Related Issue @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 3, 2024

anyscalesam added the core-hardware Hardware support in Ray core: e.g. accelerators label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS #44361

[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS #44361

vara-bonthu commented Mar 29, 2024 •

edited

GeneDer commented Apr 2, 2024

geetasg commented Apr 8, 2024

vara-bonthu commented Apr 8, 2024

GeneDer commented Apr 8, 2024

[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS #44361

[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS #44361

Comments

vara-bonthu commented Mar 29, 2024 • edited

What happened + What you expected to happen

What Happened:

What You Expected to Happen:

Additional Information:

Versions / Dependencies

Reproduction script

Steps to Reproduce:

Issue Severity

GeneDer commented Apr 2, 2024

geetasg commented Apr 8, 2024

vara-bonthu commented Apr 8, 2024

GeneDer commented Apr 8, 2024

vara-bonthu commented Mar 29, 2024 •

edited