[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS #44361
Labels
@author-action-required
The PR author is responsible for the next step. Remove tag to send back to the reviewer.
bug
Something that is supposed to be working; but isn't
core-hardware
Hardware support in Ray core: e.g. accelerators
P2
Important issue, but not time-critical
serve
Ray Serve Related Issue
What happened + What you expected to happen
What Happened:
When deploying models using RayServe with autoscaling enabled on Amazon EKS, specifically across multiple
inf2
nodes, the system scales correctly within a singleinf2.24xlarge
instance up to6
replicas. However, beyond 6 replicas, RayServe fails to request new worker nodes, which should trigger Karpenter to provision additional nodes for placing new pods. This issue occurs despite auto-scaling being seemingly configured correctly to handle such scaling.Logs Indicating the Issue:
The logs suggest that Ray's autoscaler could not find a suitable node type to satisfy the resource demands, despite the available Neuron device resources.
Possible Contributing Factors: The
RayStartParam
specifies resources in a manner (resources: {"neuron_cores": 2}
) that might not align perfectly with the resource tags added by the Neuron device plugin to the nodes managed by Karpenter.What You Expected to Happen:
Expected Behavior: RayServe's autoscaling should seamlessly request new worker nodes when the demand exceeds the capacity of the current nodes, especially in scenarios where more than 6 replicas are needed. Karpenter should then be able to provision new nodes based on the resource requests from RayServe, allowing for the continuous scaling of model deployments without manual intervention.
Seamless Integration and Scaling: Given the configuration and resources available, especially with Neuron devices on Inf2 instances, I expected a smooth scaling experience that leverages the Neuron core resources effectively across multiple nodes, allowing for a greater number of model replicas to be deployed and managed dynamically based on load.
Additional Information:
Deployment Configuration: The issue arises with a specific RayServe configuration designed for deploying models on Amazon EKS with Inf2 instances. The configuration details can be found at RayServe Configuration for Stable Diffusion on EKS.
Potential Misalignment with Neuron Device Plugin and Karpenter: The issue might stem from how Neuron device resources are tagged and utilized by Karpenter in response to RayServe's resource requests, suggesting a potential area for troubleshooting and adjustment.
Versions / Dependencies
rayproject/ray:2.9.0-py310
. Check Dockerfile here https://github.com/awslabs/data-on-eks/blob/main/ai-ml/trainium-inferentia/examples/inference/ray-serve/stable-diffusion-inf2/DockerfileReproduction script
Steps to Reproduce:
Deploy Infrastructure and RayServe Model Inference:
Follow the instructions provided in the blueprint for deploying the infrastructure and RayServe model inference for Stable Diffusion. This comprehensive guide is available at the following URL: Deploying StableDiffusion Model Inference on EKS. This guide outlines the necessary steps to set up Amazon EKS, configure Karpenter, deploy RayServe, and prepare the model for inference.
Generate concurrent requests:
Utilize Postman to simulate multiple concurrent requests to the deployed RayServe model endpoint. The objective is to create a workload that triggers the auto-scaling behavior of RayServe, necessitating the scaling of replicas and, consequently, the provisioning of additional nodes by Karpenter.
Monitor Logs for Scaling Activity:
Keep an eye on the Ray dashboard and Karpenter logs to observe the scaling behavior. The expectation is for the number of replicas to increase in response to the simulated demand, leading to Karpenter being prompted to provision new nodes to accommodate the additional replicas.
Identify autoscaling limitations:
The critical point of observation is when the number of replicas reaches 6. Beyond this point, note whether RayServe attempts to scale beyond the existing node capacity and if Karpenter responds by provisioning additional nodes. The failure to do so underlines the issue being reported.
Expected Outcome:
The infrastructure and RayServe deployment should scale seamlessly in response to increased demand, with Karpenter provisioning new nodes as required to host the additional replicas.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: