Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Add New Status: NumServeEndpoints #1901

Merged

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented Feb 4, 2024

Why are these changes needed?

This PR adds the NumServeEndpoints field to the RayService's status. The NumServeEndpoints field indicates the number of active serving Ray Pods or the number of Ray Pods selected by the serve service. If a Ray Pod has no proxy actor or is unhealthy, it will not be counted.

This new field can help users and CI tests determine how many Ray Pods are capable of serving, making debugging much easier. This is crucial because, with the high availability feature of RayService, even if some worker Pods become unhealthy, it is hard to tell since everything appears to work just fine on the surface.

Related issue number

Checks

# Run a Rayservice sample under https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples
# It will create 1 head and 1 worker pod. 
# Each pod has a serve replica. So, number of active serving Ray Pods is 2.
kubectl apply -f /home/ubuntu/kuberay/ray-operator/config/samples/ray-service.sample.yaml
kubectl describe rayservice rayservice-sample | grep "Num Serve Endpoints"
# Num Serve Endpoints  2

# Run another Rayservice sample.
# It will create 1 head and 3 worker pod. Only 1 worker pod has a serve replica.
# Since head Pod always has proxy actor, so, number of active serving Ray Pods is 1+1=2.
kubectl apply -f https://raw.githubusercontent.com/Yicheng-Lu-llll/serve-file/main/rayservice-config_v2.9_replicas-1.yaml
kubectl describe rayservice rayservice-sample | grep "Num Serve Endpoints"
#   Num Serve Endpoints:  2
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@Yicheng-Lu-llll Yicheng-Lu-llll changed the title WIP [RayService] Add New Field to Status for Active Serving Ray Pods Count Feb 4, 2024
@kevin85421 kevin85421 self-requested a review February 5, 2024 21:47
@kevin85421 kevin85421 self-assigned this Feb 5, 2024
@@ -71,6 +71,9 @@ type RayServiceStatuses struct {
PendingServiceStatus RayServiceStatus `json:"pendingServiceStatus,omitempty"`
// ServiceStatus indicates the current RayService status.
ServiceStatus ServiceStatus `json:"serviceStatus,omitempty"`
// ActiveServingRayPods indicates the number of Ray Pods that are actively serving or have been selected by the serve service.
// Ray Pods without a proxy actor or those that are unhealthy will not be counted.
ActiveServingRayPods int32 `json:"ActiveServingRayPods,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumServeEndpoints

@Yicheng-Lu-llll Yicheng-Lu-llll changed the title [RayService] Add New Field to Status for Active Serving Ray Pods Count [RayService] Add New Status: NumServeEndpoints Feb 9, 2024
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@Yicheng-Lu-llll Yicheng-Lu-llll marked this pull request as ready for review February 15, 2024 16:58
}

numServeEndpoints := 0
for _, subset := range serveEndPoints.Subsets {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with endpoints. What is Subsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -224,6 +229,25 @@ func (r *RayServiceReconciler) Reconcile(ctx context.Context, request ctrl.Reque
return ctrl.Result{RequeueAfter: ServiceDefaultRequeueDuration}, nil
}

func (r *RayServiceReconciler) calculateStatus(ctx context.Context, rayServiceInstance *rayv1.RayService, rayClusterInstance *rayv1.RayCluster) error {
serveSvc, err := common.BuildServeServiceForRayService(ctx, *rayServiceInstance, *rayClusterInstance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Associate RayService and its K8s service in a smarter way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rueian

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@kevin85421 kevin85421 merged commit 834aed3 into ray-project:master Feb 17, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants