TPU Multi-Host Support #1913

ryanaoleary · 2024-02-07T01:09:39Z

Why are these changes needed?

Fix reconciliation logic for multi-host worker groups.

For NumOfHosts > 1, the controller now treats replicas as workerGroups and scales by NumOfHosts pods per replica. Additionally, if a pod in a multi-host group fails or is deleted, the entire multi-host group is deleted. This PR adds unit tests to cover multi-host pod creation, deletion, and reconciliation logic.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>

ray-operator/controllers/ray/raycluster_controller.go

richardsliu · 2024-02-07T01:52:21Z

ray-operator/controllers/ray/raycluster_controller.go

+		r.Log.Info("reconcilePods", "removing worker groups in the scaleStrategy of", worker.GroupName)
+		for _, groupToDelete := range worker.ScaleStrategy.MultihostGroupsToDelete {
+			for _, pod := range workerPods.Items {
+				if pod.Labels[utils.RayNodeGroupLabelKey] == groupToDelete {


Thinking some more about this, I am not sure if this would work - if this is the worker group name, we'll end up deleting all pods from the worker group, including all the replicas. I think the intention here should be to delete only a specific replica, including all the hosts.

Also I am not sure how the autoscaler logic would work. I am assuming the resource scheduler should know if certain pods have been idle, and it adds those pods to the deletion request. I don't think the autoscaler has any idea if an entire multihost replica has been idle.

@kevin85421 FYI

After giving this a bit more thought, I think there are a couple of ways to solve this:

In the get_node_data function in Kuberay node provider, add some code to parse the replica index for each node.

Then build a reverse lookup map that finds all other worker pods belonging to the same replica.

When one pod needs to scale down, make sure that all other pods from the same replica are also sent as part of the WorkersToDelete.

OR:

Similar to the above, but in step 3 we send the name of the replica to the scale down request. On the Kuberay operator, we figure out which exact pods belong to that replica and delete them together.

richardsliu · 2024-02-07T01:53:17Z

ray-operator/controllers/ray/raycluster_controller.go

-					return err
+				// Due to pods being scaled down, we are not guaranteed that the multihost group name will always be
+				// incremental. So we just need to use some random integer here.
+				group := rand.Uint32()


Given the above comment, perhaps we should make this id more deterministic?

(Update) I think it should be fine to leave it as it is, see the reply on first comment.

ray-operator/controllers/ray/utils/constant.go

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/utils/constant.go

kevin85421 · 2024-02-16T04:51:32Z

ray-operator/controllers/ray/utils/constant.go

@@ -22,6 +22,8 @@ const (
 	RayClusterServingServiceLabelKey         = "ray.io/serve"
 	HashWithoutReplicasAndWorkersToDeleteKey = "ray.io/hash-without-replicas-and-workers-to-delete"
 	NumWorkerGroupsKey                       = "ray.io/num-worker-groups"
+	MultihostReplicaKey                      = "ray.io/multihost-replica"
+	RayNodeHostIndexKey                      = "ray.io/host-index"


In my understanding, this label is similar to TPU_WORKER_ID. We should not add this label because it will increase a lot of complexity of KubeRay's reconciliation logic.

No. 1: KubeRay is not aware of multi-host PodSlice and TPU_WORKER_ID. All Pods in the same worker group are the same. If a Pod is deleted accidentally, we just need to create a new one easily.

No. 2: KubeRay is aware of multi-host PodSlice but not of TPU_WORKER_ID. Pods in the same worker group are different. If a Pod is deleted accidentally, we need to figure out which PodSlice doesn't have enough Pods and create one for the PodSlice.

No. 3: KubeRay is aware of multi-host PodSlice and TPU_WORKER_ID. If a Pod is accidentally deleted, we need to determine which PodSlice the Pod belongs to and its TPU_WORKER_ID are needed to create a new Pod.

We should do our best to implement strategy No. 1. If that's not possible, we at least need to adhere to strategy No. 2.

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2024-02-16T05:35:56Z

ray-operator/controllers/ray/raycluster_controller.go

+				group := rand.Uint32()
+				var j uint32
+				for j = 0; j < uint32(worker.NumOfHosts); j++ {
+					if err := r.createWorkerPod(ctx, *instance, *worker.DeepCopy(), group, j); err != nil {


What will happen if we manually delete a Pod in a multi-host PodSlice? It seems the implementation may not be able to handle it.

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2024-02-16T05:41:20Z

We need to address #1913 (comment) before everything moves forward.

kevin85421 · 2024-02-16T20:01:14Z

Summarize the offline discussion:

We will go with the No. 1 strategy.
- KubeRay is not responsible for labels ray.io/multihost-replica and ray.io/host-index.
- KubeRay doesn't need to handle MultihostReplicasToDelete. Ray should tell KubeRay which 4 Pods need to be deleted instead of the ID of the multi-host PodSlice.
- For multi-host cases (NumOfHosts > 1), we don't need to handle random deletion at this moment. If users want to manually delete a PodSlice, they should update both replicas and WorkersToDelete manually. See TPU Multi-Host Support #1913 (comment) for more details.
- KubeRay only promises to create the correct number of Pods for the worker group.
  - GKE webhook is responsible for:
    - Set labels to enable Pods to be scheduled on the correct Kubernetes nodes.
    - Inject environment variables to the Pods to enable Ray to know which Ray nodes belong to the same PodSlice.
    - Inject environment variables for TPU_WORKER_ID.

kevin85421 · 2024-02-21T18:58:16Z

@ryanaoleary @richardsliu any update?

ryanaoleary · 2024-02-21T19:21:38Z

@ryanaoleary @richardsliu any update?

The PR should be ready for review now going with Strategy 1. I took out the new labels and all of the deletion logic, since Ray should now use WorkersToDelete to tell Kuberay which Pods in a multi-host Podslice to remove.

kevin85421 · 2024-02-21T22:42:54Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -758,7 +757,10 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
 				runningPods.Items = append(runningPods.Items, pod)
 			}
 		}
-		diff := workerReplicas - int32(len(runningPods.Items))
+		// A replica can contain multiple hosts, so we need to calculate this based on the number of hosts per replica.
+		runningReplicas := int32(len(runningPods.Items)) / worker.NumOfHosts


Consider the following case:

len(runningPods.Items): 7

NumOfHosts: 4

workerReplicas: 2

runningReplicas := int32(len(runningPods.Items)) / worker.NumOfHosts // 1 diff := workerReplicas - runningReplicas // 2 - 1 = 1 => Create 4 new Pods. => 11 Pods in total.

Maybe we should use:

numExpectedPods := workerReplicas * worker.NumOfHosts diff := numExpectedPods - len(runningPods.Items)

I'm a bit unsure how we're supposed to handle the case where NumOfHosts > 1 and len(runningPods.Items) % NumOfHosts != 0. This would only happen when a pod in the multi-host podslice crashed or was deleted, in which case should the entire podslice be deleted and then recreated if necessary? If that's not the case then I think this makes sense.

I added this change in b038e4b.

Let's take a step back:

KubeRay doesn't handle any logic about multi-host PodSlice ("when a pod in the multi-host podslice crashed ... entire podslice be deleted ...").

KubeRay only promises to create replicas * NumOfHosts Pods for a worker group.

Pod scheduling is handled by GKE webhook.

Scaling up and down is handled by the Ray Autoscaler (ex: "... should the entire podslice be deleted and then recreated ..." ).

…groups

kevin85421 · 2024-02-22T18:54:14Z

ray-operator/controllers/ray/raycluster_controller_fake_test.go

+	// (1) 1 workerGroup (2) disable autoscaling
+	assert.Equal(t, 1, len(testRayCluster.Spec.WorkerGroupSpecs), "This test assumes only one worker group.")
+
+	// Disable autoscaling so that the random Pod deletion is enabled.


Random pod deletion is a pretty bad behavior for a multi-host setup. I am considering disabling it.

richardsliu and others added 8 commits January 19, 2024 01:46

multihost pods

6cd70c3

pod deletion logic

4151150

rename method

8c6ddfe

Reconcile pods for multihost group when NumOfHosts > 1

389340e

Add multihost groups to delete for autoscaler

906c8fb

Added tests for multi-host pod reconciliation

5928e59

Remove unnecessary Log.Info statement

dd7976a

Merge branch 'master' into tpu-hosts

834fe3a

Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>

richardsliu reviewed Feb 7, 2024

View reviewed changes

ryanaoleary added 2 commits February 7, 2024 02:35

Added ctx to buildWorkerPod

63ad728

Change usages of 'MultihostGroup' to MultihostReplica

026f143

ryanaoleary force-pushed the tpu-hosts branch from 5f2876a to 026f143 Compare February 7, 2024 03:16

kevin85421 self-requested a review February 7, 2024 03:34

kevin85421 self-assigned this Feb 7, 2024

kevin85421 reviewed Feb 16, 2024

View reviewed changes

ryanaoleary added 3 commits February 20, 2024 19:08

Remove multihostReplicasToDelete

dc561a9

Remove utils.RayNodeHostIndexKey

34e0bd9

Remove unnecessary if statement

6a83a5d

ryanaoleary added 2 commits February 21, 2024 19:05

Remove multi-host replica logic

0b5af18

Remove tests for multi-host labels and replica deletion

bda00a4

ryanaoleary requested review from kevin85421 and richardsliu February 21, 2024 19:19

richardsliu approved these changes Feb 21, 2024

View reviewed changes

kevin85421 reviewed Feb 21, 2024

View reviewed changes

Scale using numExpectedPods rather than full replicas for multi-host …

b038e4b

…groups

kevin85421 approved these changes Feb 22, 2024

View reviewed changes

kevin85421 merged commit dbd6b72 into ray-project:master Feb 22, 2024
23 checks passed

This was referenced Feb 22, 2024

Build Headless Service for Multi-Host TPU Worker Pods #1920

Merged

Ray TPU Webhook Autoscaling Changes GoogleCloudPlatform/ai-on-gke#180

Merged

kevin85421 mentioned this pull request Feb 29, 2024

[TPU] Add envtests for multi-host #1950

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU Multi-Host Support #1913

TPU Multi-Host Support #1913

ryanaoleary commented Feb 7, 2024 •

edited

richardsliu Feb 7, 2024

richardsliu Feb 14, 2024

richardsliu Feb 7, 2024

richardsliu Feb 14, 2024

kevin85421 Feb 16, 2024

kevin85421 Feb 16, 2024

kevin85421 commented Feb 16, 2024

kevin85421 commented Feb 16, 2024

kevin85421 commented Feb 21, 2024

ryanaoleary commented Feb 21, 2024 •

edited

kevin85421 Feb 21, 2024

ryanaoleary Feb 22, 2024

ryanaoleary Feb 22, 2024

kevin85421 Feb 22, 2024

kevin85421 Feb 22, 2024

TPU Multi-Host Support #1913

TPU Multi-Host Support #1913

Conversation

ryanaoleary commented Feb 7, 2024 • edited

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Feb 16, 2024

kevin85421 commented Feb 16, 2024

kevin85421 commented Feb 21, 2024

ryanaoleary commented Feb 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanaoleary commented Feb 7, 2024 •

edited

ryanaoleary commented Feb 21, 2024 •

edited