Add unit tests for raycluster_controller reconcilePods function #219

Waynegates · 2022-04-04T01:29:46Z

Why are these changes needed?

This diff adds unit tests for Kuberay reconcilePods function to improve the code reliability.

Related issue number

None

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

sriram-anyscale · 2022-04-04T02:20:47Z

You could make all tests run with both values for the flag (PrioritizeWorkersToDelete)

sriram-anyscale · 2022-04-04T02:33:57Z

ray-operator/controllers/raycluster_controller_fake_test.go

+				Labels: map[string]string{
+					common.RayClusterLabelKey:   instanceName,
+					common.RayNodeTypeLabelKey:  string(rayiov1alpha1.HeadNode),
+					common.RayNodeGroupLabelKey: groupNameStr,


I think this should be "headgroup", not "small-group". The reconcilePods method does filter to get headPods and workerPods. Does your setup reproduce that correctly?

Good point. I will update the group name. This test setup just considers a part of pod resources.

pcmoritz · 2022-04-04T17:39:22Z

ray-operator/controllers/raycluster_controller_fake_test.go

+
+	assert.Equal(t, int(expectReplicaNum), len(podList.Items),
+		"Replica number is wrong after reconcile expect %d actual %d", expectReplicaNum, len(podList.Items))
+}


Should we also check that WorkersToDelete has been cleared at the end?

Good point. Will add it.

As I mentioned earlier, we should make sure WorkersToDelete has been deleted in the CR - not just in testRayCluster

pcmoritz · 2022-04-04T17:43:07Z

ray-operator/controllers/raycluster_controller_fake_test.go

+		"Replica number is wrong after reconcile expect %d actual %d", expectReplicaNum, len(podList.Items))
+
+	for i := 0; i < len(podList.Items); i++ {
+		if contains(testRayCluster.Spec.WorkerGroupSpecs[0].ScaleStrategy.WorkersToDelete, podList.Items[i].Name) {


This part was not clear to me when reading the code -- I would expect the WorkersToDelete to be cleared already. But maybe that's only the case in the PrioritizeWorkersToDelete = true codepath :)

(the same comment applies to the tests below)

yes, will update the code

pcmoritz · 2022-04-04T17:45:37Z

ray-operator/controllers/raycluster_controller_fake_test.go

+	assert.Nil(t, err, "Fail to get pod list")
+	assert.Equal(t, len(testPods), len(podList.Items), "Init pod list len is wrong")
+
+	// Simulate 2 pod container crash.


I'm not sure if that actually simulates a crash of the pod -- the pod spec wouldn't get deleted even if that happens, rather it would to into ERRORED or similar state I think. Maybe that's better to test here. Pod specs only get deleted if somebody does that (and that's also great to test for, but it probably shouldn't be called a crash in that case).

Sure. Will rename it as pod container removed state. Will try to add ERRORED unit test

pcmoritz · 2022-04-04T17:48:31Z

Thanks a lot for putting these tests together, this is really great! I left some small comments (which might just expose my lack of knowledge about how the code actually works).

Waynegates · 2022-04-04T18:49:55Z

ray-operator/controllers/raycluster_controller.go

 					r.Recorder.Eventf(instance, v1.EventTypeNormal, "Deleted", "Deleted pod %s", pod.Name)
 				}
 			}
 			worker.ScaleStrategy.WorkersToDelete = []string{}
+			instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete = []string{}


This is a bug fix to clear the instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete

Isn't that precisely what the line above is doing?

worker = instance.Spec.WorkerGroupSpecs[index]

as per the outer for loop :)

Actually no. WorkerGroupSpecs []WorkerGroupSpec is not WorkerGroupSpecs []*WorkerGroupSpec. So update worker will not update instance.Spec.WorkerGroupSpecs[index].
This is found by the unit tests. So instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete = []string{} is needed.
Check this for more info.

Thanks for catching this bug.

Could you replace line 239 with:

for index := range instance.Spec.WorkerGroupSpecs { worker = &instance.Spec.WorkerGroupSpecs[index]

then you would not require both lines 272 and 273

However, even after you do this, does the CR actually get updated? The only place I see this happening is within updateStatus() which has:

if err := r.Status().Update(context.Background(), instance); err != nil {

Doesn't this only update the Status and not the Spec? See https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/client#StatusClient

So how does this clear WorkersToDelete in the CR?

Separately (strictly speaking) the operator should not be changing the Spec, it should only update the Status given the declarative nature of the k8s programming style. But in this case I would say it is OK given that ScaleStrategy is not really declarative.

Also would like to discuss the sync logic.

Right now, this

for index := range instance.Spec.WorkerGroupSpecs { worker = &instance.Spec.WorkerGroupSpecs[index]

has a conflict with diff < 0 (line 310) branch code.

sriram-anyscale · 2022-04-05T17:03:10Z

After thinking about this a bit I feel we should not be clearing WorkersToDelete. The abstraction we should follow is:

The requestor (autoscaler in this case) looks at the real world situation and decides what the goal state should be and updates the Spec part of the CR to match this goal state. This should be declarative but in our case we have some procedural components also - in WorkersToDelete. We are stating how to get to the goal state also. This is somewhat imperfect, but ok for now.
The operator should look at the real world situation and the goal state as described in the Spec and reconcile to make the real world be more aligned towards the goal state. It then updates the Status part of the CR to say what it did.

Some rules to follow:

The requestor should not depend on the current state of the Spec in what it does (then it introduces race conditions with the operator). Instead it should work in an idempotent manner. I.e., if the requestor loop runs many times without the operator running, it should not update the Spec to new values.
The operator should read from the Spec, but never update the Spec. The operator should not be specifying what the goal state should be, it just tries to get the real world to move towards the goal state. It should only write to the Status.
The only entities that read from the Status are dashboards, monitoring tools, etc. No action taken by the requestor or operator should depend on what is in the Status.

Having said all this, I argue that the operator should not update WorkersToDelete. But we should make sure that repeated iterations of the reconciler do not become less efficient as a consequence. On reading the we can make one small change to address this:

There are 4 places in the code that iterator over the pods in WorkersToDelete - that look like this:

		for _, podsToDelete := range worker.ScaleStrategy.WorkersToDelete {

We should simply add the following at the beginning of each of these loops:

   <if podsToDelete is not in runningPods> break

We can do this most efficiently perhaps by changing runningPods to be a list of pod names instead of a list of pods.

…lete call

pcmoritz · 2022-04-05T18:26:21Z

Thanks for writing down your thoughts -- not having the operator update the WorkersToDelete makes sense to me and with your optimization that should be fine. We can implement this first and test it and then if it works well, we can hopefully move to a more declarative way to specify the goal state. For example if we replace the workersToDelete with a target list of pods that should do the trick :)

…project#219) * Write raycluster_controller reconcilePods unit tests * Improve unit tests code * Fix lint issue and Improve unit tests code * Run goimports * Fix a bug of workersToDelete update and update unit tests * Fix a bug of workersToDelete update and update unit tests * Update unit tests log * Update workersToDelete local var to avoid unuseful kube api server delete call * Remove code of updating instance spec workersToDelete Co-authored-by: Taikun Liu <taikunliu@taikuns-mbp.lan>

Taikun Liu and others added 2 commits April 3, 2022 18:02

Write raycluster_controller reconcilePods unit tests

82c90ee

Improve unit tests code

86258fe

sriram-anyscale self-requested a review April 4, 2022 02:08

sriram-anyscale reviewed Apr 4, 2022

View reviewed changes

Fix lint issue and Improve unit tests code

b0c6de8

pcmoritz reviewed Apr 4, 2022

View reviewed changes

Run goimports

6e3db6c

pcmoritz reviewed Apr 4, 2022

View reviewed changes

Waynegates added 2 commits April 4, 2022 11:47

Fix a bug of workersToDelete update and update unit tests

37fb342

Fix a bug of workersToDelete update and update unit tests

be89699

Waynegates commented Apr 4, 2022

View reviewed changes

Update unit tests log

f34632e

Update workersToDelete local var to avoid unuseful kube api server de…

84d5e09

…lete call

Remove code of updating instance spec workersToDelete

0138e28

sriram-anyscale merged commit 57b4f87 into ray-project:master Apr 5, 2022

Waynegates deleted the brucez/reconcileUnitTest branch April 6, 2022 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unit tests for raycluster_controller reconcilePods function #219

Add unit tests for raycluster_controller reconcilePods function #219

Waynegates commented Apr 4, 2022

sriram-anyscale commented Apr 4, 2022

sriram-anyscale Apr 4, 2022

Waynegates Apr 4, 2022

pcmoritz Apr 4, 2022

Waynegates Apr 4, 2022

sriram-anyscale Apr 5, 2022

pcmoritz Apr 4, 2022

pcmoritz Apr 4, 2022

Waynegates Apr 4, 2022

pcmoritz Apr 4, 2022 •

edited

Loading

Waynegates Apr 4, 2022 •

edited

Loading

pcmoritz commented Apr 4, 2022

Waynegates Apr 4, 2022

pcmoritz Apr 5, 2022

Waynegates Apr 5, 2022 •

edited

Loading

sriram-anyscale Apr 5, 2022

Waynegates Apr 5, 2022

Waynegates Apr 5, 2022 •

edited

Loading

sriram-anyscale commented Apr 5, 2022

pcmoritz commented Apr 5, 2022 •

edited

Loading

Add unit tests for raycluster_controller reconcilePods function #219

Add unit tests for raycluster_controller reconcilePods function #219

Conversation

Waynegates commented Apr 4, 2022

Why are these changes needed?

Related issue number

Checks

sriram-anyscale commented Apr 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz Apr 4, 2022 • edited Loading

Choose a reason for hiding this comment

Waynegates Apr 4, 2022 • edited Loading

Choose a reason for hiding this comment

pcmoritz commented Apr 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Waynegates Apr 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Waynegates Apr 5, 2022 • edited Loading

Choose a reason for hiding this comment

sriram-anyscale commented Apr 5, 2022

pcmoritz commented Apr 5, 2022 • edited Loading

pcmoritz Apr 4, 2022 •

edited

Loading

Waynegates Apr 4, 2022 •

edited

Loading

Waynegates Apr 5, 2022 •

edited

Loading

Waynegates Apr 5, 2022 •

edited

Loading

pcmoritz commented Apr 5, 2022 •

edited

Loading