[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743

kevin85421 · 2023-12-13T01:31:22Z

Why are these changes needed?

It is impossible to have a stable RayJob controller without a well-defined state machine. However, it is pretty hard to implement a state machine in the current code base because: (1) There are a lot of undefined JobDeploymentStatus (2) updateState is called everywhere.

This PR removes some JobDeploymentStatus:
- JobDeploymentStatusFailedToGetOrCreateRayCluster: KubeRay doesn't use the status to make any decision.
- JobDeploymentStatusWaitForK8sJob: KubeRay doesn't use the status to make any decision.
- JobDeploymentStatusFailedJobDeploy: KubeRay doesn't use the status to make any decision.
- JobDeploymentStatusFailedToGetJobStatus: It is used by a function shouldUpdateJobStatus. However, the check seems to be unnecessary. Both jobInfo.JobStatus and rayv1.JobDeploymentStatusRunning are the source of truth.
```
if r.shouldUpdateJobStatus(rayJobInstance.Status.JobStatus, rayJobInstance.Status.JobDeploymentStatus, jobInfo) {
  err = r.updateState(ctx, rayJobInstance, jobInfo, jobInfo.JobStatus, rayv1.JobDeploymentStatusRunning, nil)
}
```
This PR updates updateState(..., err error) to updateState(..., err error).
- After I remove the JobDeploymentStatus mentioned above, only 1 updateState function call's err is not explicitly set to nil. However, if you trace the code, err must be nil in r.updateState(...rayv1.JobDeploymentStatusSuspended, err).
This PR removes numerous updateState function calls. It's preferable to minimize the number of places where the CR status is updated.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

architkulkarni

Nice, this is a welcome refactor!

In this PR we chose to remove a lot of JobDeploymentStatus values because those values are never used by KubeRay to make a decision. However, at first glance they might still have some value for the user when debugging a job. Are we making an explicit tradeoff here for a simpler state machine, at the expense of some debuggability? Or is it generally a best practice to only define status values if the controller uses the status to make a decision?

In any case, I think it's a fair tradeoff, since the user will typically just look at the logs anyway for debugging, so it's not that important to have fine-grained statuses. So I think the current PR is fine.

architkulkarni · 2023-12-13T23:43:29Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -645,11 +633,6 @@ func (r *RayJobReconciler) getOrCreateRayClusterInstance(ctx context.Context, ra
 			return nil, err
 		}

-		// special case: is the job is complete status and cluster has been recycled.
-		if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) && rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete {


Why don't we need this anymore?

I forgot to explain it in the PR description.

If the status is JobDeploymentStatusComplete, KubeRay will skip the reconciliation right at the start of the reconcile process.

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 128 to 131 in f56c66f

if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete {

r.Log.Info("rayjob is complete, skip reconciliation", "rayjob", rayJobInstance.Name)

return ctrl.Result{}, nil

}

If the status updates to JobDeploymentStatusComplete successfully, KubeRay will skip the reconciliation immediately.

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 138 to 161 in f56c66f

if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) {

// If the function `updateState` updates the JobStatus to Complete successfully, we can skip the reconciliation.

rayClusterInstance := &rayv1.RayCluster{}

rayClusterNamespacedName := types.NamespacedName{

Namespace: rayJobInstance.Namespace,

Name: rayJobInstance.Status.RayClusterName,

}

if err := r.Get(ctx, rayClusterNamespacedName, rayClusterInstance); err != nil {

if !errors.IsNotFound(err) {

return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

}

if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil {

return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

}

return ctrl.Result{}, nil

}

if rayClusterInstance.DeletionTimestamp != nil {

if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil {

return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

}

return ctrl.Result{}, nil

}

}

These are the only two occurrences of JobDeploymentStatusComplete, except for the deletion at L649. Therefore, the condition rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete cannot be true, meaning the if statement will never be executed.

Thanks, makes sense!

kevin85421 · 2023-12-14T01:11:08Z

In this PR we chose to remove a lot of JobDeploymentStatus values because those values are never used by KubeRay to make a decision. However, at first glance they might still have some value for the user when debugging a job. Are we making an explicit tradeoff here for a simpler state machine, at the expense of some debuggability?

For improved observability, we could add a new string field (e.g., Details) to provide more detailed descriptions. I may add some states back if I find it necessary. At this moment, I just want to simplify the states as much as possible before I start to work on the new state machine.

Or is it generally a best practice to only define status values if the controller uses the status to make a decision?

In my opinion, each status should have a specific associated goal, and for different statuses to lead to different actions. The spark-on-k8s-operator has a similar pattern as RayJob, and its state machine is here. In the state machine of the Spark operator, the function syncSparkApplication makes the decision based on the status, and I plan to build a similar but simplified one for RayJob as the first step. We can add some new states when we decide to support some new features like retry. The following is my note about the state machine. There are three key functions:

architkulkarni · 2023-12-14T17:33:47Z

Got it, thanks a lot for the additional context!

update

a7c6f9f

kevin85421 force-pushed the remove-useless-state branch from 12251df to a7c6f9f Compare December 13, 2023 01:31

kevin85421 added 5 commits December 13, 2023 01:44

update

fc2bf02

update

6fa80a7

update

67ab7b6

update

ebe82e8

update

69680a7

kevin85421 requested a review from architkulkarni December 13, 2023 21:12

kevin85421 assigned architkulkarni Dec 13, 2023

kevin85421 changed the title ~~[RayJob][Status][4/n] Remove JobDeploymentStatusFailedToGetOrCreateRayCluster~~ [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls Dec 13, 2023

kevin85421 marked this pull request as ready for review December 13, 2023 21:41

architkulkarni approved these changes Dec 13, 2023

View reviewed changes

kevin85421 merged commit 62bbc13 into ray-project:master Dec 14, 2023
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743

[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743

kevin85421 commented Dec 13, 2023 •

edited

architkulkarni left a comment

architkulkarni Dec 13, 2023

kevin85421 Dec 14, 2023

architkulkarni Dec 14, 2023

kevin85421 commented Dec 14, 2023

architkulkarni commented Dec 14, 2023

	if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete {
	r.Log.Info("rayjob is complete, skip reconciliation", "rayjob", rayJobInstance.Name)
	return ctrl.Result{}, nil
	}

	if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) {
	// If the function `updateState` updates the JobStatus to Complete successfully, we can skip the reconciliation.
	rayClusterInstance := &rayv1.RayCluster{}
	rayClusterNamespacedName := types.NamespacedName{
	Namespace: rayJobInstance.Namespace,
	Name: rayJobInstance.Status.RayClusterName,
	}
	if err := r.Get(ctx, rayClusterNamespacedName, rayClusterInstance); err != nil {
	if !errors.IsNotFound(err) {
	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
	}
	if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil {
	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
	}
	return ctrl.Result{}, nil
	}

	if rayClusterInstance.DeletionTimestamp != nil {
	if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil {
	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
	}
	return ctrl.Result{}, nil
	}
	}

[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743

[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743

Conversation

kevin85421 commented Dec 13, 2023 • edited

Why are these changes needed?

Related issue number

Checks

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni Dec 13, 2023

Choose a reason for hiding this comment

kevin85421 Dec 14, 2023

Choose a reason for hiding this comment

architkulkarni Dec 14, 2023

Choose a reason for hiding this comment

kevin85421 commented Dec 14, 2023

architkulkarni commented Dec 14, 2023

kevin85421 commented Dec 13, 2023 •

edited