Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Long image pull time will trigger blue-green upgrade after the head is ready #1231

Merged
merged 1 commit into from
Jul 11, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jul 11, 2023

Why are these changes needed?

Without this PR, the rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime would be initialized by the updateAndCheckDashboardStatus function at a very early stage because the FetchHeadServiceURL function (code) fails to fetch the head service because it has not been created yet.

When using a very large image (e.g., ray-ml:2.5.0), the head Pod requires more time than the DeploymentUnhealthySecondThreshold (which is typically set to 300 seconds in most examples) to pull the image. Hence, before the head Pod is running and ready, the RayService will be considered unhealthy by calling markRestart (code). However, the new RayCluster preparation will not be triggered immediately because the pendingRayClusterInstance is not nil (code).

The new RayCluster preparation will be triggered immediately when the head Pod is running and ready. To elaborate, the GCS, Dashboard, and Dashboard Agent require a few seconds to be ready after the head Pod is running and ready, so the first few Put requests to create the serve applications may fail and thus will not reset HealthLastUpdateTime.

	if clientURL, err = utils.FetchHeadServiceURL(ctx, &r.Log, r.Client, rayClusterInstance, common.DefaultDashboardAgentListenPortName); err != nil || clientURL == "" {
		if !r.updateAndCheckDashboardStatus(rayServiceStatus, false, rayServiceInstance.Spec.DeploymentUnhealthySecondThreshold) {
			logger.Info("Dashboard is unhealthy, restart the cluster.")
			r.markRestart(rayServiceInstance)
		}
		err = r.updateState(ctx, rayServiceInstance, rayv1alpha1.WaitForDashboard, err)
		return ctrl.Result{RequeueAfter: ServiceDefaultRequeueDuration}, false, false, err
	}
  • The error thrown by FetchHeadServiceURL has two possibilities: (1) fail to get the head service (2) cannot find a port with name DefaultDashboardAgentListenPortName.
    • Both are not related to the data plane logic (i.e. Ray Dashboard in this context)
    • (1) should be promised by the RayCluster controller, and (2) should be maintained by users.
    • It is not helpful to prepare a new RayCluster in both cases.
    • Hence, I remove the lines from updateAndCheckDashboardStatus to updateState.

Reproduce

# Install a KubeRay operator with the current master branch (https://github.com/ray-project/kuberay/commit/422098d1f8308e4419f9617eb8c112d255a9fdc8)
helm install kuberay-operator . --set image.tag=422098d

# Create a RayService with a big image
# https://gist.github.com/kevin85421/1a4f5489d72dd0133ca6fba72406a22b
kubectl apply -f ray-service.big-image.yaml

# If the image needs more than 5 mins to pull, the new RayCluster preparation will be triggered after the head Pod is ready and running.
Screen Shot 2023-07-11 at 11 11 15 AM Screen Shot 2023-07-11 at 11 11 37 AM
  • The new RayCluster is created 5 minutes and 34 seconds (5m45s - 11s = 5m34s) after the old RayCluster is created.
  • The Pod uses 5m34s to pull the image (5m48s scheduled - 14s image present).

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=controller:latest pytest -vs tests/test_sample_rayservice_yamls.py --log-cli-level=INFO
Screen Shot 2023-07-11 at 9 17 20 AM

@kevin85421 kevin85421 added the v0.6.0 Must be included in v0.6.0. label Jul 11, 2023
@kevin85421 kevin85421 marked this pull request as ready for review July 11, 2023 18:15
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

@kevin85421 kevin85421 merged commit a0e59be into ray-project:master Jul 11, 2023
20 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…head is ready (ray-project#1231)

Long image pull time will trigger blue-green upgrade after the head is ready
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v0.6.0 Must be included in v0.6.0.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants