[Bug] Long image pull time will trigger blue-green upgrade after the head is ready #1231
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Without this PR, the
rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime
would be initialized by theupdateAndCheckDashboardStatus
function at a very early stage because theFetchHeadServiceURL
function (code) fails to fetch the head service because it has not been created yet.When using a very large image (e.g.,
ray-ml:2.5.0
), the head Pod requires more time than theDeploymentUnhealthySecondThreshold
(which is typically set to 300 seconds in most examples) to pull the image. Hence, before the head Pod is running and ready, the RayService will be considered unhealthy by callingmarkRestart
(code). However, the new RayCluster preparation will not be triggered immediately because thependingRayClusterInstance
is not nil (code).The new RayCluster preparation will be triggered immediately when the head Pod is running and ready. To elaborate, the GCS, Dashboard, and Dashboard Agent require a few seconds to be ready after the head Pod is running and ready, so the first few
Put
requests to create the serve applications may fail and thus will not resetHealthLastUpdateTime
.FetchHeadServiceURL
has two possibilities: (1) fail to get the head service (2) cannot find a port with nameDefaultDashboardAgentListenPortName
.updateAndCheckDashboardStatus
toupdateState
.Reproduce
Related issue number
Checks