Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Oct 28, 2023

Why are these changes needed?

The RayService's health check mechanism is not well-defined. This PR is a hotfix for v1.0.0, and I will revisit the mechanism after the release. Hence, I do not spend much time on writing tests for this PR.

  • Change 1: updateAndCheckDashboardStatus

    • Question: KubeRay will not update the status of the RayService custom resource when the only changes are related to timestamps. Without this PR, deleting the head Pod of a GCS ft-enabled RayCluster could potentially trigger an unnecessary zero-downtime upgrade immediately after the head Pod restarts.

      # (without this PR)
      # Step 1: Create a GCS ft-enabled RayService
      kubectl apply -f ray-service.high-availability.yaml
      
      # Step 2: Check the RayService status
      kubectl describe rayservices.ray.io rayservice-ha
      # Dashboard Status:
      #   Health Last Update Time:  $t1
      #   Is Healthy:               true
      #   Last Update Time:         $t1
      
      # Step 3: Wait more than `deploymentUnhealthySecondThreshold` (300 here) seconds ($t2 >= $t1 + 300).
      # Delete the head Pod.
      export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
      kubectl delete pod $HEAD_POD
      
      # Step 4: `reconcileServe` will not send requests to the dashboard agent when the head Pod is not running and ready.
      
      # Step 5: When the new head Pod is ready and running again, KubeRay starts sending requests
      # to the dashboard agent and update the RayService CR status. In addition, the dashboard agent
      # still needs several seconds to be ready because the readiness probe only checks Raylet and GCS.
      
      # Step 6: $t3 is larger than $t1 + 300, so zero downtime upgrade is triggered accidentally.
    • With this PR, zero downtime upgrade will only be triggered when the dashboard agent has been unhealthy for more than deploymentUnhealthySecondThreshold seconds.

  • Change 2: reconcileServe

    • Question: When we delete the dashboard agent process on the head Pod, the head Pod becomes not ready. Hence, the dashboard agent health check will be skipped. Hence, the zero-downtime upgrade will not be triggered after deploymentUnhealthySecondThreshold seconds. The Ray head will fail and restart after the liveness probes fail 120 times consecutively (~600 seconds).
    • This PR skips the active RayCluster's head Pod status check. KubeRay still sends requests to the dashboard agent, and triggers a zero-downtime upgrade after deploymentUnhealthySecondThreshold seconds.

Related issue number

Closes #1565

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
  • Manual test
    • Create a GCS ft-enabled RayService
    • Kill the head Pod after deploymentUnhealthySecondThreshold seconds. => No zero-downtime upgrade should be triggered.
    • Kill the dashboard agent process on the head Pod. => Zero-downtime upgrade should be triggered after deploymentUnhealthySecondThreshold seconds.

@kevin85421 kevin85421 marked this pull request as ready for review October 28, 2023 06:23
rayServiceClusterStatus.DashboardStatus.LastUpdateTime = &timeNow
rayServiceClusterStatus.DashboardStatus.IsHealthy = isHealthy
if rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime.IsZero() || isHealthy {
if rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime.IsZero() || oldIsHealthy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this change and how it fixes the first issue in the PR. Can you explain this more?

KubeRay will not update the status of the RayService custom resource when the only changes are related to timestamps

It sounds like this is the real issue. Why don't we just update HealthLastUpdateTime every time? Wouldn't that fix the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is the real issue. Why don't we just update HealthLastUpdateTime every time?

This will overload the Kubernetes API Server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this change and how it fixes the first issue in the PR. Can you explain this more?

By definition, healthLastUpdateTime: $t and isHealth: true mean that the dashboard agent is healthy from time = $t to time = now. Hence, the last second of being healthy should be the first moment we consider the dashboard agent as unhealthy.

Note that, healthLastUpdateTime currently does not strictly follow the definition above due to the check of head Pod status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, explained more offline and it makes sense. The main point I was missing is that when the health state has a transition, the status update actually goes through (there's a binary field isHealthy which is not a timestamp, so if it changes the status actually updates). Also to simplify the issue, the unnecessary zero-downtime upgrade bug happens any time it stays at healthy for >600s and then transitions to unhealthy; it's not specifically related to a head pod restarting.

@kevin85421 One last question. With the new approach, if it's unhealthy for a while, then the first time it becomes healthy, we don't actually update HealthLastUpdateTime. Instead, we'll update it in the next iteration. Is this your understanding as well? I think this off-by-one issue is still okay for a hotfix.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as a hotfix.

For the long term: The core issue (we need to track how long a resource has been healthy, but we can't actually update the last healthy time in the status each time because it overloads the k8s api server) is pretty bad, but it also sounds like a pretty general k8s problem. @hongchaodeng do you know if there's a recommended way for dealing with this?

@kevin85421
Copy link
Member Author

we need to track how long a resource has been healthy, but we can't actually update the last healthy time in the status each time because it overloads the k8s api server

It never minds. We can still implement the correct logics for deploymentUnhealthySecondThreshold. The core issue is that the current logic is too complicated. Both KubeRay and the container's readiness/liveness probes perform the health check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug][RayService HA] Killing a head Pod triggers zero-downtime upgrade
3 participants