Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [RayJob] First job pod sometimes fails to connect to Ray cluster #1381

Closed
2 tasks done
architkulkarni opened this issue Aug 31, 2023 · 7 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks rayjob

Comments

@architkulkarni
Copy link
Contributor

architkulkarni commented Aug 31, 2023

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Sometimes after submitting a RayJob, we see that the first job submission pod has failed:

NAME                                        READY   STATUS      RESTARTS   AGE
kuberay-operator-7447d85d58-89nhg           1/1     Running     0          5m44s
rayjob-sample-hwbgg                         0/1     Error       0          2m55s
rayjob-sample-raycluster-vkbxr-head-4pnd5   1/1     Running     0          5m12s
rayjob-sample-w5nvf                         0/1     Completed   0          2m48s

In the logs for the errored pod, we see

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
    self._check_connection_and_version_with_url(min_version, version_error_message)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 278, in _check_connection_and_version_with_url
    raise ConnectionError(
ConnectionError: Failed to connect to Ray at address: http://rayjob-sample-raycluster-vkbxr-head-svc.default.svc.cluster.local:8265.

The pod gets retried, so the RayJob itself eventually succeeds, but this is still unexpected because the RayJob controller is supposed to wait for the cluster to be ready before submitting the job.

Reproduction script

kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml

Anything else

logs.zip

It only happens sometimes. (5-20% of the time?)

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@architkulkarni architkulkarni added bug Something isn't working P1 Issue that should be fixed within a few weeks rayjob labels Aug 31, 2023
@rueian
Copy link
Contributor

rueian commented Sep 13, 2023

Hi @architkulkarni, @kevin85421

Is anyone working on this? If not, I would like to work on this.

Investigation

Given the message from the logs.zip:

Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f063be65ea0>: Failed to establish a new connection: [Errno 111] Connection refused'))

It seems that the cause was the dashboard server didn't start yet.

Solutions

I believe this issue can be addressed by sending an HTTP request to the dashboard server at line 209 of the below section.

// Check the current status of ray cluster before submitting.
if rayClusterInstance.Status.State != rayv1alpha1.Ready {
r.Log.Info("waiting for the cluster to be ready", "rayCluster", rayClusterInstance.Name)
err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusInitializing, nil)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
// Ensure k8s job has been created
jobName, wasJobCreated, err := r.getOrCreateK8sJob(ctx, rayJobInstance, rayClusterInstance)
if err != nil {
err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusFailedJobDeploy, err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

If that request fails, we then requeue the reconciliation request by doing the same action when rayClusterInstance.Status.State != rayv1alpha1.Ready:

err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusInitializing, nil)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

Or setting the job status to rayv1alpha1.JobDeploymentStatusWaitForDashboard and requeuing:

err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1alpha1.JobDeploymentStatusWaitForDashboard, nil)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err

Would these be preferable solutions? By doing so, we can make sure that the dashboard server is ready before creating the k8s Job instance.

@architkulkarni
Copy link
Contributor Author

That sounds reasonable to me, and it would be great if you could submit a PR. @kevin85421 what do you think of the approach? Originally we wanted to get rid of the dashboard HTTP client for job submission, but here the dashboard HTTP client is just being used to check that the dashboard is ready, which seems fine.

@kevin85421
Copy link
Member

It is OK as a workaround because we want to implement this fix before the release of KubeRay 1.0.0.

I am not a fan of this solution. In Kubernetes' convention, "ready" signifies that the resource is ready to serve traffic. Unfortunately, the "ready" state in RayCluster doesn't accurately reflect this, necessitating the use of an HTTP client to check the head Pod's status. For a long-term solution, I am contemplating revising the definition of "ready" in RayCluster by updating certain functions and probes.

In addition, we may consider creating the Kubernetes Job after the RayCluster is ready. If we can create the Job earlier, it can start some processes earlier, such as pulling images. Hence, we may also need to consider adding a waiting mechanism for the Job.

I will sync with @rueian offline to discuss possible solutions.

@rueian
Copy link
Contributor

rueian commented Sep 15, 2023

Agree. I believe it will be nice if the "ready" condition can be customizable based on different situations. For example, a RayServe may only require a ready head node while a RayJob may prefer to wait until all workers are ready.

Another possible solution will be simply adding a retry mechanism to the ray job submit command, since honestly speaking, even if we make sure the dashboard is alive, the job submission can still fail for other reasons. I have seen a case in which raylet is dead but the dashboard is alive and therefore job submissions still fail.

Besides, by adding retries to the ray job submit command, it is possible to pull job images in advance.

@architkulkarni
Copy link
Contributor Author

Great points, thanks for the discussion. By the way, any ideas why the job submitter pod gets retried here? The pod has RestartPolicyNever so I'm a bit confused.

@kevin85421
Copy link
Member

any ideas why the job submitter pod gets retried here?

We can have two-levels of retry:

  • Connection: If the ray job submit command cannot establish a connection with the RayCluster, it shouldn't immediately fail. Instead, it should attempt to retry the connection. We could establish a timeout period, after which the ray job submit process will fail if a connection cannot be established. It is better to add some flags in the Ray Job Submission component.

  • Application: If the ray job submit fails after the connection with the RayCluster has been established, the K8s Job Pod should fail and the Kubernetes Job will create a new Job Pod until the number of failures is higher than the backoff limit.

rueian added a commit to rueian/kuberay that referenced this issue Sep 25, 2023
kevin85421 pushed a commit that referenced this issue Sep 26, 2023
… (#1429)

* [Bug][RayJob] Check dashboard readiness before creating job pod (#1381)

* [Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs similar to those described in (#1381)
kevin85421 pushed a commit to kevin85421/kuberay that referenced this issue Oct 17, 2023
…project#1381) (ray-project#1429)

* [Bug][RayJob] Check dashboard readiness before creating job pod (ray-project#1381)

* [Bug][RayJob] Enhance the RayJob end-to-end tests to detect bugs similar to those described in (ray-project#1381)
@kevin85421
Copy link
Member

Close with #1733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks rayjob
Projects
None yet
Development

No branches or pull requests

3 participants