Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit ray job after cluster is ready #405

Merged
merged 5 commits into from
Jul 25, 2022
Merged

Conversation

pingsutw
Copy link
Contributor

Signed-off-by: Kevin Su pingsutw@apache.org

Why are these changes needed?

Ray-operator try to submit the ray job when the ray cluster is not ready. we can check the cluster state first before submitting the job.

Before

2022-07-23T08:44:48.609Z	INFO	raycluster-controller	reconcilePods	{"all workers already exist for group": "test-group"}
2022-07-23T08:44:48.755Z	ERROR	controller.rayjob	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "name": "test2-n0-0", "namespace": "flytesnacks-development", "error": "Get \"http://test2-n0-0-raycluster-fm75r-head-svc.flytesnacks-development.svc.cluster.local:8265/api/jobs/test2-n0-0-f4nh7\": dial tcp 10.96.63.5:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-07-23T08:44:48.756Z	INFO	raycluster-controller	reconciling RayJob	{"NamespacedName": "flytesnacks-development/test2-n0-0"}
2022-07-23T08:44:48.756Z	INFO	controllers.RayJob	getOrCreateRayClusterInstance	{"rayClusterInstanceName": "test2-n0-0-raycluster-fm75r"}
2022-07-23T08:44:48.855Z	ERROR	controller.rayjob	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "name": "test2-n0-0", "namespace": "flytesnacks-development", "error": "Get \"http://test2-n0-0-raycluster-fm75r-head-svc.flytesnacks-development.svc.cluster.local:8265/api/jobs/test2-n0-0-f4nh7\": dial tcp 10.96.63.5:8265: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

After

2022-07-24T02:21:57.951Z	INFO	raycluster-controller	reconciling RayJob	{"NamespacedName": "flytesnacks-development/achss4th7xtc5mh6hm27-n0-0"}
2022-07-24T02:21:57.951Z	INFO	controllers.RayJob	getOrCreateRayClusterInstance	{"rayClusterInstanceName": "achss4th7xtc5mh6hm27-n0-0-raycluster-sqfvh"}
2022-07-24T02:21:57.952Z	INFO	controllers.RayJob	waiting for the cluster to be ready	{"rayCluster": "achss4th7xtc5mh6hm27-n0-0-raycluster-sqfvh"}

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 25, 2022

@pingsutw I will double check this one after #398.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 25, 2022

Actually this one looks straightforward, I can merge it and update #398

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 25, 2022

The master RayJob Controller needs lots of improvement. I think after this one and #398, it should be good

if rayClusterInstance.Status.State != rayv1alpha1.Ready {
r.Log.Info("waiting for the cluster to be ready", "rayCluster", rayClusterInstance.Name)
err = r.updateState(ctx, rayJobInstance, rayv1alpha1.JobDeploymentStatusInitializing, nil)
return ctrl.Result{}, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: We can requeue it after x seconds since cluster ready takes a while. without requeueAfter we will see lots of duplicated logs.

This is minor, I can update it later in another PR

@Jeffwan Jeffwan merged commit dd0b0a3 into ray-project:master Jul 25, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
* Submit ray job after cluster is ready

Signed-off-by: Kevin Su <pingsutw@apache.org>

* Fix test errors

Signed-off-by: Kevin Su <pingsutw@apache.org>

* Fix test errors

Signed-off-by: Kevin Su <pingsutw@apache.org>

* Fix test errors

Signed-off-by: Kevin Su <pingsutw@apache.org>

* Fix test errors

Signed-off-by: Kevin Su <pingsutw@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants