[Bug][Operator][Leader election?] Operator failure and restart, logs attached #601

DmitriGekhtman · 2022-09-28T23:27:16Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I was running the nightly KubeRay operator for development purposes and observed a failure with some strange error messages logged.

The logs had roughly 1000 lines of Read request instance not found error with names of pods unrelated to KubeRay,
followed by some complaints about leases and leader election,
followed by a crash and restart.

Ideally, there shouldn't be any issues related to leader election, since we don't support leader election right now.

Only the last few lines of Read request instance not found error! are pasted below.

2022-09-28T18:52:13.868Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf.171917992063a820"}
2022-09-28T18:52:13.869Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf.171917a01bf956aa"}
2022-09-28T18:52:13.869Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent.171913fcc9e51685"}
2022-09-28T18:52:13.869Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent.1719140d14d73e54"}
2022-09-28T18:52:23.872Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf-cqw2g.1719179920d0d07b"}
2022-09-28T18:52:53.884Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf-pp5jv.171917a01c50dde1"}
2022-09-28T19:59:07.446Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "default/gk3-autopilot-cluster-1-nap-1l3c0g4h-554d0488-ltwl.1718f0a38651b857"}
2022-09-28T19:59:07.447Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "default/gk3-autopilot-cluster-1-nap-1l3c0g4h-83672b81-t6rs.1718f0a3d8493393"}
E0928 21:27:20.873442       1 leaderelection.go:330] error retrieving resource lock ray-system/ray-operator-leader: Get "https://10.8.56.1:443/api/v1/namespaces/ray-system/configmaps/ray-operator-leader": context deadline exceeded
2022-09-28T21:27:20.874Z	DEBUG	events	Normal	{"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "kuberay-operator-56749d657d-x5f65_085aa91b-20ef-4011-b7f4-4edff573f5c7 stopped leading"}
2022-09-28T21:27:20.874Z	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"ray-system","name":"ray-operator-leader","uid":"e6a67c4c-c3a9-4b83-a0c6-7a9a3f02a1a2","apiVersion":"coordination.k8s.io/v1","resourceVersion":"31046663"}, "reason": "LeaderElection", "message": "kuberay-operator-56749d657d-x5f65_085aa91b-20ef-4011-b7f4-4edff573f5c7 stopped leading"}
I0928 21:27:20.874714       1 leaderelection.go:283] failed to renew lease ray-system/ray-operator-leader: timed out waiting for the condition
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for non leader election runnables
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for leader election runnables
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for caches
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for webhooks
2022-09-28T21:27:20.903Z	INFO	Wait completed, proceeding to shutdown the manager
2022-09-28T21:27:20.903Z	ERROR	setup	problem running manager	{"error": "leader election lost"}

Reproduction script

I don't know how to reproduce this yet.

Anything else

I don't know yet.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

jeevb · 2023-03-02T04:14:39Z

We are observing this as well with kuberay/operator:v.0.4.0. We have also confirmed that our crash loops are due to the operator being OOMKilled, even with a memory limit of 2Gi, and even when we have no RayCluster objects.

It seems likely that the issue is a result of this line:

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Line 820 in af8fb0c

    
           Watches(&source.Kind{Type: &corev1.Event{}}, &handler.EnqueueRequestForObject{}).

We are now running a test with this line omitted, and can confirm that the spurious logs of unrelated resources are gone, and memory usage is lower.

This seems to suggest that the line will watch every event (and consequently trigger a reconcile on each one): https://github.com/kubernetes-sigs/controller-runtime/blob/f6f37e6cc1ec7b7d18a266a6614f86df211b1a0a/pkg/handler/enqueue.go#L35

To watch events related to child objects only, this seems more appropriate: https://github.com/kubernetes-sigs/controller-runtime/blob/f6f37e6cc1ec7b7d18a266a6614f86df211b1a0a/pkg/handler/enqueue_owner.go#L42

I believe .Owns already captures all events from pods and services owned by RayCluster objects, so the separate .Watch for events might be extraneous.

DmitriGekhtman · 2023-03-02T05:12:52Z

Ok, watching events was not a good idea cc @kevin85421 @Jeffwan @wilsonwang371 -- I'd recommend simplifying the implementation of the fault tolerance feature so that it doesn't do this.

wilsonwang371 · 2023-03-02T05:26:52Z

sounds reasonable. I will take a look at this and discuss this with @Jeffwan

jeevb · 2023-03-02T05:32:21Z

This PR seems to suggest that the events watching was meant to support recovering from readiness/liveness probe failures. My understanding is this is already covered by watching the child pods. I’m curious if I’m missing something, and should be looking out for something in particular while we run a test with events watching disabled.

wilsonwang371 · 2023-03-02T05:49:56Z

we are filtering out the events that we do not care here.

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Line 122 in af8fb0c

    
           if event.InvolvedObject.Kind != "Pod" || event.Type != "Warning" || event.Reason != "Unhealthy" ||

So I think we may need to change this part to make it only process the pod events it cares about.

This PR seems to suggest that the events watching was meant to support recovering from readiness/liveness probe failures. My understanding is this is already covered by watching the child pods. I’m curious if I’m missing something, and should be looking out for something in particular while we run a test with events watching disabled.

Jeffwan · 2023-03-02T22:25:05Z

Let's try to filter the event with owner type = RayCluster. that's enough to bring down the memory usage.

kevin85421 · 2023-03-07T18:46:26Z

We can have a hotfix PR for this issue recently, but we finally need to stop watching events because operator operations should be idempotent and stateless. However, events are time-sensitive and deleting Pods based on events are not idempotent.

wilsonwang371 · 2023-03-16T05:59:04Z

This PR seems to suggest that the events watching was meant to support recovering from readiness/liveness probe failures. My understanding is this is already covered by watching the child pods. I’m curious if I’m missing something, and should be looking out for something in particular while we run a test with events watching disabled.

Hi Jeev,

Are you able to try our patched version later and help us to verify this?

jeevb · 2023-03-16T07:21:02Z

Are you able to try our patched version later and help us to verify this?

Yes, I’ll deploy from the PR branch tomorrow and let it run for awhile to collect metrics. Thanks for working on this so promptly! :)

jeevb · 2023-03-16T20:13:05Z

Getting lots of spurious logs related to events from unrelated objects (on df56e5c52309f85eb9a72a2081afc17bbe88f9c8):

2023-03-16T20:12:25.884Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-nmtxj.174cfcdb11e529a1"}
2023-03-16T20:12:25.884Z	ERROR	controller.raycluster-controller	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "gke-metadata-server-nmtxj.174cfcdb11e529a1", "namespace": "kube-system", "error": "Index with name field:metadata.name does not exist"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2023-03-16T20:12:25.985Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-bltns.174cfcddbcea53f2"}
2023-03-16T20:12:25.985Z	ERROR	controller.raycluster-controller	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "gke-metadata-server-bltns.174cfcddbcea53f2", "namespace": "kube-system", "error": "Index with name field:metadata.name does not exist"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2023-03-16T20:12:26.084Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-q78nk.174cfcde1063988c"}
2023-03-16T20:12:26.084Z	ERROR	controller.raycluster-controller	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "gke-metadata-server-q78nk.174cfcde1063988c", "namespace": "kube-system", "error": "Index with name field:metadata.name does not exist"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

wilsonwang371 · 2023-03-20T22:47:37Z

Getting lots of spurious logs related to events from unrelated objects (on df56e5c52309f85eb9a72a2081afc17bbe88f9c8):

2023-03-16T20:12:25.884Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-nmtxj.174cfcdb11e529a1"}
2023-03-16T20:12:25.884Z	ERROR	controller.raycluster-controller	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "gke-metadata-server-nmtxj.174cfcdb11e529a1", "namespace": "kube-system", "error": "Index with name field:metadata.name does not exist"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2023-03-16T20:12:25.985Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-bltns.174cfcddbcea53f2"}
2023-03-16T20:12:25.985Z	ERROR	controller.raycluster-controller	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "gke-metadata-server-bltns.174cfcddbcea53f2", "namespace": "kube-system", "error": "Index with name field:metadata.name does not exist"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2023-03-16T20:12:26.084Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-q78nk.174cfcde1063988c"}
2023-03-16T20:12:26.084Z	ERROR	controller.raycluster-controller	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "gke-metadata-server-q78nk.174cfcde1063988c", "namespace": "kube-system", "error": "Index with name field:metadata.name does not exist"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

thanks. let me take a look

wilsonwang371 · 2023-03-22T02:22:12Z

Hi @jeevb ,

I manually tested the latest code on my machine and it is working as expected now. The issue you are seeing is because of too many debug messages that I forgot to disable.

You can try the latest patch and see the result.

wilsonwang371 · 2023-03-27T18:25:33Z

Hi @jeevb can you take a look at this again and see if no more extra logs?

jeevb · 2023-03-27T18:26:17Z

Yes, will test today and report back!

jeevb · 2023-03-28T03:26:01Z

Not seeing spam of messages anymore, but seeing these ~10 log messages associated with unrelated pods at startup:

2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-kf45m.175075bb8d2a4d81  kube-system  c3768cad-377d-43c6-9aef-c648170a4dd7 500053417 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-kf45m,UID:4c109758-c5db-42a4-afaa-f759dd984ce3,APIVersion:v1,ResourceVersion:4099672145,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-c73cb1c6-rw5j,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-qcd6c.175075bb7adbf927  kube-system  a705c3d8-2121-43f5-b8ce-49492d8f3f87 500053397 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-qcd6c,UID:c607a02f-f2ae-4b41-a6af-c7f4f26421be,APIVersion:v1,ResourceVersion:4099672288,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-69mr,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-cmx5x.175075c006fe45fd  kube-system  630d5f76-84e2-4453-a86e-de454edb9fce 500053961 0 2023-03-28 02:55:28 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:28 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-cmx5x,UID:ebd44c55-0353-4673-8fd5-74c9ffef055c,APIVersion:v1,ResourceVersion:4099672960,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-7l6p,},FirstTimestamp:2023-03-28 02:55:28 +0000 UTC,LastTimestamp:2023-03-28 02:55:29 +0000 UTC,Count:2,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}

wilsonwang371 · 2023-03-28T06:38:53Z

Not seeing spam of messages anymore, but seeing these ~10 log messages associated with unrelated pods at startup:

2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-kf45m.175075bb8d2a4d81  kube-system  c3768cad-377d-43c6-9aef-c648170a4dd7 500053417 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-kf45m,UID:4c109758-c5db-42a4-afaa-f759dd984ce3,APIVersion:v1,ResourceVersion:4099672145,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-c73cb1c6-rw5j,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-qcd6c.175075bb7adbf927  kube-system  a705c3d8-2121-43f5-b8ce-49492d8f3f87 500053397 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-qcd6c,UID:c607a02f-f2ae-4b41-a6af-c7f4f26421be,APIVersion:v1,ResourceVersion:4099672288,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-69mr,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-cmx5x.175075c006fe45fd  kube-system  630d5f76-84e2-4453-a86e-de454edb9fce 500053961 0 2023-03-28 02:55:28 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:28 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-cmx5x,UID:ebd44c55-0353-4673-8fd5-74c9ffef055c,APIVersion:v1,ResourceVersion:4099672960,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-7l6p,},FirstTimestamp:2023-03-28 02:55:28 +0000 UTC,LastTimestamp:2023-03-28 02:55:29 +0000 UTC,Count:2,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}

This is generally ok since this is the case when we have an unhealthy pod but we are not going to deal with it. If this is also not something we want, we can remove it later.

jeevb · 2023-03-28T16:44:29Z

Everything looks good so far. Anything in particular I should test?

DmitriGekhtman added bug Something isn't working P1 Issue that should be fixed within a few weeks operator labels Sep 28, 2022

kevin85421 added this to the v0.5.0 release milestone Mar 2, 2023

Jeffwan assigned wilsonwang371 Mar 2, 2023

wilsonwang371 mentioned this issue Mar 7, 2023

Fix issue with operator OOM restart #946

Merged

4 tasks

wilsonwang371 linked a pull request Mar 7, 2023 that will close this issue

Fix issue with operator OOM restart #946

Merged

4 tasks

Yicheng-Lu-llll mentioned this issue Mar 26, 2023

[Feature] Another possible way to detect readiness failures in fault tolerance after release 0.5.0 #990

Closed

2 tasks

kevin85421 closed this as completed in #946 Mar 28, 2023

kevin85421 mentioned this issue Apr 3, 2023

Kuberay 0.5.0 docs validation update docs for GCS FT #1004

Merged

4 tasks

kevin85421 mentioned this issue Aug 18, 2023

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

Merged

4 tasks

kevin85421 mentioned this issue Nov 15, 2023

Add flag leader-election-namespace #1624

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][Operator][Leader election?] Operator failure and restart, logs attached #601

[Bug][Operator][Leader election?] Operator failure and restart, logs attached #601

DmitriGekhtman commented Sep 28, 2022

jeevb commented Mar 2, 2023 •

edited

DmitriGekhtman commented Mar 2, 2023

wilsonwang371 commented Mar 2, 2023

jeevb commented Mar 2, 2023

wilsonwang371 commented Mar 2, 2023

Jeffwan commented Mar 2, 2023

kevin85421 commented Mar 7, 2023 •

edited

wilsonwang371 commented Mar 16, 2023

jeevb commented Mar 16, 2023 •

edited

jeevb commented Mar 16, 2023 •

edited

wilsonwang371 commented Mar 20, 2023

wilsonwang371 commented Mar 22, 2023

wilsonwang371 commented Mar 27, 2023

jeevb commented Mar 27, 2023

jeevb commented Mar 28, 2023 •

edited

wilsonwang371 commented Mar 28, 2023

jeevb commented Mar 28, 2023

[Bug][Operator][Leader election?] Operator failure and restart, logs attached #601

[Bug][Operator][Leader election?] Operator failure and restart, logs attached #601

Comments

DmitriGekhtman commented Sep 28, 2022

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

jeevb commented Mar 2, 2023 • edited

DmitriGekhtman commented Mar 2, 2023

wilsonwang371 commented Mar 2, 2023

jeevb commented Mar 2, 2023

wilsonwang371 commented Mar 2, 2023

Jeffwan commented Mar 2, 2023

kevin85421 commented Mar 7, 2023 • edited

wilsonwang371 commented Mar 16, 2023

jeevb commented Mar 16, 2023 • edited

jeevb commented Mar 16, 2023 • edited

wilsonwang371 commented Mar 20, 2023

wilsonwang371 commented Mar 22, 2023

wilsonwang371 commented Mar 27, 2023

jeevb commented Mar 27, 2023

jeevb commented Mar 28, 2023 • edited

wilsonwang371 commented Mar 28, 2023

jeevb commented Mar 28, 2023

jeevb commented Mar 2, 2023 •

edited

kevin85421 commented Mar 7, 2023 •

edited

jeevb commented Mar 16, 2023 •

edited

jeevb commented Mar 16, 2023 •

edited

jeevb commented Mar 28, 2023 •

edited