operator pod panic: send on closed channel during switchover

How to replicate:
1. Start a new operator
2. Create cluster with 2+ instances
3. Trigger a switchover by changing the manifest
4. See that operator pod is crashing in the middle of the rolling update

```
time="2022-04-25T16:29:56Z" level=debug msg="performing rolling update" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:29:56Z" level=info msg="there are 2 pods in the cluster to recreate" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:29:56Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=info msg="pod \"default/acid-minimal-cluster-1\" has been recreated" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="unsubscribing from pod \"default/acid-minimal-cluster-1\" events" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="making GET http request: http://10.2.16.115:8008/cluster" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="switching over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="making POST http request: http://10.2.16.115:8008/failover" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=debug msg="successfully switched over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=debug msg="unsubscribing from pod \"default/acid-minimal-cluster-1\" events" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=info msg="recreating old master pod \"default/acid-minimal-cluster-0\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-0\"" cluster-name=default/acid-minimal-cluster pkg=cluster
panic: send on closed channel

goroutine 196 [running]:
github.com/zalando/postgres-operator/pkg/cluster.(*Cluster).processPodEvent(0xc0005e0000, {0x1bdc360, 0xc002eb03c0})
	/workspace/pkg/cluster/cluster.go:1039 +0x205
k8s.io/client-go/tools/cache.(*FIFO).Pop(0xc000143680, 0xc0008f7790)
	/workspace/vendor/k8s.io/client-go/tools/cache/fifo.go:300 +0x1fc
github.com/zalando/postgres-operator/pkg/cluster.(*Cluster).processPodEventQueue(0xc0005e0000, 0x0)
	/workspace/pkg/cluster/cluster.go:1056 +0x65
created by github.com/zalando/postgres-operator/pkg/cluster.(*Cluster).Run
	/workspace/pkg/cluster/cluster.go:1047 +0x77
```

The operator pod comes back immediately and finishes the rolling update as expected (since it finds the corresponding annotations on the pods to be rotated). Looks similar to #342

The error happens when the operator [tries to delete](https://github.com/zalando/postgres-operator/blob/master/pkg/cluster/pod.go#L408) the old master pod. The pod is not deleted though. The new operator finds t still the rolling update flag on one pod meaning it wasn't replaced yet. The replacement of the replica then works as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

operator pod panic: send on closed channel during switchover #1867

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

operator pod panic: send on closed channel during switchover #1867

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions