Skip to content

operator pod panic: send on closed channel during switchover #1867

@FxKu

Description

@FxKu

How to replicate:

  1. Start a new operator
  2. Create cluster with 2+ instances
  3. Trigger a switchover by changing the manifest
  4. See that operator pod is crashing in the middle of the rolling update
time="2022-04-25T16:29:56Z" level=debug msg="performing rolling update" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:29:56Z" level=info msg="there are 2 pods in the cluster to recreate" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:29:56Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=info msg="pod \"default/acid-minimal-cluster-1\" has been recreated" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="unsubscribing from pod \"default/acid-minimal-cluster-1\" events" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="making GET http request: http://10.2.16.115:8008/cluster" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="switching over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="making POST http request: http://10.2.16.115:8008/failover" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:21Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=debug msg="successfully switched over from \"acid-minimal-cluster-0\" to \"default/acid-minimal-cluster-1\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=debug msg="unsubscribing from pod \"default/acid-minimal-cluster-1\" events" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=info msg="recreating old master pod \"default/acid-minimal-cluster-0\"" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2022-04-25T16:30:23Z" level=debug msg="subscribing to pod \"default/acid-minimal-cluster-0\"" cluster-name=default/acid-minimal-cluster pkg=cluster
panic: send on closed channel

goroutine 196 [running]:
github.com/zalando/postgres-operator/pkg/cluster.(*Cluster).processPodEvent(0xc0005e0000, {0x1bdc360, 0xc002eb03c0})
	/workspace/pkg/cluster/cluster.go:1039 +0x205
k8s.io/client-go/tools/cache.(*FIFO).Pop(0xc000143680, 0xc0008f7790)
	/workspace/vendor/k8s.io/client-go/tools/cache/fifo.go:300 +0x1fc
github.com/zalando/postgres-operator/pkg/cluster.(*Cluster).processPodEventQueue(0xc0005e0000, 0x0)
	/workspace/pkg/cluster/cluster.go:1056 +0x65
created by github.com/zalando/postgres-operator/pkg/cluster.(*Cluster).Run
	/workspace/pkg/cluster/cluster.go:1047 +0x77

The operator pod comes back immediately and finishes the rolling update as expected (since it finds the corresponding annotations on the pods to be rotated). Looks similar to #342

The error happens when the operator tries to delete the old master pod. The pod is not deleted though. The new operator finds t still the rolling update flag on one pod meaning it wasn't replaced yet. The replacement of the replica then works as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions