Skip to content

Panic of uncertain origin when restarting operator pod during a cluster sync that hung (due to bad SA config) #342

@valer-cara

Description

@valer-cara
  • I've updated pgop to the latest version.
  • The operator started syncing my clusters. It hung due to the sts service account not existing. (since the config api of pgop changed, it defaulted to operator, etc...)
  • Realising the bad SA setting, i've updated the config and restarted the operator and now got this panic:
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:14Z" level=info msg="pod \"app/rabbit-pg-1\" has been recreated" cluster-name=app/rabbit-pg pkg=cluster worker=0                        
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:14Z" level=debug msg="unsubscribing from pod \"app/rabbit-pg-1\" events" cluster-name=app/rabbit-pg pkg=cluster worker=0                
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:14Z" level=debug msg="failing over from \"rabbit-pg-0\" to \"app/rabbit-pg-1\"" cluster-name=app/rabbit-pg pkg=cluster worker=0         
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:14Z" level=debug msg="making POST http request: http://100.107.29.224:8008/failover" cluster-name=app/rabbit-pg pkg=cluster worker=0    
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:14Z" level=debug msg="subscribing to pod \"app/rabbit-pg-1\"" cluster-name=app/rabbit-pg pkg=cluster worker=0                           
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:28Z" level=info msg="pod \"app/web-pg-1\" has been recreated" cluster-name=app/web-pg pkg=cluster worker=1                              
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:28Z" level=debug msg="unsubscribing from pod \"app/web-pg-1\" events" cluster-name=app/web-pg pkg=cluster worker=1                      
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:28Z" level=debug msg="failing over from \"web-pg-0\" to \"app/web-pg-1\"" cluster-name=app/web-pg pkg=cluster worker=1                  
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:28Z" level=debug msg="making POST http request: http://100.107.29.222:8008/failover" cluster-name=app/web-pg pkg=cluster worker=1       
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:28Z" level=debug msg="subscribing to pod \"app/web-pg-1\"" cluster-name=app/web-pg pkg=cluster worker=1                                 
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:29Z" level=warning msg="could not perform failover: could not failover: patroni returned 'Failover failed'" cluster-name=app/web-pg pkg=cluster worker=1
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:29Z" level=info msg="recreating old master pod \"app/web-pg-0\"" cluster-name=app/web-pg pkg=cluster worker=1                           
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:29Z" level=debug msg="subscribing to pod \"app/web-pg-0\"" cluster-name=app/web-pg pkg=cluster worker=1                                 
pgop-postgres-operator-8687f8cc54-trgb6 operator time="2018-07-13T12:16:29Z" level=debug msg="unsubscribing from pod \"app/web-pg-1\" events" cluster-name=app/web-pg pkg=cluster worker=1                      
pgop-postgres-operator-8687f8cc54-trgb6 operator panic: send on closed channel
pgop-postgres-operator-8687f8cc54-trgb6 operator
pgop-postgres-operator-8687f8cc54-trgb6 operator goroutine 142 [running]:
pgop-postgres-operator-8687f8cc54-trgb6 operator github.com/zalando-incubator/postgres-operator/pkg/cluster.(*Cluster).Switchover.func1(0xc420027b00, 0xc4202ffc17, 0x3, 0xc4202ffbe8, 0x8, 0xc42008cba0, 0xc42008c660)
pgop-postgres-operator-8687f8cc54-trgb6 operator        /home/blk/golang/src/github.com/zalando-incubator/postgres-operator/pkg/cluster/cluster.go:883 +0x1ca                                                   
pgop-postgres-operator-8687f8cc54-trgb6 operator created by github.com/zalando-incubator/postgres-operator/pkg/cluster.(*Cluster).Switchover                                                                    
pgop-postgres-operator-8687f8cc54-trgb6 operator        /home/blk/golang/src/github.com/zalando-incubator/postgres-operator/pkg/cluster/cluster.go:877 +0x22f                                                   

Atm the operator is stuck in this state and restarts.. i'll research some more to figure out how to get it unstuck. Any hints appreciated :)

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions