Skip to content

Conversation

FxKu
Copy link
Member

@FxKu FxKu commented Apr 29, 2022

fixes #1867

Switchover function has this inner go routine which is not needed. I guess, it was once introduced with the idea to have two go routines that wait for each other when sending and consuming evens on the podEvent channels. But even with a WaitGroup added later, chances for a race condition between closing the channel (consumer side) and sending events were still high as the operator did panic in most of my tests. And because of the fact that we wait anyway for the go routine to finish the need for concurrency in this part of the code looks questionable.

With the go routine removed from Switchover, chances for hitting the race condition are already a lot lower. To fully eliminate the possibility of running into it, the PR moves c.podSubscribersMu.RUnlock() after the event handling. Thus, it will always send on the channel first before unregisterPodSubscriber called from Switchover can close it. Since unbuffered channels are usually blocking when there's no receiver (Switchover might already be done), a select block with a do nothing default case has to be provided. See this go example.

The PR changes the check and timeout setting for deleting and patching Pods. PatroniAPICheck* config values are used because their defaults correspond to the hard coded values of 1s and 5s that we used before. But PatroniAPICheck* constants should only be used when calling the Patroni REST API instead.

The PR also fixes WaitGroup when PostgresTeam informer is enabled. Thanks @dmvolod for the hint in #1874 .

@FxKu FxKu added this to the 1.8.1 milestone Apr 29, 2022
…en processPodEvent and unregisterPodSubscriber
@FxKu FxKu force-pushed the switchover-subscriber-fix branch from c9b8dc7 to 3641911 Compare May 5, 2022 14:06
@FxKu
Copy link
Member Author

FxKu commented May 16, 2022

👍

}

close(c.podSubscribers[podName])
delete(c.podSubscribers, podName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() before delete() looks a bit safer here. (what does Goalng do to ch when the entry that contains it is removed from c.podSubscribers ?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete only removes the channel from c.podSubscribers map. We can still safely close it afterwards. As the go routine processPodEvent first gets the channel, then writes to it, the idea was to remove the channel from the map first then close it - saving some nanoseconds in this race condition we faced.

@sdudoladov
Copy link
Member

👍

1 similar comment
@FxKu
Copy link
Member Author

FxKu commented May 17, 2022

👍

@FxKu FxKu closed this May 17, 2022
@FxKu FxKu reopened this May 17, 2022
@FxKu FxKu merged commit 268a86a into master May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

operator pod panic: send on closed channel during switchover
2 participants