Skip to content

Conversation

gotascii
Copy link

If the group coordinator is replaced with a different host, but the broker id remains the same, the client will go into and endless reconnection loop. This PR refreshes the cluster data if there is a ConnectionError when joining a group. The issue is reproducible by following these steps:

  • Start up a cluster with 3 nodes.
  • Publish some messages to a topic.
  • Connect to the topic and start an each_message loop.
  • A broker, say #0 for example, becomes memoized in @coordinator in ConsumerGroup.
  • Stop the each_message loop but do not exit the process.
  • Kill broker 0 and bring back a new host with a different ip as broker 0.
  • With the same consumer instance, run the each_message loop again.

When the above steps are taken:

  • ConsumerGroup#join is called.
  • Then coordinator.join_group on ConsumerGroup L:117 fails with ConnectionError.
  • ConsumerGroup#join sets @coordinator = nil.
  • Cluster#get_group_coordinator asks a broker for the broker id of the coordinator which is 0.
  • connect_to_broker pulls cached info for id 0 (i.e. the old IP).
  • Then coordinator.join_group on ConsumerGroup L:117 fails with ConnectionError restarting the loop.

Seeing as the retry for a ConnectionError is guarded by a sleep 1 I'm hoping this is a pretty safe place to refresh metadata.

@dasch
Copy link
Contributor

dasch commented Oct 27, 2017

❤️

mark_as_stale! doesn't by itself refresh the metadata, it just means that a subsequent query of metadata will cause of refresh rather than serving cached info.

@dasch dasch merged commit a506abd into zendesk:master Oct 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants