Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes try to rejoin cluster when first listed node is down #347

Closed
ccrebolder opened this issue Feb 22, 2016 · 2 comments

Comments

Projects
None yet
3 participants
@ccrebolder
Copy link
Contributor

commented Feb 22, 2016

Regarding the line https://github.com/jjasghar/rabbitmq/blob/master/providers/cluster.rb#L202, it looks as if the elsif statement is checking whether var_node_name_to_join is part of cluster_status, but I think it should be checking var_node_name. var_node_name_to_join is just set to the first node name in the array passed into the lwrp.

I discovered that when powering down or stopping the first listed node in node['rabbitmq']['clustering']['cluster_nodes'], the other nodes worked fine until chef-client ran. They would then attempt to rejoin the cluster because the first node was no longer listed in the "running_nodes" output of rabbitmqctl cluster_status, and to rejoin they would try to connect again to the first node, which would fail as it was turned off. This would result in the whole cluster coming down.

Let me know if you'd like more info, or a PR for this.

ccrebolder added a commit to ccrebolder/rabbitmq that referenced this issue Feb 23, 2016

Fix check for whether node has joined cluster
The call to `joined_cluster?` was passing in the `to_join` node name
instead of the current node name. This resulted in the nodes trying to
rejoin whenever the `to_join` node was offline.

Resolves rabbitmq#347
@Rarian

This comment has been minimized.

Copy link

commented Mar 7, 2016

+1

@jjasghar jjasghar closed this in #348 Mar 8, 2016

@opsline-radek

This comment has been minimized.

Copy link

commented Apr 10, 2016

This fix breaks new cluster builds. When a new node comes up with cluster, it's alone part of the cluster itself. The check will always return true, it will never allow a node to join another one.

When I run cluster status on a new node I get this:

# rabbitmqctl cluster_status
Cluster status of node 'rabbit@production-rabbitmq-6' ...
[{nodes,[{disc,['rabbit@production-rabbitmq-6']}]},
 {running_nodes,['rabbit@production-rabbitmq-6']},
 {cluster_name,<<"mycluster">>},
 {partitions,[]},
 {alarms,[{'rabbit@production-rabbitmq-6',[]}]}]

Let's say, the cluster_nodes list contains production-rabbitmq-5 and production-rabbitmq-6. With the new code, node 6 will check whether itself is part of the cluster and according to output above, it is. It will never join 5. It should check whether node 5 is part of the cluster, then join it. The original code was correct.

The fix is simple - if the node is down and removed from chef server, update the cluster_nodes attributes to remove it - or better - make it dynamic based on a search in a wrapper cookbook.

jjasghar pushed a commit that referenced this issue Jun 16, 2017

Fix check for whether node has joined cluster
The call to `joined_cluster?` was passing in the `to_join` node name
instead of the current node name. This resulted in the nodes trying to
rejoin whenever the `to_join` node was offline.

Resolves #347
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.