Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remediate RabbitMQ reset failures #449

Merged
merged 1 commit into from Jun 15, 2017

Conversation

Projects
None yet
4 participants
@jkugler
Copy link
Contributor

commented Jun 15, 2017

We were getting intermittent failures after the erlang cookie was changed.
They looked like this:

Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '69'
---- Begin output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
STDOUT: Stopping rabbit application on node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113' ...
STDERR: Error: unable to connect to node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113': nodedown

DIAGNOSTICS
===========

attempted to contact: ['3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113']

3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113:
  * connected to epmd (port 4369) on ip-10-72-81-113
  * epmd reports: node '3f49a593-39c1-4954-9c38-f3e763cb4ee3' not running at all
                  no other nodes on ip-10-72-81-113
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-19@ip-10-72-81-113'
- home dir: /var/lib/rabbitmq
- cookie hash: WYSyTQI4sAl0fW/1IdQOyQ==
---- End output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
Ran rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app returned

The Erlang VM was up but it had not had time to bring up the RabbitMQ app.
This patch adds some retries to the command to give the time needed.
Given that we only sometimes saw this error, a minute of retries
should be more than enough.

I did not add any tests because I am not sure how to test an intermittent failure.

Remediate RabbitMQ reset failures
We were getting intermittent failures after the erlang cookie was changed.
They looked like this:

```
Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '69'
---- Begin output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
STDOUT: Stopping rabbit application on node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113' ...
STDERR: Error: unable to connect to node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113': nodedown

DIAGNOSTICS
===========

attempted to contact: ['3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113']

3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113:
  * connected to epmd (port 4369) on ip-10-72-81-113
  * epmd reports: node '3f49a593-39c1-4954-9c38-f3e763cb4ee3' not running at all
                  no other nodes on ip-10-72-81-113
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-19@ip-10-72-81-113'
- home dir: /var/lib/rabbitmq
- cookie hash: WYSyTQI4sAl0fW/1IdQOyQ==
---- End output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
Ran rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app returned
```

The Erlang VM was up but it had not had time to bring up the RabbitMQ app.
This patch adds some retries to the command to give the time needed.
Given that we only sometimes saw this error, a minute of retries
should be more than enough.

I did not add any tests because I am not sure how to test an intermittent failure.
@amulyas

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2017

This will help us with the current problem of race condition

@michaelklishin

This comment has been minimized.

Copy link
Member

commented Jun 15, 2017

@jjasghar fine with you to merge?

@michaelklishin

This comment has been minimized.

Copy link
Member

commented Jun 15, 2017

@jkugler thank you!

@jkugler

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2017

Looks like FoodCritic doesn't like some of the providers...but I didn't change those. Will that prevent a merge?

https://travis-ci.org/rabbitmq/chef-cookbook/jobs/243461797

@jjasghar jjasghar merged commit 959f2c5 into rabbitmq:master Jun 15, 2017

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jjasghar

This comment has been minimized.

Copy link
Collaborator

commented Jun 15, 2017

I'll release this cookbook tomorrow :)

@jkugler jkugler deleted the jkugler:fix_node_reset_failure branch Jun 15, 2017

@jkugler

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2017

Will this be version 5.1.1?

@jjasghar

This comment has been minimized.

Copy link
Collaborator

commented Jun 15, 2017

I'll have to verify the changes, but i'm pretty sure it'll be 5.2.0.

@jkugler

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2017

OK, sounds good.

@jkugler

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

Thanks for the new release!

@amulyas

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2017

still having reset failure when recreating all nodes

                                                                                           ================================================================================^[[0m^M
                                                                                           ^[[31mError executing action `run` on resource 'execute[reset-node]'^[[0m^M
                                                                                           ================================================================================^[[0m^M
                                                                                           ^M
                                                                                       ^[[0m    Mixlib::ShellOut::ShellCommandFailed^[[0m^M
                                                                                           ------------------------------------^[[0m^M
                                                                                           Expected process to exit with [0], but received '70'^M
                                                                                       ^[[0m    ---- Begin output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----^M
                                                                                       ^[[0m    STDOUT: Stopping rabbit application on node '347ad9c0-05d5-4d46-a97c-77c110753d7f@ip-10-72-80-163'^M
                                                                                       ^[[0m    Resetting node '347ad9c0-05d5-4d46-a97c-77c110753d7f@ip-10-72-80-163'^M
                                                                                       ^[[0m    STDERR: Error: {no_running_cluster_nodes,"You cannot leave a cluster if no online nodes are present."}^M
                                                                                       ^[[0m    ---- End output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----^M
                                                                                       ^[[0m    Ran rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app returned 70^[[0m^M
                                                                                           ^M
                                                                                       ^[[0m    Cookbook Trace:^[[0m^M
                                                                                           ---------------^[[0m^M
                                                                                           /var/chef/cache/cookbooks/compat_resource/files/lib/chef_compat/monkeypatches/chef/runner.rb:78:in `run_action'^M
@amulyas

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2017

^[[0m STDERR: Error: {no_running_cluster_nodes,"You cannot leave a cluster if no online nodes are present."}^M

@jkugler

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2017

Looks like another timing issue, but may not be related to this fix. In my understanding, RabbitMQ shouldn't care that you tell it to leave a node-less cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.