New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ upgrade fails due to stale /etc/hosts entries #14

Closed
markgoddard opened this Issue Sep 15, 2017 · 2 comments

Comments

Projects
None yet
1 participant
@markgoddard
Member

markgoddard commented Sep 15, 2017

Seen on Ocata and Pike.

When running kayobe overcloud service upgrade, the RabbitMQ upgrade fails. This is due to stale entries in /etc/hosts for the controller's hostname with the overcloud provisioning network IP address. RabbitMQ requires the hostname to resolve to the IP on which it is listening, namely the internal network IP address. The issue can be resolved by removing the stale entries in the rabbitmq container. They should also be removed from the host to prevent propagation to new containers.

Here's an example of a broken /etc/hosts:

cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

10.41.253.103 sv-b16-u23
10.41.253.103 sv-b16-u23
10.41.253.103 sv-b16-u23
10.41.253.103 sv-b16-u23
10.41.253.103 sv-b16-u23
10.41.253.103 sv-b16-u23
127.0.0.1 localhost
127.0.0.1 localhost
127.0.0.1 localhost
127.0.0.1 localhost
127.0.0.1 localhost
127.0.0.1 localhost
# BEGIN ANSIBLE GENERATED HOSTS
192.168.7.11 sv-b16-u23
# END ANSIBLE GENERATED HOSTS

The 10.41.253.103 entries are incorrect.

Here's an example output on failure:

PLAY [Apply role rabbitmq] *****************************************************

TASK [setup] *******************************************************************
ok: [sv-b16-u23]

TASK [common : include] ********************************************************
skipping: [sv-b16-u23]

TASK [common : Registering common role has run] ********************************
skipping: [sv-b16-u23]

TASK [rabbitmq : include] ******************************************************
included: /opt/alaska/alt-1/venvs/kolla/share/kolla-ansible/ansible/roles/rabbitmq/tasks/upgrade.yml for sv-b16-u23

TASK [rabbitmq : Checking if rabbitmq container needs upgrading] ***************
ok: [sv-b16-u23]

TASK [rabbitmq : include] ******************************************************
included: /opt/alaska/alt-1/venvs/kolla/share/kolla-ansible/ansible/roles/rabbitmq/tasks/config.yml for sv-b16-u23

TASK [rabbitmq : Ensuring config directories exist] ****************************
ok: [sv-b16-u23] => (item=rabbitmq)

TASK [rabbitmq : Copying over config.json files for services] ******************
ok: [sv-b16-u23] => (item=rabbitmq)

TASK [rabbitmq : Copying over rabbitmq configs] ********************************
ok: [sv-b16-u23] => (item=rabbitmq-env.conf)
ok: [sv-b16-u23] => (item=rabbitmq.config)
ok: [sv-b16-u23] => (item=rabbitmq-clusterer.config)
ok: [sv-b16-u23] => (item=definitions.json)

TASK [rabbitmq : Find gospel node] *********************************************
fatal: [sv-b16-u23]: FAILED! => {"changed": true, "cmd": ["docker", "exec", "-t", "rabbitmq", "/usr/local/bin/rabbitmq_get_gospel_node"], "delta": "0:00:01.263525", "end": "2017-09-15 15:28:36.476105", "failed": true, "failed_when_result": true, "rc": 0, "start": "2017-09-15 15:28:35.212580", "stderr": "", "stdout": "{\"failed\": true, \"changed\": true, \"error\": \"Traceback (most recent call last):\\n  File \\\"/usr/local/bin/rabbitmq_get_gospel_node\\\", line 29, in main\\n    shell=True, stderr=subprocess.STDOUT  # nosec: this command appears\\n  File \\\"/usr/lib64/python2.7/subprocess.py\\\", line 575, in check_output\\n    raise CalledProcessError(retcode, cmd, output=output)\\nCalledProcessError: Command '/usr/sbin/rabbitmqctl eval 'rabbit_clusterer:status().'' returned non-zero exit status 2\\n\"}", "stdout_lines": ["{\"failed\": true, \"changed\": true, \"error\": \"Traceback (most recent call last):\\n  File \\\"/usr/local/bin/rabbitmq_get_gospel_node\\\", line 29, in main\\n    shell=True, stderr=subprocess.STDOUT  # nosec: this command appears\\n  File \\\"/usr/lib64/python2.7/subprocess.py\\\", line 575, in check_output\\n    raise CalledProcessError(retcode, cmd, output=output)\\nCalledProcessError: Command '/usr/sbin/rabbitmqctl eval 'rabbit_clusterer:status().'' returned non-zero exit status 2\\n\"}"], "warnings": []}

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @/opt/alaska/alt-1/venvs/kolla/share/kolla-ansible/ansible/site.retry

PLAY RECAP *********************************************************************
sv-b16-u23                 : ok=75   changed=5    unreachable=0    failed=1

markgoddard added a commit that referenced this issue Sep 18, 2017

Add kayobe overcloud host upgrade
This command performs necessary changes on the host to prepare the control
plane for an upgrade.

Currently this performs a workaround for issue #14, RabbitMQ upgrade failure.

We clear stale entries from /etc/hosts on the overcloud hosts and from the
rabbitmq containers, which allows the upgrade to complete successfully. The
source of the stale entries is currently unknown.

markgoddard added a commit that referenced this issue Sep 20, 2017

Apply RabbitMQ workaround for issue #14 to all overcloud hosts
In some scenarios the RabbitMQ services may be running on hosts other than the
controllers. For example, when there are separate database servers.

@markgoddard markgoddard added the bug label Sep 20, 2017

@markgoddard markgoddard added the verne label Nov 15, 2017

@markgoddard

This comment has been minimized.

Show comment
Hide comment
@markgoddard

markgoddard Nov 15, 2017

Member

@darrylweaver I've verified that we will not hit this issue during the upcoming pike upgrade. I believe the erroneous /etc/hosts entries are added during provisioning of the control plane, and should not return once removed.

Member

markgoddard commented Nov 15, 2017

@darrylweaver I've verified that we will not hit this issue during the upcoming pike upgrade. I believe the erroneous /etc/hosts entries are added during provisioning of the control plane, and should not return once removed.

@markgoddard

This comment has been minimized.

Show comment
Hide comment
@markgoddard

markgoddard Mar 13, 2018

Member

The workaround seemed to fix this issue.

Member

markgoddard commented Mar 13, 2018

The workaround seemed to fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment