Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release #3148

Closed
johanrhodin opened this issue Jun 25, 2021 · 13 comments

Comments

@johanrhodin
Copy link
Contributor

Up to, and including, RabbitMQ 3.8.14 restarting one node in a multi-node cluster would cause the associated exchange federation links to jump to another node in the cluster. (This is the documented behavior in https://www.rabbitmq.com/federation.html#clustering: "Exchange federation links will start on any node in the downstream cluster. They will fail over to other nodes if the node they are running on crashes or stops.")

If the node that has the links in 3.8.15 (and higher) goes down, the federation link is removed and the policy will need to be recreated for the links to reappear. While the node serving the link is stopping a status of "Starting" is shown in the management interface, followed by no links at all when the node has fully stopped.

Screen Shot 2021-06-24 at 11 30 13 PM

In my example scenario I used a one node cluster (3.8.16 with Erlang 24.0.2) as upstream and a three node cluster (3.8.14/15 with Erlang 23.2.3 as downstream. After noticing which node ran the link I stopped that node.

The following policy was used:
curl -i -X PUT -H 'Content-Type: application/json' $HTTPS_DOWNSTREAM/api/policies/$DOWNSTREAM_VHOST/fedit/ -d '{"pattern":".", "definition": {"federation-upstream-set":"all"}, "priority":0, "apply-to": "exchanges"}'

and the upstream is defined as:
curl -i -XPUT -H "content-type:application/json" -d'{"value":{"uri":"'$UPSTREAM_URL'","expires":3600000}}' $HTTPS_DOWNSTREAM/api/parameters/federation-upstream/$DOWNSTREAM_VHOST/upstream

This issue was reported both to us (CloudAMQP) and to RabbitMQ Slack: https://rabbitmq.slack.com/archives/C1EDN83PA/p1623755314386000

@michaelklishin
Copy link
Member

Has any investigation into this been conducted by CloudAMQP?

The only changes relevant to Federation after 3.8.14 were

and the upgrade of Ranch to 2.0 which I doubt can matter here.

@johanrhodin
Copy link
Contributor Author

No more investigation has been done apart for confirming the issue and constructing cases to test it.

I take it that one way to debug would be to build 3.8.x without 98724ef#diff-f478bae44be51a403ce03f61c0db27438df938766af16db7f2407c32b774eb8f and b9836cc#diff-aea6cb0036e8357870edde16f99c4b4dfd3d91e013d937ed66135da1703d18f6, respectively and see if the issue still remains, and go from there?

@michaelklishin
Copy link
Member

Reverting those is one option but seeing what mirrored supervisor messages are produced at debug level might be enough.

@johanrhodin
Copy link
Contributor Author

OK, attaching debug level logs (rabbitmqctl set_log_level debug) from the three downstream nodes.
rabbit@test-burly-silver-chamois-01.log
rabbit@test-burly-silver-chamois-02.log
rabbit@test-burly-silver-chamois-03.log

Node -02 was the one that had the federation link running, was stopped and then started again.

@michaelklishin
Copy link
Member

I cannot reproduce this with three v3.8.x nodes. When I stop nodes that host some exchange federation links, the links migrate to one of the online nodes and recover their connections.

@michaelklishin
Copy link
Member

@johanrhodin when you say

3.8.14/15 nodes

does this mean this is a mixed-version cluster? In a mixed version cluster, 3.8.15 nodes use a different process group membership library, pg instead of pg2 (as pg2 was removed), so mirrored supervisors that ultimately start
and monitoring federation links cannot see old cluster members after the upgrade (the nodes remain clustered, sure,
but not as far as mirrored supervisors go).

This is expected that during a rolling upgrade to 3.8.15 all features that rely on process groups will not observe
new cluster members on old nodes. This is the same problem as described in #3080 but in a difference place.

As explained in #3080, there isn't much our team can do about this. pg2 was removed in Erlang 24 and replaced
with pg (the original name of the process group module), so even if we wanted to invest time into introducing
a module that would use both and merge their member sets, we cannot do that while still supporting Erlang 24.

For federation users, clearing and re-enabling the policy should be sufficient to bring back the links on the upgraded post-3.8.15 nodes.

@michaelklishin
Copy link
Member

I could not reproduce this with 3.8.16, 3.8.18 but could with a mixed 3.8.14 cluster upgraded to 3.8.16 in a rolling fashion.

All the symptoms from #3080 were present: this is a side-effect of a changed process group module used by plugins
that are distributed one way or another: Federation, Shovel, management.

The bad news is that we cannot do anything about this without dropping (and never re-introducing) Erlang 24 support.
The good news is that there is a trivial workaround: after all nodes are upgraded to 3.8.16 or later, remove the policy that enables exchange federation
and re-created it. The link will be started and migrated between cluster nodes as you'd expect.

@michaelklishin michaelklishin changed the title Exchange federation links not automatically restarted in RabbitMQ >=3.8.15 Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release Jul 3, 2021
michaelklishin added a commit that referenced this issue Jul 5, 2021
michaelklishin added a commit that referenced this issue Jul 5, 2021
… changes

in the logs.

Referenes #3148.

(cherry picked from commit 65ccf7c)
@johanrhodin
Copy link
Contributor Author

@michaelklishin I should have been more explicit with 3.8.14/15. What I meant was that with 3.8.14 it worked and when the same cluster was upgraded to 3.8.14+ (3.8.15 and higher) it didn't work anymore. I didn't try with a mixed cluster.

I still see this with all nodes involved running Erlang 24.0.2 and RabbitMQ 3.8.19. I will try and create a minimal working example for reproduction.

@johanrhodin
Copy link
Contributor Author

OK I can't reproduce with 3.8.21. I assume it is because of #3263.

I can reproduce with 3.8.16, with the following:

# 1. Two clusters defintions
UPSTREAM_URL="amqps://xcjvgoyg:PASSWD@test-myrtle-green-stingray.rmq2.cloudamqp.com/xcjvgoyg" # 1 node 3.8.21

DOWNSTREAM_VHOST=kjdjuxhr

HTTPS_DOWNSTREAM="https://kjdjuxhr:PASSWD@test-exotic-blond-duckbill.rmq2.cloudamqp.com" # 3 nodes 3.8.21

# 2. Create federation-upstream on downstream
curl -i -XPUT -H "content-type:application/json" -d'{"value":{"uri":"'$UPSTREAM_URL'","expires":3600000}}' $HTTPS_DOWNSTREAM/api/parameters/federation-upstream/$DOWNSTREAM_VHOST/upstream

# 3. Create a federation policy on downstream
curl -i -X PUT -H 'Content-Type: application/json' $HTTPS_DOWNSTREAM/api/policies/$DOWNSTREAM_VHOST/fedit/ -d '{"pattern":".", "definition": {"federation-upstream-set":"all"}, "priority":0, "apply-to": "exchanges"}'

# 4. Stop RabbitMQ on the node that has the link running on downstream. 

@lfstuttgart
Copy link

Hi,

we seem to have the same situation in RabbitMQ RabbitMQ 3.13.2 Erlang 26.2.4. It is not consistent. But sometimes during Cluster restart (one node at a time), we experience losing the federation link.

In our case federation is used to transport messages from one vhost to another..

I have uploaded logs from our 2-node-cluster. The last time I see anything of the federation link in there ist at 2024-06-17 14:48:13.975211+02:00 and the lines following immediately after that. The federation upstream points to a loadbalancer address (x17-rabbit-ha.stuttgart.de)

This cluster is not yet operational. We have another older cluster with RabbitMQ 3.8.2 Erlang 22.2.7 with the same configuration except cluster names and address of the loadbalancer as well of the introduction of certificates within die upstream URI needed since Erlang 26.

Certificate data to show this should not be an issue:
CN: rabbit.stuttgart.de
SAN: DNS:rabbit.stuttgart.de, DNS:x17-rabbit-ha.stuttgart.de, DNS:x17-rabbit1.stuttgart.de, DNS:x17-rabbit2.stuttgart.de
IP Address:10.163.41.70, IP Address:10.163.41.71, IP Address:10.163.41.69
Validity
Not Before: Apr 8 13:12:00 2024 UTC
Not After : May 11 13:13:00 2025 UTC

Is there any solution to this?

Thank you very much in advance for your help!

rabbit@x17-rabbit2.log

rabbit@x17-rabbit1.log

@johanrhodin
Copy link
Contributor Author

@lfstuttgart start a new discussion and provide exact steps to reproduce (even if it only happens sometimes). We can then take a look at this (potential) issue.
In that new discussion you can link to this one, for context.

@lfstuttgart
Copy link

@johanrhodin thanks! Here ist the new discussion: https://github.com/rabbitmq/rabbitmq-server/issues/11492

@johanrhodin
Copy link
Contributor Author

Here is the new discussion: #11495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants