no binding 'xxx' between exchange 'yyy' in vhost 'zzz' and queue #887

Closed
jianping-roth opened this Issue Jul 19, 2016 · 14 comments

Comments

Projects
None yet
10 participants
@jianping-roth

This is related to "RabbitMQ 3.5.7, Erlang R14B04". We have a cluster of 3 nodes and have experienced a partition failure. After that, we can no longer declare a binding between an exchange and a queue. From RabbitMQ admin Web, we can see the queue and the exchange, but the queue has no bindings. We have subsequently restarted the RabbitMQ nodes, one at a time. This problem persisted, all binding declarations have failed since.

Please help us identify the root cause of the problem. FYI, 'rabbitmqctl list_queues' showed the queue is in 'running' state.

Searching the internet, we figured there are a few solutions:

  1. delete and recreate the queue; perhaps with some different parameters such as durability to trick the server to let go of some cache.
  2. delete and recreate the virtual host: we cannot do this because there are many other queues/bindings in our production system already.
  3. shutdown the entire RabbitMQ cluster. Again, we cannot do this in our production environment.
@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Jul 19, 2016

Member

Please post questions to rabbitmq-users or Stack Overflow. RabbitMQ uses GitHub issues for specific actionable items engineers can work on, not questions. Thank you.

Member

michaelklishin commented Jul 19, 2016

Please post questions to rabbitmq-users or Stack Overflow. RabbitMQ uses GitHub issues for specific actionable items engineers can work on, not questions. Thank you.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Jul 19, 2016

Member

The root cause isn't known but the workarounds you list make some sense. There has been dozens of bug fixes, including in Mnesia itself in 18.3.3 from our team, so consider upgrading to 3.6.3 (which means a cluster shutdown because it's a feature version A => feature version B upgrade).

Member

michaelklishin commented Jul 19, 2016

The root cause isn't known but the workarounds you list make some sense. There has been dozens of bug fixes, including in Mnesia itself in 18.3.3 from our team, so consider upgrading to 3.6.3 (which means a cluster shutdown because it's a feature version A => feature version B upgrade).

@jianping-roth

This comment has been minimized.

Show comment
Hide comment
@jianping-roth

jianping-roth Jul 19, 2016

thank you for your reply :)

thank you for your reply :)

@noahhaon

This comment has been minimized.

Show comment
Hide comment
@noahhaon

noahhaon Jul 19, 2016

@jianping-roth Hi, we've hit this issue a few times after a partition on 3.5.7 and Erlang 17.5. Pivotal support was unable to find the root cause. Deleting and recreating the queue did not resolve the issue for us, we had to create a similar binding with a wildcard to avoid a cluster restart.

Otherwise, stopping and starting the entire cluster fixes the problem - which is our SOP now for "handling" partitions in a RMQ cluster, as we frequently see a variety of stability issues after a partition. Best bet is just to restart the whole thing.

@jianping-roth Hi, we've hit this issue a few times after a partition on 3.5.7 and Erlang 17.5. Pivotal support was unable to find the root cause. Deleting and recreating the queue did not resolve the issue for us, we had to create a similar binding with a wildcard to avoid a cluster restart.

Otherwise, stopping and starting the entire cluster fixes the problem - which is our SOP now for "handling" partitions in a RMQ cluster, as we frequently see a variety of stability issues after a partition. Best bet is just to restart the whole thing.

@spencer1248

This comment has been minimized.

Show comment
Hide comment
@spencer1248

spencer1248 Mar 29, 2017

I just ran into this with rabbitmq-server 3.5.4 and Erlang R16B03 on a 3 node cluster following a network partition. We resolved without taking down our cluster by deleting the exchange identified in our log. Deleting the queue did not resolve nor did manually creating a binding between the exchange and queue.

I just ran into this with rabbitmq-server 3.5.4 and Erlang R16B03 on a 3 node cluster following a network partition. We resolved without taking down our cluster by deleting the exchange identified in our log. Deleting the queue did not resolve nor did manually creating a binding between the exchange and queue.

@nagas

This comment has been minimized.

Show comment
Hide comment
@nagas

nagas Apr 7, 2017

We also run into this issue after network partition with rabbitmq-server 3.6.6 and Erlang 19.2.

nagas commented Apr 7, 2017

We also run into this issue after network partition with rabbitmq-server 3.6.6 and Erlang 19.2.

@RogerSolerV

This comment has been minimized.

Show comment
Hide comment
@RogerSolerV

RogerSolerV Feb 7, 2018

Same here, it worked deleting the affected exchanges and recreating everything from scratch, thanks @spencer1248

Same here, it worked deleting the affected exchanges and recreating everything from scratch, thanks @spencer1248

@gujun4990

This comment has been minimized.

Show comment
Hide comment
@gujun4990

gujun4990 Mar 15, 2018

We encounter the same issue. The version of rabbitmq is 3.6.5 and we also have cluster of 3 nodes. So anybody know the issue have been solved or how to avoid the issue.

We encounter the same issue. The version of rabbitmq is 3.6.5 and we also have cluster of 3 nodes. So anybody know the issue have been solved or how to avoid the issue.

@lukebakken

This comment has been minimized.

Show comment
Hide comment
@lukebakken

lukebakken Mar 15, 2018

Member

@gujun4990 - please read the responses in this issue, several people list a fix that works. If you wish to discuss, please use the mailing list

Member

lukebakken commented Mar 15, 2018

@gujun4990 - please read the responses in this issue, several people list a fix that works. If you wish to discuss, please use the mailing list

@svrx

This comment has been minimized.

Show comment
Hide comment
@svrx

svrx May 19, 2018

This issue as been found to be still affecting 3.6.12.
Binding to queues fail, Then only deleting the exchange and recreating it seems to work.

This is extremely painful to deal with in production, because it may happen during OS patch restarts and will require multiple tries to bring application with right subscriptions back up.

Would you consider reopening this bug?
Let me know what information would you require.

svrx commented May 19, 2018

This issue as been found to be still affecting 3.6.12.
Binding to queues fail, Then only deleting the exchange and recreating it seems to work.

This is extremely painful to deal with in production, because it may happen during OS patch restarts and will require multiple tries to bring application with right subscriptions back up.

Would you consider reopening this bug?
Let me know what information would you require.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin May 20, 2018

Member

We would not consider reopening this bug. 3.6.12 is 3 versions behind even 3.6.x which is technically out of support.

We have one know scenario where this was caused by a queue that had non-ASCII characters in the name. We don't know much details but if there is a way to reproduce from scratch, we'd like to hear about it on rabbitmq-users (the mailing list).

Member

michaelklishin commented May 20, 2018

We would not consider reopening this bug. 3.6.12 is 3 versions behind even 3.6.x which is technically out of support.

We have one know scenario where this was caused by a queue that had non-ASCII characters in the name. We don't know much details but if there is a way to reproduce from scratch, we'd like to hear about it on rabbitmq-users (the mailing list).

@kajottnasdaq

This comment has been minimized.

Show comment
Hide comment
@kajottnasdaq

kajottnasdaq Jun 5, 2018

I was just hit with this on 3.7.4, with Erlang 20.3.4 after a switch patching reboot left 2 node cluster partitioned... We are still in test, but risking this silent loss happeing at patching as opposed to having a one node cluster fail with a clean break is making me seriously reconsider deploying a cluster in production. Right now risks seem to outweigh benefits.

I was just hit with this on 3.7.4, with Erlang 20.3.4 after a switch patching reboot left 2 node cluster partitioned... We are still in test, but risking this silent loss happeing at patching as opposed to having a one node cluster fail with a clean break is making me seriously reconsider deploying a cluster in production. Right now risks seem to outweigh benefits.

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Jun 5, 2018

Member

I'm sorry but we don't have much to add to this issue at this time. We have been trying to find a way to reproduce for over a year. Maybe one day we will dedicate an engineer to work on this issue for months before we understand what's going on. Today is not that day.

Member

michaelklishin commented Jun 5, 2018

I'm sorry but we don't have much to add to this issue at this time. We have been trying to find a way to reproduce for over a year. Maybe one day we will dedicate an engineer to work on this issue for months before we understand what's going on. Today is not that day.

@rabbitmq rabbitmq locked and limited conversation to collaborators Jun 5, 2018

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Jun 5, 2018

Member

Those who have hypothesis and evidence to back them as to what the root cause is are welcome to share them on the mailing list.

Member

michaelklishin commented Jun 5, 2018

Those who have hypothesis and evidence to back them as to what the root cause is are welcome to share them on the mailing list.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.