Skip to content
This repository has been archived by the owner on Nov 17, 2020. It is now read-only.

Bumping vhost start timeout #591

Merged

Conversation

kitsirota
Copy link
Contributor

@kitsirota kitsirota commented Jul 27, 2018

This issue is new since we upgraded from 3.6.x. We run 5 node clusters running RMQ 3.7.7/Erlang 20.3.8.1. When we reach about 15 vhosts, new vhosts can take longer than 15s to create. This typically results in unhealthy vhosts with 1+ "stopped" nodes.

Proposed Changes

We would like to bump the limit to 45 seconds to mitigate having to detect failed nodes with an external monitoring solution and start them via the /api/vhosts/name/start/node endpoint.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after
creating the PR. If you're unsure about any of them, don't hesitate to
ask on the mailing list. We're here to help! This is simply a reminder
of what we are going to look for before merging your code.

  • [x ] I have read the CONTRIBUTING.md document
  • [x ] I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

In our 5 node clusters running 3.7.7, once we reach about 15 vhosts, new vhosts can take longer than 15s to create.  This typically results in unhealthy vhosts with 1+ "stopped" nodes.  

We would like to bump the limit to 45 seconds to mitigate having to detect failed nodes and start them via the /api/vhosts/name/start/node endpoint.
@michaelklishin
Copy link
Member

This is merely a workaround for a problem somewhere else. 15 seconds to create a virtual host is pretty extreme.

@michaelklishin
Copy link
Member

I am not against merging this but please post some details on how it can be reproduced (and give 3.7.8-rc.1 a shot, it has non-trivial optimizations in virtual host recovery) to the mailing list.

@kitsirota
Copy link
Contributor Author

kitsirota commented Jul 27, 2018

@michaelklishin I absolutely agree, this is purely a bandaid for a condition that should probably be handled asynchronously. We're having trouble identifying the root cause of the variable request times.

In our use-case, we're deploying containers in a Pivotal CloudFoundry deployment. We seem to reach a point where vhosts take longer than 15 seconds to generate when we're up to about 300-400 containers spread across around 15 vhosts. The underlying nodes dont seem to have any operational bottlenecks (no load/mem/io issues when this starts happening).

I'll also try 3.7.8-rc1 and report back if that helps.

Thanks!

@michaelklishin
Copy link
Member

@kitsirota so, 300-400 application instances? How many connections do they open on average? (a ball park estimate would do)

@michaelklishin michaelklishin merged commit ae223d9 into rabbitmq:master Jul 29, 2018
@michaelklishin michaelklishin added this to the 3.7.8 milestone Jul 29, 2018
@michaelklishin
Copy link
Member

So apparently the management part of #575 was not cherry-picked to v3.7.x 🤦‍♂️, so the bump per se may or may not be necessary but 45s is not an unreasonable value.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants