Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user stacks starting before infra services end up in deadlock because of process blocking thread pool #9680

Closed
aemneina opened this issue Aug 15, 2017 · 2 comments

Comments

@aemneina
Copy link

Rancher versions:
rancher/server: v1.6.5

Steps to Reproduce:

  1. have a large number of hosts and services per environment.
  2. reboot all hosts
  3. delayed processes should grow indefinitely.
  4. the processblocking pool should be fully consumed

Results:
ProcessBlocking thread pool was expanded to allow for processes to start processing again.

@deniseschannon
Copy link

Available with rancher/server:v1.6.8-rc4

@moelsayed
Copy link
Contributor

moelsayed commented Aug 18, 2017

I was able to reproduce the issue with the following parameters:

rancher v1.6.5
20 hosts
2 env's
50 service - 30 scale
pool.processblockingexecutorservice.max.size = 5
pool.processblockingexecutorservice.core.size = 2

Steps to reproduce:

  • install v1.6.5
  • create 2 env's add 10 host each.
  • create 25 stacks per env, 30 scale, nginx containers
  • shutdown all the hosts
  • wait for all hosts to be disconnected on the server
  • start the hosts back

Results:
The services on the setups remained in updating-active state for several hours.

To check the merged fix, I tested with the follow setup parameters:

rancher v1.6.8-rc4
20 hosts
2 env's
50 service - 30 scale
pool.processblockingexecutorservice.max.size = 5
pool.processblockingexecutorservice.core.size = 2
pool.processblockingextraexecutorservice.core.size = 2 
pool.processblockingextraexecutorservice.max.size = 5 
pool.processblockingsystemexecutorservice.max.size =5 
pool.processblockingsystemexecutorservice.core.size = 2

Following the same steps to reproduce, services recovered in 30~35 minutes after rebooting the hosts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants