Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keeper: "our keeper requested role is not available" #146

Closed
wchrisjohnson opened this issue Sep 4, 2016 · 5 comments
Closed

keeper: "our keeper requested role is not available" #146

wchrisjohnson opened this issue Sep 4, 2016 · 5 comments

Comments

@wchrisjohnson
Copy link
Contributor

wchrisjohnson commented Sep 4, 2016

It appears that when you attempt to start up a stolon cluster, and 2-n keepers are required, there is the distinct possibility that the cluster will not come up due to a deadlock wrt electing a leader/master.

I've been trying to start a stolon cluster with 1 proxy, 1 sentinel, and 2 or 3 keepers. The underlying env is docker related. All of the containers are started at one time. All configuration of the containers is done via environment variables - not stolonctl.

The following is a very typical snapshot of the env (logs):

KEEPER #1

2016-09-04` 19:17:29.672906 [keeper.go:997] W | keeper: superuser name and replication user name are the same. Different users are suggested.
2016-09-04 19:17:29.673004 [keeper.go:125] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-09-04 19:17:29.673060 [keeper.go:125] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-09-04 19:17:29.673122 [keeper.go:1042] I | keeper: generated id: a209cacc
2016-09-04 19:17:29.673177 [keeper.go:1049] I | keeper: id: a209cacc
2016-09-04 19:17:29.673191 [keeper.go:1052] I | keeper: running under kubernetes.
2016-09-04 19:17:39.679676 [postgresql.go:177] I | postgresql: Stopping database
2016-09-04 19:17:39.690762 [keeper.go:620] I | keeper: Initializing database
2016-09-04 19:17:42.928930 [postgresql.go:127] I | postgresql: Setting required accesses to pg_hba.conf
2016-09-04 19:17:42.929011 [postgresql.go:158] I | postgresql: Starting database
2016-09-04 19:17:43.945958 [postgresql.go:138] I | postgresql: Setting roles
2016-09-04 19:17:43.946005 [postgresql.go:245] I | postgresql: Adding replication role to superuser
2016-09-04 19:17:43.950187 [postgresql.go:177] I | postgresql: Stopping database
2016-09-04 19:17:44.964989 [keeper.go:648] I | keeper: current pg state: master

2016-09-04 19:17:44.965029 [keeper.go:660] I | keeper: our keeper requested role is not available

2016-09-04 19:17:44.965037 [postgresql.go:158] I | postgresql: Starting database
2016-09-04 19:17:50.999407 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:50.999506 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:17:56.012936 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:56.013023 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:18:01.027622 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:18:01.027805 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:18:06.044353 [keeper.go:648] I | keeper: current pg state: master

KEEPER #2

2016-09-04 19:17:12.683218 [keeper.go:997] W | keeper: superuser name and replication user name are the same. Different users are suggested.
2016-09-04 19:17:12.683297 [keeper.go:125] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-09-04 19:17:12.683332 [keeper.go:125] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-09-04 19:17:12.683591 [keeper.go:1042] I | keeper: generated id: 097ec11b
2016-09-04 19:17:12.683680 [keeper.go:1049] I | keeper: id: 097ec11b
2016-09-04 19:17:12.683695 [keeper.go:1052] I | keeper: running under kubernetes.
2016-09-04 19:17:22.708022 [postgresql.go:177] I | postgresql: Stopping database
2016-09-04 19:17:22.786089 [keeper.go:620] I | keeper: Initializing database
2016-09-04 19:17:32.169408 [postgresql.go:127] I | postgresql: Setting required accesses to pg_hba.conf
2016-09-04 19:17:32.169552 [postgresql.go:158] I | postgresql: Starting database
2016-09-04 19:17:33.663659 [postgresql.go:138] I | postgresql: Setting roles
2016-09-04 19:17:33.663715 [postgresql.go:245] I | postgresql: Adding replication role to superuser
2016-09-04 19:17:33.677821 [postgresql.go:177] I | postgresql: Stopping database
2016-09-04 19:17:34.690691 [keeper.go:648] I | keeper: current pg state: master

2016-09-04 19:17:34.690865 [keeper.go:660] I | keeper: our keeper requested role is not available

2016-09-04 19:17:34.690887 [postgresql.go:158] I | postgresql: Starting database
2016-09-04 19:17:41.135703 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:41.135803 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:17:46.311731 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:46.311814 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:17:51.419087 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:51.419115 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:17:56.432315 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:56.432512 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:18:01.453198 [keeper.go:648] I | keeper: current pg state: master

KEEPER #3

2016-09-04 19:17:12.619702 [keeper.go:997] W | keeper: superuser name and replication user name are the same. Different users are suggested.
2016-09-04 19:17:12.619889 [keeper.go:125] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-09-04 19:17:12.619957 [keeper.go:125] W | keeper: password file /etc/secrets/stolon permissions 01000000777 are too open. This file should only be readable to the user executing stolon! Continuing...
2016-09-04 19:17:12.620002 [keeper.go:1042] I | keeper: generated id: 8e5b50f3
2016-09-04 19:17:12.620123 [keeper.go:1049] I | keeper: id: 8e5b50f3
2016-09-04 19:17:12.620137 [keeper.go:1052] I | keeper: running under kubernetes.
2016-09-04 19:17:22.636370 [postgresql.go:177] I | postgresql: Stopping database
2016-09-04 19:17:22.653779 [keeper.go:620] I | keeper: Initializing database
2016-09-04 19:17:32.168523 [postgresql.go:127] I | postgresql: Setting required accesses to pg_hba.conf
2016-09-04 19:17:32.168907 [postgresql.go:158] I | postgresql: Starting database
2016-09-04 19:17:33.661895 [postgresql.go:138] I | postgresql: Setting roles
2016-09-04 19:17:33.662090 [postgresql.go:245] I | postgresql: Adding replication role to superuser
2016-09-04 19:17:33.678542 [postgresql.go:177] I | postgresql: Stopping database
2016-09-04 19:17:34.692113 [keeper.go:648] I | keeper: current pg state: master

2016-09-04 19:17:34.692304 [keeper.go:660] I | keeper: our keeper requested role is not available

2016-09-04 19:17:34.692342 [postgresql.go:158] I | postgresql: Starting database
2016-09-04 19:17:41.134165 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:41.134333 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:17:46.311174 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:46.311212 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:17:51.418279 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:51.418317 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:17:56.432622 [keeper.go:648] I | keeper: current pg state: master
2016-09-04 19:17:56.432664 [keeper.go:660] I | keeper: our keeper requested role is not available
2016-09-04 19:18:01.448981 [keeper.go:648] I | keeper: current pg state: master

SENTINEL

2016-09-04 19:16:59.149493 [sentinel.go:870] I | sentinel: id: fd5d7ce1
2016-09-04 19:16:59.166822 [sentinel.go:94] I | sentinel: Trying to acquire sentinels leadership
2016-09-04 19:17:09.188373 [sentinel.go:104] I | sentinel: sentinel leadership acquired
2016-09-04 19:17:09.192785 [sentinel.go:797] I | sentinel: Initializing cluster
2016-09-04 19:17:14.207201 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, no keepers registered
2016-09-04 19:17:19.217947 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, no keepers registered
2016-09-04 19:17:24.323122 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, more than 1 keeper registered
2016-09-04 19:17:29.417230 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, more than 1 keeper registered
2016-09-04 19:17:34.430276 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, more than 1 keeper registered
2016-09-04 19:17:39.443094 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, more than 1 keeper registered
2016-09-04 19:17:44.459136 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, more than 1 keeper registered
2016-09-04 19:17:49.482514 [sentinel.go:816] E | sentinel: failed to update clusterView: cannot choose initial master, more than 1 keeper registered

PROXY

2016-09-04 19:18:31.656789 [proxy.go:258] I | proxy: id: c66ab4ca
2016-09-04 19:18:31.684417 [proxy.go:107] I | proxy: Starting proxying
2016-09-04 19:18:31.684606 [proxy.go:188] I | proxy: no proxyconf available, closing connections to previous master
2016-09-04 19:18:36.705424 [proxy.go:188] I | proxy: no proxyconf available, closing connections to previous master
2016-09-04 19:18:41.738874 [proxy.go:188] I | proxy: no proxyconf available, closing connections to previous master
2016-09-04 19:18:46.748185 [proxy.go:188] I | proxy: no proxyconf available, closing connections to previous master
2016-09-04 19:18:51.756402 [proxy.go:188] I | proxy: no proxyconf available, closing connections to previous master
2016-09-04 19:18:56.764637 [proxy.go:188] I | proxy: no proxyconf available, closing connections to previous master

As you can see Keeper#1 and Keeper#2 BOTH try to acquire the leader/master role for keepers at the very same HH:MM:SS. It appears that this causes some sort of deadlock.

Possible solutions?

  1. Increase the determination of the keeper leader based on the millisecond.
  2. Introduce a random delay of between, say 0 and 15, 20, 30 seconds into the election of the leader of each type of node (keeper, sentinel, proxy) or all node types via a new ENV VAR.
  3. Introduce a random delay of between, say 0 and 15, 20, 30 seconds into the startup of each node or type of node (could be done via a new ENV VAR).
  4. Another option?

I'd be willing to offer a PR for one of the above if we could come to an agreement on the preferred approach.

@wchrisjohnson wchrisjohnson changed the title Bug: "our keeper requested role is not available" - URGENT keeper: "our keeper requested role is not available" - URGENT Sep 4, 2016
@wchrisjohnson wchrisjohnson changed the title keeper: "our keeper requested role is not available" - URGENT keeper: "our keeper requested role is not available" Sep 4, 2016
@wchrisjohnson
Copy link
Contributor Author

@sgotti any thoughts on the above?

@wchrisjohnson
Copy link
Contributor Author

More background after further testing: given my setup of 1 sentinel, 1 proxy and 3 keepers, I have the ability to start the sentinel first. I'm currently starting the three keeper nodes such that each starts with a random sleep between 1-60 secs before each one is started. I do not currently have the ability to control the order of these three nodes. I delay the start of the proxy by 5 minutes, so I can be sure it gets started last.

It would appear that the very fact that there are >1 keepers being started is the hangup. I cannot seem to successfully start >1 keeper node and consistently have the cluster come up.

If I HAVE to start the three keepers individually, I may be able to do so. I was hoping to avoid that sort of approach.

@sgotti
Copy link
Member

sgotti commented Sep 6, 2016

@wchrisjohnson this is the wanted behavior.

When the master sentinel have to initialize the cluster for the first time and there're more than one registered keepers, the sentinel cannot make any assumption on which keeper to choose as master.

If you're starting from an empty db (so choosing any keeper as master will be ok) you can just set an initial cluster config (exec the sentinels with --initial-cluster-config=initconfig.json) with the init_with_multiple_keepers option set to true) (see https://github.com/sorintlab/stolon/blob/master/doc/cluster_config.md#configuration-format).

This option is not the default since sane configs should not make any assumption.

@wchrisjohnson
Copy link
Contributor Author

@sgotti sounds good - will try that. It would be useful to pass that cluster config via an env var vs having to read a file on the filesystem, FWIW...

@wchrisjohnson
Copy link
Contributor Author

This appears to be working - thanks @sgotti !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants