Initial etcd leader-election implementation. #20
Conversation
|
/cc @pperalta |
|
EtcdTests contains a test which needs a real server to execute and hence causing the CI check to fail. Do you have any suggestions as how to go about fixing that? I'm planning to write a mock client. |
|
In spring-cloud-etcd I install etcd on travis: https://github.com/spring-cloud-incubator/spring-cloud-etcd/blob/master/.travis.yml |
|
@spencergibb thank you! That made my life easy :) |
| */ | ||
| private void notifyGranted() throws InterruptedException { | ||
| isLeader = true; | ||
| candidate.onGranted(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract for onGranted states:
Callback method invoked when this candidate is elected leader. Implementations may chose to launch a background thread to perform leadership roles and return immediately. Another option is for implementations to perform all leadership work in the thread invoking this method. In the latter case, the method must respond to thread interrupts by throwing
java.lang.InterruptedException. When the thread is interrupted, this indicates that this candidate is no longer leader.
If the implementation decides to perform its work in an endless loop, this will prevent leaderEventPublisher from publishing the application event (this is an issue with the other implementations; see #21). More critically, this will prevent the leader from renewing the TTL which will cause another process to become leader while this process continues to think that it is still the leader.
To honor the contract, I suggest the following:
- Invoke
candidate.onGrantedin a separate thread. - If leadership is revoked, that thread needs to be interrupted.
This is different from the ZooKeeper and Hazelcast implementations. The major difference is that with both of those, the underlying infrastructure maintains a connection with cluster peers and performs heartbeats; whereas with etcd that responsibility lies with the leader implementation.
Please let me know if you have any questions!
|
Hi Venil, Thank you for your contribution! Have you signed the Individual Contributor Agreement? This is a requirement for us to accept pull requests from contributors: https://support.springsource.com/spring_committer_signup Thanks, |
|
Hi Patrick, Thanks for the quick review. I've just signed the individual contributor agreement. I'll try implementing the candidate.onGranted mechanism as you've suggested; will keep you updated on it. |
events in etcd LeaderInitiator implementation.
|
Hi @pperalta, Apart from the changes you had mentioned, I'm also spawning a separate thread for publishing revocation events. This is to avoid publishing of those revocation events before the GrantNotification execution actually finishes. Please review and let me know if this commit works out. Thanks. Regards, |
blocking candidate implementations.
LeaderInitiator.
|
I've added 2 more tests.
|
|
Hi Venil, I had imagined a design with two threads: etcd leadership thread
worker thread
What do you think? |
|
Hi Patrick, In the design you've specified I would like to know how would leader yielding work? As the worker thread itself would be the interrupting thread, we might not clearly understand when the worker thread would exit, if its a blocking candidate implementation. This is the reason I've kept another thread for interrupting the grantor and awaiting its termination. |
|
I'll go through the designed you've proposed once again and let me see if I manage it without the revoker thread. Thanks, Venil |
|
Hi Venil, I put together a quick prototype of what I was thinking and pushed it to here: Please have a look and let me know what you think. Thanks, |
|
Hi Patrick, Thank you so much for the example code. I've implemented it accordingly now. Regards, |
LeaderInitiator.
|
Hi Patrick, Were you able to verify the final implementation? |
|
Hi Venil, I am still working on reviewing your changes. We had a long holiday weekend in the US so I haven't looked in a few days. I'll update the PR with comments soon! |
|
OK. Thanks Patrick. |
| stop(); | ||
| workerExecutorService.shutdown(); | ||
| leaderExecutorService.shutdown(); | ||
| workerExecutorService.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to block the thread until the service has been shut down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pperalta If the current node was the cluster leader, Initiator would try to delete the candidate key on etcd (via the 'tryDeleteCandidateEntry' method) while shutting down. Now, EtcdClient which issues requests via its internal thread pool also implements the Closable interface, to shut down the said thread pool. When the Spring container is shutting down, EtcdClient's 'close' method gets called, causing it to accept no more requests i.e. if EtcdClient is closed before calling the 'tryDeleteCandidateEntry' method, its etcd request would fail. So, to avoid the etcd call failure, I'm blocking the shutdown thread until the Initiator and Worker threads finish execution i.e. the EtcdClient's 'close' method execution is delayed until these threads exit.
Let me know what you think or if you have any alternate ideas.
Thanks,
Venil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. The only thing I would change is to set the timeout to something reasonable (perhaps 30 seconds) so that the thread shutting down the application context doesn't get stuck forever if one of the executor threads gets stuck.
executing candidate's grant callback in etcd leader-election implementation.
|
Hi Patrick, Could you please go through these commits; I've tried and fixed the problems you've highlighted. Thanks, |
| class Worker implements Callable<Void> { | ||
|
|
||
| @Override | ||
| public Void call() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to:
@Override
public Void call() throws InterruptedException {
try {
candidate.onGranted(context);
Thread.sleep(Long.MAX_VALUE);
}
finally {
relinquishLeadership = true;
candidate.onRevoked(context);
}
return null;
}
|
I pushed some suggestions to pperalta@bb6676c for your consideration. It includes changes based on my latest feedback, and it attempts to simplify exception handling. I was also thinking that this project would be an interesting candidate to use the new Spring State Machine project. Since we are tracking multiple states with various |
|
Hi Patrick, I've updated the implementation as per your changes and reduced the thread shutdown wait to 30 seconds. Also, I've modified the etcd configuration properties to accept connect uris in a comma separated fashion instead of a YAML list one. I'm interested in implementing this feature using Spring StateMachine that you mentioned; I'm planning to create a separate PR for it. Should that be fine? |
|
Yes, if you want to try with the state machine that would be great! In the meantime do you want me to merge this PR or do you want me to wait until you get a chance to try the state machine? |
|
I would request you to please merge this PR while I work on the state machine based implementation. Thanks Patrick for bearing with me and all your help :) Regards, |
|
Rebased, squashed, and merged as ed17e15. Thanks Venil! |
Hi,
I've created a etcd implementation for spring-cloud-cluster leader-election along with autoconfiguration. I've also updated the tests and the leaderelection doc. Please pull.