Skip to content
This repository has been archived by the owner on Jan 21, 2022. It is now read-only.

Initial etcd leader-election implementation. #20

Closed
wants to merge 26 commits into from

Conversation

venilnoronha
Copy link
Contributor

Hi,

I've created a etcd implementation for spring-cloud-cluster leader-election along with autoconfiguration. I've also updated the tests and the leaderelection doc. Please pull.

@spencergibb
Copy link
Contributor

/cc @pperalta

@pperalta pperalta self-assigned this Nov 18, 2015
@venilnoronha
Copy link
Contributor Author

EtcdTests contains a test which needs a real server to execute and hence causing the CI check to fail. Do you have any suggestions as how to go about fixing that? I'm planning to write a mock client.

@spencergibb
Copy link
Contributor

In spring-cloud-etcd I install etcd on travis: https://github.com/spring-cloud-incubator/spring-cloud-etcd/blob/master/.travis.yml

@venilnoronha
Copy link
Contributor Author

@spencergibb thank you! That made my life easy :)

*/
private void notifyGranted() throws InterruptedException {
isLeader = true;
candidate.onGranted(context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract for onGranted states:

Callback method invoked when this candidate is elected leader. Implementations may chose to launch a background thread to perform leadership roles and return immediately. Another option is for implementations to perform all leadership work in the thread invoking this method. In the latter case, the method must respond to thread interrupts by throwing java.lang.InterruptedException. When the thread is interrupted, this indicates that this candidate is no longer leader.

If the implementation decides to perform its work in an endless loop, this will prevent leaderEventPublisher from publishing the application event (this is an issue with the other implementations; see #21). More critically, this will prevent the leader from renewing the TTL which will cause another process to become leader while this process continues to think that it is still the leader.

To honor the contract, I suggest the following:

  1. Invoke candidate.onGranted in a separate thread.
  2. If leadership is revoked, that thread needs to be interrupted.

This is different from the ZooKeeper and Hazelcast implementations. The major difference is that with both of those, the underlying infrastructure maintains a connection with cluster peers and performs heartbeats; whereas with etcd that responsibility lies with the leader implementation.

Please let me know if you have any questions!

@pperalta
Copy link
Contributor

Hi Venil,

Thank you for your contribution! Have you signed the Individual Contributor Agreement? This is a requirement for us to accept pull requests from contributors:

https://support.springsource.com/spring_committer_signup

Thanks,
Patrick

@venilnoronha
Copy link
Contributor Author

Hi Patrick,

Thanks for the quick review. I've just signed the individual contributor agreement.

I'll try implementing the candidate.onGranted mechanism as you've suggested; will keep you updated on it.

events in etcd LeaderInitiator implementation.
@venilnoronha
Copy link
Contributor Author

Hi @pperalta,

Apart from the changes you had mentioned, I'm also spawning a separate thread for publishing revocation events. This is to avoid publishing of those revocation events before the GrantNotification execution actually finishes.

Please review and let me know if this commit works out.

Thanks.

Regards,
Venil

@venilnoronha
Copy link
Contributor Author

I've added 2 more tests.

  1. leader yielding
  2. revocation of leadership in blocking candidate implementations

@pperalta
Copy link
Contributor

Hi Venil,

I had imagined a design with two threads:

etcd leadership thread

  • attempts to acquire leadership
  • issues heartbeats
  • manages worker thread
  • raises leader events

worker thread

  • started by etcd leadership thread when leadership granted
  • interrupted by etcd leadership thread when leadership revoked

What do you think?

@venilnoronha
Copy link
Contributor Author

Hi Patrick,

In the design you've specified I would like to know how would leader yielding work?

As the worker thread itself would be the interrupting thread, we might not clearly understand when the worker thread would exit, if its a blocking candidate implementation. This is the reason I've kept another thread for interrupting the grantor and awaiting its termination.

@venilnoronha
Copy link
Contributor Author

I'll go through the designed you've proposed once again and let me see if I manage it without the revoker thread.

Thanks,

Venil

@pperalta
Copy link
Contributor

Hi Venil,

I put together a quick prototype of what I was thinking and pushed it to here:

pperalta@55d1128

Please have a look and let me know what you think.

Thanks,
Patrick

@venilnoronha
Copy link
Contributor Author

Hi Patrick,

Thank you so much for the example code. I've implemented it accordingly now.

Regards,
Venil

@venilnoronha
Copy link
Contributor Author

Hi Patrick,

Were you able to verify the final implementation?

@pperalta
Copy link
Contributor

Hi Venil,

I am still working on reviewing your changes. We had a long holiday weekend in the US so I haven't looked in a few days. I'll update the PR with comments soon!

@venilnoronha
Copy link
Contributor Author

OK.

Thanks Patrick.

stop();
workerExecutorService.shutdown();
leaderExecutorService.shutdown();
workerExecutorService.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to block the thread until the service has been shut down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pperalta If the current node was the cluster leader, Initiator would try to delete the candidate key on etcd (via the 'tryDeleteCandidateEntry' method) while shutting down. Now, EtcdClient which issues requests via its internal thread pool also implements the Closable interface, to shut down the said thread pool. When the Spring container is shutting down, EtcdClient's 'close' method gets called, causing it to accept no more requests i.e. if EtcdClient is closed before calling the 'tryDeleteCandidateEntry' method, its etcd request would fail. So, to avoid the etcd call failure, I'm blocking the shutdown thread until the Initiator and Worker threads finish execution i.e. the EtcdClient's 'close' method execution is delayed until these threads exit.

Let me know what you think or if you have any alternate ideas.

Thanks,

Venil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. The only thing I would change is to set the timeout to something reasonable (perhaps 30 seconds) so that the thread shutting down the application context doesn't get stuck forever if one of the executor threads gets stuck.

@venilnoronha
Copy link
Contributor Author

Hi Patrick,

Could you please go through these commits; I've tried and fixed the problems you've highlighted.

Thanks,
Venil

class Worker implements Callable<Void> {

@Override
public Void call() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified to:

@Override
public Void call() throws InterruptedException {
    try {
        candidate.onGranted(context);
        Thread.sleep(Long.MAX_VALUE);
    }
    finally {
        relinquishLeadership = true;
        candidate.onRevoked(context);
    }
    return null;
}

@pperalta
Copy link
Contributor

pperalta commented Dec 7, 2015

I pushed some suggestions to pperalta@bb6676c for your consideration. It includes changes based on my latest feedback, and it attempts to simplify exception handling.

I was also thinking that this project would be an interesting candidate to use the new Spring State Machine project. Since we are tracking multiple states with various if then else statements, using a state machine could simplify the implementation. This is your PR so there is no obligation, but I thought that I'd make the suggestion in case you're interested. 😊

@venilnoronha
Copy link
Contributor Author

Hi Patrick,

I've updated the implementation as per your changes and reduced the thread shutdown wait to 30 seconds. Also, I've modified the etcd configuration properties to accept connect uris in a comma separated fashion instead of a YAML list one.

I'm interested in implementing this feature using Spring StateMachine that you mentioned; I'm planning to create a separate PR for it. Should that be fine?

@pperalta
Copy link
Contributor

pperalta commented Dec 9, 2015

Yes, if you want to try with the state machine that would be great! In the meantime do you want me to merge this PR or do you want me to wait until you get a chance to try the state machine?

@venilnoronha
Copy link
Contributor Author

I would request you to please merge this PR while I work on the state machine based implementation.

Thanks Patrick for bearing with me and all your help :)

Regards,
Venil

@pperalta
Copy link
Contributor

pperalta commented Dec 9, 2015

Rebased, squashed, and merged as ed17e15. Thanks Venil!

@pperalta pperalta closed this Dec 9, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants