-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fabric8 leader election #1658
base: main
Are you sure you want to change the base?
Fabric8 leader election #1658
Conversation
Configure Renovate
*/ | ||
// @formatter:off | ||
@ConfigurationProperties("spring.cloud.kubernetes.leader.election") | ||
public record LeaderElectionProperties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the properties needed in order to configure the new leader election:
waitForPodReady
should we wait for the readiness of the pod, before we even trigger the leader election processpublishEvents
- should we publish events (ApplicationEvent
) when the state of leaders change. We do this in the current implementation, so I added it here alsoleaseDuration
- TTL of the lease, if for example the leader dies. No other leader candidate can acquire the lease, unless this one expires.lockNamespace
- where to create the "lock" (this is either a lease or a config map)lockName
- the name of the lease or configmaprenewDeadline
- once the lock is acquired and we are the current leader, we try to "extend" the lease. We must extend it within this timeline.retryPeriod
- how often to retry when trying to get the lock in order to become the leader.
In our current code, this is what we use inLeaderInitiator::start
, more exactly in thescheduleAtFixRate
I'll try to explain this a bit more verbose.
The process internally in fabric8 is something like this:
- first try to acquire the lock (lock is either a configmap or a lease) and by "acquire" I mean write to it (or its annotations for a configmap). Whomever writes first (all others will get a 409) becomes the leader.
- All leader candidates that are not leaders, will continue to spin forever until they get a chance to become a leader. The retry every
retryPeriod
- The current leader, after it establishes itself as one, will spin forever too, but will try to extend its leadership. It extends that by updating the entries in the lease, specifically the one we care about is :
renewTime
. This one is updated on every timeretryPeriod
. For example, every 2 seconds (retryPeriod
), it will update itsrenewTime
with "now". - All other, non-leaders, are spinning and check a few things in each cycle:
- "am I the leader?" If the answer is no, they go to (2)
- "can I become the leader?" This is answered by looking at :
now().isAfter(leaderElectionRecord.getRenewTime().plus(leaderElectionConfig.getLeaseDuration()))
So they can only try to acquire the leadership if leaseDuration
(basically a TTL) + renewTime
(when was the last renewal) has expired.
As such, leaseDuration
acts a s TTL, if that makes sense. But that means that no one will be able to even try to acquire the lock until that leaseDuration
expires and that is OK for the cases when the pod dies or is killed.
But in case of a graceful shutdown (and I implemented this via CompletableFuture::cancel
because this is what fabric8 expects), there is code that fabric8 will trigger to "reset" the lease: they will set the renewTime
to "now" and leaseDuration
to 1 second.
|
||
@Bean | ||
@ConditionalOnMissingBean | ||
LeaderElectionConfig fabric8LeaderElectionConfig(LeaderElectionProperties properties, Lock lock, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the configuration that must be provided to fabric8, populated with reasonable defaults
|
||
@Bean | ||
@ConditionalOnMissingBean | ||
Lock lock(KubernetesClient fabric8KubernetesClient, LeaderElectionProperties properties, String holderIdentity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if lease
is available on the cluster, use it as the lock
implementation. If not, use configmap
|
||
@Bean | ||
@ConditionalOnMissingBean | ||
Fabric8LeaderElectionInitiator fabric8LeaderElectionInitiator(String holderIdentity, String podNamespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "initiator" of leader election
*/ | ||
final class Fabric8LeaderElectionCallbacks extends LeaderCallbacks { | ||
|
||
Fabric8LeaderElectionCallbacks(Runnable onStartLeading, Runnable onStopLeading, Consumer<String> onNewLeader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encapsulates the callbacks that fabric8 offers to us
|
||
@Bean | ||
String holderIdentity() throws UnknownHostException { | ||
String podHostName = LeaderUtils.hostName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use the pod name as the holder identity
/** | ||
* @author wind57 | ||
*/ | ||
@ExtendWith(OutputCaptureExtension.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that we can't really add more integration tests due to the build time that is already big, so this is as close as I can get to an integration test
CompletableFuture<Void> podReadyFuture = new CompletableFuture<>(); | ||
|
||
// wait until pod is ready | ||
if (leaderElectionProperties.waitForPodReady()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first we wait for the pod to be ready. For that, we schedule a Runnable
via: scheduler.scheduleWithFixedDelay
, where scheduler
is a CachedSingleThreadScheduler
, which I borrowed from fabric8, which is used in leader election in their implementation.
This CachedSingleThreadScheduler
is pretty interesting: it has a single daemon thread, but in our case : two Runnables that it executes. The first one is the one we submit, the second one is an internal one that checks if there is anything else pending to do (by looking at its inner queue).
So at any point in time, the queue that this executor uses has either two tasks (ours that we submit + internal one), or just one (if any of the two is currently executing). What it does is this: in its own internal Runnable
, it will check how many items its internal queue has. If its one, it means something was submitted by end users, if its zero, it means nothing was submitted (or was finished/canceled) and it can shutdown itself.
How I use it: I submit a Runnable
to such an executor, that every one second checks if the pod is ready. When it is ready, it completes :
CompletableFuture<Void> podReadyFuture = new CompletableFuture<>();
....
podReady.complete()
This completion matters because later in the code, I do this:
// wait in a different thread until the pod is ready
// and in the same thread start the leader election
executorService.get().submit(() -> {
try {
if (leaderElectionProperties.waitForPodReady()) {
CompletableFuture<?> ready = podReadyFuture
.whenComplete((x, y) -> scheduledFuture.get().cancel(true));
ready.get();
}
leaderFuture.set(leaderElector(leaderElectionConfig, fabric8KubernetesClient).start());
leaderFuture.get();
}
catch (Exception e) {
if (e instanceof CancellationException) {
LOG.warn(() -> "leaderFuture was canceled");
}
throw new RuntimeException(e);
}
});
Let's break it down a bit:
CompletableFuture<?> ready = podReadyFuture .whenComplete((x, y) -> scheduledFuture.get().cancel(true));
When pod is ready, cancel the Runnable that is scheduled. This means that the executor will stop also, since there would be only one internal "shutdown" task left inside.
-
Block until the pod is ready :
ready.get()
-
Once pod is ready kick off leader election:
leaderFuture.set(leaderElector(leaderElectionConfig, fabric8KubernetesClient).start());
leaderFuture.get();
This will "hang" for as long as we are the leader or try to acquire the leadership.
|
||
@PreDestroy | ||
void preDestroy() { | ||
LOG.info(() -> "preDestroy called in the leader initiator"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start cleaning up after a graceful shutdown
I've added explanation on the code a little bit, but I might be biased since I "took apart" the entire fabric8 leader election and tried to understand in its entire beauty. I've also contributed there with some minor PRs, so that I stay in good connection with the code. Though we might not add integrations tests for obvious reasons, I did test lots of scenarios in a separate project that I will post on my github page so that in future it would be easy to debug any issues. I am pending documentation, but in order to write it properly, I need to know what direction you see this going. I'll keep this one in sync with future merges from |
} | ||
|
||
@Bean | ||
String podNamespace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to add one more namespace inferring rule from service account path before fallback it to ENV?
In most use cases, the pod should have it injected by default configuration.
The Fabric8 KubernetesClient#getNamespace() already do it, see https://github.com/fabric8io/kubernetes-client/blob/main/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/Config.java#L915.
public static final String KUBERNETES_NAMESPACE_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/namespace";
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed! good to have you around Vic!
@ryanjbaxter I'm proposing a new implementation of fabric8 leader election, that is native in the library itself. For the time being I think that users should be left to opt-in and then in the long run may be have this one as the only implementation... thank you for looking into it |
|
||
[source] | ||
---- | ||
spring.cloud.kubernetes.leader.enabled=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we disable this if the new one is enabled so folks dont need to set 2 properties?
spring.cloud.kubernetes.leader.election.lockName=other-name | ||
---- | ||
|
||
The namespace (`default` being set if no explicit one exists) can be set also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt the namespace be the namespace the pod is running in?
|
||
Once a certain pod establishes itself as the leader (by acquiring the lock), it will continuously (every `spring.cloud.kubernetes.leader.election.retryPeriod`) try to renew its lease, or in other words: it will try to extend its leadership. When a renewal happens, the "record" that is stored inside the lock, is updated. For example, `renewTime` is updated inside the record, to denote when the last renewal happened. (You can always peek inside these fields by using `kubectl describe lease...` for example). | ||
|
||
Renewal must happen within a certain interval, specified by `spring.cloud.kubernetes.leader.election.renewDeadline`. By default, it is equal to 10 seconds and it means that the leader pod has a maximum of 10 seconds to renew its leadership. If that does not happen, this pod is taken out from leader election process and never participates again (unless you refresh the Spring context or restart the pod). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can never participate again, meaning the pod can no longer become the leader? Why?
No description provided.