Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fabric8 leader election #1658

Open
wants to merge 193 commits into
base: main
Choose a base branch
from

Conversation

wind57
Copy link
Contributor

@wind57 wind57 commented May 28, 2024

No description provided.

wind57 and others added 30 commits December 4, 2021 07:59
*/
// @formatter:off
@ConfigurationProperties("spring.cloud.kubernetes.leader.election")
public record LeaderElectionProperties(
Copy link
Contributor Author

@wind57 wind57 May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the properties needed in order to configure the new leader election:

  • waitForPodReady should we wait for the readiness of the pod, before we even trigger the leader election process
  • publishEvents - should we publish events (ApplicationEvent) when the state of leaders change. We do this in the current implementation, so I added it here also
  • leaseDuration - TTL of the lease, if for example the leader dies. No other leader candidate can acquire the lease, unless this one expires.
  • lockNamespace - where to create the "lock" (this is either a lease or a config map)
  • lockName - the name of the lease or configmap
  • renewDeadline - once the lock is acquired and we are the current leader, we try to "extend" the lease. We must extend it within this timeline.
  • retryPeriod - how often to retry when trying to get the lock in order to become the leader.
    In our current code, this is what we use in LeaderInitiator::start, more exactly in the scheduleAtFixRate

I'll try to explain this a bit more verbose.

The process internally in fabric8 is something like this:

  • first try to acquire the lock (lock is either a configmap or a lease) and by "acquire" I mean write to it (or its annotations for a configmap). Whomever writes first (all others will get a 409) becomes the leader.
  • All leader candidates that are not leaders, will continue to spin forever until they get a chance to become a leader. The retry every retryPeriod
  • The current leader, after it establishes itself as one, will spin forever too, but will try to extend its leadership. It extends that by updating the entries in the lease, specifically the one we care about is : renewTime. This one is updated on every time retryPeriod. For example, every 2 seconds (retryPeriod), it will update its renewTime with "now".
  • All other, non-leaders, are spinning and check a few things in each cycle:
  1. "am I the leader?" If the answer is no, they go to (2)
  2. "can I become the leader?" This is answered by looking at :
now().isAfter(leaderElectionRecord.getRenewTime().plus(leaderElectionConfig.getLeaseDuration()))

So they can only try to acquire the leadership if leaseDuration (basically a TTL) + renewTime (when was the last renewal) has expired.

As such, leaseDuration acts a s TTL, if that makes sense. But that means that no one will be able to even try to acquire the lock until that leaseDuration expires and that is OK for the cases when the pod dies or is killed.

But in case of a graceful shutdown (and I implemented this via CompletableFuture::cancel because this is what fabric8 expects), there is code that fabric8 will trigger to "reset" the lease: they will set the renewTime to "now" and leaseDuration to 1 second.


@Bean
@ConditionalOnMissingBean
LeaderElectionConfig fabric8LeaderElectionConfig(LeaderElectionProperties properties, Lock lock,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the configuration that must be provided to fabric8, populated with reasonable defaults


@Bean
@ConditionalOnMissingBean
Lock lock(KubernetesClient fabric8KubernetesClient, LeaderElectionProperties properties, String holderIdentity) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if lease is available on the cluster, use it as the lock implementation. If not, use configmap


@Bean
@ConditionalOnMissingBean
Fabric8LeaderElectionInitiator fabric8LeaderElectionInitiator(String holderIdentity, String podNamespace,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "initiator" of leader election

*/
final class Fabric8LeaderElectionCallbacks extends LeaderCallbacks {

Fabric8LeaderElectionCallbacks(Runnable onStartLeading, Runnable onStopLeading, Consumer<String> onNewLeader) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encapsulates the callbacks that fabric8 offers to us


@Bean
String holderIdentity() throws UnknownHostException {
String podHostName = LeaderUtils.hostName();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use the pod name as the holder identity

/**
* @author wind57
*/
@ExtendWith(OutputCaptureExtension.class)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that we can't really add more integration tests due to the build time that is already big, so this is as close as I can get to an integration test

CompletableFuture<Void> podReadyFuture = new CompletableFuture<>();

// wait until pod is ready
if (leaderElectionProperties.waitForPodReady()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first we wait for the pod to be ready. For that, we schedule a Runnable via: scheduler.scheduleWithFixedDelay, where scheduler is a CachedSingleThreadScheduler, which I borrowed from fabric8, which is used in leader election in their implementation.

This CachedSingleThreadScheduler is pretty interesting: it has a single daemon thread, but in our case : two Runnables that it executes. The first one is the one we submit, the second one is an internal one that checks if there is anything else pending to do (by looking at its inner queue).

So at any point in time, the queue that this executor uses has either two tasks (ours that we submit + internal one), or just one (if any of the two is currently executing). What it does is this: in its own internal Runnable, it will check how many items its internal queue has. If its one, it means something was submitted by end users, if its zero, it means nothing was submitted (or was finished/canceled) and it can shutdown itself.


How I use it: I submit a Runnable to such an executor, that every one second checks if the pod is ready. When it is ready, it completes :

CompletableFuture<Void> podReadyFuture = new CompletableFuture<>();
....
podReady.complete()

This completion matters because later in the code, I do this:

		// wait in a different thread until the pod is ready
		// and in the same thread start the leader election
		executorService.get().submit(() -> {
			try {
				if (leaderElectionProperties.waitForPodReady()) {
					CompletableFuture<?> ready = podReadyFuture
							.whenComplete((x, y) -> scheduledFuture.get().cancel(true));
					ready.get();
				}
				leaderFuture.set(leaderElector(leaderElectionConfig, fabric8KubernetesClient).start());
				leaderFuture.get();
			}
			catch (Exception e) {
				if (e instanceof CancellationException) {
					LOG.warn(() -> "leaderFuture was canceled");
				}
				throw new RuntimeException(e);
			}
		});

Let's break it down a bit:

  • CompletableFuture<?> ready = podReadyFuture .whenComplete((x, y) -> scheduledFuture.get().cancel(true));

When pod is ready, cancel the Runnable that is scheduled. This means that the executor will stop also, since there would be only one internal "shutdown" task left inside.

  • Block until the pod is ready : ready.get()

  • Once pod is ready kick off leader election:

leaderFuture.set(leaderElector(leaderElectionConfig, fabric8KubernetesClient).start());
leaderFuture.get();

This will "hang" for as long as we are the leader or try to acquire the leadership.


@PreDestroy
void preDestroy() {
LOG.info(() -> "preDestroy called in the leader initiator");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start cleaning up after a graceful shutdown

@wind57
Copy link
Contributor Author

wind57 commented May 29, 2024

I've added explanation on the code a little bit, but I might be biased since I "took apart" the entire fabric8 leader election and tried to understand in its entire beauty. I've also contributed there with some minor PRs, so that I stay in good connection with the code.

Though we might not add integrations tests for obvious reasons, I did test lots of scenarios in a separate project that I will post on my github page so that in future it would be easy to debug any issues.

I am pending documentation, but in order to write it properly, I need to know what direction you see this going. I'll keep this one in sync with future merges from main, until a decision is made (if any, of course :) )

}

@Bean
String podNamespace() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to add one more namespace inferring rule from service account path before fallback it to ENV?
In most use cases, the pod should have it injected by default configuration.
The Fabric8 KubernetesClient#getNamespace() already do it, see https://github.com/fabric8io/kubernetes-client/blob/main/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/Config.java#L915.

public static final String KUBERNETES_NAMESPACE_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/namespace";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! good to have you around Vic!

@wind57 wind57 marked this pull request as ready for review July 18, 2024 07:05
@wind57
Copy link
Contributor Author

wind57 commented Jul 18, 2024

@ryanjbaxter I'm proposing a new implementation of fabric8 leader election, that is native in the library itself. For the time being I think that users should be left to opt-in and then in the long run may be have this one as the only implementation... thank you for looking into it


[source]
----
spring.cloud.kubernetes.leader.enabled=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we disable this if the new one is enabled so folks dont need to set 2 properties?

spring.cloud.kubernetes.leader.election.lockName=other-name
----

The namespace (`default` being set if no explicit one exists) can be set also:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt the namespace be the namespace the pod is running in?


Once a certain pod establishes itself as the leader (by acquiring the lock), it will continuously (every `spring.cloud.kubernetes.leader.election.retryPeriod`) try to renew its lease, or in other words: it will try to extend its leadership. When a renewal happens, the "record" that is stored inside the lock, is updated. For example, `renewTime` is updated inside the record, to denote when the last renewal happened. (You can always peek inside these fields by using `kubectl describe lease...` for example).

Renewal must happen within a certain interval, specified by `spring.cloud.kubernetes.leader.election.renewDeadline`. By default, it is equal to 10 seconds and it means that the leader pod has a maximum of 10 seconds to renew its leadership. If that does not happen, this pod is taken out from leader election process and never participates again (unless you refresh the Spring context or restart the pod).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can never participate again, meaning the pod can no longer become the leader? Why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants