Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use podAntiAffinity so brokers don't accidentally run on the same machine #33

Closed
tombentley opened this issue Oct 11, 2017 · 5 comments · Fixed by #486
Closed

Use podAntiAffinity so brokers don't accidentally run on the same machine #33

tombentley opened this issue Oct 11, 2017 · 5 comments · Fixed by #486
Labels
Milestone

Comments

@tombentley
Copy link
Member

tombentley commented Oct 11, 2017

We should use podAntiAffinity so brokers don't accidentally run on the same physical machine, because that would defeat the point of Kafka holding multiple replicas of partitions.

@scholzj
Copy link
Member

scholzj commented Oct 12, 2017

I'm not sure we should really use this by default. Some of my thoughts why ...

Kubernetes will try to schedule the pods of stateful set on separate nodes where possible on its own. So for the basic situations anti-affinity is not needed. The situation when it might not work perfectly is for example:

  1. you have 3 node Kubernetes cluster
  2. Kafka is deployed with 3 pods - one pod is scheduled per Kubernetes node
  3. One of the nodes crashes => Kubernetes reschedules the pod from the crashed node to one of the two other nodes
  4. The crashed node rejoined the cluster, but Kubernetes will not pro-actively kill one of the pods to move it to the newly joined node. So although we have 3 nodes again, we are running only 2 nodes => problem.

The pod affinity would cause us some complications:

  • It will never run a cluster on minikube / minishift / oc cluster up
  • We will need to carefully select the anti affinity selector so that it binds the affinity only against Kafka pods from the same Kafka cluster (e.g. if you for whatever reason have multiple Kafka installations)

With regards to the replication, at the end this is not about not running two Kafka nodes on the same host but about not running two Kafka hosts with different rack IDs on the same host. So what we should do is add support for broker.rack in the Kafka images and add support for host selectors based on the "zone" where the host is running. That will ensure that there might be several Kafka instances running on the same host but always only with the same rack ID and such instances will not share any replicas. And when we have that implemented the only problem which sharing the physical machine for multiple Kafka pods can cause is that they will steal each others disk cache.

The practical complication with podAntiAffinity is that we will need to configure it somehow - we will at least need a set of files without it for testing on single node installations such as minikube / minishift and set of files with anti affinity for running on real clusters. That sounds like something would should be done after #31. I guess that sooner or later we will anyway get to the state when deploying the cluster will be done by some controller - there it would be even easier because you would just tell the controller that it should deploy "with anti-affinity" and it would generate the code for you.

@tombentley
Copy link
Member Author

  • It will never run a cluster on minikube / minishift / oc cluster up

Is that true of preferredDuringSchedulingIgnoredDuringExecution, wouldn't it just ignore the affinity with that setting?

So what we should do is add support for broker.rack in the Kafka images

Can you open a separate issue for that, @scholzj?

I guess that sooner or later we will anyway get to the state when deploying the cluster will be done by some controller - there it would be even easier because you would just tell the controller that it should deploy "with anti-affinity" and it would generate the code for you.

I'm happy to defer this issue until we have such a controller.

@scholzj
Copy link
Member

scholzj commented Oct 12, 2017

You are right, preferredDuringSchedulingIgnoredDuringExecution should schedule all on Minikube and such (see, I don't know that much about Kubernetes and openShift ;-)). It is also interesting that the Kubernetes docu seems to talk about this even in connection to StatufulSets. I haven't seen any stateful set which would not follow the preferredDuringSchedulingIgnoredDuringExecution rule automatically. But maybe it is not guaranteed. So maybe we should add it to be sure.

I raised #36 for the rack IDs.

@scholzj scholzj added the yaml label Oct 18, 2017
@vinu
Copy link

vinu commented May 30, 2018

hey anything on this yet? -
what is the best way to have podaffinity and antiaffnity with strimzi
thanks

@ppatierno
Copy link
Member

Hi @vinu this feature is planned for the current sprint so it will be available at the end of the next 3 weeks.

@tombentley tombentley mentioned this issue Jun 11, 2018
6 tasks
tombentley added a commit that referenced this issue Jun 12, 2018
* Add support for user-configurable pod and node affinities

Fixes #33
@tombentley tombentley added this to the 0.5.0 milestone Jun 12, 2018
tomncooper added a commit to tomncooper/strimzi-kafka-operator that referenced this issue Apr 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants