Add support for `broker.rack` #36

scholzj · 2017-10-12T13:58:23Z

Kafka has a concept of racks. They represent group Kafka nodes into different segments based on the probability that they will fail all together. E.g. all machines in single rack will fail when the rack fails - hence rack. But in in public cloud for example the role of the rack will be taken over by the availability zone. The idea is that when a topic has N replicas, the replicas should be spread across different racks to minimise the probability that one disaster event will take down all replicas of particular topic / partition.

In Kubernetes world, the nodes have often tags to describe the availability zone or some similar concept of disaster partition (failure-domain.beta.kubernetes.io/zone). We should:

Make sure we have Kafka nodes in each rack (node selectors? pod anti-affinity? In some environment this might be easier as the pods will be unable to move between the partitions because the disks are limited to a single partition (e.g. AWS). In other environments this might be harder)
Configure Kafka nodes with the proper broker.rack property

The text was updated successfully, but these errors were encountered:

tombentley · 2017-10-13T16:40:18Z

How many availability zones are there, typically, in a region?

scholzj · 2017-10-14T12:48:25Z

It depends. In AWS this ranges from 2 to 6 - you can have a look here. Google seems to be doing a bit better and most of their regions have 3+ zones. I expect that on premise many customers might have only two as well.

2 zones should not be such a big deal for Kafka. Zookeeper might be bigger issue in such situation.

tombentley · 2017-10-16T09:26:52Z

KIP-36 describes rack aware assignment, and explicitly says that "rack==availability zone" in AWS. It also says, when describing how the rack-aware assignment would work, "if the number of racks is less than the the number of replicas (which should happen very infrequently)..." If there are only 2 AZ in a region it seems this assertion of infrequency is likely to fail, and an uneven distribution is possible.

I still think we should do this, but we need to be a little bit cautious.

scholzj · 2017-10-16T12:07:36Z

Yeah, it is a bit unfortunate that it cannot distribute the replicas equally across the racks when replicas > racks. But still better then nothing.

ppatierno · 2018-05-15T09:46:59Z

After digging more into node affinity and pod affinity/anti-affinity ...

We need to have brokers spread across different zones/racks as much as possible and it can be provided by the podAntiAffinity. It should use preferredDuringSchedulingIgnoreDuringExecution so that if not enough zones are available to spread brokers across all of them, the scheduler is allowed to put two brokers in the same zone as well. The topology key for that is the failure-domain.beta.kubernetes.io/zone. The inner label selector should use the strimzi.io based labels for selecting pods which are part of this anti-affinity scenario.

At same time, I guess we want user to configure "preferred" zones/racks in the cluster configmap so something like :

broker.rack = "zone1, zone2, ..." (or a JSON array or something like that)

It means that users would like to have the broker replicas spread across the specified zones/racks.
In this case the nodeAffinity could be useful. It should use preferredDuringSchedulingIgnoreDuringExecution as well and a matchExpressions where the failure-domain.beta.kubernetes.io/zone is the key and, in this example, "zone1" and "zone2" (and more) are values.

These settings are added by the cluster operator to the StatefulSets at creation.

On the other side, we'll have an init-container running in the same pod as the Kafka container. It will be in charge to get info from the node it's running most important getting the zone and "injecting" it as broker.rack parameter for the Kafka container.

On how implementing the init-container, in order to do that the init container could have just a script using oc/kubectl for getting node info ? It needs the spec.nodeName from the pod where it's running in order to get the label zone from that. Or a simple Java application using fabric8 API ? Via script could be less trivial but running Java could be too much.

@scholzj @tombentley wdyt ?

scholzj · 2018-05-15T10:26:24Z

At same time, I guess we want user to configure "preferred" zones/racks in the cluster configmap so something like :

broker.rack = "zone1, zone2, ..." (or a JSON array or something like that)

I'm not sure we need this. What is the use case you can see in selecting only some zones? Even if there is such usecase, it should be part of general node selector, not as the broker.rack implementation IMHO.

The topology key for that is the failure-domain.beta.kubernetes.io/zone.

Yeah, right ... so, do we actually need to make this configurable? Or can we just hardcode the affinity into the stateful sets / deployments?

One other problem which comes to my mind ... we might not have access to the nodes (such as on OSO). So we will either need to be able to switch the init container on / off from config map and leave it to the user to decide whether he can give us the rights. Or we will have to find out if we have the rights and use it accordingly. This should affect the Init container and setting rack ID only ... should not have impact on the affinity in the deployment / statefulset.

ppatierno · 2018-05-15T10:32:13Z

I'm not sure we need this. What is the use case you can see in selecting only some zones? Even if there is such usecase, it should be part of general node selector, not as the broker.rack implementation IMHO.

Yeah, right ... so, do we actually need to make this configurable? Or can we just hardcode the affinity into the stateful sets / deployments?

Ok I I misunderstood then. We don't want the user to specify "some" zones in the configmap but the node label to use for "selecting" nodes so that it should not be the hard coded failure-domain.beta.kubernetes.io/zone (as default) but a different one. Am I right ?

scholzj · 2018-05-15T10:36:42Z

Ok I I misunderstood then. We don't want the user to specify "some" zones in the configmap but the node label to use for "selecting" nodes so that it should not be the hard coded failure-domain.beta.kubernetes.io/zone (as default) but a different one. Am I right ?

Well, I actually haven't though about it before either. But you are right that failure-domain.beta.kubernetes.io/zone is well defined label which should specify the racks. And with the affinity being "optional" (i.e. only when the zones are available) I actually don't see any harm in setting this by default.

My expectation (which might be of course wrong) is that people usually do not have more zones than what they want to use. Hence no need to select them.

Anyway ... we will need to allow users to select nodes. And although my primary motivation there would be to for example select a dedicated node or select a node with local SSD or with super fast storage network, I would assume it could be used also for selecting the zones if needed.

ppatierno · 2018-05-15T10:52:20Z

It seems that the issue is broader than supporting the rack feature.

In bare metal Kafka, the user can specify the broker.rack property for each broker which is something that he cannot do using a Kubernetes/OpenShift deployment because we have replicas from a StatefulSets created from same Pod template.
For this reason spreading the pod/replicas across different zones/racks can be done via podAntiAffinity using failure-domain.beta.kubernetes.io/zone (or something else configurable) as topologyKey. After that the zone is injected as value to broker.rack property for the broker.

What you are saying now ...

we will need to allow users to select nodes.

... sounds a little bit different to me. It's something that can live even without the rack feature.
We want a way, in the configmap, to allow users to say "I want my Kafka brokers running on nodes which has a fast SSD" supposing for example that some nodes have label disktype=ssd. In that case we need nodeAffinity using match expressions where the key is the label (i.e. disktype) having values (i.e. ssd). How is it related to the broker.rack feature ?

scholzj · 2018-05-15T10:55:42Z

Sorry if I confused you. The node selectors are separate issue and not necessarily needed for the broker.rack support. But they can IMHO be used as a way to select only some zones in case that is really needed. So I'm just saying that IMHO we do not need the configuration of selected zones as mentioned on the beginning.

tombentley · 2018-05-15T11:24:47Z

This concerns me:

It should use preferredDuringSchedulingIgnoreDuringExecution so that if not enough zones are available to spread brokers across all of them, the scheduler is allowed to put two brokers in the same zone as well.

The reason is that if it is only preferred then it is possible that a broker can change zone should it terminate and get rescheduled? That would be a problem as the rack of a broker is used for partition assignment. If the broker can change zone that would thwart the efforts of the rack-aware partition assignment: Replicas carefully assigned to brokers in different racks would subsequently be in the same rack and the extra protection of rack-aware assignment would be lost.

ppatierno · 2018-05-15T13:19:08Z

It's a valid point but ...

I don't see a solution because if in AWS the range is from 2 to 6 zones and let's consider having 2 for example but you want to deploy a 3 brokers Kafka cluster, the only way is to use preferred otherwise with required the third broker cannot be deployed at all (as far as I understood on how it works with required vs preferred). Is that something that we can avoid ?

scholzj · 2018-05-15T13:26:13Z

Yeah, AFAIK this is a Paolo says - with required, you will not be able to deploy more brokers than we have zones.

Whether the pods change the zone or not depends also on other factors. For example EBS storage will not move the zone and the pod with persistent EBS volume will be bound to the zone through the volume. I'm not sure how for example GlusterFS deals with this. I would say we should go for it and just document the caveats. WDYT?

It is funny how in some cases it turns out how insufficient the stateful sets are. If we were scheduling directly our own Pods, this would be easier to handle.

tombentley · 2018-05-15T13:57:28Z

It seems we've got two pieces here:

Using the zone node label as the broker rack. I don't think this is contentious, really, since that's pretty much what it is.
Then it's a question of scheduling and the right thing here depends on external factors:
- For EBS, @scholzj asserts that pod anti affinity should work.
- For other storage options it might not

Looking forward there are other mitigations in the problematic case. For example we could try using nodeSelectors or pod affinity to make the initial zone assignment sticky, or stickyish. Or just rely on the cluster balancer to reassign replicas in the same zone.

In any case, it seems clear that we should do this for now.

scholzj · 2018-05-15T14:13:46Z

For other storage options it might not

TBH, I think saying that it will not work is a bit strong statement. So lets imagine following situation:

The cluster is deployed on 3 zones with 6 brokers
The affinity will put two brokers into each zone
The rack IDs will be assigned accordingly and topics distributed
A disaster causes one of the zones to go down
2 brokers which crashed will be rescheduled into one of the two remaining zones, the will change the rack ID and break the disctriution

So it works fairly well up to here. The only problem is that once the zone is recovered, someone will need to reschedule the brokers and possibly reassign the topics. But it is not all that bad - the primary target has been achieved and the data survived zone crash.

Or did I missed something?

tombentley · 2018-05-15T14:30:20Z

That's the scenario @scholzj, but the problem is in the "someone will need to reschedule the brokers and possibly reassign the topics" part. If that step doesn't happen the cluster appears to be working but the users expectation around redundancy has been broken. For sure we don't have to solve that in this issue, but it would be good to have some mitigation.

ppatierno · 2018-05-15T14:36:33Z

@tombentley but it's part of the more advanced cluster balancing, isn't it ? Please be focused on the rack issue here! :P

scholzj · 2018-05-15T15:01:47Z

Yeah ... well ... I think we should document this until we have a better solution. Rescheduling the pods can be in theory done by CO. For reassigning the topics we would need the cluster balancer.

scholzj · 2018-08-26T22:40:33Z

This should be now all implemented. I will close this issue.

* Switched from mock to full client impl and fixed UserTaskStatus values Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Added KCRAO test and edited operator to take optional host parameter Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Fixed task state switch statement and other PR comments Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Refatored CC Mockserver to make expectations more configurable Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Added CC error parsing and convenience method to check state of proposal Signed-off-by: Thomas Cooper <tcooper@redhat.com>

Many tests fail due to a flaky ZK issue where the server hostname config is causing a problem. With 1 ZK, setting the server address to 0.0.0.0 is not possible due to ZOOKEEPER-4708. Until the ZK issue is fixed, use 3 ZK nodes in system tests.

Many tests fail due to a flaky ZK issue where the server hostname config is causing a problem. With 1 ZK, setting the server address to 0.0.0.0 is not possible due to ZOOKEEPER-4708. Until the ZK issue is fixed, use 3 ZK nodes in system tests. CSMDS-451: Use persistent cluster in MultipleClusterOperatorsST (strimzi#50) Restarting an ephemeral ZK can cause intermittent issues due to the ZK data being reset.

scholzj mentioned this issue Oct 12, 2017

Use podAntiAffinity so brokers don't accidentally run on the same machine #33

Closed

scholzj added the enhancement label Oct 13, 2017

ppatierno self-assigned this May 15, 2018

This was referenced May 22, 2018

Add support for broker.rack #461

Merged

Add support for broker.rack follow up #466

Merged

scholzj closed this as completed Aug 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `broker.rack` #36

Add support for `broker.rack` #36

scholzj commented Oct 12, 2017 •

edited

tombentley commented Oct 13, 2017

scholzj commented Oct 14, 2017

tombentley commented Oct 16, 2017

scholzj commented Oct 16, 2017

ppatierno commented May 15, 2018 •

edited

scholzj commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

tombentley commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

tombentley commented May 15, 2018

scholzj commented May 15, 2018

tombentley commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

scholzj commented Aug 26, 2018

Add support for broker.rack #36

Add support for broker.rack #36

Comments

scholzj commented Oct 12, 2017 • edited

tombentley commented Oct 13, 2017

scholzj commented Oct 14, 2017

tombentley commented Oct 16, 2017

scholzj commented Oct 16, 2017

ppatierno commented May 15, 2018 • edited

scholzj commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

tombentley commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

tombentley commented May 15, 2018

scholzj commented May 15, 2018

tombentley commented May 15, 2018

ppatierno commented May 15, 2018

scholzj commented May 15, 2018

scholzj commented Aug 26, 2018

Add support for `broker.rack` #36

Add support for `broker.rack` #36

scholzj commented Oct 12, 2017 •

edited

ppatierno commented May 15, 2018 •

edited