-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for broker.rack
#36
Comments
How many availability zones are there, typically, in a region? |
It depends. In AWS this ranges from 2 to 6 - you can have a look here. Google seems to be doing a bit better and most of their regions have 3+ zones. I expect that on premise many customers might have only two as well. 2 zones should not be such a big deal for Kafka. Zookeeper might be bigger issue in such situation. |
KIP-36 describes rack aware assignment, and explicitly says that "rack==availability zone" in AWS. It also says, when describing how the rack-aware assignment would work, "if the number of racks is less than the the number of replicas (which should happen very infrequently)..." If there are only 2 AZ in a region it seems this assertion of infrequency is likely to fail, and an uneven distribution is possible. I still think we should do this, but we need to be a little bit cautious. |
Yeah, it is a bit unfortunate that it cannot distribute the replicas equally across the racks when replicas > racks. But still better then nothing. |
After digging more into node affinity and pod affinity/anti-affinity ... We need to have brokers spread across different zones/racks as much as possible and it can be provided by the At same time, I guess we want user to configure "preferred" zones/racks in the cluster configmap so something like : broker.rack = "zone1, zone2, ..." (or a JSON array or something like that) It means that users would like to have the broker replicas spread across the specified zones/racks. These settings are added by the cluster operator to the StatefulSets at creation. On the other side, we'll have an init-container running in the same pod as the Kafka container. It will be in charge to get info from the node it's running most important getting the zone and "injecting" it as On how implementing the init-container, in order to do that the init container could have just a script using oc/kubectl for getting node info ? It needs the @scholzj @tombentley wdyt ? |
I'm not sure we need this. What is the use case you can see in selecting only some zones? Even if there is such usecase, it should be part of general node selector, not as the
Yeah, right ... so, do we actually need to make this configurable? Or can we just hardcode the affinity into the stateful sets / deployments? One other problem which comes to my mind ... we might not have access to the nodes (such as on OSO). So we will either need to be able to switch the init container on / off from config map and leave it to the user to decide whether he can give us the rights. Or we will have to find out if we have the rights and use it accordingly. This should affect the Init container and setting rack ID only ... should not have impact on the affinity in the deployment / statefulset. |
Ok I I misunderstood then. We don't want the user to specify "some" zones in the configmap but the node label to use for "selecting" nodes so that it should not be the hard coded |
Well, I actually haven't though about it before either. But you are right that My expectation (which might be of course wrong) is that people usually do not have more zones than what they want to use. Hence no need to select them. Anyway ... we will need to allow users to select nodes. And although my primary motivation there would be to for example select a dedicated node or select a node with local SSD or with super fast storage network, I would assume it could be used also for selecting the zones if needed. |
It seems that the issue is broader than supporting the rack feature. In bare metal Kafka, the user can specify the What you are saying now ...
... sounds a little bit different to me. It's something that can live even without the rack feature. |
Sorry if I confused you. The node selectors are separate issue and not necessarily needed for the |
This concerns me:
The reason is that if it is only preferred then it is possible that a broker can change zone should it terminate and get rescheduled? That would be a problem as the rack of a broker is used for partition assignment. If the broker can change zone that would thwart the efforts of the rack-aware partition assignment: Replicas carefully assigned to brokers in different racks would subsequently be in the same rack and the extra protection of rack-aware assignment would be lost. |
It's a valid point but ... I don't see a solution because if in AWS the range is from 2 to 6 zones and let's consider having 2 for example but you want to deploy a 3 brokers Kafka cluster, the only way is to use |
Yeah, AFAIK this is a Paolo says - with required, you will not be able to deploy more brokers than we have zones. Whether the pods change the zone or not depends also on other factors. For example EBS storage will not move the zone and the pod with persistent EBS volume will be bound to the zone through the volume. I'm not sure how for example GlusterFS deals with this. I would say we should go for it and just document the caveats. WDYT? It is funny how in some cases it turns out how insufficient the stateful sets are. If we were scheduling directly our own Pods, this would be easier to handle. |
It seems we've got two pieces here:
Looking forward there are other mitigations in the problematic case. For example we could try using nodeSelectors or pod affinity to make the initial zone assignment sticky, or stickyish. Or just rely on the cluster balancer to reassign replicas in the same zone. In any case, it seems clear that we should do this for now. |
TBH, I think saying that it will not work is a bit strong statement. So lets imagine following situation:
So it works fairly well up to here. The only problem is that once the zone is recovered, someone will need to reschedule the brokers and possibly reassign the topics. But it is not all that bad - the primary target has been achieved and the data survived zone crash. Or did I missed something? |
That's the scenario @scholzj, but the problem is in the "someone will need to reschedule the brokers and possibly reassign the topics" part. If that step doesn't happen the cluster appears to be working but the users expectation around redundancy has been broken. For sure we don't have to solve that in this issue, but it would be good to have some mitigation. |
@tombentley but it's part of the more advanced cluster balancing, isn't it ? Please be focused on the rack issue here! :P |
Yeah ... well ... I think we should document this until we have a better solution. Rescheduling the pods can be in theory done by CO. For reassigning the topics we would need the cluster balancer. |
This should be now all implemented. I will close this issue. |
* Switched from mock to full client impl and fixed UserTaskStatus values Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Added KCRAO test and edited operator to take optional host parameter Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Fixed task state switch statement and other PR comments Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Refatored CC Mockserver to make expectations more configurable Signed-off-by: Thomas Cooper <tcooper@redhat.com> * Added CC error parsing and convenience method to check state of proposal Signed-off-by: Thomas Cooper <tcooper@redhat.com>
Many tests fail due to a flaky ZK issue where the server hostname config is causing a problem. With 1 ZK, setting the server address to 0.0.0.0 is not possible due to ZOOKEEPER-4708. Until the ZK issue is fixed, use 3 ZK nodes in system tests.
Many tests fail due to a flaky ZK issue where the server hostname config is causing a problem. With 1 ZK, setting the server address to 0.0.0.0 is not possible due to ZOOKEEPER-4708. Until the ZK issue is fixed, use 3 ZK nodes in system tests. CSMDS-451: Use persistent cluster in MultipleClusterOperatorsST (strimzi#50) Restarting an ephemeral ZK can cause intermittent issues due to the ZK data being reset.
Kafka has a concept of racks. They represent group Kafka nodes into different segments based on the probability that they will fail all together. E.g. all machines in single rack will fail when the rack fails - hence rack. But in in public cloud for example the role of the rack will be taken over by the availability zone. The idea is that when a topic has N replicas, the replicas should be spread across different racks to minimise the probability that one disaster event will take down all replicas of particular topic / partition.
In Kubernetes world, the nodes have often tags to describe the availability zone or some similar concept of disaster partition (
failure-domain.beta.kubernetes.io/zone
). We should:broker.rack
propertyThe text was updated successfully, but these errors were encountered: