New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for moving data between two JBOD disks using Cruise Control #106
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal. I left some quick comments.
strimzi.io/cluster: my-cluster | ||
spec: | ||
mode: remove-disks | ||
brokerandlogdirs: [brokerid1-logdir1,brokerid2-logdir2...] # for eg. 0-/var/lib/kafka/data-0/kafka-log0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brokerandlogdirs
seems strange as a name. If nothing else, you should use proper case: brokerAndLogDirs
. For the users, it is quite complicated to understand what the paths are etc. as they are an internal implementation. So I also wonder if you should create more elegant APIs. E.g. something like:
spec:
mode: remove-disks
brokerandlogdirs:
- brokerId: 0
volumeId: 0
- brokerId: 1
volumeId: 0
- brokerId: 2
volumeId: 0
Or something similar that would be based on what the user configures int he Kafka / KafkaNodePool resources and nto on the internal paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Jakub. In our current Strimzi implementation, the users have no knowledge of the internal paths on the volume where we configure the brokers to store logs. The only thing the user knows is the volume on a specific broker as pointed out in the above example.
The user doesn't know about /var/lib/kafka/data-0/kafka-log0
at all but we are using a very well defined pattern to make it depending on the volume for a broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree. That endpoint has a really bad user interface, and we should offer a better view.
|
||
### Flow | ||
|
||
- The user should be using the `Kafka` resource with JBOD configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user uses KafkaNodePool resources? What if the user does not use JBOD type storage?
- The user accepts the proposal by applying the `strimzi.io/rebalance=approve` annotation on it. | ||
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for running the actual rebalancing. | ||
|
||
Note: The movement of data between the JBOD disks doesn't affect the broker load, therefore there will be no changes in the before/after broker load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it not impact it? The moving of data between disks has to impact performance and load of the broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mislead Shubham on this one, I thought it wouldn't affect the broker load since the data going in and out of the broker would be the same regardless of how the data was distributed amongst the disks. But I understand that if one disk was faster than another, moving the data between them within a broker certainly would impact the performance
@ShubhamRwt did we ever find out why the loadBefore/loadAfter information was not provided by this remove_disk
endpoint? Is it just an oversight by upstream Cruise Control?
|
||
## Current situation | ||
|
||
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple requests ... from community users I guess? Let's make it clear.
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: | ||
- The current disk is too big and the user wants to use smaller one | ||
- When we want to use different Storage Class with different parameters or different storage types. | ||
For now, we can do this using the Kafka CLI tools but it is not very user-friendly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which Kafka CLI tools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we want to mention kafka-reassign-partitions.sh
tool
@@ -0,0 +1,71 @@ | |||
# Moving data between two JBOD disks using Cruise Control | |||
|
|||
This proposal is about integrating the `remove_disks` endpoint from Cruise Control into Strimzi cluster operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any documentation about this endpoint in Cruise Control, for example it's missing in the wiki.
Is there an official place where we can find how it works and put a link here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to link to the yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct link is this one, from LinkedIn repo.
You could also raise a documentation bug in Cruise Control for the missing endpoint documentation.
Two things I find weird about this endpoint:
- The
stop_ongoing_execution
parameter, when we have a dedicated endpoint for that. - The fact the you use the same endpoint to poll for status updates, instead of using the
user_tasks
endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a documentation bug and create a PR for the same on upsteam repo regarding this. The only piece of information for using this is only on the PR
strimzi.io/cluster: my-cluster | ||
spec: | ||
mode: remove-disks | ||
brokerandlogdirs: [brokerid1-logdir1,brokerid2-logdir2...] # for eg. 0-/var/lib/kafka/data-0/kafka-log0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Jakub. In our current Strimzi implementation, the users have no knowledge of the internal paths on the volume where we configure the brokers to store logs. The only thing the user knows is the volume on a specific broker as pointed out in the above example.
The user doesn't know about /var/lib/kafka/data-0/kafka-log0
at all but we are using a very well defined pattern to make it depending on the volume for a broker.
- The user should be using the `Kafka` resource with JBOD configured. | ||
- When the Kafka cluster is ready, the user creates a `KafkaRebalance` custom resource with the `spec.mode` field as `remove-disks` and the list of the broker and the corresponding logdirs to move in the `spec.brokerandlogdirs` field. | ||
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for getting an optimization proposal (by using the dryrun feature). | ||
- The user accepts the proposal by applying the `strimzi.io/rebalance=approve` annotation on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the user can still use the auto approval feature we have. It's better to highlight it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useful sought-after feature. 👍
I made a few suggestions as I read through
## Current situation | ||
|
||
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: | ||
- The current disk is too big and the user wants to use smaller one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The current disk is too big and the user wants to use smaller one | |
- The current disk is too big and the user wants to use a smaller one |
|
||
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: | ||
- The current disk is too big and the user wants to use smaller one | ||
- When we want to use different Storage Class with different parameters or different storage types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When we want to use different Storage Class with different parameters or different storage types. | |
- When we want to use a different Storage Class with different parameters or different storage types. |
## Proposal | ||
|
||
Cruise Control provides `remove_disks` HTTP REST endpoint to move replicas from a specified disk to other disks of the same broker. | ||
This endpoint allows to trigger a rebalancing operation by moving replicas in a round-robin manner to the remaining disks, from the largest to the smallest, while checking the following constraint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This endpoint allows to trigger a rebalancing operation by moving replicas in a round-robin manner to the remaining disks, from the largest to the smallest, while checking the following constraint: | |
This endpoint triggers a rebalancing operation by moving replicas in a round-robin manner to the remaining disks, from the largest to the smallest, while checking the following constraint: |
errorMargin = configurable property (default 0.1); it makes sure that a disk percentage is always free when moving replicas | ||
``` | ||
|
||
In order to use the above endpoint in the Strimzi cluster operator, it would be added to the [`CruiseControlApi`](https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApi.java) interface and developing the corresponding implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to use the above endpoint in the Strimzi cluster operator, it would be added to the [`CruiseControlApi`](https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApi.java) interface and developing the corresponding implementation. | |
In order to use the `remove_disks` endpoint in the Strimzi cluster operator, it would be added to the [`CruiseControlApi`](https://github.com/strimzi/strimzi-kafka-operator/blob/main/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlApi.java) interface and the corresponding implementation developed. |
|
||
### Implementation | ||
|
||
For implementing this feature, We will be adding a new mode to the `KafkaRebalanceMode` class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For implementing this feature, We will be adding a new mode to the `KafkaRebalanceMode` class. | |
To implement this feature, we will be adding a new mode to the `KafkaRebalanceMode` class: |
|
||
- The user should be using the `Kafka` resource with JBOD configured. | ||
- When the Kafka cluster is ready, the user creates a `KafkaRebalance` custom resource with the `spec.mode` field as `remove-disks` and the list of the broker and the corresponding logdirs to move in the `spec.brokerandlogdirs` field. | ||
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for getting an optimization proposal (by using the dryrun feature). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for getting an optimization proposal (by using the dryrun feature). | |
- The `KafkaRebalanceAssemblyOperator` interacts with Cruise Control via the `/remove_disks` endpoint to generate an optimization proposal (by using the dryrun feature). |
- When the Kafka cluster is ready, the user creates a `KafkaRebalance` custom resource with the `spec.mode` field as `remove-disks` and the list of the broker and the corresponding logdirs to move in the `spec.brokerandlogdirs` field. | ||
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for getting an optimization proposal (by using the dryrun feature). | ||
- The user accepts the proposal by applying the `strimzi.io/rebalance=approve` annotation on it. | ||
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for running the actual rebalancing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for running the actual rebalancing. | |
- The `KafkaRebalanceAssemblyOperator` interacts with Cruise Control via the `/remove_disks` endpoint to perform the actual rebalancing. |
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: | ||
- The current disk is too big and the user wants to use smaller one | ||
- When we want to use different Storage Class with different parameters or different storage types. | ||
For now, we can do this using the Kafka CLI tools but it is not very user-friendly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we want to mention kafka-reassign-partitions.sh
tool
@@ -0,0 +1,71 @@ | |||
# Moving data between two JBOD disks using Cruise Control | |||
|
|||
This proposal is about integrating the `remove_disks` endpoint from Cruise Control into Strimzi cluster operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to link to the yaml
|
||
## Current situation | ||
|
||
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better in the Motivation
section?
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: | ||
- The current disk is too big and the user wants to use smaller one | ||
- When we want to use different Storage Class with different parameters or different storage types. | ||
For now, we can do this using the Kafka CLI tools but it is not very user-friendly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reference to the kafka-reassign-partitions
script right? If so, you could emphasize how it is devising the partition mappings to migrate data between disks in this scenario is a manual, time-consuming process. Solutions exist for moving data of brokers but not off specific disks of brokers which have JBOD config
# Moving data between two JBOD disks using Cruise Control | ||
|
||
This proposal is about integrating the `remove_disks` endpoint from Cruise Control into Strimzi cluster operator. | ||
This endpoint will allow us to move the data between two JBOD disks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it only allow moving data between two disks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, according to the endpoint definition (YAML). It would help to see it in action with curl commands.
- The user accepts the proposal by applying the `strimzi.io/rebalance=approve` annotation on it. | ||
- The `KafkaRebalanceAssemblyOperator` starts the interaction with Cruise Control via the `/remove_disks` endpoint for running the actual rebalancing. | ||
|
||
Note: The movement of data between the JBOD disks doesn't affect the broker load, therefore there will be no changes in the before/after broker load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mislead Shubham on this one, I thought it wouldn't affect the broker load since the data going in and out of the broker would be the same regardless of how the data was distributed amongst the disks. But I understand that if one disk was faster than another, moving the data between them within a broker certainly would impact the performance
@ShubhamRwt did we ever find out why the loadBefore/loadAfter information was not provided by this remove_disk
endpoint? Is it just an oversight by upstream Cruise Control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ShubhamRwt. I left just few comments. I will have another pass when out of draft state.
@@ -0,0 +1,71 @@ | |||
# Moving data between two JBOD disks using Cruise Control | |||
|
|||
This proposal is about integrating the `remove_disks` endpoint from Cruise Control into Strimzi cluster operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct link is this one, from LinkedIn repo.
You could also raise a documentation bug in Cruise Control for the missing endpoint documentation.
Two things I find weird about this endpoint:
- The
stop_ongoing_execution
parameter, when we have a dedicated endpoint for that. - The fact the you use the same endpoint to poll for status updates, instead of using the
user_tasks
endpoint.
# Moving data between two JBOD disks using Cruise Control | ||
|
||
This proposal is about integrating the `remove_disks` endpoint from Cruise Control into Strimzi cluster operator. | ||
This endpoint will allow us to move the data between two JBOD disks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, according to the endpoint definition (YAML). It would help to see it in action with curl commands.
## Current situation | ||
|
||
Currently, we get a multiple requests to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios: | ||
- The current disk is too big and the user wants to use smaller one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the other way around too. Current disk is too small, and the user wants a bigger one, to avoid running out of disk space. Disk removal to reduce the total storage size is another use case.
strimzi.io/cluster: my-cluster | ||
spec: | ||
mode: remove-disks | ||
brokerandlogdirs: [brokerid1-logdir1,brokerid2-logdir2...] # for eg. 0-/var/lib/kafka/data-0/kafka-log0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree. That endpoint has a really bad user interface, and we should offer a better view.
## Proposal | ||
|
||
Cruise Control provides `remove_disks` HTTP REST endpoint to move replicas from a specified disk to other disks of the same broker. | ||
This endpoint allows to trigger a rebalancing operation by moving replicas in a round-robin manner to the remaining disks, from the largest to the smallest, while checking the following constraint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not round-robin, but a size-based scheduling.
```sh | ||
remainingUsageAfterRemoval = current usage for remaining disks + additional usage from removed disks | ||
remainingCapacity = sum of capacities of the remaining disks | ||
errorMargin = configurable property (default 0.1); it makes sure that a disk percentage is always free when moving replicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess users with very busy clusters may need more that 10% of available disk space. From the endpoint definition (YAML), I don't see how we can configure the errorMargine. Do you know how?
### Flow | ||
|
||
- The user should be using the `Kafka` resource with JBOD configured. | ||
- When the Kafka cluster is ready, the user creates a `KafkaRebalance` custom resource with the `spec.mode` field as `remove-disks` and the list of the broker and the corresponding logdirs to move in the `spec.brokerandlogdirs` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have remove_disks
on one place and remove-disks
one another - it should be the same across whole proposal
Thanks @ShubhamRwt, what you describe here could certainly be made to work. But I'd like there to be some discussion about whether this is the right approach. Before I get on to that, let's talk about terminology for a moment:
I think this feature should stick with the established Strimzi terminology, so if we're going to add a new mode to the User experienceKafka clusters need rebalancing because of uneven load on brokers imposed by the replicas they're hosting. But moving data between log dirs is quite different. It's not a global rebalance at all. What CC is providing is much less involved, because the space of possible changes is much smaller: It can only move replicas to the remaining disks on the same broker, and all it's using for that assignment (currently) is knowledge of the disk space required by each replica. It's possible that these things are done by different people, and they're always done for different reasons. Removing disks is usually going to be done because it's related to some operational necessity. Rebalancing is usually going to be done to get better resource usage. The schema of the All this is to say that I'm not convinced by the approach of repurposing
There's a few things that can go wrong between steps 1 and 2:
Further more we need to consider the complexity we're imposing on users with this approach. They need to operationally coordinate steps 1 and 2. If they're using gitops this would involve applying a change for step 1 and at some later point one for step 2. This might involve different people (e.g. on different shifts). This all adds to the burden we're placing on users. I think we should consider if this is the experience we want users to have. Possible alternativeOne alternative would be for this to be driven from the
This is certainly more complicated for us to implement. (It happens over a number of reconciliations, for a start). But I think it's much closer to what users actually need: In the normal case they have a single CR to interact with (the one which already declares the With this alternative the |
Hi @scholzj @ppatierno , I do think that the above suggestion by @tombentley looks good so do we want to put the |
I do not think it looks good.
It might make sense to one day automate the removal of the disks. But I think that should be left to another proposal. It should be also done through the |
@ppatierno @tombentley Hi, It would be great to have some views from you too , I will be starting to re-write this. |
Another thing to consider is that upstream CC is not so happy to enable the |
This proposal shows how we can implement the feature to move data between two JBOD disks using Cruise Control