-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding action rebalance #137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
cmd/topicctl/subcmd/rebalance.go
Outdated
clusterConfig config.ClusterConfig, | ||
adminClient admin.Client, | ||
) error { | ||
// validate topic config is valid for the cluster config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to reimplement a lot of the CheckTopic function. I'd suggest using that function instead
Line 25 in dd82be7
func CheckTopic(ctx context.Context, config CheckConfig) (TopicCheckResults, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to reimplement a lot of the CheckTopic function.
I intentionally did not use checkTopic function.
Check perform in depth checks on a topic for settings, configurations, racks along with throttles, figuring out wrong leader aspects and so on..
For action rebalance, Priority is to ensure crucial settings partitions/retention changes did not happen which are majorly disruptive. topicctl apply do not take any action when replication is modified anyway
This approach has been followed in the interests of keeping action rebalance basic topic checks simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good. What about taking the common functionality and putting them into helper functions that are used in each function? That way one have one blessed way of doing each check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about the possibility of the two ways of doing checks drifting from each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this please.
Rebalance only need some specific checks. When action rebalance proceeds with Applying a topic, apply topic performs bunch of other stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about taking the common functionality and putting them into helper functions that are used in each function? That way one have one blessed way of doing each check
Made some changes. Review and resolve please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a lot of this check code is duplicated here, I was thinking in the check package you could move those individual checks into small private functions and then use them in the check package as well as your new rebalance functions. That way we have one defined way of doing a retention check, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, its not duplicate code.
Even if I could break checkTopic, results is a dict that maintains all the output for each check. Which is something we don't need!
Or am I missing something?
three steps in here are unit already
I still feel this is simple. Also, if we feel like adding more checks in future, Then we can consider breaking the checkTopic functionality
I am trying to avoid checkTopic unless it is needed.
Please update the PR description to remove the examples where topics were being creating and add an example of what it looks when topics are rebalanced. A lot of logging was added and I wonder if things are too verbose now |
PR description has been updated with detailed output screenshots along with test cluster and topic information |
What do you think about doing checks for all the topics first, outputting a list of ones that will be skipped and why, prompting the user for (y/n) to continue based on that "plan", and then rebalancing the topics that have passed the checks? This way it's more obvious which topics actually failed to rebalance vs. which ones didn't attempt to rebalance at all EDIT: I guess things can change between the plan and rebalance phase. Maybe this isn't a good idea |
dryRun bool | ||
partitionBatchSizeOverride int | ||
pathPrefix string | ||
autoContinueRebalance bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case for needing this flag in the context of a "rebalance everything" command? Seems like it should be the default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be resolved in next PR
partitionBatchSizeOverride int | ||
pathPrefix string | ||
autoContinueRebalance bool | ||
retentionDropStepDurationStr string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag also seems unnecessary in the context of a dedicated rebalancing command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be resolved in next PR
// there can be a situation where topics folder is not in cluster dir | ||
// hence we specify both --cluster-config and --path-prefix for action:rebalance | ||
// However, we check the path-prefix topic configs with cluster yaml for consistency before applying | ||
if rebalanceConfig.shared.clusterConfig == "" || rebalanceConfig.pathPrefix == "" { | ||
return fmt.Errorf("Must set args --cluster-config & --path-prefix (or) env variables TOPICCTL_CLUSTER_CONFIG & TOPICCTL_APPLY_PATH_PREFIX") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is confusing. Can you clarify why does a rebalance
command require both flags when apply --rebalance
does not?
Maybe something like
Unlike
apply --rebalance
, which takes a single topic config as a primary argument,rebalance
takes the cluster config as a primary argument as it acts on all of the given cluster's configured topics. It also needs to know where those topic configs are since they can exist in arbitrary locations. Thus, both--cluster-config
and--path-prefix
are required.
cancel() | ||
}() | ||
|
||
// Keep a cache of the admin clients with the cluster config path as the key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the reasoning behind/purpose for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be fixed in next PR
|
||
// Print Metrics for each iteration | ||
// This can be configured but that will need lots of TopicConfig struct changes | ||
// For now, emitting metrics every 5seconds. refer metricDuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 seconds seems excessively frequent, especially considering sleep-loop-duration defaults to 10s. I think 30s or even 60s should be sufficient to monitor processes expected to take hours.
Also, rather than emitting metrics on some timer, what do you think of the idea of emitting them when transitions occur?
if err != nil { | ||
log.Errorf("Error: %+v", err) | ||
} | ||
go util.PrintMetrics(metricStr, stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like EmitMetrics would be a better name than PrintMetrics which specifically implies a print() function, which wouldn't be the case if using a generalized stats package.
Duplicate of #142 closing this |
Introduction:
To rebalance a topic in topicctl, We currently use
topicctl apply $topic --rebalance
/topicctl apply $topic --rebalance --to-remove
. In a production scale Kafka cluster, there is a need to rebalance all topics for a cluster.Goal:
topicctl rebalance
[timestamp] INFO Metric: {}
. Please refer below screenshots for output.Considerations for action rebalance:
NOTE:
A topic is not considered for rebalance if
Test Ouput:
Tested on local kafka cluster (3 brokers, 1 in each rack) with 3 topics
test-1 -> exists in kafka. But settings mismatch with test-1 topic yaml config
test-2 -> exists in kafka. settings match (replication 2)
test-3 -> does not exist in kafka
# go run cmd/topicctl/main.go -h
# go run cmd/topicctl/main.go rebalance -h
# go run cmd/topicctl/main.go rebalance --cluster-config $CLUSTER/cluster.yaml --path-prefix $CLUSTER/topics/ --auto-continue-rebalance --skip-confirm --partition-batch-size 5 --broker-throttle-mb 640
# go run cmd/topicctl/main.go rebalance --cluster-config $CLUSTER/cluster.yaml --path-prefix $CLUSTER/topics/ --auto-continue-rebalance --skip-confirm --partition-batch-size 5 --broker-throttle-mb 640 --to-remove 3