Adding action rebalance #137

ssingudasu · 2023-05-18T15:12:05Z

Introduction:
To rebalance a topic in topicctl, We currently use topicctl apply $topic --rebalance / topicctl apply $topic --rebalance --to-remove . In a production scale Kafka cluster, there is a need to rebalance all topics for a cluster.

Goal:

Add new action - rebalance to topicctl. i.e topicctl rebalance
Print rebalance status / Metric information (stdout) during rebalance. Refer log [timestamp] INFO Metric: {}. Please refer below screenshots for output.
Rebalance status / metrics information is two fold. overall topic rebalance status and topic rebalance status for each iteration

Considerations for action rebalance:

expects --cluster-config (i.e cluster.yaml) and --path-prefix (topics folder)
Uses underlying topicctl pkg/apply to rebalance a topic.

NOTE:
A topic is not considered for rebalance if

topic config is inconsistent with cluster config (name, region, environment etc...)
partitions of a topic in kafka cluster does not match with topic partition setting in topic config
retention.ms of a topic in kafka cluster does not match with topic retentionMinutes setting in topic config
topic does not exist in kafka cluster

Test Ouput:

Tested on local kafka cluster (3 brokers, 1 in each rack) with 3 topics
test-1 -> exists in kafka. But settings mismatch with test-1 topic yaml config
test-2 -> exists in kafka. settings match (replication 2)
test-3 -> does not exist in kafka

# go run cmd/topicctl/main.go -h

# go run cmd/topicctl/main.go rebalance -h

# go run cmd/topicctl/main.go rebalance --cluster-config $CLUSTER/cluster.yaml --path-prefix $CLUSTER/topics/ --auto-continue-rebalance --skip-confirm --partition-batch-size 5 --broker-throttle-mb 640

# go run cmd/topicctl/main.go rebalance --cluster-config $CLUSTER/cluster.yaml --path-prefix $CLUSTER/topics/ --auto-continue-rebalance --skip-confirm --partition-batch-size 5 --broker-throttle-mb 640 --to-remove 3

petedannemann

Great work!

cmd/topicctl/subcmd/rebalance.go

petedannemann · 2023-05-18T15:51:26Z

cmd/topicctl/subcmd/rebalance.go

+	clusterConfig config.ClusterConfig,
+	adminClient admin.Client,
+) error {
+	// validate topic config is valid for the cluster config


This function seems to reimplement a lot of the CheckTopic function. I'd suggest using that function instead

topicctl/pkg/check/check.go

Line 25 in dd82be7

func CheckTopic(ctx context.Context, config CheckConfig) (TopicCheckResults, error) {

This function seems to reimplement a lot of the CheckTopic function.

I intentionally did not use checkTopic function.

Check perform in depth checks on a topic for settings, configurations, racks along with throttles, figuring out wrong leader aspects and so on..

For action rebalance, Priority is to ensure crucial settings partitions/retention changes did not happen which are majorly disruptive. topicctl apply do not take any action when replication is modified anyway

This approach has been followed in the interests of keeping action rebalance basic topic checks simple

Ok sounds good. What about taking the common functionality and putting them into helper functions that are used in each function? That way one have one blessed way of doing each check

I am concerned about the possibility of the two ways of doing checks drifting from each other

Can you elaborate on this please.

Rebalance only need some specific checks. When action rebalance proceeds with Applying a topic, apply topic performs bunch of other stuff.

What about taking the common functionality and putting them into helper functions that are used in each function? That way one have one blessed way of doing each check

Made some changes. Review and resolve please

Since a lot of this check code is duplicated here, I was thinking in the check package you could move those individual checks into small private functions and then use them in the check package as well as your new rebalance functions. That way we have one defined way of doing a retention check, for example.

TBH, its not duplicate code.

Even if I could break checkTopic, results is a dict that maintains all the output for each check. Which is something we don't need!

Or am I missing something?

three steps in here are unit already

I still feel this is simple. Also, if we feel like adding more checks in future, Then we can consider breaking the checkTopic functionality

I am trying to avoid checkTopic unless it is needed.

petedannemann · 2023-05-18T17:45:50Z

Please update the PR description to remove the examples where topics were being creating and add an example of what it looks when topics are rebalanced. A lot of logging was added and I wonder if things are too verbose now

ssingudasu · 2023-05-18T19:23:00Z

Please update the PR description to remove the examples where topics were being creating and add an example of what it looks when topics are rebalanced. A lot of logging was added and I wonder if things are too verbose now

PR description has been updated with detailed output screenshots along with test cluster and topic information

petedannemann · 2023-05-18T20:13:57Z

What do you think about doing checks for all the topics first, outputting a list of ones that will be skipped and why, prompting the user for (y/n) to continue based on that "plan", and then rebalancing the topics that have passed the checks? This way it's more obvious which topics actually failed to rebalance vs. which ones didn't attempt to rebalance at all

EDIT: I guess things can change between the plan and rebalance phase. Maybe this isn't a good idea

hhahn-tw · 2023-05-19T20:19:10Z

cmd/topicctl/subcmd/rebalance.go

+	dryRun                       bool
+	partitionBatchSizeOverride   int
+	pathPrefix                   string
+	autoContinueRebalance        bool


Is there a use case for needing this flag in the context of a "rebalance everything" command? Seems like it should be the default behavior.

Will be resolved in next PR

hhahn-tw · 2023-05-19T20:20:19Z

cmd/topicctl/subcmd/rebalance.go

+	partitionBatchSizeOverride   int
+	pathPrefix                   string
+	autoContinueRebalance        bool
+	retentionDropStepDurationStr string


This flag also seems unnecessary in the context of a dedicated rebalancing command.

Will be resolved in next PR

hhahn-tw · 2023-05-19T20:23:08Z

cmd/topicctl/subcmd/rebalance.go

+	// there can be a situation where topics folder is not in cluster dir
+	// hence we specify both --cluster-config and --path-prefix for action:rebalance
+	// However, we check the path-prefix topic configs with cluster yaml for consistency before applying
+	if rebalanceConfig.shared.clusterConfig == "" || rebalanceConfig.pathPrefix == "" {
+		return fmt.Errorf("Must set args --cluster-config & --path-prefix (or) env variables TOPICCTL_CLUSTER_CONFIG & TOPICCTL_APPLY_PATH_PREFIX")


This wording is confusing. Can you clarify why does a rebalance command require both flags when apply --rebalance does not?

Maybe something like

Unlike apply --rebalance, which takes a single topic config as a primary argument, rebalance takes the cluster config as a primary argument as it acts on all of the given cluster's configured topics. It also needs to know where those topic configs are since they can exist in arbitrary locations. Thus, both --cluster-config and --path-prefix are required.

hhahn-tw · 2023-05-19T20:26:50Z

cmd/topicctl/subcmd/rebalance.go

+		cancel()
+	}()
+
+	// Keep a cache of the admin clients with the cluster config path as the key


Can you explain the reasoning behind/purpose for this?

will be fixed in next PR

hhahn-tw · 2023-05-19T20:35:45Z

pkg/util/metric.go

+
+// Print Metrics for each iteration
+// This can be configured but that will need lots of TopicConfig struct changes
+// For now, emitting metrics every 5seconds. refer metricDuration


5 seconds seems excessively frequent, especially considering sleep-loop-duration defaults to 10s. I think 30s or even 60s should be sufficient to monitor processes expected to take hours.

Also, rather than emitting metrics on some timer, what do you think of the idea of emitting them when transitions occur?

hhahn-tw · 2023-05-19T20:40:00Z

pkg/apply/apply.go

+		if err != nil {
+			log.Errorf("Error: %+v", err)
+		}
+		go util.PrintMetrics(metricStr, stop)


something like EmitMetrics would be a better name than PrintMetrics which specifically implies a print() function, which wouldn't be the case if using a generalized stats package.

ssingudasu · 2023-07-18T19:06:36Z

Duplicate of #142

closing this

Adding action rebalance

37b2f32

ssingudasu requested a review from a team as a code owner May 18, 2023 15:12

petedannemann reviewed May 18, 2023

View reviewed changes

Adding action rebalance

9d36bf3

ssingudasu requested a review from petedannemann May 18, 2023 16:44

ssingudasu added 2 commits May 18, 2023 09:46

Adding action rebalance. Minor format

50817ef

Adding action rebalance. Minor format

d0b36b8

Adding action rebalance

e32bfd2

ssingudasu added 4 commits May 18, 2023 21:25

Adding action rebalance. Added Metrics log during rebalance

cd5467e

Adding action rebalance. Added Metrics log during rebalance

86cd905

Adding action rebalance. Added Metrics log during rebalance

3092d84

Adding action rebalance. Added Metrics log during rebalance

a251e81

hhahn-tw reviewed May 19, 2023

View reviewed changes

ssingudasu closed this Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding action rebalance #137

Adding action rebalance #137

ssingudasu commented May 18, 2023 •

edited

Loading

petedannemann left a comment

petedannemann May 18, 2023

ssingudasu May 18, 2023 •

edited

Loading

petedannemann May 18, 2023

petedannemann May 18, 2023 •

edited

Loading

ssingudasu May 18, 2023 •

edited

Loading

ssingudasu May 18, 2023

petedannemann May 18, 2023

ssingudasu May 18, 2023

petedannemann commented May 18, 2023 •

edited

Loading

ssingudasu commented May 18, 2023

petedannemann commented May 18, 2023 •

edited

Loading

hhahn-tw May 19, 2023

ssingudasu May 19, 2023

hhahn-tw May 19, 2023

ssingudasu May 19, 2023

hhahn-tw May 19, 2023 •

edited

Loading

hhahn-tw May 19, 2023

ssingudasu May 19, 2023

hhahn-tw May 19, 2023

hhahn-tw May 19, 2023

ssingudasu commented Jul 18, 2023

Adding action rebalance #137

Adding action rebalance #137

Conversation

ssingudasu commented May 18, 2023 • edited Loading

petedannemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssingudasu May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petedannemann May 18, 2023 • edited Loading

Choose a reason for hiding this comment

ssingudasu May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petedannemann commented May 18, 2023 • edited Loading

ssingudasu commented May 18, 2023

petedannemann commented May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhahn-tw May 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssingudasu commented Jul 18, 2023

ssingudasu commented May 18, 2023 •

edited

Loading

ssingudasu May 18, 2023 •

edited

Loading

petedannemann May 18, 2023 •

edited

Loading

ssingudasu May 18, 2023 •

edited

Loading

petedannemann commented May 18, 2023 •

edited

Loading

petedannemann commented May 18, 2023 •

edited

Loading

hhahn-tw May 19, 2023 •

edited

Loading