Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding action rebalance #137

Closed
wants to merge 9 commits into from
Closed

Conversation

ssingudasu
Copy link
Contributor

@ssingudasu ssingudasu commented May 18, 2023

Introduction:
To rebalance a topic in topicctl, We currently use topicctl apply $topic --rebalance / topicctl apply $topic --rebalance --to-remove . In a production scale Kafka cluster, there is a need to rebalance all topics for a cluster.

Goal:

  • Add new action - rebalance to topicctl. i.e topicctl rebalance
  • Print rebalance status / Metric information (stdout) during rebalance. Refer log [timestamp] INFO Metric: {}. Please refer below screenshots for output.
  • Rebalance status / metrics information is two fold. overall topic rebalance status and topic rebalance status for each iteration

Considerations for action rebalance:

  1. expects --cluster-config (i.e cluster.yaml) and --path-prefix (topics folder)
  2. Uses underlying topicctl pkg/apply to rebalance a topic.

NOTE:
A topic is not considered for rebalance if

  • topic config is inconsistent with cluster config (name, region, environment etc...)
  • partitions of a topic in kafka cluster does not match with topic partition setting in topic config
  • retention.ms of a topic in kafka cluster does not match with topic retentionMinutes setting in topic config
  • topic does not exist in kafka cluster

Test Ouput:

Tested on local kafka cluster (3 brokers, 1 in each rack) with 3 topics
test-1 -> exists in kafka. But settings mismatch with test-1 topic yaml config
test-2 -> exists in kafka. settings match (replication 2)
test-3 -> does not exist in kafka


# go run cmd/topicctl/main.go -h
Screenshot 2023-05-18 at 08 14 59

# go run cmd/topicctl/main.go rebalance -h
Screenshot 2023-05-18 at 08 07 15

Screenshot 2023-05-18 at 08 12 52

# go run cmd/topicctl/main.go rebalance --cluster-config $CLUSTER/cluster.yaml --path-prefix $CLUSTER/topics/ --auto-continue-rebalance --skip-confirm --partition-batch-size 5 --broker-throttle-mb 640
Screenshot 2023-05-18 at 22 43 12


# go run cmd/topicctl/main.go rebalance --cluster-config $CLUSTER/cluster.yaml --path-prefix $CLUSTER/topics/ --auto-continue-rebalance --skip-confirm --partition-batch-size 5 --broker-throttle-mb 640 --to-remove 3

Screenshot 2023-05-18 at 22 45 47 Screenshot 2023-05-18 at 22 46 13 Screenshot 2023-05-18 at 22 46 34 Screenshot 2023-05-18 at 22 47 38 Screenshot 2023-05-18 at 22 48 21 Screenshot 2023-05-18 at 22 48 48 Screenshot 2023-05-18 at 22 49 20

@ssingudasu ssingudasu requested a review from a team as a code owner May 18, 2023 15:12
Copy link
Contributor

@petedannemann petedannemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

cmd/topicctl/subcmd/rebalance.go Outdated Show resolved Hide resolved
cmd/topicctl/subcmd/rebalance.go Show resolved Hide resolved
cmd/topicctl/subcmd/rebalance.go Show resolved Hide resolved
cmd/topicctl/subcmd/rebalance.go Outdated Show resolved Hide resolved
clusterConfig config.ClusterConfig,
adminClient admin.Client,
) error {
// validate topic config is valid for the cluster config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to reimplement a lot of the CheckTopic function. I'd suggest using that function instead

func CheckTopic(ctx context.Context, config CheckConfig) (TopicCheckResults, error) {

Copy link
Contributor Author

@ssingudasu ssingudasu May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to reimplement a lot of the CheckTopic function.

I intentionally did not use checkTopic function.

Check perform in depth checks on a topic for settings, configurations, racks along with throttles, figuring out wrong leader aspects and so on..

For action rebalance, Priority is to ensure crucial settings partitions/retention changes did not happen which are majorly disruptive. topicctl apply do not take any action when replication is modified anyway

This approach has been followed in the interests of keeping action rebalance basic topic checks simple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good. What about taking the common functionality and putting them into helper functions that are used in each function? That way one have one blessed way of doing each check

Copy link
Contributor

@petedannemann petedannemann May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about the possibility of the two ways of doing checks drifting from each other

Copy link
Contributor Author

@ssingudasu ssingudasu May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this please.

Rebalance only need some specific checks. When action rebalance proceeds with Applying a topic, apply topic performs bunch of other stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about taking the common functionality and putting them into helper functions that are used in each function? That way one have one blessed way of doing each check

Made some changes. Review and resolve please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a lot of this check code is duplicated here, I was thinking in the check package you could move those individual checks into small private functions and then use them in the check package as well as your new rebalance functions. That way we have one defined way of doing a retention check, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, its not duplicate code.

Even if I could break checkTopic, results is a dict that maintains all the output for each check. Which is something we don't need!

Or am I missing something?

three steps in here are unit already

I still feel this is simple. Also, if we feel like adding more checks in future, Then we can consider breaking the checkTopic functionality

I am trying to avoid checkTopic unless it is needed.

@petedannemann
Copy link
Contributor

petedannemann commented May 18, 2023

Please update the PR description to remove the examples where topics were being creating and add an example of what it looks when topics are rebalanced. A lot of logging was added and I wonder if things are too verbose now

@ssingudasu
Copy link
Contributor Author

Please update the PR description to remove the examples where topics were being creating and add an example of what it looks when topics are rebalanced. A lot of logging was added and I wonder if things are too verbose now

PR description has been updated with detailed output screenshots along with test cluster and topic information

@petedannemann
Copy link
Contributor

petedannemann commented May 18, 2023

What do you think about doing checks for all the topics first, outputting a list of ones that will be skipped and why, prompting the user for (y/n) to continue based on that "plan", and then rebalancing the topics that have passed the checks? This way it's more obvious which topics actually failed to rebalance vs. which ones didn't attempt to rebalance at all

EDIT: I guess things can change between the plan and rebalance phase. Maybe this isn't a good idea

dryRun bool
partitionBatchSizeOverride int
pathPrefix string
autoContinueRebalance bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case for needing this flag in the context of a "rebalance everything" command? Seems like it should be the default behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be resolved in next PR

partitionBatchSizeOverride int
pathPrefix string
autoContinueRebalance bool
retentionDropStepDurationStr string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag also seems unnecessary in the context of a dedicated rebalancing command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be resolved in next PR

Comment on lines +126 to +130
// there can be a situation where topics folder is not in cluster dir
// hence we specify both --cluster-config and --path-prefix for action:rebalance
// However, we check the path-prefix topic configs with cluster yaml for consistency before applying
if rebalanceConfig.shared.clusterConfig == "" || rebalanceConfig.pathPrefix == "" {
return fmt.Errorf("Must set args --cluster-config & --path-prefix (or) env variables TOPICCTL_CLUSTER_CONFIG & TOPICCTL_APPLY_PATH_PREFIX")
Copy link
Contributor

@hhahn-tw hhahn-tw May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is confusing. Can you clarify why does a rebalance command require both flags when apply --rebalance does not?

Maybe something like

Unlike apply --rebalance, which takes a single topic config as a primary argument, rebalance takes the cluster config as a primary argument as it acts on all of the given cluster's configured topics. It also needs to know where those topic configs are since they can exist in arbitrary locations. Thus, both --cluster-config and --path-prefix are required.

cancel()
}()

// Keep a cache of the admin clients with the cluster config path as the key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the reasoning behind/purpose for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be fixed in next PR


// Print Metrics for each iteration
// This can be configured but that will need lots of TopicConfig struct changes
// For now, emitting metrics every 5seconds. refer metricDuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 seconds seems excessively frequent, especially considering sleep-loop-duration defaults to 10s. I think 30s or even 60s should be sufficient to monitor processes expected to take hours.

Also, rather than emitting metrics on some timer, what do you think of the idea of emitting them when transitions occur?

if err != nil {
log.Errorf("Error: %+v", err)
}
go util.PrintMetrics(metricStr, stop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like EmitMetrics would be a better name than PrintMetrics which specifically implies a print() function, which wouldn't be the case if using a generalized stats package.

@ssingudasu
Copy link
Contributor Author

Duplicate of #142

closing this

@ssingudasu ssingudasu closed this Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants