Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent broker scaledown if broker contains partition replicas - Part 1 #9042

Merged
merged 10 commits into from Sep 22, 2023

Conversation

ShubhamRwt
Copy link
Contributor

@ShubhamRwt ShubhamRwt commented Aug 29, 2023

Type of change

Select the type of your PR

  • Bugfix
  • Enhancement / new feature
  • Refactoring
  • Documentation

Description

This PR introduces the mechanism to stop the scale down if the broker contains partition replicas. If the partition replicas are present on the broker then scale down will be halted and we fail the reconciliation. In the second part of this work, we plan to revert the replicas instead of just failing the reconciliation.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@ShubhamRwt ShubhamRwt changed the title Prevent broker scaledown if broker contains partition replicas Prevent broker scaledown if broker contains partition replicas - 1 Aug 29, 2023
@ShubhamRwt ShubhamRwt changed the title Prevent broker scaledown if broker contains partition replicas - 1 Prevent broker scaledown if broker contains partition replicas - Part 1 Aug 29, 2023
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some nits. You should also consider adding some tests.

@scholzj scholzj added this to the 0.38.0 milestone Sep 1, 2023
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also adda note to the CHANGELOG.md abotu this.

@ShubhamRwt
Copy link
Contributor Author

@scholzj Hi, I have incorporated all the changes just mentioning the things in the changelog is left. Does this look good apart from that?

Comment on lines 82 to 86
/**
* Annotation used to bypass the broker scale-down mechanism
*/
public static final String ANNO_STRIMZI_IO_SKIP_BROKER_SCALEDOWN_CHECK = STRIMZI_DOMAIN + "skip-broker-scaledown-check";

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future: Depending on what is merged first, this should reconcile with #9103 as this is a public annotation. (nothing to do right now, we just need to keep it in mind)

@ShubhamRwt
Copy link
Contributor Author

Pushed the changes @scholzj. Just a doubt though -> Make sure brokers are empty before scaling them down Shouldn't we mention the annotation also?

@scholzj
Copy link
Member

scholzj commented Sep 15, 2023

I don't think the CHANGELOG should supply docs. There should be docs covering the feature and that should list the annotation. If you want, you could also mention it in some log message if you want. But in CHANGELOG it seems unnecessary.

@ShubhamRwt
Copy link
Contributor Author

I don't think the CHANGELOG should supply docs. There should be docs covering the feature and that should list the annotation. If you want, you could also mention it in some log message if you want. But in CHANGELOG it seems unnecessary.

Okay, Thanks Jakub

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job 👍

@scholzj
Copy link
Member

scholzj commented Sep 15, 2023

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Sep 16, 2023

@ShubhamRwt Can you please rebase it and move the new annotation to the api module? Thanks.

@ShubhamRwt
Copy link
Contributor Author

Sure @scholzj. I will do that

@@ -367,6 +367,22 @@ public Set<NodeRef> nodes() {
return nodes;
}

/**
* Generates list of references to Kafka nodes going to be removed from the Kafka cluster. The references contain both the pod name and
* the ID of the Kafka node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this description wrong? I mean it's returning just ID not pod name, right?

return Future.succeededFuture();
} else {
return brokerScaleDownOperations.canScaleDownBrokers(reconciliation, kafka.removedNodes(), secretOperator, adminClientProvider)
.compose(s -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more meaningful name for variable s? :)

* @param adminClientProvider Used to create the Admin client instance
* @param idsToBeRemoved Ids to be removed

* @return returns a boolean future based on the outcome of the check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong return description

* @param reconciliation Reconciliation marker
* @param secretOperator Secret operator for working with Secrets
* @param adminClientProvider Used to create the Admin client instance
* @param idsToBeRemoved Ids to be removed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, can you list the parameters the same order as the method signature please?

final Future<Collection<TopicDescription>> descriptions;
try {
String bootstrapHostname = KafkaResources.bootstrapServiceName(reconciliation.name()) + "." + reconciliation.namespace() + ".svc:" + KafkaCluster.REPLICATION_PORT;
LOGGER.debugCr(reconciliation, "Creating AdminClient for cluster {}/{}", reconciliation.namespace());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug command is looking for two parameters/placeholders but you are passing one, just the namespace.

} else {
namesPromise.complete(names);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some VertxUtils class method to go from a Kafka future to a Vertx one?

} else {
descPromise.complete(tds.values());
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as above?

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@scholzj
Copy link
Member

scholzj commented Sep 19, 2023

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Sep 19, 2023

@ShubhamRwt @henryZrncik It still seems to be failing ... is there one more place where it needs to be fixed? :-/

[ERROR] Errors: 
[ERROR] io.strimzi.systemtest.kafka.KafkaNodePoolST.testKafkaNodePoolBrokerIdsManagementUsingAnnotations(ExtensionContext)
[ERROR]   Run 1: KafkaNodePoolST.testKafkaNodePoolBrokerIdsManagementUsingAnnotations:280 » Wait Timeout after 180000 ms waiting for Pod: namespace-7/my-cluster-ea890678-kafka-pool-a18fff75-a to have 4 stable replicas
[ERROR]   Run 2: KafkaNodePoolST.testKafkaNodePoolBrokerIdsManagementUsingAnnotations:280 » Wait Timeout after 180000 ms waiting for Pod: namespace-15/my-cluster-33db8172-kafka-pool-a77b98c8-a to have 4 stable replicas
[ERROR]   Run 3: KafkaNodePoolST.testKafkaNodePoolBrokerIdsManagementUsingAnnotations:280 » Wait Timeout after 180000 ms waiting for Pod: namespace-16/my-cluster-6fe3a87f-kafka-pool-af5bc5a8-a to have 4 stable replicas

@ShubhamRwt
Copy link
Contributor Author

I tested it a bit and yes looks like it passes when I disable the broker scale down check

@ShubhamRwt
Copy link
Contributor Author

Would it be fine @scholzj @henryZrncik to modify this issue -> #9134 itself and add this failing test into the issue description instead of creating separate issue?

@im-konge
Copy link
Member

@ShubhamRwt I guess it's not good practice to disable all tests (checks) that are failing. The one check which is now disabled makes sense, but this test - testKafkaNodePoolBrokerIdsManagementUsingAnnotations - should be fixed before merging the PR FMPOV.

@henryZrncik
Copy link
Contributor

Given test does not contain something which should cause such error. (test itself is currently passing so it really is something caused by this change.

This is basically the only test which included node-pools and any kind of scaling and was part of regression which was run. So i think it is worth investigate how scaling up/down works with reggard to node-pools & this PR. I am writing this mainly because this failure occurres in scaling up (there is nothing that should prevent it).

As @im-konge mentioned, Disabling first test was good choice as we mentioned here that it need to be addressed after adding this new funcitonality, but this test is something which should work even after change.

@im-konge
Copy link
Member

Let's see if there are more tests failing with NodePools.

@im-konge
Copy link
Member

/azp run feature-gates-regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Sep 20, 2023

I guess it's not good practice to disable all tests (checks) that are failing. The one check which is now disabled makes sense, but this test - testKafkaNodePoolBrokerIdsManagementUsingAnnotations - should be fixed before merging the PR FMPOV.

@ShubhamRwt is not disabling these tests, or? He is disabling his check.

Given test does not contain something which should cause such error. (test itself is currently passing so it really is something caused by this change.

@henryZrncik Are you really sure the test is correct today? I suspect it is not it - it is likely passing mostly by luck. You scale down broker 6 which might contain partition replicas without rescheduling them to other brokers. So I think the test is at fault, not Shubham's code.

@im-konge
Copy link
Member

@ShubhamRwt is not disabling these tests, or? He is disabling his check.

I miss-read it, also I thought that the test is correct, but you mentioned that it works mostly out of luck. Anyway, if it's a test issue and we should update/fix the test, @ShubhamRwt can disable the check for now.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@ShubhamRwt
Copy link
Contributor Author

I have now disabled the broker scale down check for the failing test.

@scholzj
Copy link
Member

scholzj commented Sep 20, 2023

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -229,7 +229,7 @@ void testKafkaNodePoolBrokerIdsManagementUsingAnnotations(ExtensionContext exten

Kafka kafka = KafkaTemplates.kafkaPersistent(testStorage.getClusterName(), 1, 1)
.editOrNewMetadata()
.addToAnnotations(Annotations.ANNO_STRIMZI_IO_NODE_POOLS, "enabled")
.addToAnnotations(Map.of(Annotations.ANNO_STRIMZI_IO_NODE_POOLS, "enabled", Annotations.ANNO_STRIMZI_IO_SKIP_BROKER_SCALEDOWN_CHECK, "true"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adding comment here would be helpful for the one who will fix those two tests, thanks! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if I should mention about this failing test in this issue #9134 itself? I have left a comment about this issue already in the other failing test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea, We can change these test at the same time anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will update the issue then, Thanks @henryZrncik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
@scholzj
Copy link
Member

scholzj commented Sep 21, 2023

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Sep 21, 2023

/azp run feature-gates-regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Sep 21, 2023

/azp run kraft-regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants