Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add annotations to perform connector/task restart operations #4114

Merged
merged 5 commits into from Jan 12, 2021

Conversation

ajborley
Copy link
Member

@ajborley ajborley commented Dec 15, 2020

Type of change

  • Enhancement / new feature

Description

This PR implements the first part of the 'Restarting Kafka Connect connectors and tasks' proposal (https://github.com/strimzi/proposals/blob/master/007-restarting-kafka-connect-connectors-and-tasks.md),
adding new annotations that cause the operator to restart connectors or tasks. The annotations can be applied to the KafkaConnector, and the KafkaMirrorMaker2 custom resources. The annotation acts as a trigger for a single restart call
by the operator, and is removed from the CR when the restart REST API call is successfully called.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@strimzi-ci
Copy link

Can one of the admins verify this patch?
Run tests: @strimzi-ci run tests

@scholzj scholzj added this to the 0.21.0 milestone Dec 16, 2020
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This looks good to me. I left one question. You should also add it to the CHANGELOG.md under 0.21.0 release. And it would be also great to have at least some very simple doc to know how should this be used.

log.warn("{}: Failed to restart connector {}. {}", reconciliation, connectorName, throwable.getMessage());
return Future.succeededFuture();
})
.compose(ignored -> apiClient.statusWithBackOff(new BackOff(200L, 2, 10), host, port,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why do we call this at the end. Could you please explain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing because ultimately maybeCreateOrUpdateConnector needs to return the status. But it would probably be better to make pauseResume, maybeRestartConnector and maybeRestartConnectorTask all return Future<Void> and make the call the apiClient.statusWithBackOff directly from maybeCreateOrUpdateConnector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is why I wondered about this. If it is about the status for the KafkaConnector status field, it should be probably called only at the end.

Copy link
Member Author

@ajborley ajborley Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review - Tom is correct here & I have updated to move the apiClient.statusWithBackOff call to the end of the compose chains.

* @return the ID of the task to be restarted if the provided KafkaConnector resource instance has the strimzio.io/restart-task annotation or -1 otherwise.
*/
protected int getRestartTaskAnnotationTaskID(CustomResource resource, String connectorName) {
return Annotations.intAnnotation(resource, ANNO_STRIMZI_IO_RESTART_TASK, -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should support a list of ids. It would make it simpler to restart several failed tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I have started looking into this - it should be possible to restart tasks on a KafkaConnector resource using an annotation like: strimzi.io/restart-tasks: "0 1 2", although it does raise some questions:

  • what should happen if one of the restart calls fails (e.g. maybe there is no task 2)? Should the annotation be altered to only include the task IDs that were not restarted, so that they can be attempted on the next reconcile?
  • what should the annotation look like for KafkaMirrorMaker2 resources? The code currently supports annotations that address a single MM2 connector task like: strimzi.io/restart-connector-tasks: "<MM2-connector-name>:<task-id>". Do you think it should allow multiple MM2 connectors to be specified? Or just multiple tasks for a single connector? For example strimzi.io/restart-connector-tasks: "<MM2-connector-name>:0 1 2"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points. I guess it might be a good idea to leave this idea for now and see how this stuff is used in practice. It seems like it would be easy enough to add at a later date.

log.warn("{}: Failed to restart connector {}. {}", reconciliation, connectorName, throwable.getMessage());
return Future.succeededFuture();
})
.compose(ignored -> apiClient.statusWithBackOff(new BackOff(200L, 2, 10), host, port,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing because ultimately maybeCreateOrUpdateConnector needs to return the status. But it would probably be better to make pauseResume, maybeRestartConnector and maybeRestartConnectorTask all return Future<Void> and make the call the apiClient.statusWithBackOff directly from maybeCreateOrUpdateConnector.

@ajborley
Copy link
Member Author

Latest commit adds the CHANGELOG update & some initial documentation. It also updates the code to move the apiClient.statusWithBackOff call to the end of the compose chains.

.compose(ignored -> removeRestartTaskAnnotation(reconciliation, resource),
throwable -> {
// Ignore restart failures - just try again on the next reconcile
log.warn("{}: Failed to restart connector task {}:{}. {}", reconciliation, connectorName, taskID, throwable.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have my doubts about this, particularly in the case that the connector task doesn't exist (i.e. bad id, presumably an http 404). If it's only in the logs then the user might not realise why the annotation wasn't removed. For transient errors it will likely succeed, so leaving it in the annotation for another try is not a bad idea. But for permanent errors like non-existence its more of a problem, especially if the user doesn't have access to the logs. I think we should propagate the error as a warning in the status.

The error case doesn't seem to be covered by the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit adds warning status conditions to the CR when the restart or restartTask REST calls fail. Have also added failure test cases.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit. But LGTM otherwise.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

@scholzj
Copy link
Member

scholzj commented Dec 24, 2020

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Dec 24, 2020

(Just FYI: The failing regression test is not related to this PR)

@@ -476,28 +480,34 @@ protected KafkaConnectApi getKafkaConnectApi() {
* @param apiClient The client instance.
* @param connectorName The connector name.
* @param connectorSpec The desired connector spec.
* @param resource The resource that defines the connector.
* @return A Future whose result, when successfully completed, is a map of the current connector state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs amending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

. To restart the connector, annotate the `KafkaConnector` resource in Kubernetes.
For example, using `kubectl annotate`:
[source,shell,subs=+quotes]
kubectl annotate KafkaConnector _KafkaConnector-name_ strimzi.io/restart=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be restart-connector, so it's consistent with restart-task?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar questions around this were raised in the proposal discussion:

For KafkaConnnector this PR has the 2 annotations which apply to the resource they are being applied to:

  • strimzi.io/restart=true
  • strimzi.io/restart-task=<TASK-ID>

And for KafkaMirrorMaker2 this PR supports annotations that refer to a single connector or task amongst potentially many that are managed by the resource:

  • strimzi.io/restart-connector=<MM2-CONNECTOR-NAME>
  • strimzi.io/restart-connector-task=<MM2-CONNECTOR-NAME>:<TASK-ID>

With this style it's clear which annotations apply to which resource (and which connector/task is being restarted).

If people think it would be clearer, I am fine to change this to use the same annotations for both KafkaConnnector and KafkaMirrorMaker2, - e.g:

For KafkaConnnector:

  • strimzi.io/restart-connector=true
  • strimzi.io/restart-connector-task=<TASK-ID>

For KafkaMirrorMaker2:

  • strimzi.io/restart-connector=<MM2-CONNECTOR-NAME>
  • strimzi.io/restart-connector-task=<MM2-CONNECTOR-NAME>:<TASK-ID>

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the second option, even though it means the values are different between KafkaConnector and KMM2 while the keys are the same. WDYT @ppatierno ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I think I prefer the first option but I have not a strong opinion tbh. Both of them are valid. The doubt on the second option is around the fact that it could be confusing for users to see the same annotation names but getting different values (bool vs string) even if in two different resources. Anyway, I am fine with both, in the end it's a matter of documenting the difference.

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit and a question, but otherwise lgtm.

@ajborley ajborley force-pushed the restart-connectors-and-tasks branch from 456fe3e to 274a8cc Compare January 4, 2021 09:59
@scholzj
Copy link
Member

scholzj commented Jan 9, 2021

A nit and a question, but otherwise lgtm.

@tombentley @ajborley Is there still something to do on this PR? Or can it be merged now? From the last comment it is not completely clear to me if there is still something to do here or not.

@ajborley
Copy link
Member Author

ajborley commented Jan 9, 2021

A nit and a question, but otherwise lgtm.

@tombentley @ajborley Is there still something to do on this PR? Or can it be merged now? From the last comment it is not completely clear to me if there is still something to do here or not.

There may be another commit to use the same annotation keys for both KafkaConnectors and KafkaMirrorMaker2s - I was waiting for @ppatierno to respond to Tom’s latest comment (#4114 (comment)) as similar thoughts were discussed in the proposal PR:

@scholzj
Copy link
Member

scholzj commented Jan 10, 2021

@ppatierno Please have a look ^^^

@ppatierno
Copy link
Member

@ajborley @scholzj sorry I didn't notice the question. Just left a comment.

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comment I left, the PR looks good to me anyway.

@scholzj scholzj modified the milestones: 0.21.0, 0.22.0 Jan 12, 2021
@ajborley
Copy link
Member Author

We have one vote for using the same annotation keys and one vote for leaving the PR as-is with different annotation keys for KafkaConnector and KafkaMirrorMaker2 resources. I don’t have strong feelings either way!
@scholzj do you want to have the casting vote here? 🙂

 - This commit implements the first part of the 'Restarting Kafka
Connect connectors and tasks' proposal
(https://github.com/strimzi/proposals), adding new annotations that
cause the operator to restart connectors or tasks. The annotations can
be applied to the KafkaConnector, and the KafkaMirrorMaker2 custom
resources. The annotation acts as a trigger for a single restart call
by the operator, and is removed from the CR when the restart REST API
call is successfully called.

Signed-off-by: Andrew Borley <borley@uk.ibm.com>
 - Add doc, CHANGELOG
 - Reduce calls to connector status.

Signed-off-by: Andrew Borley <borley@uk.ibm.com>
Signed-off-by: Andrew Borley <borley@uk.ibm.com>
- Also fix accidental doc deletion.

Signed-off-by: Andrew Borley <borley@uk.ibm.com>
Signed-off-by: Andrew Borley <borley@uk.ibm.com>
@scholzj scholzj force-pushed the restart-connectors-and-tasks branch from 274a8cc to d81b208 Compare January 12, 2021 09:44
@scholzj
Copy link
Member

scholzj commented Jan 12, 2021

I talked about it with Tom and Paolo and I guess it is fine as it is. I rebased it to solve the conflict and if the tests pass I will merge it. Great PR ... thanks.

@scholzj
Copy link
Member

scholzj commented Jan 12, 2021

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Jan 12, 2021

Thanks for the PR.

@scholzj scholzj merged commit c292f1e into strimzi:master Jan 12, 2021
@ajborley ajborley deleted the restart-connectors-and-tasks branch January 12, 2021 16:06
@tomasz-zylka-fluke
Copy link

@scholzj was it included in 0.21 release? thanks!

@scholzj
Copy link
Member

scholzj commented Jan 13, 2021

I will include it if we find some bugs to do RC2. But if the RC1 is released as GA it will be in 0.22.

@tomasz-zylka-fluke
Copy link

hi @scholzj,
our understanding is that #4114 was not part of 0.21 release, therefore the logical question ;) would be when do you plan to release 0.22 - will it be this month (February) ?
thanks!

@scholzj
Copy link
Member

scholzj commented Feb 10, 2021

I think end-of-February is still the current plan. We need to finish the preparation for the CRDv1 ... I think once that is ready we will start with the release..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants