Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/auto restart connectors #7500

Merged
merged 12 commits into from
Nov 11, 2022

Conversation

ThomasDangleterre
Copy link
Contributor

@ThomasDangleterre ThomasDangleterre commented Oct 19, 2022

Type of change

  • Enhancement / new feature

Description

Implements Automatically restarting FAILED connectors and tasks - proposal 007

Related issue : #2621

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@strimzi-ci
Copy link

Can one of the admins verify this patch?

@ThomasDangleterre ThomasDangleterre force-pushed the feat/auto-restart-connectors branch from 87fb664 to 68d9a21 Compare October 20, 2022 15:30
@ThomasDangleterre ThomasDangleterre force-pushed the feat/auto-restart-connectors branch from 68d9a21 to b75fd15 Compare October 21, 2022 13:08
@scholzj scholzj added this to the 0.33.0 milestone Oct 25, 2022
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and sorry it took so long to get to it. I left some comments which we need to look into. Also, the tests failed => I restarted them in case it were just some flaky tests, but they seeemd related.

@EqualsAndHashCode(callSuper = true)
public class KafkaMirrorMaker2ConnectorSpec extends AbstractConnectorSpec {
private static final long serialVersionUID = 1L;

private AutoRestart autoRestartConnectorsAndTasks = new AutoRestart();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we deviate here from the proposal. The proposal suggests to have it once per KafkaMirrorMaker2 CR - so that would be added to KafkaMirrorMaker2Spec class. But here you put it per connector.

In general, I do not have problem with that. But:

  • It should be discussed and made clear since it deviates from the approved proposal
  • If this is the way we want to go, then you can make the code simple and just have the
     autoRestart:
       enabled: false
    directly in the AbstractConnectorSpec and use the same APi for both of them.

CC @strimzi/maintainers @ajborley

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with that either - I think it's unlikely that a user would want to switch autoRestart off for one MM2 connector and not the others, but at least this way they have the option.

(thanks for tagging me in here Jakub, and thanks @ThomasDangleterre for picking this up! :) )

@@ -824,6 +883,13 @@ protected JsonObject asJson(KafkaConnectorSpec spec, KafkaConnectorConfiguration
return connectorConfigJson.put("connector.class", spec.getClassName());
}

protected Future<KafkaConnectorStatus> getPreviousKafkaConnectorStatus(CrdOperator<KubernetesClient, KafkaConnector, KafkaConnectorList> resourceOperator,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we got the resource with its latest status on the beginning? Can't we use that instead of querying the API server again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I saw, Kafka Connector statuses are recreated from scratch at each reconcile.
We get connector and tasks statuses from Connect API before auto restarting but it is lacking the auto restart status as it is stored in the resource and the status of the resource is null when we call maybeCreateOrUpdateConnector function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are. But on the beginning, there is always a full KafkaConnector resource including the status. I'm not sure how easy or hard would it be to pass it around (currently, I think only the .spec part is passed around) -> maybe it would not be possible because of the duality with the Mirror Maker 2 connectors.

@scholzj scholzj requested a review from PaulRMellor October 25, 2022 12:37
Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this new feature being added, I had a few small suggested changes to wording etc. The main question I had though was around where in the reconciliation flow we are doing the auto restart and what that means for the connector status. It would be useful to have a test where the connector fails initially, is restarted successfully and we see the status is now ready.

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left a couple of comments.

@mimaison
Copy link
Contributor

I know I come pretty late in the process but after reviewing the proposal, I've got a few questions:

  1. Should we rename autoRestartConnectorsAndTasks to autoRestartFailedConnectorsAndTasks to be explicit it's not about restarting "stopped" tasks but instead only connectors and tasks in the FAILED state. Same for the MM2 flag.
  2. Are we sure we want this enabled by default? Not entirely sure what the policy is around backwards compatibility but I see this as a significant change of behavior.

I've not checked the PR yet, I'll try to take a look tomorrow.

@scholzj
Copy link
Member

scholzj commented Oct 26, 2022

Yeah, good point. It should not be anabled by default. The API classes should not have the default object but should be null by default.

Copy link
Contributor

@PaulRMellor PaulRMellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I made some suggestions on the doc. Can re-review for any additions and changes.

@ThomasDangleterre ThomasDangleterre force-pushed the feat/auto-restart-connectors branch from 92102c8 to cbc11e0 Compare November 3, 2022 16:46
@ThomasDangleterre ThomasDangleterre force-pushed the feat/auto-restart-connectors branch 2 times, most recently from bf3c488 to fc3f0ad Compare November 4, 2022 16:45
@ThomasDangleterre
Copy link
Contributor Author

Thanks everyone for your feedbacks 🙂

Does anyone know how I can test this method ?
https://github.com/strimzi/strimzi-kafka-operator/pull/7500/files#diff-b5b2f23d9bfdb2ded843e763e2a7dc4a438fa18c93820a742a1633e683d55d18R622-R641
It seems a bad idea to wait 2 minutes during IT tests. Maybe I should make this method static and create a unit test class for It ?

I didn't find any related tests for the underlying MirrorMaker2's connectors. Is there something I'm missing or Kafka Connector tests are enough ?

@scholzj
Copy link
Member

scholzj commented Nov 6, 2022

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Nov 6, 2022

Does anyone know how I can test this method ?
https://github.com/strimzi/strimzi-kafka-operator/pull/7500/files#diff-b5b2f23d9bfdb2ded843e763e2a7dc4a438fa18c93820a742a1633e683d55d18R622-R641
It seems a bad idea to wait 2 minutes during IT tests. Maybe I should make this method static and create a unit test class for It ?

Two ideas come to my mind ... make it possible to configure the current time into the method. So instead of testing in a realtime, you can call the method with some artificial timestamp. E.g. always pass 9:00:00, 9:02:00, 9:05:00, 9:06:00 etc. and always check if it will or will not restart. I think that is how we test other things which depend on time such as certificate expirations etc. I guess other option might be to make the restart strategy more configurble and for example test it with restart after 200 milliseconds instead of 2 minutes or something like that. But the time based manipulation would be what I would try first I guess. @tombentley @katheris any better ideaa?

I didn't find any related tests for the underlying MirrorMaker2's connectors. Is there something I'm missing or Kafka Connector tests are enough ?

TBH, I'm not sure. IMHO the Mirror Maker connectors are passed into the connector operator. So maybe there are no other tests?

ThomasDangleterre and others added 7 commits November 7, 2022 11:15
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
Co-authored-by: Kate Stanley <11195226+katheris@users.noreply.github.com>
Signed-off-by: Thomas Dangleterre <75978678+ThomasDangleterre@users.noreply.github.com>
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
Co-authored-by: Tom Bentley <tombentley@users.noreply.github.com>
Signed-off-by: Thomas Dangleterre <75978678+ThomasDangleterre@users.noreply.github.com>
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
Co-authored-by: PaulRMellor <47596553+PaulRMellor@users.noreply.github.com>
Signed-off-by: Thomas Dangleterre <75978678+ThomasDangleterre@users.noreply.github.com>
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
… true

Signed-off-by: ThomasDangleterre <thomas.dangleterre@decathlon.com>
@ThomasDangleterre ThomasDangleterre force-pushed the feat/auto-restart-connectors branch from 4fef123 to d985ec7 Compare November 7, 2022 10:15
@ThomasDangleterre ThomasDangleterre marked this pull request as ready for review November 7, 2022 13:07
@tombentley
Copy link
Member

So instead of testing in a realtime, you can call the method with some artificial timestamp. E.g. always pass 9:00:00, 9:02:00, 9:05:00, 9:06:00 etc. and always check if it will or will not restart. I think that is how we test other things which depend on time such as certificate expirations etc.

Factoring out some notion of the "current time" for any code which would depend on the current time is usually the right approach. That could simply be a long (the unix epoch offset in milliseconds) or a java.time.Clock, whichever is more convenient.

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few nits, but overall I'm happy with this. Thanks!

@ThomasDangleterre ThomasDangleterre force-pushed the feat/auto-restart-connectors branch from 148b2e7 to 5bd3e8b Compare November 9, 2022 15:20
@ThomasDangleterre ThomasDangleterre requested review from scholzj, katheris and PaulRMellor and removed request for scholzj, katheris and PaulRMellor November 9, 2022 15:20
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The build failed because of an unused import:

[ERROR] /home/vsts/work/1/s/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/AbstractConnectOperator.java:79:8: Unused import - java.time.temporal.ChronoUnit. [UnusedImports]

Can you please fix it?

@scholzj
Copy link
Member

scholzj commented Nov 10, 2022

@katheris Do you have something more for this PR?

…tor/resource/StatusUtils.java

Co-authored-by: Tom Bentley <tombentley@users.noreply.github.com>
Signed-off-by: Thomas Dangleterre <75978678+ThomasDangleterre@users.noreply.github.com>
@ThomasDangleterre ThomasDangleterre force-pushed the feat/auto-restart-connectors branch from 5bd3e8b to df17ca5 Compare November 10, 2022 09:45
@scholzj
Copy link
Member

scholzj commented Nov 10, 2022

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Nov 11, 2022

@ThomasDangleterre Thanks a lot for this PR, this is a great addition!

@kmcrawford
Copy link

@scholzj is this live in 0.32 or do we need to wait until 0.33?

@im-konge
Copy link
Member

Hi @kmcrawford,
this feature is available from 0.33.0 - which is already released (latest released version is 0.33.2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants