Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow infinite retry of actions until timeout #519

Merged
merged 3 commits into from
Apr 4, 2024
Merged

Conversation

deepthidevaki
Copy link
Member

@deepthidevaki deepthidevaki commented Apr 3, 2024

To wait for a scaling operation to completed, we port-forward to a gateway and repeatedly query the status. However, it sometimes failed because the port-forwarding disconnected but it was not retried until the action is retried. By default the actions are retried 3 times. So we have configured a long timeout (around 30 minutes) for the query and wait. That means, when the port-forwarding disconnected the action was still running for 30 minutes, with out a successful query response. If we fail the action early, it might result in an incident because the retries is hard-coded to 3.

Instead, as a generic solution for retries, this PR allows infinite retry of all actions until timeout. This can simplify operations that has to repeatedly query and wait. For example, verify readiness or cluster wait, we can in theory remove the loop with in the code and instead can be retried by just adjusting the timeout in the chaos action provider parameters.

PS:- Although with these changes implementation of verify readiness can be simplified, it is not done in this PR to reduce the impact of these changes.

closes #516

@deepthidevaki deepthidevaki marked this pull request as ready for review April 3, 2024 08:35
Copy link
Member

@Zelldon Zelldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your changes and I understand the idea behind it. Unfortunately, I'm not that happy with the changes yet (where we did the change).

I would like to see it as a parameter either on the cluster subcommand or if you need at the root level which you can use like zbchaos cluster scale --runUntilTimeout, with that we don't change the specifications and have this feature also available in the cli (useful for further experimenting and manual testing).

I hope this makes sense to you :)

@@ -69,7 +70,8 @@
"type": "process",
"path": "zbchaos",
"arguments": ["verify", "readiness"],
"timeout": 900
"timeout": 900,
"retryUntilTimeout" : "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ That should be part of the arguments array, so we don't change the specification which is defined here https://chaostoolkit.org/reference/api/experiment/#action-or-probe-provider

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose you make it part of the arguments and support this also in the CLI as new parameter, because otherwise we only have a feature now supporting for the worker but not for the cli which I would like to avoid. Especially for testing and trying out the cli is useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to stick to chaostoolkit spec? 💭

Unfortunately adding the parameter in the command does not solve this problem. We already have a retry implemented in the command. The problem is that port-forwarding is not re-connected until the job is retried. We can only fail the job 3 times because after that it raises an incident and the test fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably create and close port-forwarding in each iteration of the retry loop. But that doesn't look like an optimal solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also right now, because the default job timeout is 5 minutes there are multiple workers running for the same job if the action takes longer than 5 minutes. For actions similar to scale up, it always takes more than 5 minutes. This is why I want to provide a more generic solution at the chaos worker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to stick to the chaostoolkit spec, I think it would also make sense to set "retryUntilTimeout" as the default behaviour. Right now the commands implementation is not respecting the timeout specified anyway. Each command has its own way of handling retry and different timeout. Configured timeout is actually ignored at the command level.

Comment on lines +122 to +123
timeout := time.Minute * 25
err = waitForChange(port, changeResponse.ChangeId, timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a long time :D

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

Alternatively would it make sense to maybe change your loop to contain more parts like the port-forwarding? Like you mentioned here?

To wait for a scaling operation to completed, we port-forward to a gateway and repeatedly query the status. However, it sometimes failed because the port-forwarding disconnected but it was not retried until the action is retried.

Or is maybe some error not handled correctly? It should fail if the connection is closed right?

@deepthidevaki
Copy link
Member Author

Alternatively would it make sense to maybe change your loop to contain more parts like the port-forwarding? Like you mentioned here?

To wait for a scaling operation to completed, we port-forward to a gateway and repeatedly query the status. However, it sometimes failed because the port-forwarding disconnected but it was not retried until the action is retried.

Or is maybe some error not handled correctly? It should fail if the connection is closed right?

We intentionally do not fail the command for network errors because if the job fails for 3 times, it raises an incident and causes the test to fail.

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

Would you be open when I take a look at the code whether there is an alternative?

Is it currently causing trouble in our executions?

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

And changing the default retries to more than 3 is not an option?

@deepthidevaki
Copy link
Member Author

And changing the default retries to more than 3 is not an option?

This can be changed only globally, not for a particular action.

Changing it globally would be an alternate solution. #519 (comment)

@deepthidevaki
Copy link
Member Author

Is it currently causing trouble in our executions?

Yes. There was atleast one flaky e2e run because of this.

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

What do you think should we go with that? Feels like the simplest and also acceptable as it should be fine to retry (which is only done on failing actions anyway) and we still have the timeout which get us covered right?

Feels more natural to use the built-in mechanics

@deepthidevaki
Copy link
Member Author

What do you think should we go with that? Feels like the simplest and also acceptable as it should be fine to retry (which is only done on failing actions anyway) and we still have the timeout which get us covered right?

Feels more natural to use the built-in mechanics

Do you mean changing it globally - always retry until timeout for all actions? I like that. I was initially hesitant to do it because I was not sure how it impacts existing chaos experiments. But for me it sounds natural and fitting to our experiment process.

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

Then lets do that 👍🏼 I think you only need to change the action.bpmn process model then :)

@deepthidevaki
Copy link
Member Author

Why change actions.bpmn? My thought was to replace the following the line chaos_worker.go

_, _ = client.NewFailJobCommand().JobKey(job.Key).Retries(job.Retries - 1).Send(ctx)

to

// do not reduce retry count
_, _ = client.NewFailJobCommand().JobKey(job.Key).Retries(job.Retries).Send(ctx)

Is there a way to specify the same behavior in the service task spec?

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

Ah ok :) I was thinking of defining a high retries count https://docs.camunda.io/docs/next/components/modeler/bpmn/service-tasks/#task-definition

<zeebe:taskDefinition type="action" retries="50" />

and then make even use of the backoff stuff maybe 🤷🏼 https://docs.camunda.io/docs/next/apis-tools/go-client/job-worker/#backoff-configuration

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

But I think we could also do it like you have proposed :) we can always change it if we see any issues with it.

@deepthidevaki
Copy link
Member Author

Thanks. Then I will make the following changes:

  1. Remove the new parameters
  2. Infinite retry until timeout for all actions
  3. May be also set a default backoff for all jobs (10s?)

@Zelldon
Copy link
Member

Zelldon commented Apr 3, 2024

Sounds great thanks @deepthidevaki 🙇🏼 👍🏼 Please also add a comment with the reasoning why we have this :)

If the operation fails, it will be retried. The experiments using this operation can configure longer timeouts and enable retry until timeout.
Now all actions will be retried for ever until timeout. So it is ok to use a smaller timeout. If the operation is not complete with in that time, it will be retried anyway.
Copy link
Member

@Zelldon Zelldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks @deepthidevaki 🚀

@@ -272,7 +272,7 @@ func forceFailover(flags *Flags) error {
changeResponse, err := sendScaleRequest(port, brokersInRegion, true, -1)
ensureNoError(err)

timeout := time.Minute * 25
timeout := time.Minute * 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

@Zelldon Zelldon merged commit 7466f8e into main Apr 4, 2024
3 checks passed
@Zelldon Zelldon deleted the dd-fix-retry-wait branch April 4, 2024 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Waiting for scaling to complete failed because port-forwarding to gateway disconnected
2 participants