Allow infinite retry of actions until timeout #519

deepthidevaki · 2024-04-03T08:14:37Z

To wait for a scaling operation to completed, we port-forward to a gateway and repeatedly query the status. However, it sometimes failed because the port-forwarding disconnected but it was not retried until the action is retried. By default the actions are retried 3 times. So we have configured a long timeout (around 30 minutes) for the query and wait. That means, when the port-forwarding disconnected the action was still running for 30 minutes, with out a successful query response. If we fail the action early, it might result in an incident because the retries is hard-coded to 3.

Instead, as a generic solution for retries, this PR allows infinite retry of all actions until timeout. This can simplify operations that has to repeatedly query and wait. For example, verify readiness or cluster wait, we can in theory remove the loop with in the code and instead can be retried by just adjusting the timeout in the chaos action provider parameters.

PS:- Although with these changes implementation of verify readiness can be simplified, it is not done in this PR to reduce the impact of these changes.

closes #516

Zelldon

Thanks for your changes and I understand the idea behind it. Unfortunately, I'm not that happy with the changes yet (where we did the change).

I would like to see it as a parameter either on the cluster subcommand or if you need at the root level which you can use like zbchaos cluster scale --runUntilTimeout, with that we don't change the specifications and have this feature also available in the cli (useful for further experimenting and manual testing).

I hope this makes sense to you :)

Zelldon · 2024-04-03T09:59:05Z

go-chaos/internal/chaos-experiments/camunda-cloud/scaling/broker-scaling.json

@@ -69,7 +70,8 @@
                "type": "process",
                "path": "zbchaos",
                "arguments": ["verify", "readiness"],
-                "timeout": 900
+                "timeout": 900,
+                "retryUntilTimeout" : "true"


❌ That should be part of the arguments array, so we don't change the specification which is defined here https://chaostoolkit.org/reference/api/experiment/#action-or-probe-provider

I would propose you make it part of the arguments and support this also in the CLI as new parameter, because otherwise we only have a feature now supporting for the worker but not for the cli which I would like to avoid. Especially for testing and trying out the cli is useful.

Do we have to stick to chaostoolkit spec? 💭

Unfortunately adding the parameter in the command does not solve this problem. We already have a retry implemented in the command. The problem is that port-forwarding is not re-connected until the job is retried. We can only fail the job 3 times because after that it raises an incident and the test fails.

We can probably create and close port-forwarding in each iteration of the retry loop. But that doesn't look like an optimal solution.

Also right now, because the default job timeout is 5 minutes there are multiple workers running for the same job if the action takes longer than 5 minutes. For actions similar to scale up, it always takes more than 5 minutes. This is why I want to provide a more generic solution at the chaos worker.

If we want to stick to the chaostoolkit spec, I think it would also make sense to set "retryUntilTimeout" as the default behaviour. Right now the commands implementation is not respecting the timeout specified anyway. Each command has its own way of handling retry and different timeout. Configured timeout is actually ignored at the command level.

Zelldon · 2024-04-03T10:02:39Z

go-chaos/cmd/cluster.go

+	timeout := time.Minute * 25
+	err = waitForChange(port, changeResponse.ChangeId, timeout)


That is a long time :D

Zelldon · 2024-04-03T10:09:45Z

Alternatively would it make sense to maybe change your loop to contain more parts like the port-forwarding? Like you mentioned here?

To wait for a scaling operation to completed, we port-forward to a gateway and repeatedly query the status. However, it sometimes failed because the port-forwarding disconnected but it was not retried until the action is retried.

Or is maybe some error not handled correctly? It should fail if the connection is closed right?

deepthidevaki · 2024-04-03T10:15:44Z

Alternatively would it make sense to maybe change your loop to contain more parts like the port-forwarding? Like you mentioned here?

To wait for a scaling operation to completed, we port-forward to a gateway and repeatedly query the status. However, it sometimes failed because the port-forwarding disconnected but it was not retried until the action is retried.

Or is maybe some error not handled correctly? It should fail if the connection is closed right?

We intentionally do not fail the command for network errors because if the job fails for 3 times, it raises an incident and causes the test to fail.

Zelldon · 2024-04-03T10:37:46Z

Would you be open when I take a look at the code whether there is an alternative?

Is it currently causing trouble in our executions?

Zelldon · 2024-04-03T10:41:53Z

And changing the default retries to more than 3 is not an option?

deepthidevaki · 2024-04-03T11:13:01Z

And changing the default retries to more than 3 is not an option?

This can be changed only globally, not for a particular action.

Changing it globally would be an alternate solution. #519 (comment)

deepthidevaki · 2024-04-03T11:13:51Z

Is it currently causing trouble in our executions?

Yes. There was atleast one flaky e2e run because of this.

Zelldon · 2024-04-03T11:16:27Z

What do you think should we go with that? Feels like the simplest and also acceptable as it should be fine to retry (which is only done on failing actions anyway) and we still have the timeout which get us covered right?

Feels more natural to use the built-in mechanics

deepthidevaki · 2024-04-03T11:27:33Z

What do you think should we go with that? Feels like the simplest and also acceptable as it should be fine to retry (which is only done on failing actions anyway) and we still have the timeout which get us covered right?

Feels more natural to use the built-in mechanics

Do you mean changing it globally - always retry until timeout for all actions? I like that. I was initially hesitant to do it because I was not sure how it impacts existing chaos experiments. But for me it sounds natural and fitting to our experiment process.

Zelldon · 2024-04-03T11:29:24Z

Then lets do that 👍🏼 I think you only need to change the action.bpmn process model then :)

deepthidevaki · 2024-04-03T11:33:28Z

Why change actions.bpmn? My thought was to replace the following the line chaos_worker.go

_, _ = client.NewFailJobCommand().JobKey(job.Key).Retries(job.Retries - 1).Send(ctx)

to

// do not reduce retry count
_, _ = client.NewFailJobCommand().JobKey(job.Key).Retries(job.Retries).Send(ctx)

Is there a way to specify the same behavior in the service task spec?

Zelldon · 2024-04-03T11:35:03Z

Ah ok :) I was thinking of defining a high retries count https://docs.camunda.io/docs/next/components/modeler/bpmn/service-tasks/#task-definition

<zeebe:taskDefinition type="action" retries="50" />

and then make even use of the backoff stuff maybe 🤷🏼 https://docs.camunda.io/docs/next/apis-tools/go-client/job-worker/#backoff-configuration

Zelldon · 2024-04-03T11:37:08Z

But I think we could also do it like you have proposed :) we can always change it if we see any issues with it.

deepthidevaki · 2024-04-03T11:38:46Z

Thanks. Then I will make the following changes:

Remove the new parameters
Infinite retry until timeout for all actions
May be also set a default backoff for all jobs (10s?)

Zelldon · 2024-04-03T11:39:29Z

Sounds great thanks @deepthidevaki 🙇🏼 👍🏼 Please also add a comment with the reasoning why we have this :)

If the operation fails, it will be retried. The experiments using this operation can configure longer timeouts and enable retry until timeout.

Now all actions will be retried for ever until timeout. So it is ok to use a smaller timeout. If the operation is not complete with in that time, it will be retried anyway.

Zelldon

Great thanks @deepthidevaki 🚀

Zelldon · 2024-04-04T08:03:18Z

go-chaos/cmd/cluster.go

@@ -272,7 +272,7 @@ func forceFailover(flags *Flags) error {
 	changeResponse, err := sendScaleRequest(port, brokersInRegion, true, -1)
 	ensureNoError(err)

-	timeout := time.Minute * 25
+	timeout := time.Minute * 5


deepthidevaki marked this pull request as ready for review April 3, 2024 08:35

deepthidevaki requested a review from Zelldon as a code owner April 3, 2024 08:35

Zelldon requested changes Apr 3, 2024

View reviewed changes

feat: allow chaos actions to be retried infinitely

2072d01

deepthidevaki force-pushed the dd-fix-retry-wait branch from 294ffec to 00a2d82 Compare April 3, 2024 11:49

deepthidevaki added 2 commits April 3, 2024 13:54

fix: reduce waiting time for topology query

b0110f6

If the operation fails, it will be retried. The experiments using this operation can configure longer timeouts and enable retry until timeout.

refactor: reduce waiting time for failover

60eeca5

Now all actions will be retried for ever until timeout. So it is ok to use a smaller timeout. If the operation is not complete with in that time, it will be retried anyway.

deepthidevaki force-pushed the dd-fix-retry-wait branch from 229d006 to 60eeca5 Compare April 3, 2024 11:54

deepthidevaki requested a review from Zelldon April 3, 2024 11:58

Zelldon approved these changes Apr 4, 2024

View reviewed changes

Zelldon merged commit 7466f8e into main Apr 4, 2024
3 checks passed

Zelldon deleted the dd-fix-retry-wait branch April 4, 2024 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow infinite retry of actions until timeout #519

Allow infinite retry of actions until timeout #519

deepthidevaki commented Apr 3, 2024 •

edited

Loading

Zelldon left a comment

Zelldon Apr 3, 2024

Zelldon Apr 3, 2024

deepthidevaki Apr 3, 2024

deepthidevaki Apr 3, 2024

deepthidevaki Apr 3, 2024

deepthidevaki Apr 3, 2024

Zelldon Apr 3, 2024

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024 •

edited

Loading

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024 •

edited

Loading

Zelldon left a comment

Zelldon Apr 4, 2024

		timeout := time.Minute * 25
		err = waitForChange(port, changeResponse.ChangeId, timeout)

Allow infinite retry of actions until timeout #519

Allow infinite retry of actions until timeout #519

Conversation

deepthidevaki commented Apr 3, 2024 • edited Loading

Zelldon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024 • edited Loading

Zelldon commented Apr 3, 2024

deepthidevaki commented Apr 3, 2024

Zelldon commented Apr 3, 2024 • edited Loading

Zelldon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthidevaki commented Apr 3, 2024 •

edited

Loading

Zelldon commented Apr 3, 2024 •

edited

Loading

Zelldon commented Apr 3, 2024 •

edited

Loading