Skip to content

Conversation

@sqwishy
Copy link
Contributor

@sqwishy sqwishy commented Aug 27, 2022

When a job fails and a retry is possible, push_next_flow_job creates a new job scheduled for some time based on the retry configuration.

The previous job args are not reused because there were some weird things happening to do with for loops running using the wrong arguments passed in. So the last_result/previous_result is stored in the retry status so that it can be replayed later if a retry is necessary in order to re-calculate the job args again.

@sqwishy sqwishy requested a review from rubenfiszel as a code owner August 27, 2022 21:10
}
}
tracing::error!(job_id = %job.id, "Error handling job: {err}");
tracing::error!(job_id = %job.id, err = err.alt(), "error handling job");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors are also displayed in the frontend, need to check what impact this has

Copy link
Contributor Author

@sqwishy sqwishy Aug 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing::error! shouldn't end up anywhere near the frontend/user? (or does this end up in the job logs somehow -- github won't show me context?)

@sqwishy
Copy link
Contributor Author

sqwishy commented Aug 28, 2022 via email

Comment on lines +1777 to +1790
/* pass fail */
(0, Some(99)),
(1, None),
/* pass pass fail */
(0, Some(99)),
(1, Some(99)),
(2, None),
/* pass pass pass */
(0, Some(3)),
(1, Some(5)),
(2, Some(7)),
/* fail the last step once */
(0xff, None),
(0xff, Some(9)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've re-read the tests a few times and I couldn't make much sense of the meaning of 99, and 3,5,7 and attempts tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The particular numbers are arbitrary, but the values from the first two tries should be rejected so having them be 99 should help us know, if we see it, that it shouldn't be there and where it came from.

The attempts assert is to verify the steps are running/re-running in the right order.

Copy link
Contributor

@rubenfiszel rubenfiszel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of great things here, would prefer to merge #491 first

@sqwishy sqwishy linked an issue Aug 31, 2022 that may be closed by this pull request
also renamed duration to interval to be more specific about the retry
interval/period between tries or attempts
Copy link
Contributor

@rubenfiszel rubenfiszel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the openapi spec changes

@rubenfiszel rubenfiszel merged commit d69d002 into windmill-labs:main Sep 3, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Sep 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flow error handling and recovery

2 participants