New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Instance got rescheduled" should not be treated as failure #4
Comments
Fixes issue #4 Cirrus CI sometimes reschedules tasks on new runner (likely due to GCP preemption). During the rescheduling build status likely contains some value indicating failure. This commit allows to wait a little (2*delay = 6 seconds by default) to confirm build failure multiple times. After this change build failures will take 6 seconds longer to be reported. Successful builds are unaffected.
Workaround has been added to master. Since the source of the problem was deduced almost on a hunch, without any hard evidence, there is a chance that this fix is not enough. I'll leave the issue open for a few weeks to gather feedback. |
@sio does running |
No, no slowdowns at all. Just a lot of noise in stdout :-) One '-v' is enough for this case. After adding the workaround described above I've experienced an automatic rerun in Cirrus CI which was handled correctly by cirrus-run: GitLab, Cirrus. Cirrus-run waited until the second run was finished and fetched logs for both runs. Nature of this rerun was somewhat different though (it was not caused by GCP preemption), but that still speaks in favor of existing fix. I've noticed that a rerun occurred only because it was a genuine build failure which I went to investigate. |
We've hit issues with GitLab CI jobs reporting a failure despite the corresponding Cirrus CI job finishing successfully: this is apparently caused by the underlying VM being rescheduled. A workaround for this issue has been implemented as of sio/cirrus-run@5299874 which will be included in the upcoming 0.3.0 release; however, in order to validate that this workaround is effective it would be useful to have more data. Based on the conversation in sio/cirrus-run#4 enabling verbose mode allows to collect this data while not having any impact on performance, so let's enable it temporarily and then disable it again once cirrus-run 0.3.0 is out. Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: Ján Tomko <jtomko@redhat.com>
All libvirt pipelines starting with https://gitlab.com/libvirt/libvirt/-/pipelines/174107420 are now using verbose mode. Hopefully a couple of weeks worth of data will be enough to confirm the issue has been solved for good :) |
Great! Thank you very much |
It seems that there were no more misreported build failures in libvirt/libvirt (by the way, do any other repos under libvirt org use cirrus-run?) I think I'll go forward with 0.3 release chores in the next couple of days. |
Not yet... I haven't had the time to crate jobs for the other repos, though it's definitely something that we want to do.
If you think you have enough data confirming that this and the other changes are good, then it sounds like a plan :) |
I have encountered three automatic re-runs in Cirrus CI since the workaround has been implemented. All of them in my own project, none in libvirt: All of them were handled correctly by cirrus-run. I think it's safe to assume the workaround is sufficient. Thank you very much for your help! I wrote a helper script to query Cirrus API for builds with more than one task (since there is no way to detect re-runs I'm just looking for unexpected number of tasks in a build): https://github.com/sio/cirrus-run/blob/inspect-issue-4/inspect.py I've run that script against both my repo and libvirt/libvirt, it just so happened that libvirt didn't encounter a re-run within the last 100 builds. I've just released v0.3.0 and I'm closing this issue now. Feel free to reopen if some new information arrives. |
Originally reported by Andrea Bolognani in libvir-list:
In this case Cirrus CI instance was rescheduled (likely due to VM being preempted by GCP) and cirrus-run incorrectly reported build failure.
Timing of cirrus-run failure (24:19) roughly matches the timing of first run attempt at Cirrus CI (00:01 + 05:04 + 19:10). Rerun was scheduled 2 seconds after initial run ended. It is very likely that during those 2 seconds API returned some status value that indicates build failure (FAILED/ABORTED/ERRORED) - specifics would have been available if such failure was caught in verbose mode (
cirrus-run -v
)I do not see any field in GraphQL schema that would indicate the build is being rerun. Possible workaround would be to add some timeout (5 seconds?) to check build status again before reporting build failure to the user (
cirrus_run/queries.py:wait_for_build
).The text was updated successfully, but these errors were encountered: