"Instance got rescheduled" should not be treated as failure #4

sio · 2020-07-30T14:09:44Z

Originally reported by Andrea Bolognani in libvir-list:

Since I have your attention, I'll also report the only issue we've
encountered so far that might be a genuine bug in cirrus-run. If you
look at this recent pipeline

https://gitlab.com/libvirt/libvirt/-/pipelines/170028119

you'll see that the x86-freebsd-12-build job has failed; however if
you look at the corresponding Cirrus CI job

https://cirrus-ci.com/build/6133607741784064

you'll notice that it has completed successfully. We've seen this
happen about once a week on average. It's as if cirrus-run somehow
lost track of the status of the Cirrus CI job...

Unfortunately I haven't had time to dig further, but if there's any
information that I could provide to help you figure out what's going
on please just ask.

In this case Cirrus CI instance was rescheduled (likely due to VM being preempted by GCP) and cirrus-run incorrectly reported build failure.

Timing of cirrus-run failure (24:19) roughly matches the timing of first run attempt at Cirrus CI (00:01 + 05:04 + 19:10). Rerun was scheduled 2 seconds after initial run ended. It is very likely that during those 2 seconds API returned some status value that indicates build failure (FAILED/ABORTED/ERRORED) - specifics would have been available if such failure was caught in verbose mode (cirrus-run -v)

I do not see any field in GraphQL schema that would indicate the build is being rerun. Possible workaround would be to add some timeout (5 seconds?) to check build status again before reporting build failure to the user (cirrus_run/queries.py:wait_for_build).

The text was updated successfully, but these errors were encountered:

Fixes issue #4 Cirrus CI sometimes reschedules tasks on new runner (likely due to GCP preemption). During the rescheduling build status likely contains some value indicating failure. This commit allows to wait a little (2*delay = 6 seconds by default) to confirm build failure multiple times. After this change build failures will take 6 seconds longer to be reported. Successful builds are unaffected.

sio · 2020-07-31T09:49:28Z

Workaround has been added to master.

Since the source of the problem was deduced almost on a hunch, without any hard evidence, there is a chance that this fix is not enough.

I'll leave the issue open for a few weeks to gather feedback.

andreabolognani · 2020-08-03T07:05:25Z

@sio does running cirrus-run in verbose mode cause severe slow downs or anything like that? If not, we can temporarily switch libvirt's jobs to verbose mode and hope to catch another occurrence of rescheduling that way.

sio · 2020-08-03T07:38:08Z

No, no slowdowns at all. Just a lot of noise in stdout :-) One '-v' is enough for this case.

After adding the workaround described above I've experienced an automatic rerun in Cirrus CI which was handled correctly by cirrus-run: GitLab, Cirrus. Cirrus-run waited until the second run was finished and fetched logs for both runs.

Nature of this rerun was somewhat different though (it was not caused by GCP preemption), but that still speaks in favor of existing fix. I've noticed that a rerun occurred only because it was a genuine build failure which I went to investigate.

We've hit issues with GitLab CI jobs reporting a failure despite the corresponding Cirrus CI job finishing successfully: this is apparently caused by the underlying VM being rescheduled. A workaround for this issue has been implemented as of sio/cirrus-run@5299874 which will be included in the upcoming 0.3.0 release; however, in order to validate that this workaround is effective it would be useful to have more data. Based on the conversation in sio/cirrus-run#4 enabling verbose mode allows to collect this data while not having any impact on performance, so let's enable it temporarily and then disable it again once cirrus-run 0.3.0 is out. Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: Ján Tomko <jtomko@redhat.com>

andreabolognani · 2020-08-04T12:17:15Z

All libvirt pipelines starting with https://gitlab.com/libvirt/libvirt/-/pipelines/174107420 are now using verbose mode. Hopefully a couple of weeks worth of data will be enough to confirm the issue has been solved for good :)

sio · 2020-08-04T13:37:24Z

Great! Thank you very much

sio · 2020-08-16T17:57:59Z

It seems that there were no more misreported build failures in libvirt/libvirt (by the way, do any other repos under libvirt org use cirrus-run?)

I think I'll go forward with 0.3 release chores in the next couple of days.

andreabolognani · 2020-08-16T20:35:07Z

It seems that there were no more misreported build failures in libvirt/libvirt (by the way, do any other repos under libvirt org use cirrus-run?)

Not yet... I haven't had the time to crate jobs for the other repos, though it's definitely something that we want to do.

I think I'll go forward with 0.3 release chores in the next couple of days.

If you think you have enough data confirming that this and the other changes are good, then it sounds like a plan :)

sio · 2020-08-17T13:44:19Z

I have encountered three automatic re-runs in Cirrus CI since the workaround has been implemented. All of them in my own project, none in libvirt:

All of them were handled correctly by cirrus-run. I think it's safe to assume the workaround is sufficient. Thank you very much for your help!

I wrote a helper script to query Cirrus API for builds with more than one task (since there is no way to detect re-runs I'm just looking for unexpected number of tasks in a build): https://github.com/sio/cirrus-run/blob/inspect-issue-4/inspect.py I've run that script against both my repo and libvirt/libvirt, it just so happened that libvirt didn't encounter a re-run within the last 100 builds.

I've just released v0.3.0 and I'm closing this issue now. Feel free to reopen if some new information arrives.

sio added the bug Something isn't working label Jul 30, 2020

sio mentioned this issue Jul 31, 2020

Release v0.3 #5

Closed

4 tasks

sio closed this as completed Aug 17, 2020

berrange mentioned this issue Aug 27, 2020

cirrus CI reports HTTP error 502 with message suggesting to retry #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Instance got rescheduled" should not be treated as failure #4

"Instance got rescheduled" should not be treated as failure #4

sio commented Jul 30, 2020 •

edited

sio commented Jul 31, 2020

andreabolognani commented Aug 3, 2020

sio commented Aug 3, 2020

andreabolognani commented Aug 4, 2020

sio commented Aug 4, 2020

sio commented Aug 16, 2020

andreabolognani commented Aug 16, 2020

sio commented Aug 17, 2020

"Instance got rescheduled" should not be treated as failure #4

"Instance got rescheduled" should not be treated as failure #4

Comments

sio commented Jul 30, 2020 • edited

sio commented Jul 31, 2020

andreabolognani commented Aug 3, 2020

sio commented Aug 3, 2020

andreabolognani commented Aug 4, 2020

sio commented Aug 4, 2020

sio commented Aug 16, 2020

andreabolognani commented Aug 16, 2020

sio commented Aug 17, 2020

sio commented Jul 30, 2020 •

edited