Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Instance got rescheduled" should not be treated as failure #4

Closed
sio opened this issue Jul 30, 2020 · 8 comments
Closed

"Instance got rescheduled" should not be treated as failure #4

sio opened this issue Jul 30, 2020 · 8 comments
Labels
bug Something isn't working

Comments

@sio
Copy link
Owner

sio commented Jul 30, 2020

Originally reported by Andrea Bolognani in libvir-list:

Since I have your attention, I'll also report the only issue we've
encountered so far that might be a genuine bug in cirrus-run. If you
look at this recent pipeline

https://gitlab.com/libvirt/libvirt/-/pipelines/170028119

you'll see that the x86-freebsd-12-build job has failed; however if
you look at the corresponding Cirrus CI job

https://cirrus-ci.com/build/6133607741784064

you'll notice that it has completed successfully. We've seen this
happen about once a week on average. It's as if cirrus-run somehow
lost track of the status of the Cirrus CI job...

Unfortunately I haven't had time to dig further, but if there's any
information that I could provide to help you figure out what's going
on please just ask.

In this case Cirrus CI instance was rescheduled (likely due to VM being preempted by GCP) and cirrus-run incorrectly reported build failure.

Timing of cirrus-run failure (24:19) roughly matches the timing of first run attempt at Cirrus CI (00:01 + 05:04 + 19:10). Rerun was scheduled 2 seconds after initial run ended. It is very likely that during those 2 seconds API returned some status value that indicates build failure (FAILED/ABORTED/ERRORED) - specifics would have been available if such failure was caught in verbose mode (cirrus-run -v)

I do not see any field in GraphQL schema that would indicate the build is being rerun. Possible workaround would be to add some timeout (5 seconds?) to check build status again before reporting build failure to the user (cirrus_run/queries.py:wait_for_build).

@sio sio added the bug Something isn't working label Jul 30, 2020
sio added a commit that referenced this issue Jul 30, 2020
Fixes issue #4

Cirrus CI sometimes reschedules tasks on new runner (likely due to GCP
preemption). During the rescheduling build status likely contains some
value indicating failure.

This commit allows to wait a little (2*delay = 6 seconds by default) to
confirm build failure multiple times.

After this change build failures will take 6 seconds longer to be
reported. Successful builds are unaffected.
@sio
Copy link
Owner Author

sio commented Jul 31, 2020

Workaround has been added to master.

Since the source of the problem was deduced almost on a hunch, without any hard evidence, there is a chance that this fix is not enough.

I'll leave the issue open for a few weeks to gather feedback.

@sio sio mentioned this issue Jul 31, 2020
4 tasks
@andreabolognani
Copy link

@sio does running cirrus-run in verbose mode cause severe slow downs or anything like that? If not, we can temporarily switch libvirt's jobs to verbose mode and hope to catch another occurrence of rescheduling that way.

@sio
Copy link
Owner Author

sio commented Aug 3, 2020

No, no slowdowns at all. Just a lot of noise in stdout :-) One '-v' is enough for this case.

After adding the workaround described above I've experienced an automatic rerun in Cirrus CI which was handled correctly by cirrus-run: GitLab, Cirrus. Cirrus-run waited until the second run was finished and fetched logs for both runs.

Nature of this rerun was somewhat different though (it was not caused by GCP preemption), but that still speaks in favor of existing fix. I've noticed that a rerun occurred only because it was a genuine build failure which I went to investigate.

patchew-importer pushed a commit to patchew-project/libvirt that referenced this issue Aug 4, 2020
We've hit issues with GitLab CI jobs reporting a failure despite
the corresponding Cirrus CI job finishing successfully: this is
apparently caused by the underlying VM being rescheduled.

A workaround for this issue has been implemented as of

  sio/cirrus-run@5299874

which will be included in the upcoming 0.3.0 release; however, in
order to validate that this workaround is effective it would be
useful to have more data.

Based on the conversation in

  sio/cirrus-run#4

enabling verbose mode allows to collect this data while not having
any impact on performance, so let's enable it temporarily and then
disable it again once cirrus-run 0.3.0 is out.

Signed-off-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
@andreabolognani
Copy link

All libvirt pipelines starting with https://gitlab.com/libvirt/libvirt/-/pipelines/174107420 are now using verbose mode. Hopefully a couple of weeks worth of data will be enough to confirm the issue has been solved for good :)

@sio
Copy link
Owner Author

sio commented Aug 4, 2020

Great! Thank you very much

@sio
Copy link
Owner Author

sio commented Aug 16, 2020

It seems that there were no more misreported build failures in libvirt/libvirt (by the way, do any other repos under libvirt org use cirrus-run?)

I think I'll go forward with 0.3 release chores in the next couple of days.

@andreabolognani
Copy link

It seems that there were no more misreported build failures in libvirt/libvirt (by the way, do any other repos under libvirt org use cirrus-run?)

Not yet... I haven't had the time to crate jobs for the other repos, though it's definitely something that we want to do.

I think I'll go forward with 0.3 release chores in the next couple of days.

If you think you have enough data confirming that this and the other changes are good, then it sounds like a plan :)

@sio
Copy link
Owner Author

sio commented Aug 17, 2020

I have encountered three automatic re-runs in Cirrus CI since the workaround has been implemented. All of them in my own project, none in libvirt:

All of them were handled correctly by cirrus-run. I think it's safe to assume the workaround is sufficient. Thank you very much for your help!

I wrote a helper script to query Cirrus API for builds with more than one task (since there is no way to detect re-runs I'm just looking for unexpected number of tasks in a build): https://github.com/sio/cirrus-run/blob/inspect-issue-4/inspect.py I've run that script against both my repo and libvirt/libvirt, it just so happened that libvirt didn't encounter a re-run within the last 100 builds.

I've just released v0.3.0 and I'm closing this issue now. Feel free to reopen if some new information arrives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants