Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: CI flake in "UpdateMachine.load retries registerQueue on NetworkError" #602

Closed
chrisbobbe opened this issue Mar 29, 2024 · 5 comments · Fixed by #726
Closed

test: CI flake in "UpdateMachine.load retries registerQueue on NetworkError" #602

chrisbobbe opened this issue Mar 29, 2024 · 5 comments · Fixed by #726
Assignees
Labels
Milestone

Comments

@chrisbobbe
Copy link
Collaborator

Noticed when working on code that's almost definitely unrelated (implementing the web-auth feature):

#600 (comment)

❌ /home/runner/work/zulip-flutter/zulip-flutter/test/model/store_test.dart: UpdateMachine.load retries registerQueue on NetworkError (failed)
  Bad state: Exceeded timeout 1:00:00.000000 while flushing timers
  package:fake_async/fake_async.dart 217:9   FakeAsync.flushTimers.<fn>
  package:fake_async/fake_async.dart 241:21  FakeAsync._fireTimersWhile
  package:fake_async/fake_async.dart 214:5   FakeAsync.flushTimers
  test/fake_async.dart 29:7                  awaitFakeAsync
  test/model/store_test.dart 176:57          main.<fn>.<fn>
@gnprice
Copy link
Member

gnprice commented Mar 29, 2024

Thanks for spotting this. If that indeed also happens in main, then it's a bug in either the test code, or the code under test, so it'd be good to debug and fix.

@chrisbobbe
Copy link
Collaborator Author

chrisbobbe commented Apr 10, 2024

Got this again, this time locally in tools/check, while looking at #587. It didn't appear on a subsequent run.

@gnprice
Copy link
Member

gnprice commented Jun 5, 2024

This flake happened today in a Flutter "customer tests" run:

So it's past time we should resolve it. I'll investigate, and either fix the test or disable it.

@gnprice
Copy link
Member

gnprice commented Jun 5, 2024

Aha. The problem is here in BackoffMachine.wait:

    await Future<void>.delayed(Duration(milliseconds: duration.round()));

If duration is less than 0.5 (which happens with probability 0.5% on the first backoff, since it's uniform from 0 to 100), then it rounds to zero, and we wait for a zero duration.

That then confuses the sequence of happenings that the test is counting on, which is fairly brittle.

I'll try to tighten down the self-checks inside the test, so that if its fairly-brittle sequencing assumptions fail then the test detects that up front. And then I'll fix backoff so that it always waits a nonzero interval — Duration accepts microseconds, so applying a floor of one microsecond should have no material effect on the jitter but should solve this kind of problem for tests.

@gnprice
Copy link
Member

gnprice commented Jun 5, 2024

Oho, and one of the contributing factors here was this limitation called out in awaitFakeAsync:

  // TODO: if the future returned by [callback] completes with an error,
  //   it would be good to throw that error immediately rather than finish
  //   flushing timers.  (This probably requires [FakeAsync] to have a richer
  //   API, like a `fireNextTimer` that does one iteration of `flushTimers`.)
  //
  //   In particular, if flushing timers later causes an uncaught exception, the
  //   current behavior is that that uncaught exception gets printed first
  //   (while `flushTimers` is running), and then only later (after
  //   `flushTimers` has returned control to this function) do we throw the
  //   error that the [callback] future completed with.  That's confusing
  //   because it causes the exceptions to appear in test output in an order
  //   that's misleading about what actually happened.

Because what's happening is that we do have an appropriately paranoid check in the test body:

      // The retry doesn't happen immediately; there's a timer.
      check(async.pendingTimers).length.equals(1);
      async.elapse(Duration.zero);
      check(connection.lastRequest).isNull();

in particular that last line, checking that the elapse didn't cause a retry of the request…

…but when the check fails, that's an exception thrown by test body, aka the callback passed to awaitFakeAsync. So now the flushTimers carries on running to completion before anything else happens. If that runs to timeout (which is quick in real-time terms, thankfully), then it throws, so we don't reach the rest of awaitFakeAsync and never do rethrow the check failure, so the latter doesn't appear in the output at all.

edit: previously wrote: The exception from the check failure does get printed at the end, but on top of being confusingly out of order, the flushTimers may have gone on for a long time by then, running until timeout. but that only applies if the loop of timers eventually stops, not if it continues until timeout.

So, maybe I'll pursue that FakeAsync issue as a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants