tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

marc-hb · 2019-06-27T02:05:12Z

Quoting bug "Configure QEMU to run independent of the host clock #14173"

We have been struggling for years with issues related to how QEMU
attempts to synchronize guest timer interrupts with the host clock,
for example #12553. The symptom is that heavily loaded sanitycheck
runs have tests spuriously failing due to timing related issues.

This creates noise in our CI runs which masks true bugs in our system
which manifest only intermittently, causing real issues that will
happen all the time at scale to be 'swept under the rug'; right now
any time a test fails sanitycheck retries it a few times and only
consecutive failures produce an error.

There's also a lot of relevant information and more links in:
"List of tests that keep failing sporadically" #12553

This new "emu_time" tag helps by letting users either select or exclude
the tests that really need accurate time to pass and have a high chance
to actually be impacted by this emulation issue. As an example, it's
only with 'sanitycheck --exclude emu_time' that I could spot and file
intermittent but non-emu_time issue #16915. As Andrew predicted above,
it was drown in emu_time noise before that.

Conversely, "--tag emu_time" can be used by developers focusing on
fixing qemu's -icount feature, for instance #14173 or others.

Even before qemu's -icount is fixed, Continuous Integration could be
split in two separate runs: A. --tag emu_time and B. --exclude
emu_time. Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

This initial commit does not pretend to exhaustively tag all affected
tests. However it's an already functional and useful start of 14 tests
collected from and tested over weeks of local sanitycheck runs and
partially reviewed by qemu clock expert Andy Ross.

This commit also increases the individual timeout of 7 tests that have
been observed to consistently take much longer than the
median (2-3s). This flags how unusually long these are, lets users
temporarily reduce the very long 60s default timeout in their local
workspace and finally should reduce the chance of them timing out on a
heavily loaded system. Set their timeout to 3-4 times the duration
observed in CI and locally.

Signed-off-by: Marc Herbert marc.herbert@intel.com

zephyrbot · 2019-06-27T02:06:30Z

All checks are passing now.

Review history of this comment for details about previous failed status.
Note that some checks might have not completed yet.

Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173" We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues. This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error. There's also a lot of relevant information and more links in: "List of tests that keep failing sporadically" zephyrproject-rtos#12553 This new "emu_time" tag helps by letting users either select or exclude the tests that really need accurate time to pass and have a high chance to actually be impacted by this emulation issue. As an example, it's only with 'sanitycheck --exclude emu_time' that I could spot and file intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above, it was drown in emu_time noise before that. Conversely, "--tag emu_time" can be used by developers focusing on fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others. Even before qemu's -icount is fixed, Continuous Integration could be split in two separate runs: A. --tag emu_time and B. --exclude emu_time. Only A tests would be allowed retries which would stop hiding other, unrelated intermittent issues affecting B tests. This initial commit does not pretend to exhaustively tag all affected tests. However it's an already functional and useful start of 14 tests collected from and tested over weeks of local sanitycheck runs and _partially_ reviewed by qemu clock expert Andy Ross. This commit also increases the individual timeout of 7 tests that have been observed to consistently take much longer than the median (2-3s). This flags how unusually long these are, lets users temporarily reduce the very long 60s default timeout in their local workspace and finally should reduce the chance of them timing out on a heavily loaded system. Set their timeout to 3-4 times the duration observed in CI and locally. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

nashif

na, we should fix the problem for real, not hide it under the rug.

marc-hb · 2019-06-27T02:21:27Z

This new tag:

doesn't hide anything,
exposes non-qemu issues that have been hidden for years and are still routinely being hidden now. See commit message.

jukkar · 2019-06-27T11:44:04Z

tests/net/trickle/testcase.yaml

@@ -4,4 +4,4 @@ common:
 tests:
  net.trickle:
    min_ram: 12
-    tags: net trickle
+    tags: net trickle emu_time


There was some issues with Trickle test earlier (like months ago) but I though I had changed the tests slightly to "fix" the issues and I have not seen (at least in networking PRs) that the Trickle tests are failing. Do you have a link to sanitycheck log about the failure case?

@jukkar are you saying these tests do not require a reliable clock or not any more?

are you saying these tests do not require a reliable clock or not any more?

I was not saying that (at least not yet :), I was just wondering if this is needed because personally I have not seen issues with Trickle tests lately.

This emu_time tag is purely about marking tests that require a reliable clock. Have you observed clock issues in general? For instance have you seen any other test in this list failing sometimes with timing related errors?

If not then your system somehow provides tests a reliable clock. See references in the commit message.

If you do observe on your system random timing failures spread across various tests then I may have mistaken a past failure of this trickle test for an emu_time issue.

I removed this emu_time tag on tests/net/trickle and it does fail intermittently at fairly recent commit 8f1780f. Caught this one just now:

starting test - test_trickle_2_wait Trickle 1 0x00400040 callback called Assertion failed at ZEPHYR_BASE/tests/net/trickle/src/main.c:171: test_trickle_2_wait: (cb_2_called is false) Trickle 2 no timeout FAIL - test_trickle_2_wait

I don't know these tests. @andyross , @jukkar, anyone: does this assert look like a clock-related / emu_time issue or something else?

I don't know these tests. @andyross , @jukkar, anyone: does this assert look like a clock-related / emu_time issue or something else?

The test is clock related. While running the test in mps2_an385, I noticed that the test emits error print. I sent a #17289 that fixes the error printing but it does not touch the clock issues.
I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.

I kept running this particular test locally and regularly and to be fair it seems to fail rarely. Now maybe it's just because the couple workarounds recently merged in sanitycheck made all emu_time tests fail more rarely...

I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.

Unless we're not talking about the same "make run", this runs only one test at a time AFAIK. For sanitycheck it depends on many factors, see @andyross analysis referenced above.

andrewboie

Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

Great!! Is there related logic implemented in CI to make this happen?

@nashif I think we should merge this, @wentongwu is of course working on fixing QEMU but that may be some ways off, and we may not be able to get it working for all the QEMU architecture variants for 2.0. We can always remove these tags once the final fix is in.

marc-hb · 2019-06-28T17:40:26Z

Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

Great!! Is there related logic implemented in CI to make this happen?

I don't know. I've used sanitycheck a million times but I'm not familiar with the CI-only code. I mention this in the commit message because 1. there's already some (indiscriminate) retry logic, so CI is hopefully "half-way there", 2. Isolating flaky tests is the well known practice: https://martinfowler.com/articles/nonDeterminism.html#Quarantine

Initially people will look at the failure report and notice that the failures are in non-deterministic tests, but soon they'll lose the discipline to do that. Once that discipline is lost, then a failure in the healthy deterministic tests will get ignored too. At that point you've lost the whole game and might as well get rid of all the tests.

There's plenty of literature about flaky tests, here's other ones: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
https://www.thoughtworks.com/insights/blog/no-more-flaky-tests-go-team

nashif · 2019-06-28T22:42:46Z

still think this is a bad idea and not sure who is going to be the user of this. Tags actually are not great in this case, because you are going to exclude everyone, when we have some architecture that have no issues with timing.

The qemu timing issue happens mostly in CI, when running locally, if something fails, you always have the option to rerun the tests and get them all pass on second run, so again, not sure who is going to be the user of this. I'd rather exclude tests selectively than use such a tag.

andyross · 2019-06-28T23:41:50Z

Definitely approve of the tag. Ideally we would augment the "--only-failed" feature to be able to use it (maybe that works already; not sure what the precedence is of filters vs. --only-failed). Long term, yes, we want deterministic tests. But realistically we're going to be struggling with issues like this repeatedly, and having a framework for handling it can't be a bad thing.

marc-hb · 2019-06-29T00:33:10Z

you are going to exclude everyone, when we have some architecture that have no issues with timing.

Fair point. Which architectures? The failures I observed seemed spread across all qemus but I didn't look closely.
(This tag should obviously not be used when testing hardware, it's pretty clear from the name.)

when running locally, if something fails, you always have the option to rerun the tests and get them all pass on second run,

Some of my runs last a day (I don't have a powerful system for every OS) so waiting a second day and manually comparing the sets of failures isn't really an option. Unless you're referring to --only-failed too? I haven't looked into it yet, I will.

In any case an indiscriminate retry is dropping on the floor other intermittent failures for other, unrelated causes. Known as the "broken window" in CI. Even with --exclude emu_time I'm still seeing a couple dozens of other tests failing intermittently. I filed a couple bugs already but I'm not going to keep bothering unless a little bit is done to help with the categorization. I'll just keep a big fat "flaky" tag for myself and done. Once PR #14593 reaches some level of completion I will hopefully not need wide and frequent test runs any more so it'll be even easier for me to ignore intermittent failures.

I'd rather exclude tests selectively than use such a tag.

I miss the conceptual difference between a list of tests and a tag (and no one has to use either)

not sure who is going to be the user of this

Anyone or anything interested in accurate test reports.

Long term, yes, we want deterministic tests.

I found this interesting: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html

We have invested a lot of effort in removing flakiness from tests, but overall the insertion rate is about the same as the fix rate, meaning we are stuck with a certain rate of tests that provide value, but occasionally produce a flaky result.

Maybe it's because engineers have more time to write more (stress) tests when they run out of bugs to fix. The number of engineers is finite, the number of bugs not so sure. Especially not when concurency's involved.

mike-scott · 2019-07-09T21:54:14Z

The addition of the tag alone doesn't hurt anyone and might let CI make intelligent decisions about when to retry and when not to. However, adjusting the timeout values in this PR might need to be debated. That's an active change to the tests regardless of whether the tag is used or not.

marc-hb

However, adjusting the timeout values in this PR might need to be debated.

I'll split it out, it is a mostly unrelated change. I put them together because of the proximity and version control inconvenience but there aren't that many tests where I changed both after all.

That's an active change to the tests regardless of whether the tag is used or not.

Not sure what you mean by "active". As the current default is 60 it seemed pretty minor to me.

marc-hb · 2019-07-11T01:20:36Z

tests/net/trickle/testcase.yaml

@@ -4,4 +4,4 @@ common:
 tests:
  net.trickle:
    min_ram: 12
-    tags: net trickle
+    tags: net trickle emu_time


I kept running this particular test locally and regularly and to be fair it seems to fail rarely. Now maybe it's just because the couple workarounds recently merged in sanitycheck made all emu_time tests fail more rarely...

I could not reproduce the issue you mentioned after running the sanitychecker or doing "make run" quite many times in my host.

Unless we're not talking about the same "make run", this runs only one test at a time AFAIK. For sanitycheck it depends on many factors, see @andyross analysis referenced above.

mike-scott · 2019-07-18T07:16:11Z

samples/portability/cmsis_rtos_v2/timer_synchronization/sample.yaml

@@ -2,7 +2,8 @@ sample:
  name: CMSIS_RTOS_V2 Synchronization
 tests:
  sample.portability.cmsis_rtos_v2.timer_synchronization:
-    tags: cmsis_rtos_v2_synchronization
+    timeout: 60


If the default is 60, why would we need it expressly set here?

Because I don't use 60 seconds as a default, it's way too long. Most tests complete in 5-10 seconds, 15 seconds default is enough.

@marc-hb why are you doing that? sanitycheck detects when the tests complets, all changing the default to a shorter time is gonna do is increase the likelihood your run will fail

why are you doing that?

Simple: when some random test that normally takes 5 seconds hangs forever, I prefer to wait 15 seconds than 60.

all changing the default to a shorter time is gonna do is increase the likelihood your run will fail

E pur si muove. I've been using a shorter default and the very few tests in this PR are the only ones that needed a longer timeout. Does this mean the 60s default is too conservative? I think so.

I thought it would be uncontroversial to flag the very few, long duration tests in the entire suite but forget it, this is not the point of this PR and as I said above I'll remove them (when I get back from vacation).

(Now I just remembered about the existing "slow" tag, didn't look at that yet)

mike-scott · 2019-07-18T07:16:32Z

tests/kernel/mem_pool/mem_pool_concept/testcase.yaml

@@ -1,3 +1,5 @@
 tests:
  kernel.memory_pool:
-    tags: kernel mem_pool
+    timeout: 60


same with default of 60 set here.

mike-scott · 2019-07-18T07:16:41Z

tests/kernel/semaphore/semaphore/testcase.yaml

@@ -1,4 +1,5 @@
 tests:
  kernel.semaphore:
+    timeout: 60


pfalcon

I find description of the issue here detailed and argumentation convincing, so +1.

And vice-versa, I find comments like:

na, we should fix the problem for real, not hide it under the rug.

to be rather unproductive. Translated into plain language, it's "Go and fix all the issues which we carefully accumulated here for ages - at once. If not, we'd rather continue to cultivate those issues further."

nashif · 2019-12-18T18:30:23Z

Does not seem to be going anywhere, closing. Rebase and reopen with new data if you think this still needs to go in.

marc-hb requested review from andrewboie, andyross, jukkar, nashif, pfalcon and tbursztyka as code owners June 27, 2019 02:05

zephyrbot added area: Networking area: Samples Samples area: Tests Issues related to a particular existing or missing test area: Kernel labels Jun 27, 2019

marc-hb force-pushed the sanitycheck-emu-time branch from 0e8918a to e730517 Compare June 27, 2019 02:13

marc-hb added area: Sanitycheck Sanitycheck has been renamed to Twister Enhancement Changes/Updates/Additions to existing features labels Jun 27, 2019

nashif requested changes Jun 27, 2019

View reviewed changes

marc-hb requested review from wentongwu and ceolin June 27, 2019 02:22

jukkar reviewed Jun 27, 2019

View reviewed changes

andrewboie approved these changes Jun 27, 2019

View reviewed changes

marc-hb requested review from mbolivar and aescolar June 28, 2019 17:40

marc-hb requested a review from mike-scott July 9, 2019 21:04

marc-hb commented Jul 11, 2019

View reviewed changes

marc-hb mentioned this pull request Jul 17, 2019

arch: POSIX: Fix race with unused threads #17614

Merged

mike-scott reviewed Jul 18, 2019

View reviewed changes

pfalcon approved these changes Jul 23, 2019

View reviewed changes

wentongwu mentioned this pull request Jul 24, 2019

Configure QEMU to run independent of the host clock #14173

Closed

15 tasks

mbolivar removed their request for review August 17, 2019 03:29

marc-hb mentioned this pull request Oct 3, 2019

Sanitycheck overhaul #18603

Merged

nashif closed this Dec 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

marc-hb commented Jun 27, 2019

zephyrbot commented Jun 27, 2019 •

edited

Loading

nashif left a comment

marc-hb commented Jun 27, 2019

jukkar Jun 27, 2019

marc-hb Jun 27, 2019

jukkar Jun 27, 2019

marc-hb Jun 27, 2019

marc-hb Jul 3, 2019

jukkar Jul 3, 2019

marc-hb Jul 11, 2019

andrewboie left a comment

marc-hb commented Jun 28, 2019 •

edited

Loading

nashif commented Jun 28, 2019

andyross commented Jun 28, 2019

marc-hb commented Jun 29, 2019

mike-scott commented Jul 9, 2019

marc-hb left a comment

marc-hb Jul 11, 2019

mike-scott Jul 18, 2019

marc-hb Jul 18, 2019

andrewboie Jul 18, 2019

marc-hb Jul 18, 2019 •

edited

Loading

mike-scott Jul 18, 2019

mike-scott Jul 18, 2019

pfalcon left a comment

nashif commented Dec 18, 2019

tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

Conversation

marc-hb commented Jun 27, 2019

zephyrbot commented Jun 27, 2019 • edited Loading

nashif left a comment

Choose a reason for hiding this comment

marc-hb commented Jun 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewboie left a comment

Choose a reason for hiding this comment

marc-hb commented Jun 28, 2019 • edited Loading

nashif commented Jun 28, 2019

andyross commented Jun 28, 2019

marc-hb commented Jun 29, 2019

mike-scott commented Jul 9, 2019

marc-hb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marc-hb Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfalcon left a comment

Choose a reason for hiding this comment

nashif commented Dec 18, 2019

zephyrbot commented Jun 27, 2019 •

edited

Loading

marc-hb commented Jun 28, 2019 •

edited

Loading

marc-hb Jul 18, 2019 •

edited

Loading