Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SkvbcPersistenceTest fails intermittently (in CI & locally) #294

Closed
teoparvanov opened this issue Dec 4, 2019 · 11 comments
Closed

SkvbcPersistenceTest fails intermittently (in CI & locally) #294

teoparvanov opened this issue Dec 4, 2019 · 11 comments
Assignees

Comments

@teoparvanov
Copy link
Contributor

teoparvanov commented Dec 4, 2019

Describe the bug
As stated in the title, there are intermittent failures of persistence-related system tests.
Most of the failures seem to be related to the fact that we restart replicas and some of the metrics aren't immediately available. However other failures have been observed too, including failing assertions in the concord-bft code (see attached stack traces).

To Reproduce
Steps to reproduce the behavior:

  • in CI it happens approx. 1 out of 10 times - cannot be reproduced deterministically
  • when running the system tests locally (Ubuntu 18.04), failures can be observed by running the following shell command:
    for i in `seq 20`; do python3 -m unittest test_skvbc_persistence.SkvbcPersistenceTest.test_st_while_primary_crashes 1>/dev/null; done

Expected behavior
SkvbcPersistenceTest should succeed consistently.

Screenshots
N/A

@teoparvanov teoparvanov self-assigned this Dec 4, 2019
@teoparvanov
Copy link
Contributor Author

observed_stacktraces.txt

@teoparvanov
Copy link
Contributor Author

teoparvanov commented Dec 4, 2019

After introducing PR #295 we could observe the following failure (possibly an actual ST+persistency bug):

stacktrace_possible_st_bug.txt

@teoparvanov
Copy link
Contributor Author

PR #289 aims to solve the intermittent failures caused by flaky test logic.

Note: this PR does not try to address potential concord-bft issues.

@teoparvanov
Copy link
Contributor Author

Here's another intermittent failure instance:
stacktrace_keyerror.txt

@teoparvanov
Copy link
Contributor Author

Most of the failures are observed when wait_for_state_transfer_to_stop() is invoked on a random "stable" replica towards the end of _run_state_transfer_while_crashing_primary_repeatedly().

To make the test more reliable, we should remove this random aspect and make sure we select an "up_to_date" replica which has not been recently restarted, so that it has all its metrics ready.

@teoparvanov
Copy link
Contributor Author

teoparvanov commented Dec 11, 2019

The last remaining failure is due to an assert in ReplicaLoader.cpp which fails occasionally (while filling up the SKVBC with data and waiting for checkpoints):

VerifyOR(seqNum > ld.lastStableSeqNum, e.getCheckpointMsg()->isStableState(), InconsistentErr);

We should add some logging to understand what's going on here, but the problem could be due to the strict inequality between seqNum and ld.lastStableSeqNum.

@yuliasherman
Copy link
Contributor

We don't need additional logging. asserUtils.hpp does this job for us. We need just a replica log file, which includes information which parameters caused the assert to throw.

@teoparvanov
Copy link
Contributor Author

teoparvanov commented Dec 11, 2019

Thanks @yuliasherman, this would help indeed! However, to make it work, we need to run CI with log4cpp (USE_LOG4CPP=ON), which is not the case right now (not sure why)...

@yuliasherman
Copy link
Contributor

Oh, got your point now.

yuliasherman pushed a commit to yuliasherman/concord-bft that referenced this issue Jan 2, 2020
yuliasherman added a commit that referenced this issue Jan 2, 2020
* Return exception throw in Timers class in case timer not found

* Restore previous test behavior

* Fix issue "SkvbcPersistenceTest fails intermittently (in CI & locally) #294"
@teoparvanov
Copy link
Contributor Author

Yulia's fix seems to resolve the last intermittent failures.

Observing for several days and closing if not reproduced anymore...

@teoparvanov
Copy link
Contributor Author

teoparvanov commented Jan 13, 2020

Closing this issue, as it hasn't been observed since the above fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants