-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SkvbcPersistenceTest fails intermittently (in CI & locally) #294
Comments
After introducing PR #295 we could observe the following failure (possibly an actual ST+persistency bug): |
PR #289 aims to solve the intermittent failures caused by flaky test logic. Note: this PR does not try to address potential concord-bft issues. |
Here's another intermittent failure instance: |
Most of the failures are observed when wait_for_state_transfer_to_stop() is invoked on a random "stable" replica towards the end of _run_state_transfer_while_crashing_primary_repeatedly(). To make the test more reliable, we should remove this random aspect and make sure we select an "up_to_date" replica which has not been recently restarted, so that it has all its metrics ready. |
The last remaining failure is due to an assert in ReplicaLoader.cpp which fails occasionally (while filling up the SKVBC with data and waiting for checkpoints): VerifyOR(seqNum > ld.lastStableSeqNum, e.getCheckpointMsg()->isStableState(), InconsistentErr); We should add some logging to understand what's going on here, but the problem could be due to the strict inequality between seqNum and ld.lastStableSeqNum. |
We don't need additional logging. asserUtils.hpp does this job for us. We need just a replica log file, which includes information which parameters caused the assert to throw. |
Thanks @yuliasherman, this would help indeed! However, to make it work, we need to run CI with log4cpp (USE_LOG4CPP=ON), which is not the case right now (not sure why)... |
Oh, got your point now. |
* Return exception throw in Timers class in case timer not found * Restore previous test behavior * Fix issue "SkvbcPersistenceTest fails intermittently (in CI & locally) #294"
Yulia's fix seems to resolve the last intermittent failures. Observing for several days and closing if not reproduced anymore... |
Closing this issue, as it hasn't been observed since the above fix. |
Describe the bug
As stated in the title, there are intermittent failures of persistence-related system tests.
Most of the failures seem to be related to the fact that we restart replicas and some of the metrics aren't immediately available. However other failures have been observed too, including failing assertions in the concord-bft code (see attached stack traces).
To Reproduce
Steps to reproduce the behavior:
for i in `seq 20`; do python3 -m unittest test_skvbc_persistence.SkvbcPersistenceTest.test_st_while_primary_crashes 1>/dev/null; done
Expected behavior
SkvbcPersistenceTest should succeed consistently.
Screenshots
N/A
The text was updated successfully, but these errors were encountered: