Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash merging buckets #2218

Closed
MonsieurNicolas opened this issue Aug 7, 2019 · 2 comments
Closed

Crash merging buckets #2218

MonsieurNicolas opened this issue Aug 7, 2019 · 2 comments
Assignees
Labels
Projects

Comments

@MonsieurNicolas
Copy link
Contributor

This needs to be investigated, seems to have occurred on 11.3.0

hi seem to be having problems History ERROR] Replay failed: Error merging bucket curr=0acce3 with snap=03c5ba: Malformed bucket: old non-DEAD + new INIT.. There may be a problem with the local filesystem. Ensure that there is enough space to perform that operation and that disc is behaving correctly. [ApplyLedgerChainWork.cpp:290]
disk is fine and there is 10 GB of free space
Aug 06 11:03:21 ip-172-31-62-17 stellar-core[2960]: 2019-08-06T11:03:21.671 GCJCS [History INFO] Applying transactions for ledgers 25187933..25195267, LCL is [seq=25187932, hash=48e36b]
Aug 06 11:03:21 ip-172-31-62-17 stellar-core[2960]: 2019-08-06T11:03:21.671 GCJCS [History INFO] Catching up: applying checkpoint 1/115 (0%)
Aug 06 11:03:21 ip-172-31-62-17 stellar-core[2960]: 2019-08-06T11:03:21.678 GCJCS [Tx INFO] applying ledger 25187933 (txs:32, ops:73)
Aug 06 11:03:22 ip-172-31-62-17 stellar-core[2960]: stellar-core: bucket/BucketList.cpp:134: void stellar::BucketLevel::prepare(stellar::Application &, uint32_t, uint32_t, std::shared_ptr, const std::vector<std::shared_ptr > &, bool): Assertion `!mNextCurr.isMerging()' failed.
Aug 06 11:03:22 ip-172-31-62-17 systemd[1]: stellar.service: Main process exited, code=dumped, status=6/ABRT
Aug 06 11:03:22 ip-172-31-62-17 systemd[1]: stellar.service: Unit entered failed state.
Aug 06 11:03:22 ip-172-31-62-17 systemd[1]: stellar.service: Failed with result 'core-dump'.
stellar-core 11.3.0 (5f7821d)
antb321
4:36 AM - Yesterday
I downgraded to the last version of 11.2 and it still occured and then downgraded to stellar-core 11.0.0 (236f831) and it seems to be resolved
EDITED
Here is the log from 11.2
....
Aug 06 10:11:37 ip-172-31-62-17 stellar-core[11983]: 2019-08-06T10:11:37.146 GCJCS [History ERROR] Replay failed: Error merging bucket curr=0acce3 with snap=03c5ba: Malformed bucket: old non-DEAD + new INIT.. There may be a problem with the local filesystem. Ensure that there is enough space to perform that operation and that disc is behaving correctly. [ApplyLedgerChainWork.cpp:290]
Aug 06 10:11:37 ip-172-31-62-17 stellar-core[11983]: 2019-08-06T10:11:37.153 GCJCS [Ledger ERROR] Catchup will restart at next close. [LedgerManagerImpl.cpp:692]

@MonsieurNicolas MonsieurNicolas added this to To do in v11.4.0 via automation Aug 7, 2019
@graydon
Copy link
Contributor

graydon commented Aug 12, 2019

Eek! Well, the scary message in here is the first one:

Error merging bucket curr=0acce3 with snap=03c5ba: Malformed bucket: old non-DEAD + new INIT

I'll talk to the original reporter to try to figure out a little more context. Judging from the lines following it looks like it happened during a catchup from some non-initial state. Replaying through the same range (say 25187904..25195268) does not reproduce such a failure:

  • No replay errors occur
  • No bucket with either putative hash 0acce3 or 03c5ba occurs at all
    (of 15,071 retained buckets, with --disable-bucket-gc)

This does not mean it didn't happen, but it means it's not trivial to provoke.

Vague initial hypothesis: I wonder if there's some way to get a malformed bucket of this sort if core crashes at the right time, in the form of the failure-to-fsync bug currently pending. The malformedness is either merging old LIVE + new INIT, or old INIT + new INIT, for some ledger entry.

This could happen if, say, some live entry E was killed -- had a DEAD entry written to some bucket B -- and core crashed before B was durably stored in the bucketlist. Then B would (potentially) be zero-sized, and from the bucketlist perspective E would be still alive, but the database would think B was dead and so when reviving it, would write a new INIT entry for it.

@MonsieurNicolas
Copy link
Contributor Author

as we merged #2204 we can reopen this issue if it happens again

v11.4.0 automation moved this from To do to Done Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
v11.4.0
  
Done
Development

No branches or pull requests

2 participants