Crash merging buckets #2218

MonsieurNicolas · 2019-08-07T23:21:37Z

This needs to be investigated, seems to have occurred on 11.3.0

hi seem to be having problems History ERROR] Replay failed: Error merging bucket curr=0acce3 with snap=03c5ba: Malformed bucket: old non-DEAD + new INIT.. There may be a problem with the local filesystem. Ensure that there is enough space to perform that operation and that disc is behaving correctly. [ApplyLedgerChainWork.cpp:290]
disk is fine and there is 10 GB of free space
Aug 06 11:03:21 ip-172-31-62-17 stellar-core[2960]: 2019-08-06T11:03:21.671 GCJCS [History INFO] Applying transactions for ledgers 25187933..25195267, LCL is [seq=25187932, hash=48e36b]
Aug 06 11:03:21 ip-172-31-62-17 stellar-core[2960]: 2019-08-06T11:03:21.671 GCJCS [History INFO] Catching up: applying checkpoint 1/115 (0%)
Aug 06 11:03:21 ip-172-31-62-17 stellar-core[2960]: 2019-08-06T11:03:21.678 GCJCS [Tx INFO] applying ledger 25187933 (txs:32, ops:73)
Aug 06 11:03:22 ip-172-31-62-17 stellar-core[2960]: stellar-core: bucket/BucketList.cpp:134: void stellar::BucketLevel::prepare(stellar::Application &, uint32_t, uint32_t, std::shared_ptr, const std::vector<std::shared_ptr > &, bool): Assertion `!mNextCurr.isMerging()' failed.
Aug 06 11:03:22 ip-172-31-62-17 systemd[1]: stellar.service: Main process exited, code=dumped, status=6/ABRT
Aug 06 11:03:22 ip-172-31-62-17 systemd[1]: stellar.service: Unit entered failed state.
Aug 06 11:03:22 ip-172-31-62-17 systemd[1]: stellar.service: Failed with result 'core-dump'.
stellar-core 11.3.0 (5f7821d)
antb321
4:36 AM - Yesterday
I downgraded to the last version of 11.2 and it still occured and then downgraded to stellar-core 11.0.0 (236f831) and it seems to be resolved
EDITED
Here is the log from 11.2
....
Aug 06 10:11:37 ip-172-31-62-17 stellar-core[11983]: 2019-08-06T10:11:37.146 GCJCS [History ERROR] Replay failed: Error merging bucket curr=0acce3 with snap=03c5ba: Malformed bucket: old non-DEAD + new INIT.. There may be a problem with the local filesystem. Ensure that there is enough space to perform that operation and that disc is behaving correctly. [ApplyLedgerChainWork.cpp:290]
Aug 06 10:11:37 ip-172-31-62-17 stellar-core[11983]: 2019-08-06T10:11:37.153 GCJCS [Ledger ERROR] Catchup will restart at next close. [LedgerManagerImpl.cpp:692]

The text was updated successfully, but these errors were encountered:

graydon · 2019-08-12T23:17:20Z

Eek! Well, the scary message in here is the first one:

Error merging bucket curr=0acce3 with snap=03c5ba: Malformed bucket: old non-DEAD + new INIT

I'll talk to the original reporter to try to figure out a little more context. Judging from the lines following it looks like it happened during a catchup from some non-initial state. Replaying through the same range (say 25187904..25195268) does not reproduce such a failure:

No replay errors occur
No bucket with either putative hash 0acce3 or 03c5ba occurs at all
(of 15,071 retained buckets, with --disable-bucket-gc)

This does not mean it didn't happen, but it means it's not trivial to provoke.

Vague initial hypothesis: I wonder if there's some way to get a malformed bucket of this sort if core crashes at the right time, in the form of the failure-to-fsync bug currently pending. The malformedness is either merging old LIVE + new INIT, or old INIT + new INIT, for some ledger entry.

This could happen if, say, some live entry E was killed -- had a DEAD entry written to some bucket B -- and core crashed before B was durably stored in the bucketlist. Then B would (potentially) be zero-sized, and from the bucketlist perspective E would be still alive, but the database would think B was dead and so when reviving it, would write a new INIT entry for it.

MonsieurNicolas · 2019-08-19T19:20:10Z

as we merged #2204 we can reopen this issue if it happens again

MonsieurNicolas added the bug label Aug 7, 2019

MonsieurNicolas assigned graydon Aug 7, 2019

MonsieurNicolas added this to To do in v11.4.0 via automation Aug 7, 2019

MonsieurNicolas closed this as completed Aug 19, 2019

v11.4.0 automation moved this from To do to Done Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash merging buckets #2218

Crash merging buckets #2218

MonsieurNicolas commented Aug 7, 2019

graydon commented Aug 12, 2019

MonsieurNicolas commented Aug 19, 2019

Crash merging buckets #2218

Crash merging buckets #2218

Comments

MonsieurNicolas commented Aug 7, 2019

graydon commented Aug 12, 2019

MonsieurNicolas commented Aug 19, 2019