Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node crashed, and stop syncing with EOF #12897

Open
5 of 11 tasks
emmanuelm41 opened this issue Feb 13, 2025 · 25 comments
Open
5 of 11 tasks

Node crashed, and stop syncing with EOF #12897

emmanuelm41 opened this issue Feb 13, 2025 · 25 comments
Labels

Comments

@emmanuelm41
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon: v1.31.1

Repro Steps

  1. Run the node with ChainIndexer running
  2. Wait until some random block fails, and crashes the node entirely.

Describe the Bug

Once the node starts, it crashes for some reason in some random block. So far this happened twice. The first one was fix by upgrading the node from v1.31.0 to v1.31.1. The second one is still preventing our node to start. This is a full archival node.

Logging Information

2025-02-13T16:23:32.379Z	INFO	badgerbs	v2@v2.2007.4/value.go:1158	Replaying file id: 42814 at offset: 723470378
2025-02-13T16:23:32.379Z	INFO	badgerbs	v2@v2.2007.4/value.go:1178	Replay took: 6.132µs
2025-02-13T16:23:32.436Z	INFO	p2pnode	lp2p/addrs.go:101	Swarm listening at: [/ip6/::1/udp/52224/quic-v1/webtransport/certhash/uEiC94LiJnAp23RuZQI5Gh46hbZeByx22U5CP6PC8k0R8Sw/certhash/uEiBJ3qy7DeJtTOr9w4L3o9tqHPoeKzmW3kbRu06FhaB2Qg /ip4/127.0.0.1/udp/41958/quic-v1 /ip4/10.42.10.28/udp/41958/quic-v1 /ip6/::1/udp/57241/quic-v1 /ip4/127.0.0.1/tcp/44895 /ip4/10.42.10.28/tcp/44895 /ip6/::1/tcp/42707 /ip4/127.0.0.1/udp/39093/quic-v1/webtransport/certhash/uEiC94LiJnAp23RuZQI5Gh46hbZeByx22U5CP6PC8k0R8Sw/certhash/uEiBJ3qy7DeJtTOr9w4L3o9tqHPoeKzmW3kbRu06FhaB2Qg /ip4/10.42.10.28/udp/39093/quic-v1/webtransport/certhash/uEiC94LiJnAp23RuZQI5Gh46hbZeByx22U5CP6PC8k0R8Sw/certhash/uEiBJ3qy7DeJtTOr9w4L3o9tqHPoeKzmW3kbRu06FhaB2Qg]
2025-02-13T16:23:32.583Z	INFO	modules	modules/services.go:152	subscribing to pubsub topic /fil/blocks/testnetnet
2025-02-13T16:23:32.583Z	INFO	messagepool	messagepool/messagepool.go:447	mpool ready
2025-02-13T16:23:32.585Z	INFO	f3/manifest-provider	manifest/dynamic_provider.go:136	starting a dynamic manifest provider	{"manifestServerID": "12D3KooWENMwUF9YxvQxar7uBWJtZkA6amvK4xWmKXfSiHUo2Qq7"}
2025-02-13T16:23:32.585Z	INFO	modules	modules/eth.go:48	Start prefilling GetTipsetByHeight cache
2025-02-13T16:23:32.593Z	INFO	f3	go-f3@v0.7.3/f3.go:302	resuming F3 internals
2025-02-13T16:23:32.593Z	INFO	f3	go-f3@v0.7.3/f3.go:190	F3 is starting	{"initialDelay": 0, "hasPendingManifest": false, "NetworkName": "filecoin/48", "BootstrapEpoch": 4534590, "Finality": 900, "InitialPowerTable": "b", "CommitteeLookback": 10}
2025-02-13T16:23:32.595Z	INFO	chainindex	index/reconcile.go:66	starting chain reconciliation from head height 4702087; reconciliation epoch is 1
2025-02-13T16:23:32.599Z	INFO	chainindex	index/reconcile.go:91	found matching tipset at height 4702086, setting reconciliation epoch to 4702087
2025-02-13T16:23:32.599Z	INFO	chainindex	index/reconcile.go:112	Marking tipsets as reverted from height 4702087
2025-02-13T16:23:32.602Z	INFO	chainindex	index/reconcile.go:127	marked 77 tipsets as reverted from height 4702087
2025-02-13T16:23:32.602Z	INFO	chainindex	index/reconcile.go:136	indexing missing tipsets backwards from head height 4702087 to reconciliation epoch 4702087
2025-02-13T16:23:32.602Z	INFO	chainindex	index/reconcile.go:199	backfilling chain index backwards starting from head height 4702087
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	sub	sub/incoming.go:54	quitting HandleIncomingBlocks loop
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	peermgr	peermgr/peermgr.go:154	closing peermgr done
2025-02-13T16:23:32.603Z	WARN	peermgr	peermgr/peermgr.go:172	exiting peermgr run
2025-02-13T16:23:32.604Z	INFO	pubsub	go-libp2p-pubsub@v0.11.0/pubsub.go:671	pubsub processloop shutting down
2025-02-13T16:23:32.604Z	WARN	chainstore	store/store.go:670	reorgWorker quit
2025-02-13T16:23:32.604Z	WARN	modules	modules/eth.go:51	error when prefilling GetTipsetByHeight cache: %!w(*xerrors.wrapError=&{failed to fill cache 0xe2e698ec00 {[7038764 28404881 28481413]}})
2025-02-13T16:23:32.604Z	INFO	modules	modules/eth.go:53	Prefilling GetTipsetByHeight done in 19.467942ms
2025-02-13T16:23:32.874Z	INFO	badgerbs	v2@v2.2007.4/db.go:1027	Storing value log head: {Fid:42814 Len:682268 Offset:723470378}
2025-02-13T16:23:32.875Z	INFO	badgerbs	v2@v2.2007.4/levels.go:1000	[Compactor: 173] Running compaction: {level:0 score:1.73 dropPrefixes:[]} for level: 0
2025-02-13T16:23:32.877Z	INFO	badgerbs	v2@v2.2007.4/levels.go:962	LOG Compact 0->1, del 2 tables, add 1 tables, took 1.274169ms
2025-02-13T16:23:32.877Z	INFO	badgerbs	v2@v2.2007.4/levels.go:1010	[Compactor: 173] Compaction for level: 0 DONE
2025-02-13T16:23:32.877Z	INFO	badgerbs	v2@v2.2007.4/db.go:550	Force compaction on level 0 done
ERROR: initializing node: starting node: error while reconciling chain index with chain state: transaction failed: failed to backfill index: failed to apply tipset at height 4702087: failed to index tipset with parent events: failed to index tipset: failed to get messages for tipset: failed to get messages for block: loading bls messages for block: failed to get message: (bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa):29: EOF
@emmanuelm41 emmanuelm41 added the kind/bug Kind: Bug label Feb 13, 2025
@github-project-automation github-project-automation bot moved this to 📌 Triage in FilOz Feb 13, 2025
@BigLep
Copy link
Member

BigLep commented Feb 13, 2025

Thanks @emmanuelm41 - acknowledged on the bug report.

Have you been able to try the advice of turning on your node with indexing disabled?

From @rvagg:

you can turn it off, setting EnableIndexer to false, but it’ll want you to turn off some other things, EnableEthRPC needs to be set to false and EnableActorEventsAPI needs to be set to false. Then it’ll start without the indexer and just proceed with syncing and operating normally, just without those APIs enabled and without keeping track in the chain indexer.
I’m assuming here that the node isn’t syncing at the moment? that you’re stuck at some epoch because the chainindexer is panicking, so we’ll get you caught up and see how we go.
Once you’re caught up, shut it down again, restore those options to how they were and start it up again but it’ll also attempt to reconcile its missing data. By default, MaxReconcileTipsets in the config is set to 3 days worth of epochs, so if you have a gap of 3 days then it should attempt to reconcile the gap by itself. If that’s not enough, the lotus index validate-backfill exists to help you deal with gaps - you can even run that over ranges of epochs you’re unsure about, supply it the --backfill flag and it’ll attempt to fill in any holes it finds. --log-good will show you which epochs it’s happy with if you want to see that. It’s a good tool to give you confidence that it’s got the holes it needs.
My only concern here is that there’s a corrupt message block, or something else about that message block that’s unexpected, and when you re-enable it and it tries to work on that epoch again it’s going to panic.

More info on using the backfill tool at https://github.com/filecoin-project/lotus/blob/master/documentation/en/chain-indexer-overview-for-operators.md#backfill

@emmanuelm41
Copy link
Author

I have not been able to test that exactly yet, but I can say it happened once, fix by its own after the upgrade, and later happened again. That makes me think that it will happen again, even if this workaround works.

@BigLep
Copy link
Member

BigLep commented Feb 14, 2025

Ack, thanks @emmanuelm41 . And is always failing on message bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa in both instances?

@rvagg or @aarshkshah1992 : is there a way to get more state dumped when it crashes so we can debug?

@rvagg
Copy link
Member

rvagg commented Feb 14, 2025

@emmanuelm41 it would be good to check that you can even read this particular message, maybe there's genuine corruption happening here. This is what I get for bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa:

$ curl -s http://localhost:1234/rpc/v1 -X POST -H "Content-Type": "appl
ication/json" --data '{"method":"Filecoin.ChainGetMessage", "params":[{"/":"bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa"}], "id":1,"jsonrpc":"2.0"}' 
{"id":1,"jsonrpc":"2.0","result":{"Version":0,"To":"f03275720","From":"f3w73knlgas5qfykkq3fbxykqzttekb5gt6nn2gqnymgjex4qe3aihsooqecbxcfd7yuunqnka7hddbqd6ghra","Nonce":137038,"Value":"0","GasLimit":61031853,"GasFeeCap":"1198140","GasPremium":"1197086","Method":28,"Params":"gYGHDRoAAZ7I2CpYKQABguIDgegCIFStAy/Db1FxTawzKnql3hyVfN70ZuGQX4H0uTIV/WI9GgBHuZCAGgBRA2rYKlgoAAGB4gOSICAGmgNddduJTfMw7z4VcibMMkgIPIcEvenOLignceQZAg==","CID":{"/":"bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa"}}}

@emmanuelm41
Copy link
Author

emmanuelm41 commented Feb 14, 2025

@rvagg I could try, but the thing here is that the node is down, completely down. So it never gets to start.

@BigLep no, not really. It failed on a different block before. This one is a different one.

@BigLep
Copy link
Member

BigLep commented Feb 14, 2025

@emmanuelm41 : thanks. I think @rvagg's suggestion is to start the node with Eth RPC / ChainIndexer disabled and then attempt the ChainGetMessage call.

@emmanuelm41
Copy link
Author

After a few restarts, It actually fixed itself. The answer to the curl is this one

{"id":1,"jsonrpc":"2.0","result":{"Version":0,"To":"f03275720","From":"f3w73knlgas5qfykkq3fbxykqzttekb5gt6nn2gqnymgjex4qe3aihsooqecbxcfd7yuunqnka7hddbqd6ghra","Nonce":137038,"Value":"0","GasLimit":61031853,"GasFeeCap":"1198140","GasPremium":"1197086","Method":28,"Params":"gYGHDRoAAZ7I2CpYKQABguIDgegCIFStAy/Db1FxTawzKnql3hyVfN70ZuGQX4H0uTIV/WI9GgBHuZCAGgBRA2rYKlgoAAGB4gOSICAGmgNddduJTfMw7z4VcibMMkgIPIcEvenOLignceQZAg==","CID":{"/":"bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa"}}}

I guess the bug is not unrecoverable, as it is resolved by itself, but it is critical for us, as it could leave our nodes down for some time (uncertain) until manual intervention, or auto fix on some unknown amount of time.

@emmanuelm41
Copy link
Author

We stopped the deployment of this new version on other full nodes as we don't want to find ourselves with all nodes down later.

@emmanuelm41
Copy link
Author

Any updates on this? Something to test or try? Nodes are running now, but we stopped the deployment of the 1.31 version

@rvagg
Copy link
Member

rvagg commented Feb 20, 2025

@emmanuelm41 not yet, looking into it, but it does seem to me like you might have experienced a one-off and possibly won't encounter this again with other nodes 🤞 I know that's not a great answer. I am wondering now if it's related to this that's been swirling around, suggested to be splitstore related but may not be. Your error came from a situation just like this—BLS message not being found. #12907 (comment)

@emmanuelm41
Copy link
Author

What makes me a bit uncomfortable about this is the fact that we never saw this issue before on any previous lotus version, and we saw it three times. Two in one node, and one in another (both v1.31.1). So far it has not happened again though.

@rvagg
Copy link
Member

rvagg commented Feb 21, 2025

I think there's two things going on here - the underlying message-lookup problem which is probably not related to chainindexer and probably not even limited to 1.31, and the second problem is the lack of resilience of chainindexer in causing this panic. It's not clear to me yet how to handle that case because chainindexer really should know about messages and not finding one is a problem, but we probably shouldn't be panicking. I've asked @aarshkshah1992 to have a look at this specifically.

@aarshkshah1992
Copy link
Contributor

I can raise a PR to warn instead of panicking during backfilling as not having a message in the statestore can be a valid scenario if the state is corrupted.

@rjan90 rjan90 moved this from 📌 Triage to ⌨️ In Progress in FilOz Feb 25, 2025
@emmanuelm41
Copy link
Author

@aarshkshah1992 is it possible we would loose data on the chain indexers by doing so? Whenever it is EOF, the node directly hangs up. I guess the only way for the node to pass this situation is whenever that block is healthy again. However, if this is just a warning, what will happen with that block on the chain indexer? Will it be missing? Will it require manual work to fix it?

@emmanuelm41
Copy link
Author

@aarshkshah1992 @BigLep @rvagg We have the same error again on our node. It is panicking.

ERROR: initializing node: starting node: error while reconciling chain index with chain state: transaction failed: failed to backfill index: failed to apply tipset at height 4746105: failed to index tipset with parent events: failed to index tipset: failed to get messages for tipset: failed to get messages for block: loading bls messages for block: failed to get message: (bafy2bzacedwgr7fr2jln4clcnwuiqll6hpsedkfo6qbyzwvek24bbab6t4zki):38: EOF

@BigLep
Copy link
Member

BigLep commented Mar 3, 2025

Ack, thanks @emmanuelm41 for the patience. We'll get on this early this week now that the team is back from Denver.

What are Zondax's capabilities for testing this out? Do you support doing your own builds of Lotus? Do we need to release v1.31.2 release?

@aarshkshah1992
Copy link
Contributor

@emmanuelm41

  • You will have missing data in the indexer if you allow the node start even if re-conciliation fails using our fix at fix: allow users to optionally configure node startup even if index reconciliation fails #12930
  • However, you can always backfill those epochs and the epochs before it using the validate-backfill CLI after your chain state has been fixed and you have all the missing blocks/messages.
  • There is no way for the Indexer to Index something if the corresponding blocks/messages are missing.

@aarshkshah1992
Copy link
Contributor

@emmanuelm41

Can you check if those nodes have the missing messages ?

So for your first error message

ERROR: initializing node: starting node: error while reconciling chain index with chain state: transaction failed: failed to backfill index: failed to apply tipset at height 4702087: failed to index tipset with parent events: failed to index tipset: failed to get messages for tipset: failed to get messages for block: loading bls messages for block: failed to get message: (**bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa**):29: EOF

You can run

lotus chain getmessage bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa

And for your second error message

failed to index tipset: failed to get messages for tipset: failed to get messages for block: loading bls messages for block: failed to get message: (bafy2bzacedwgr7fr2jln4clcnwuiqll6hpsedkfo6qbyzwvek24bbab6t4zki):38: EOF

You can run

lotus chain getmessage bafy2bzacedwgr7fr2jln4clcnwuiqll6hpsedkfo6qbyzwvek24bbab6t4zki

Let me know the output. If you don't have these messages, some of your blocks haven't synced properly.

@emmanuelm41
Copy link
Author

@aarshkshah1992 the tricky part here is that the lotus node is not up and running, and it will never come back if that block is not fixed, as the chain indexer won't allow it to start. The only way for me to test this is to use this new configuration.

@emmanuelm41
Copy link
Author

Ack, thanks @emmanuelm41 for the patience. We'll get on this early this week now that the team is back from Denver.

What are Zondax's capabilities for testing this out? Do you support doing your own builds of Lotus? Do we need to release v1.31.2 release?

I would be just great if you can build a new release... however, we are testing the v1.31.1 tag. We are not on the 1.32

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Mar 4, 2025

@emmanuelm41 Can you take a look at #12897 (comment) and the fix and let me know if you are okay with that ?

It will allow you to start your node but you will have missing entries for the epochs that are not re-conciled/have missing messages.

You can still use validate-backfill CLI to backfill all the epochs you have data for -> fix the missing messages in the state -> run validate-backfill for those epochs and get your Index fully hydrated.

Once your node is up after deploying the fix -> you can also use the lotus chain getmessage CLI to confirm that those messages are indeed missing.

@BigLep
Copy link
Member

BigLep commented Mar 4, 2025

2025-03-04 notes from maintainer verbal sync:

These are the steps that need to happen:

  1. (maintainers) Merge fix: allow users to optionally configure node startup even if index reconciliation fails #12930 into master (targeting 2025-03-05)
  2. (maintainers) Cherry pick the change into a new branch release/v1.31.2-zondax which is release/v1.31.1 + this change
  3. (maintainers) Create a tag from this commit
  4. (Zondax) Consume tag release/v1.32.2-zondax
  5. (Zondax) Set ChainIndexerConfig.AllowIndexReconciliationFailure to true
  6. (Zondax) Start node

Once Zondax observes the error while reconciling chain index with chain state warning...

  1. (Zondax) Run lotus chain get message
  2. (Zondax) Report back the results
    If message isn't there, we need collaborate to figure out why it's missing

While the investigation in underway, to keep Zondax unblocked they can:

  1. (Zondax) Use index validate-backfill to backfill the epochs before the failure (e.g., from failureEpoch-1 to 0)
  2. (Zondax) Use index validate-backfill to backfill the epochs before the failure (e.g., from head to failureEpoch+1)

@aarshkshah1992, @rjan90: please amend/comment if any of this seems wrong.

@emmanuelm41
Copy link
Author

@aarshkshah1992 it is indeed a weird issue, as the node is back to life, by itself. After a few restarts, and some time being block there, one last restart made it work again. The output for those commands is

lotus chain getmessage bafy2bzacedwgr7fr2jln4clcnwuiqll6hpsedkfo6qbyzwvek24bbab6t4zki
{
  "Version": 0,
  "To": "f03335798",
  "From": "f3rukg7eijyng5pfouhu4ab2xkpajhw4m6vho5knvi4wdo5mlj64libmynsd7auj56prhrooip7cosgaqnggua",
  "Nonce": 53378,
  "Value": "0",
  "GasLimit": 56920302,
  "GasFeeCap": "167618",
  "GasPremium": "165489",
  "Method": 28,
  "Params": "gYGHDRmHm9gqWCkAAYLiA4HoAiCtgMpy40X0FWLWgGcXiloF6RczrVfOWSgVKvgh6nXVORoASGVrgBoAUa9F2CpYKAABgeIDkiAggHrjTBEugod2uSDRzMYKXIJw0lHtuDWnb7gzi6DIMh0=",
  "CID": {
    "/": "bafy2bzacedwgr7fr2jln4clcnwuiqll6hpsedkfo6qbyzwvek24bbab6t4zki"
  }
}
lotus chain getmessage bafy2bzacedwgr7fr2jln4clcnwuiqll6hpsedkfo6qbyzwvek24bbab6t4zki
{
  "Version": 0,
  "To": "f03335798",
  "From": "f3rukg7eijyng5pfouhu4ab2xkpajhw4m6vho5knvi4wdo5mlj64libmynsd7auj56prhrooip7cosgaqnggua",
  "Nonce": 53378,
  "Value": "0",
  "GasLimit": 56920302,
  "GasFeeCap": "167618",
  "GasPremium": "165489",
  "Method": 28,
  "Params": "gYGHDRmHm9gqWCkAAYLiA4HoAiCtgMpy40X0FWLWgGcXiloF6RczrVfOWSgVKvgh6nXVORoASGVrgBoAUa9F2CpYKAABgeIDkiAggHrjTBEugod2uSDRzMYKXIJw0lHtuDWnb7gzi6DIMh0=",
  "CID": {
    "/": "bafy2bzacedwgr7fr2jln4clcnwuiqll6hpsedkfo6qbyzwvek24bbab6t4zki"
  }
}

@emmanuelm41
Copy link
Author

Is it possible that those "missing blocks" get solved after some time, by themselves? And that may have happened many times in the past. This may just araise now that the chain indexer is there to check in live the block status

@rjan90
Copy link
Contributor

rjan90 commented Mar 6, 2025

Hey!

With #12930 merged, I've created a new tag that includes the cherry-picked changes on top of release/v1.31.1. The tag is named release/v1.31.2-zondax-tag and you can view it here: https://github.com/filecoin-project/lotus/tree/release/v1.31.2-zondax-tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ⌨️ In Progress
Development

No branches or pull requests

5 participants