-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node crashed, and stop syncing with EOF #12897
Comments
Thanks @emmanuelm41 - acknowledged on the bug report. Have you been able to try the advice of turning on your node with indexing disabled? From @rvagg:
More info on using the backfill tool at https://github.com/filecoin-project/lotus/blob/master/documentation/en/chain-indexer-overview-for-operators.md#backfill |
I have not been able to test that exactly yet, but I can say it happened once, fix by its own after the upgrade, and later happened again. That makes me think that it will happen again, even if this workaround works. |
Ack, thanks @emmanuelm41 . And is always failing on message @rvagg or @aarshkshah1992 : is there a way to get more state dumped when it crashes so we can debug? |
@emmanuelm41 it would be good to check that you can even read this particular message, maybe there's genuine corruption happening here. This is what I get for
|
@emmanuelm41 : thanks. I think @rvagg's suggestion is to start the node with Eth RPC / ChainIndexer disabled and then attempt the ChainGetMessage call. |
After a few restarts, It actually fixed itself. The answer to the curl is this one
I guess the bug is not unrecoverable, as it is resolved by itself, but it is critical for us, as it could leave our nodes down for some time (uncertain) until manual intervention, or auto fix on some unknown amount of time. |
We stopped the deployment of this new version on other full nodes as we don't want to find ourselves with all nodes down later. |
Any updates on this? Something to test or try? Nodes are running now, but we stopped the deployment of the 1.31 version |
@emmanuelm41 not yet, looking into it, but it does seem to me like you might have experienced a one-off and possibly won't encounter this again with other nodes 🤞 I know that's not a great answer. I am wondering now if it's related to this that's been swirling around, suggested to be splitstore related but may not be. Your error came from a situation just like this—BLS message not being found. #12907 (comment) |
What makes me a bit uncomfortable about this is the fact that we never saw this issue before on any previous lotus version, and we saw it three times. Two in one node, and one in another (both v1.31.1). So far it has not happened again though. |
I think there's two things going on here - the underlying message-lookup problem which is probably not related to chainindexer and probably not even limited to 1.31, and the second problem is the lack of resilience of chainindexer in causing this panic. It's not clear to me yet how to handle that case because chainindexer really should know about messages and not finding one is a problem, but we probably shouldn't be panicking. I've asked @aarshkshah1992 to have a look at this specifically. |
I can raise a PR to warn instead of panicking during backfilling as not having a message in the statestore can be a valid scenario if the state is corrupted. |
@aarshkshah1992 is it possible we would loose data on the chain indexers by doing so? Whenever it is EOF, the node directly hangs up. I guess the only way for the node to pass this situation is whenever that block is healthy again. However, if this is just a warning, what will happen with that block on the chain indexer? Will it be missing? Will it require manual work to fix it? |
@aarshkshah1992 @BigLep @rvagg We have the same error again on our node. It is panicking.
|
Ack, thanks @emmanuelm41 for the patience. We'll get on this early this week now that the team is back from Denver. What are Zondax's capabilities for testing this out? Do you support doing your own builds of Lotus? Do we need to release v1.31.2 release? |
|
Can you check if those nodes have the missing messages ? So for your first error message
You can run
And for your second error message
You can run
Let me know the output. If you don't have these messages, some of your blocks haven't synced properly. |
@aarshkshah1992 the tricky part here is that the lotus node is not up and running, and it will never come back if that block is not fixed, as the chain indexer won't allow it to start. The only way for me to test this is to use this new configuration. |
I would be just great if you can build a new release... however, we are testing the v1.31.1 tag. We are not on the 1.32 |
@emmanuelm41 Can you take a look at #12897 (comment) and the fix and let me know if you are okay with that ? It will allow you to start your node but you will have missing entries for the epochs that are not re-conciled/have missing messages. You can still use Once your node is up after deploying the fix -> you can also use the |
2025-03-04 notes from maintainer verbal sync: These are the steps that need to happen:
Once Zondax observes the
While the investigation in underway, to keep Zondax unblocked they can:
@aarshkshah1992, @rjan90: please amend/comment if any of this seems wrong. |
@aarshkshah1992 it is indeed a weird issue, as the node is back to life, by itself. After a few restarts, and some time being block there, one last restart made it work again. The output for those commands is
|
Is it possible that those "missing blocks" get solved after some time, by themselves? And that may have happened many times in the past. This may just araise now that the chain indexer is there to check in live the block status |
Hey! With #12930 merged, I've created a new tag that includes the cherry-picked changes on top of |
Checklist
Latest release
, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
Lotus Version
Repro Steps
Describe the Bug
Once the node starts, it crashes for some reason in some random block. So far this happened twice. The first one was fix by upgrading the node from v1.31.0 to v1.31.1. The second one is still preventing our node to start. This is a full archival node.
Logging Information
The text was updated successfully, but these errors were encountered: