Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horcrux crashes and then cannot restart because of pid file #226

Closed
Quilamir opened this issue Nov 29, 2023 · 8 comments · Fixed by #227
Closed

Horcrux crashes and then cannot restart because of pid file #226

Quilamir opened this issue Nov 29, 2023 · 8 comments · Fixed by #227

Comments

@Quilamir
Copy link

Quilamir commented Nov 29, 2023

Version v3.2.0

Here is the log for the crash

E[2023-11-29|17:02:42.517] Cosigner failed to set nonces and sign       chain_id=froopyland_100-1 height=1481301 round=0 type=prevote cosigner=1 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E[2023-11-29|17:02:43.519] Cosigner failed to set nonces and sign       chain_id=froopyland_100-1 height=1481301 round=0 type=prevote cosigner=3 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
panic: runtime error: slice bounds out of range [6:4]
goroutine 4156 [running]:
github.com/strangelove-ventures/horcrux/signer.(*NonceCache).Delete(...)
        /horcrux/signer/cosigner_nonce_cache.go:86
github.com/strangelove-ventures/horcrux/signer.(*CosignerNonceCache).ClearNonces(0xc0001241c0, {0x17f0ac8, 0xc000121710})
        /horcrux/signer/cosigner_nonce_cache.go:368 +0x4a6
github.com/strangelove-ventures/horcrux/signer.(*ThresholdValidator).Sign.func2()
        /horcrux/signer/threshold_validator.go:743 +0x57a
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 4153
        /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:72 +0x96
horcrux.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
horcrux.service: Failed with result 'exit-code'.
horcrux.service: Scheduled restart job, restart counter is at 583.
Stopped MPC Signer node.
Started MPC Signer node.
Error: unclean shutdown detected. PID file exists at /root/.horcrux/horcrux.pid but PID 1173404 is not running.
manual deletion of PID file required

since this is a crash the PID file is not removed, and the service can not restart properly after the crash making the signer stuck until manual deletion of the PID file.

I have not seen this error happen in previous versions so I am assuming its a bug of the latest release, for now the only thing i can think of doing is downgrade to the previous version.

@agouin
Copy link
Member

agouin commented Nov 30, 2023

Thank you for the report! This is indeed due to an edge case bug in the new nonce pre-sharing mechanism in v3.2.0. We have a patch in #227 and will cut a v3.2.1 patch release ASAP.

@agouin
Copy link
Member

agouin commented Nov 30, 2023

This edge case being triggered suggests that you might have a different issue as well though. You may want to increase your grpcTimeout value to account for the failed grpc requests from the leader to the other cosigners. Additionally the raftTimeout value may need to be increased if you are seeing frequent leader elections.

@Quilamir
Copy link
Author

Is there any kind of documentation about these settings i can go over?

also it seems like the PID mechanism needs a change, if the PID in the file dosent exist then a new instance should be started and the PID updated, as my understanding is the PID file is supposed to stop the running of two instances at the same time, this will allow the instance to recover from an un expected restart or an edge case like this one.

@agouin
Copy link
Member

agouin commented Nov 30, 2023

Is there any kind of documentation about these settings i can go over?

Documentation for these parameters is here, but a fine-tuning guide would be a useful addition. I'd suggest determining the happy path minimum value for both:

  • no grpc timeouts occurring in the logs with the configured grpc-timeout
  • no unwanted leader elections with the configured raft-timeout
    Then add a buffer to each with the level of expected network latency variance in your network.

this will allow the instance to recover from an un expected restart

I agree this would be a nice feature. The required manual removal of the pid file under unclean shutdown was intentional originally, but it is more painful than it is worth I think in operation

@Quilamir
Copy link
Author

The PID mechanism is good, but it just needs to add a check if the mentioned process exists or not, if it does not exist then it means it is safe to start a new process and update the PID in the file, its also possible in order to make sure there is no race condition to have it just delete the PID file and continue to shutdown so the next restart attempt will be successful (assuming this is running with some auto restart mechanism like systemd or docker restarts)

@agouin
Copy link
Member

agouin commented Nov 30, 2023

Yes exactly, we don't want to remove the PID mechanism but we can delete the PID file on startup if (and only if) the process no longer exists by the PID within. That has been added to the PR with updated tests.

@Quilamir
Copy link
Author

On second thought the PID file is not necessary, if a new instance tries to run while one is already running it will fail to bind the port and crash anyways.

I would remove the PID mechanism or at least provide a flag to disable it.

@activenodes
Copy link

On second thought the PID file is not necessary, if a new instance tries to run while one is already running it will fail to bind the port and crash anyways.

I would remove the PID mechanism or at least provide a flag to disable it.

I agree with you, @agouin correct us if we are wrong.

In my setups I had added

[Service]
ExecStart=...
...
ExecStopPost=rm -f /xxx/xxx/.horcrux/xxx/horcrux.pid

and I have never had problems in the past

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants