Read-only node showing failed to open store: set log info: failed to get last command index #1705

dwco-z · 2024-02-27T18:32:13Z

What version are you running?
8.16.6

Are you using Docker or Kubernetes to run your system?
No

Are you running a single node or a cluster?
Cluster with three voters and two read-only nodes.

Description
I was facing an issue with rqlited v8.16.6 where I had a cluster with 3 voters and 2 non-voters and after some days working somehow the cluster ended up in a state where the non-voters could not start anymore and they were always showing "failed to open store: set log info: failed to get last command index: failed to get log at index 12437: log not found" during startup. Performing a peers recovery managed to get everything back to normal. I don't have access to the database anymore and I still couldn't find a reproducible scenario. All I have are some logs from the attempts to start the read-only node while the voters were working, but I don't have the logs from when the error first started happening. Besides that, I know that the three-voters were being executed in physical machines and the 2 read-only nodes were running in virtual machines in a notebook that could be turned off sometimes.

These are the logs from the startup attempts

�
            _ _ _�
        | (_) |�
_ __ __ _| |_| |_ ___�
| '__/ _  | | | __/ _ \   The lightweight, distributed�
| | | (_| | | | ||  __/   relational database.�
|_|  \__, |_|_|\__\___|�
        | |               www.rqlite.io�
        |_|�
�
[rqlited] 2024/02/14 19:03:22 rqlited starting, version 8, SQLite 3.44.0, commit unknown, branch unknown, compiler gc�
[rqlited] 2024/02/14 19:03:22 go1.21.5, target architecture is amd64, operating system target is windows�
[rqlited] 2024/02/14 19:03:22 launch command: rqlited.exe -raft-log-level=DEBUG -node-id 192.168.2.61 -http-addr 192.168.2.61:8907 -raft-addr 192.168.2.61:8905 -raft-non-voter=true -node-ca-cert Database\DatabaseCertificateAuthority.cer -node-cert Database\DatabasePublicCertificate.cer -node-verify-client -node-key Database\DatabasePrivateCertificate.pem -http-ca-cert Database\DatabaseCertificateAuthority.cer -http-cert Database\DatabasePublicCertificate.cer -http-verify-client -http-key Database\DatabasePrivateCertificate.pem -raft-timeout=2s -raft-election-timeout=10s Database\Global�
[rqlited] 2024/02/14 19:03:22 enabling node-to-node encryption with cert: Database\DatabasePublicCertificate.cer, key: Database\DatabasePrivateCertificate.pem, CA cert Database\DatabaseCertificateAuthority.cer, mutual TLS enabled�
[rqlited] 2024/02/14 19:03:22 Raft TCP mux Listener registered with byte header 1�
[mux] 2024/02/14 19:03:22 TLS mux serving on 192.168.2.61:8905, advertising 192.168.2.61:8905�
[rqlited] 2024/02/14 19:03:22 preexisting node state detected in Database\Global�
[cluster] 2024/02/14 19:03:22 service listening on 192.168.2.61:8905�
[rqlited] 2024/02/14 19:03:22 cluster TCP mux Listener registered with byte header 2�
[http] 2024/02/14 19:03:22 secure HTTPS server enabled with cert Database\DatabasePublicCertificate.cer, key Database\DatabasePrivateCertificate.pem, CA cert Database\DatabaseCertificateAuthority.cer, mutual TLS enabled�
[http] 2024/02/14 19:03:22 execute queue processing started with capacity 1024, batch size 128, timeout 50ms�
[http] 2024/02/14 19:03:22 service listening on 192.168.2.61:8907�
[rqlited] 2024/02/14 19:03:22 HTTP server started�
[store] 2024/02/14 19:03:22 opening store with node ID 192.168.2.61, listening on 192.168.2.61:8905�
[store] 2024/02/14 19:03:22 ensuring data directory exists at Database\Global�
[store] 2024/02/14 19:03:22 old snapshot directory Database\Global\snapshots does not exist, nothing to upgrade�
[snapshot-store] 2024/02/14 19:03:22 store initialized using Database\Global\rsnapshots�
[snapshot-store] 2024/02/14 19:03:22 checking consistency of snapshot store at Database\Global\rsnapshots�
[snapshot-store] 2024/02/14 19:03:22 check complete�
[store] 2024/02/14 19:03:22 1 preexisting snapshots present�
[rqlited] 2024/02/14 19:03:22 failed to open store: set log info: failed to get last command index: failed to get log at index 12437: log not found�
�
            _ _ _�
        | (_) |�
_ __ __ _| |_| |_ ___�
| '__/ _  | | | __/ _ \   The lightweight, distributed�
| | | (_| | | | ||  __/   relational database.�
|_|  \__, |_|_|\__\___|�
        | |               www.rqlite.io�
        |_|�
�
[rqlited] 2024/02/14 19:05:27 rqlited starting, version 8, SQLite 3.44.0, commit unknown, branch unknown, compiler gc�
[rqlited] 2024/02/14 19:05:27 go1.21.5, target architecture is amd64, operating system target is windows�
[rqlited] 2024/02/14 19:05:27 launch command: rqlited.exe -raft-log-level=DEBUG -node-id 192.168.2.61 -http-addr 192.168.2.61:8907 -raft-addr 192.168.2.61:8905 -raft-non-voter=true -node-ca-cert Database\DatabaseCertificateAuthority.cer -node-cert Database\DatabasePublicCertificate.cer -node-verify-client -node-key Database\DatabasePrivateCertificate.pem -http-ca-cert Database\DatabaseCertificateAuthority.cer -http-cert Database\DatabasePublicCertificate.cer -http-verify-client -http-key Database\DatabasePrivateCertificate.pem -raft-timeout=2s -raft-election-timeout=10s Database\Global�
[rqlited] 2024/02/14 19:05:27 enabling node-to-node encryption with cert: Database\DatabasePublicCertificate.cer, key: Database\DatabasePrivateCertificate.pem, CA cert Database\DatabaseCertificateAuthority.cer, mutual TLS enabled�
[rqlited] 2024/02/14 19:05:27 Raft TCP mux Listener registered with byte header 1�
[mux] 2024/02/14 19:05:27 TLS mux serving on 192.168.2.61:8905, advertising 192.168.2.61:8905�
[rqlited] 2024/02/14 19:05:27 preexisting node state detected in Database\Global�
[cluster] 2024/02/14 19:05:27 service listening on 192.168.2.61:8905�
[rqlited] 2024/02/14 19:05:27 cluster TCP mux Listener registered with byte header 2�
[http] 2024/02/14 19:05:27 secure HTTPS server enabled with cert Database\DatabasePublicCertificate.cer, key Database\DatabasePrivateCertificate.pem, CA cert Database\DatabaseCertificateAuthority.cer, mutual TLS enabled�
[http] 2024/02/14 19:05:27 execute queue processing started with capacity 1024, batch size 128, timeout 50ms�
[http] 2024/02/14 19:05:27 service listening on 192.168.2.61:8907�
[rqlited] 2024/02/14 19:05:27 HTTP server started�
[store] 2024/02/14 19:05:27 opening store with node ID 192.168.2.61, listening on 192.168.2.61:8905�
[store] 2024/02/14 19:05:27 ensuring data directory exists at Database\Global�
[store] 2024/02/14 19:05:27 old snapshot directory Database\Global\snapshots does not exist, nothing to upgrade�
[snapshot-store] 2024/02/14 19:05:27 store initialized using Database\Global\rsnapshots�
[snapshot-store] 2024/02/14 19:05:27 checking consistency of snapshot store at Database\Global\rsnapshots�
[snapshot-store] 2024/02/14 19:05:27 check complete�
[store] 2024/02/14 19:05:27 1 preexisting snapshots present�
[rqlited] 2024/02/14 19:05:27 failed to open store: set log info: failed to get last command index: failed to get log at index 12437: log not found�
�
            _ _ _�
        | (_) |�
_ __ __ _| |_| |_ ___�
| '__/ _  | | | __/ _ \   The lightweight, distributed�
| | | (_| | | | ||  __/   relational database.�
|_|  \__, |_|_|\__\___|�
        | |               www.rqlite.io�
        |_|�
�
[rqlited] 2024/02/14 19:07:36 rqlited starting, version 8, SQLite 3.44.0, commit unknown, branch unknown, compiler gc�
[rqlited] 2024/02/14 19:07:36 go1.21.5, target architecture is amd64, operating system target is windows�
[rqlited] 2024/02/14 19:07:36 launch command: rqlited.exe -raft-log-level=DEBUG -node-id 192.168.2.61 -http-addr 192.168.2.61:8907 -raft-addr 192.168.2.61:8905 -raft-non-voter=true -node-ca-cert Database\DatabaseCertificateAuthority.cer -node-cert Database\DatabasePublicCertificate.cer -node-verify-client -node-key Database\DatabasePrivateCertificate.pem -http-ca-cert Database\DatabaseCertificateAuthority.cer -http-cert Database\DatabasePublicCertificate.cer -http-verify-client -http-key Database\DatabasePrivateCertificate.pem -raft-timeout=2s -raft-election-timeout=10s Database\Global�
[rqlited] 2024/02/14 19:07:36 enabling node-to-node encryption with cert: Database\DatabasePublicCertificate.cer, key: Database\DatabasePrivateCertificate.pem, CA cert Database\DatabaseCertificateAuthority.cer, mutual TLS enabled�
[rqlited] 2024/02/14 19:07:36 Raft TCP mux Listener registered with byte header 1�
[mux] 2024/02/14 19:07:36 TLS mux serving on 192.168.2.61:8905, advertising 192.168.2.61:8905�
[rqlited] 2024/02/14 19:07:36 preexisting node state detected in Database\Global�
[cluster] 2024/02/14 19:07:36 service listening on 192.168.2.61:8905�
[rqlited] 2024/02/14 19:07:36 cluster TCP mux Listener registered with byte header 2�
[http] 2024/02/14 19:07:36 secure HTTPS server enabled with cert Database\DatabasePublicCertificate.cer, key Database\DatabasePrivateCertificate.pem, CA cert Database\DatabaseCertificateAuthority.cer, mutual TLS enabled�
[http] 2024/02/14 19:07:36 execute queue processing started with capacity 1024, batch size 128, timeout 50ms�
[http] 2024/02/14 19:07:36 service listening on 192.168.2.61:8907�
[rqlited] 2024/02/14 19:07:36 HTTP server started�
[store] 2024/02/14 19:07:36 opening store with node ID 192.168.2.61, listening on 192.168.2.61:8905�
[store] 2024/02/14 19:07:36 ensuring data directory exists at Database\Global�
[store] 2024/02/14 19:07:36 old snapshot directory Database\Global\snapshots does not exist, nothing to upgrade�
[snapshot-store] 2024/02/14 19:07:36 store initialized using Database\Global\rsnapshots�
[snapshot-store] 2024/02/14 19:07:36 checking consistency of snapshot store at Database\Global\rsnapshots�
[snapshot-store] 2024/02/14 19:07:36 check complete�
[store] 2024/02/14 19:07:36 1 preexisting snapshots present�
[rqlited] 2024/02/14 19:07:36 failed to open store: set log info: failed to get last command index: failed to get log at index 12437: log not found�

The text was updated successfully, but these errors were encountered:

otoolep · 2024-02-27T19:12:11Z

This is an issue in some now-obsolete code, which is preventing the Store from opening and carrying on with normal catch-up with the rest of the cluster. Instead of debugging it, it's time I just removed the code, as it serves no purpose -- and hasn't done so since the 7.x series.

We do lose some logging around start-up times, but I think there are better ways to do this that don't involve Goroutines.

dwco-z · 2024-02-27T19:38:19Z

Thanks for your swift action on the matter @otoolep

otoolep · 2024-02-27T21:38:54Z

Merged.

Not necessarily the root fix, but this code is preventing the node from starting up correctly. Just nuke it, and let's see. There may actually be no issue.

wellescastro · 2024-02-27T21:50:16Z

Agreed. I'll let you know if anything else comes up. Thanks again.

otoolep linked a pull request Feb 27, 2024 that will close this issue

Remove obsolete log info code #1706

Merged

otoolep closed this as completed in #1706 Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read-only node showing failed to open store: set log info: failed to get last command index #1705

Read-only node showing failed to open store: set log info: failed to get last command index #1705

dwco-z commented Feb 27, 2024 •

edited

otoolep commented Feb 27, 2024

dwco-z commented Feb 27, 2024

otoolep commented Feb 27, 2024 •

edited

wellescastro commented Feb 27, 2024

Read-only node showing failed to open store: set log info: failed to get last command index #1705

Read-only node showing failed to open store: set log info: failed to get last command index #1705

Comments

dwco-z commented Feb 27, 2024 • edited

otoolep commented Feb 27, 2024

dwco-z commented Feb 27, 2024

otoolep commented Feb 27, 2024 • edited

wellescastro commented Feb 27, 2024

dwco-z commented Feb 27, 2024 •

edited

otoolep commented Feb 27, 2024 •

edited