Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readyz/freshness do not consider raft index #1672

Closed
aderouineau opened this issue Feb 6, 2024 · 93 comments
Closed

readyz/freshness do not consider raft index #1672

aderouineau opened this issue Feb 6, 2024 · 93 comments

Comments

@aderouineau
Copy link

What version are you running?
8.19.0

Are you using Docker or Kubernetes to run your system?
Neither: running rqlited directly.

Are you running a single node or a cluster?
Single voter + read-only

What did you do?

  1. Launched a voter node: ./rqlite-v8.19.0-linux-amd64/rqlited -node-id 1 data1
  2. Imported a large (~10GB) SQLite database with .boot in rqlite CLI
  3. Launched a read-only node: ./rqlite-v8.19.0-linux-amd64/rqlited -raft-non-voter=true -http-addr=localhost:4011 -raft-addr=localhost:4012 -join=localhost:4002 data2
  4. Queried curl -G 'localhost:4011/readyz?noleader' which returned [+]node ok
  5. Queried curl -G 'localhost:4011/readyz' which returned [+]node ok [+]leader ok [+]store ok
  6. Queried curl -G 'localhost:4011/db/query?level=none&freshness=1s' --data-urlencode 'q=SELECT * FROM table LIMIT 5' which returned {"results":[{"error":"no such table: table"}]}

Logs of interest on the read-only node:

[rqlited] 2024/02/06 00:25:38 node HTTP API available at http://localhost:4011
[rqlited] 2024/02/06 00:25:38 connect using the command-line tool via 'rqlite -H localhost -p 4011'
[raft] 2024/02/06 00:25:39 [WARN]  failed to get previous log: previous-index=10 last-index=0 error="log not found"
[store] 2024/02/06 00:28:27 initiating node restore on node ID localhost:4012
[store] 2024/02/06 00:28:58 successfully opened database at /home/derouin/rqlite/data2/db.sqlite due to restore
[store] 2024/02/06 00:28:58 node restored in 31.508336751s
  • I expected /readyz to be able to consider the staleness of its raft log, either by default or with a parameter. In this case, it should not say "node ok" until the data was available locally.
  • I expected the SELECT query with level=none&freshness=1s to return an error due to staleness. It seems like freshness is based on the last time the node contacted the leader, but I think the test should also look at the raft index.
@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

Thanks for the feedback.

I expected /readyz to be able to consider the staleness of its raft log, either by default or with a parameter. In this case, it should not say "node ok" until the data was available locally.

The fact that the node is "behind" the leader has no bearing on the readyness. Ready is defined as the node being "ready to respond to database requests and cluster management operations" -- and once connected to the Leader this is true except for None consistency. None doesn't involve consensus or the Leader, so that's why you're seeing what you're seeing. But /readyz is working as intended -- I can update the docs to make this clearer.

I expected the SELECT query with level=none&freshness=1s to return an error due to staleness. It seems like freshness is based on the last time the node contacted the leader, but I think the test should also look at the raft index.

That's fair, I can see how it could be a little deceiving, the docs need to be clarified. The issue is that even if the node you're querying knew how many Raft log entries it was behind the leader, it's not easy to convert that to a "time". Who knows, for example, when the logs will arrive. However I don't believe the underlying Raft system exposes the Leader index, as it's contained in the heatbeat with the node. So not sure if I can add support for that.

So yeah, I need to tweak the docs -- good feedback. But the system is working as designed in both cases.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

I think you've identified an interesting practical issue however -- how does one know when the second node has loaded a large dataset after joining the cluster, one that you previously booted a single node with?

One idea would be to enhance curl -G 'localhost:4011/readyz' so that it returns this:

[+]node ok
[+]leader ok
[+]store ok
[+]sync ok

sync would be a new addition. If sync is OK it means that the applied index (what's reflected in the node's database) is the same index as what is applied in the Leader's database. WDYT?

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

Of course, if the node can't contact the Leader, the sync would not be OK. But that case leader wouldn't be OK either.

otoolep added a commit to rqlite/rqlite.io that referenced this issue Feb 6, 2024
@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

how does one know when the second node has loaded a large dataset after joining the cluster, one that you previously booted a single node with?

Of course, obviously you can look at the logs, but I'm talking about an automated check.

@aderouineau
Copy link
Author

aderouineau commented Feb 6, 2024

Thanks for the quick response, @otoolep!

Ready is defined as the node being "ready to respond to database requests and cluster management operations" -- and once connected to the Leader this is true except for None consistency. None doesn't involve consensus or the Leader, so that's why you're seeing what you're seeing. But /readyz is working as intended -- I can update the docs to make this clearer.

Do you agree that a read-only node only makes sense with the use of "None" consistency? And therefore that it is actually important for a read-only node to provide a status of the freshness of its data, not just whether it can contact the leader?

Could we add a freshness parameter to /readyz that would add your suggested [+]sync ok (or an error)?

That's fair, I can see how it could be a little deceiving, the docs need to be clarified.

It would be nice to clarify this quirk in the docs. I do think this is more than a little deceiving though, as the term "freshness" is generally understood to refer to the data. And I believe there is little use in checking the freshness of the node-leader connection if the data itself is far from fresh. Especially when the goal is for an application to query the replica locally, and therefore needs to wait for the replica to be fresh enough or else many queries will fail for bad reasons.

The issue is that even if the node you're querying knew how many Raft log entries it was behind the leader, it's not easy to convert that to a "time". Who knows, for example, when the logs will arrive.

Would the following be possible?

  • If the log is not being updated, use s.raft.LastContact()).Nanoseconds()
  • If the log is being updated, update the timestamp in WaitForAppliedIndex().

By the way, tracking data freshness could allow an interesting feature: combining freshness with Weak consistency to allow a node to authoritatively answer a request if its data is fresh enough, and forward if it is not. Tracking the freshness of the data rather than the connection makes this more robust.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

Agreed, there is an assumption in the way freshness works that if the Leader has been contacted within, say, the last 5 seconds then the node isn't "behind" the Leader more than 5 seconds. But yeah, in certain circumstances (large data transfers, or when the node connects to the Leader after having been disconnected for longer than that) it's not necessarily correct. freshness is more about guarding against the node being disconnected altogether. So yes, the term is not great, but we're stuck with it now (it's been around for years) because I can't change the API in a breaking manner.

You are correct that read-only nodes only provide read scalability if you use None. But this means that when serving a None query on a read-only node, the node cannot reach out to the Leader to first check the Leader's Applied Index (the index of the log entry the Leader has most recently applied to the SQLite database) as that would defeat the point of None -- you might as well do Weak in that case.

No, None checks must run just with the information the node has at that time -- that places limits on what we can do. However if a node is in contact with the Leader, then it must have a received an AppendEntries RPC request within (at least) the Heartbeat interval. And those RPCs contain information about the index of the last log the Leader has committed (which is not the same as Applied, but is close in practise).

So in principle the best definition a read-only node could offer is the following: a freshness of, say, 5 seconds means that a) the node has heard from the Leader in the last 5 seconds, and b) the AppliedIndex of the node is at least as great as CommitIndex value contained in the last message received from the Leader.

There is a fair amount of Raft terminology in the description above, but it comes down to what you're saying. Interesting, I hadn't thought about this whole area in a while, but I see it could be improved. It's not necessarily a redefinition of freshness but it's certainly a tightening up (and was my intention in the first place).

If the log is being updated, update the timestamp in WaitForAppliedIndex()
Yeah, that wouldn't work, because that function only uses the node's state when executed, it has nothing to do with any Leader state.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

Just to clarify something. With the the current implementation, a node could, in theory, have heard from the Leader in, say, the last second, but is many log entries behind. This is because the node might still be catching up with Leader, as the Leader is streaming log entries to it. But the freshness check could still pass. That's what I'm proposing to address here.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

the AppliedIndex of the node is at least as great as CommitIndex value contained in the last message received from the Leader.

Actually, this is where it gets tricky. This is the strictest form of the check. An client making a call might be OK if the node is N entries behind, at most. So it might make sense to also use the above as a default, but allow the client to loosen it somewhat.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

@aderouineau -- not sure if you're getting into the code, but if you look at #1673, it now redefines a freshness of, say, 5 seconds, to be the following:

  • the node must have heard from the Leader in the last 5 seconds (no change to definition). It is this message that tells the node what the latest committed index is on the Leader.
  • the node must have applied the latest committed index as set on the Leader to its local SQLite database.

I could make the case that this is a better definition of stale reads, and what I really meant ever since I introduced the concept.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

This basically addresses your issue, I think. :-)

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

Actually, there are a couple more subtleties here, so not done yet.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

The issue with the current change is the following:

  • CommitIndex can run slightly ahead of DBAppliedIdx (it fact, it does this a lot anytime a log is sent to a follower, as Commit Index is updated before DBAppliedIndex).
  • So read-staleness could appear more often than the user might expect, and increasing the freshness time wouldn't help.
  • We could consider bounding how much DBAppliedIndex < Commit Index, but the value to pick is not clear to me.

@aderouineau
Copy link
Author

Yes, the new condition in #1673 has a race condition which is exacerbated the larger the update.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

OK, thanks again for pushing on this, it's interesting. Let's see if we can solve it, or at least make clear statements about what we can do.

I saw your comment from earlier, about adding a time.Time somewhere to the Store. That might help, but fundamentally this comes down to this issue: we are trying to correlate passage of time to index increments, but Raft says nothing about correlation.

I think what I describe below is the hard case, let me know what you think. All time is in units of seconds.

  • Let's start in steady state, every node in a 3-node cluster is caught up with the Leader.
  • A write happens, and the Leader sends a Raft log entry to a Follower at T = 0. The Follower applies it, is completely up-to-date.
  • At T = 0.5 the follower loses network connection to the Leader for 2 seconds.
  • During this 2-second window the Leader gets some updates, and commits them successfully (it's a 3-node cluster, so losing contact with the Follower doesn't stop the cluster from doing work). Let's say the Leader's Commit Index gets to 10 during the 2-second window it has no connection to the Follower.
  • At T = 2.5 the Follower regains connection with the Leader, let's say it happens via a Heartbeat so the Leader doesn't send any Log Entries at reconnect. I don't know for a fact this is how the Hashicorp Library will work, I'd need to check. But it makes sense.
  • Let's say the Leader sends just the first Log Entry that occurred during the 2-second window to the Follower, as part of catching it up. Now the Follower has a LastApplied Time (which you are proposing) to now, as well as a Last Contact time of now. Let's say all this happens at T = 2.6.
  • At T = 2.7 a client issues a NONE read, with freshness set to 0.5 seconds. This will pass because both Last Contact Time and Last Applied Time are within the freshness, though the node has data which could be ~2 seconds out of date.

I believe this is the crux of the issue. Do you agree with my analysis? The issue is that a given log entry is applied at a different absolute times on a different nodes. Now, of course, you could start to introduce the concept of an absolute clock into Log Entries, but that introduces lots of complexity, and systems can really be screwed up if an operator changes the time on the machines. You really, really, want to avoid introducing absolute time in distributed systems, if you can avoid it. I would not introduce it for the purposes of solving this issue. I don't see any problem with offering a new option for NONE reads -- "check that the Last Applied time is also within the freshness", but I'm just pointing out that it doesn't prevent stale reads in the way the user might expect.

WDYT?

@aderouineau
Copy link
Author

I agree with your analysis, and maybe we can find different levels of mitigation.

Is it possible to know the latest log entry on the leader even if we are only sent an intermediate one? If it is, then a node could track the time at which it is made aware of the latest index, even if the corresponding log entry has not yet been sent.

but I'm just pointing out that it doesn't prevent stale reads in the way the user might expect.

Depending on the compromise, we will definitely need to manage expectations in the logs.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

Is it possible to know the latest log entry on the leader even if we are only sent an intermediate one?

Yes, the moment raft.LastContact() is updated on the Follower, raft.CommitIndex() is also updated.

So that's the info we have on the Follower. This is because comms between the Leader and the Follower (with the exception of a Snapshot, which is a special case, and actually what you're hitting in particular) is done via AppendEntriesRequests, and those RPCs contain that information.

https://pkg.go.dev/github.com/hashicorp/raft#AppendEntriesRequest

So yes, we know what the Commit Index is on the Leader, as of the time returned by LastContact(). We can't know the entry itself, of course, until it arrives, but we do knows its committed index. So given that, what do you suggest?

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

The best thing we might able to do is this: offer freshness control, and Raft entry index control. They are controlled separately, but probably make most sense together.

curl -G 'localhost:4011/db/query?level=none&freshness=1s&raft_delta=2

This would mean "give me the data as long as you've heard from the Leader within the last two seconds and your Applied Index is no more than 2 entries behind the Leader's Committed Index". It's definitely a technical issue at this stage, and most folks may not care, but at least end-users could tune to their needs. To keep existing behaviour, raft_delta would not be checked, unless explicitly set (a value of 0 would count as being "set").

@aderouineau
Copy link
Author

aderouineau commented Feb 6, 2024

I think we can find a way to convert log delta into timestamps.

Essentially, I think a node can be considered fresh if both the following are within the freshold (hope you like this pun :p) duration:

  • The difference between LastContact() and now
  • The difference between lastCommitIndexSeen and lastCommitIndexApplied.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

The difference between lastCommitIndexSeen and lastCommitIndexApplied

Can you give me an example? Is the latter simply the latest time any log entry was applied to the database? Or are you proposing to track the applied time on a per index basis? If you're not proposing to track on a per index basis, I don't see how that solves the problem above.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

If you look at the example above, lastCommtIndexSeen is the LastContact, but lastCommitIndexApplied is only 0.1 seconds later than LastContact (assuming lastCommitIndexApplied is the latest time a log was applied to the database). This would be within a freshness threshold of even 0.2 seconds, but the data on the Follower is possibly 2 seconds behind the Leader.

If you are proposing to track it on a per index basis, via some sort map (indexes to timestamps) then how large do you make the map? It could get complicated.

@aderouineau
Copy link
Author

(assuming lastCommitIndexApplied is the latest time a log was applied to the database)

That is not the assumption I am proposing. When a log has just been applied, we check if it is the latest index seen. If it is, we can update lastCommitIndexApplied.

This does create an issue for constant updates where the node never gets a chance to apply a log that is the latest. I'm hoping there's a way to track that.

@aderouineau
Copy link
Author

By the way, the issue I identified was for a read-only node joining the cluster from scratch. Maybe it would help to consider this scenario separately? Or find a solution that at least mitigates this particular problem?

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

That is not the assumption I am proposing. When a log has just been applied, we check if it is the latest index seen. If it is, we can update lastCommitIndexApplied.

I see, OK, let me think about that. It might improve the situation.

This does create an issue for constant updates where the node never gets a chance to apply a log that is the latest. I'm hoping there's a way to track that.

But that is just a variation on the hard problem I outlined above.

The core issue is that the AppliedIndex can always be behind the CommitIndex, and there is no way to translate that delta to absolute time. It can be behind due to catch-up-after-network-disconnect (the scenario I outlined above) or your scenario where the updates happen to coming in fast when the NONE read is issued. It's the same fundamental issue as far as I can see.

I know it feels like this should be solvable, but the fundamental issue is that when a node receives a committed log entry from a Leader (and LastContact and CommitIndex are updated on the Follower), there is no way to for the Follower to know when that Log Entry was actually committed on the cluster. It could have been committed 1 second ago, or it could have been committed 3 days ago. Nothing in the Raft protocol involves absolute time (for good reason).

I'm pushing on this because we both want to get it right. But this discussion has been super-helpful (thanks so much!) -- it's exposed an important weakness in the NONE read semantics. Even if we add the raft_delta flag, we could still be quite stale, even if both the freshness time is short, and raft_delta small. The docs definitely need work to make this clearer.

Let me know if you disagree with my analysis. I will update the docs, and I think about adding the raft_delta option. If it existed today I believe it would allow you to know when the second node joining your cluster was caught up (assuming there are no ongoing writes).

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

Hashicorp Consul offers similar functionality: https://developer.hashicorp.com/consul/docs/agent/config/config-files#max_stale

I wonder if it suffers from the same problem? Their docs imply it doesn't, but perhaps they've made the same mistake?

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

I know it feels like this should be solvable, but the fundamental issue is that when a node receives a committed log entry from a Leader (and LastContact and CommitIndex are updated on the Follower), there is no way to for the Follower to know when that Log Entry was actually committed on the cluster. It could have been committed 1 second ago, or it could have been committed 3 days ago. Nothing in the Raft protocol involves absolute time (for good reason).

Ooops, I'm confusing things. It doesn't matter if it was committed 3 days ago, the main question is if the Follower running with the latest committed entry. Sheesh, there is a lot of details here! Ignore that.

That said, I think my objection that the "constant updates" problem you mentioned is just a variation of the hard problem is correct, right? I think if you're going to convince me this can be solved, I'd need a step-by-step example.

@otoolep
Copy link
Member

otoolep commented Feb 6, 2024

@aderouineau -- your engagement on this issue continues to be very helpful. Thanks again.

@aderouineau
Copy link
Author

I wonder if it suffers from the same problem? Their docs imply it doesn't, but perhaps they've made the same mistake?

Quickly looking at their code, it does seem like they are making the same mistake. Examples:

@otoolep
Copy link
Member

otoolep commented Feb 15, 2024

While the snapshot is being transferred and installed, can I see the output of /status?pretty on the receiving node?

@aderouineau
Copy link
Author

During the transfer:

{
  "addr": "localhost:4012",
  "apply_timeout": "10s",
  "command_commit_index": 0,
  "db_applied_index": 0,
  "db_conf": {
    "fk_constraints": false
  },
  "dir": "/home/derouin/rqlite/data2",
  "dir_size": 2361266176,
  "dir_size_friendly": "2.4 GB",
  "election_timeout": "1s",
  "fsm_index": 0,
  "fsm_update_time": "0001-01-01T00:00:00Z",
  "heartbeat_timeout": "1s",
  "last_applied_index": 0,
  "leader": {
    "addr": "localhost:4002",
    "node_id": "1"
  },
  "leader_appended_at_time": "0001-01-01T00:00:00Z",
  "no_freelist_sync": false,
  "node_id": "2",
  "nodes": [],
  "observer": {
    "dropped": 0,
    "observed": 1
  },
  "open": true,
  "raft": {
    "applied_index": 0,
    "bolt": {
      "TxStats": {
        "PageCount": 14,
        "PageAlloc": 57344,
        "CursorCount": 44,
        "NodeCount": 13,
        "NodeDeref": 0,
        "Rebalance": 0,
        "RebalanceTime": 0,
        "Split": 0,
        "Spill": 7,
        "SpillTime": 33843,
        "Write": 21,
        "WriteTime": 9530799
      },
      "FreePageN": 0,
      "PendingPageN": 2,
      "FreeAlloc": 8192,
      "FreelistInuse": 32,
      "TxN": 11,
      "OpenTxN": 0
    },
    "commit_index": 0,
    "fsm_pending": 0,
    "last_contact": "127.527386ms",
    "last_log_index": 0,
    "last_log_term": 0,
    "last_snapshot_index": 0,
    "last_snapshot_term": 0,
    "latest_configuration": "[]",
    "latest_configuration_index": 0,
    "log_size": 32768,
    "num_peers": 0,
    "protocol_version": 3,
    "protocol_version_max": 3,
    "protocol_version_min": 0,
    "snapshot_version_max": 1,
    "snapshot_version_min": 0,
    "state": "Follower",
    "term": 74,
    "voter": false
  },
  "ready": true,
  "reap_read_only_timeout": "0s",
  "reap_timeout": "0s",
  "request_marshaler": {
    "compression_batch": 50,
    "compression_size": 1024,
    "force_compression": false
  },
  "snapshot_interval": "10s",
  "snapshot_store": {
    "db_path": "",
    "dir": "/home/derouin/rqlite/data2/rsnapshots",
    "snapshots": []
  },
  "snapshot_threshold": 8192,
  "sqlite3": {...},
  "trailing_logs": 10240
}

The most interesting change during the restore is:

  "snapshot_store": {
    "db_path": "/home/derouin/rqlite/data2/rsnapshots/44-6-1708013657655.db",
    "dir": "/home/derouin/rqlite/data2/rsnapshots",
    "snapshots": [
      "44-6-1708013657655"
    ]
  }

When restore is done, this is where we get a lot more stuff: "db_applied_index" and "fsm_index" are 6 instead of 0, "last_applied_index" is 8 instead of 0, "nodes" is no longer empty, and there more non-zero data in "raft".

@otoolep
Copy link
Member

otoolep commented Feb 15, 2024

OK, well the output of /status confirms why the recent work doesn't solve this issue. It's as I suspected. The Leader is not heartbeating with the node while the snapshot transfer is taking place, and therefore you have these values:

  "leader_appended_at_time": "0001-01-01T00:00:00Z"
  "fsm_index": 0
  "command_commit_index": 0

Only after the snapshot has been transferred and installed are all these values brought up to date.

Now we could enhance the isStaleRead logic to do something different when fsm_index and command_commit_index are zero, but that doesn't really solve the problem because:

  • if they are zero, that doesn't mean that a snapshot is being transferred and installed. The node could be perfectly caught up, the cluster could be totally idle. Perhaps not in practise, but in theory yes.
  • if they are non-zero, but equal, a snapshot could be in transfer anyway. It just takes place later in the node's life.

This shows that when a snapshot is being transferred and installed, the staleness is undefined, at least by the current logic. That said, by definition, if a snapshot is being transferred and installed -- which the system can determine -- the node is definitely behind, but by how much, it seems impossible to know.

I think one change I would be OK with is the following. If freshness_strict is set, then a None read will return "stale read" if either of the following is true:

  • the current check (as it's currently coded in 8.20.1) is true
  • a snapshot transfer and install is in progress. No attempt would be made to measure "time behind leader". It's a stale read, full stop.

Thoughts?

@otoolep
Copy link
Member

otoolep commented Feb 15, 2024

Well, there may be some subtlety here -- the Last Contact time is only 127ms in the past, so the Leader is contacting the node, but it's not sending any log entries. So I take it back, it looks like it is heartbeating, but since it's not sending any Log Entries with the heartbeat, there is no mechanism to send any "appended at" timestamps (they only come with logs).

So a more precise statement would be that in the case you are hitting, there is no Leader timestamps available to compare to.

@otoolep
Copy link
Member

otoolep commented Feb 15, 2024

So I'm back to the simple proposal -- if a snapshot transfer and install is taking place, all None level reads are stale for any value of freshness (except a value of 0 or less), if freshness_strict is also set.

@otoolep
Copy link
Member

otoolep commented Feb 15, 2024

So I'm back to the simple proposal -- if a snapshot transfer and install is taking place, all None level reads are stale for any value of freshness (except a value of 0 or less), if freshness_strict is also set.

Note that this doesn't solve the case where the node comes up, it's behind the leader, needs a snapshot, but doesn't know it yet (and it hasn't started it yet). In this reads may not be marked as stale. I do not know how this can be solved, nor do I know if it's even worth solving. We're starting to hit up against the fundamental nature of these systems now.

@otoolep
Copy link
Member

otoolep commented Feb 15, 2024

Corrected latest comment.

@aderouineau
Copy link
Author

Thoughts?

Is it possible for a snapshot to be transferred when a node is in-sync, or does that only happen when a node has been out of the loop for too long? If it is the latter, then I think your proposal is a good compromise.

Note that this doesn't solve the case where the node comes up, it's behind the leader, needs a snapshot, but doesn't know it yet (and it hasn't started it yet).

Couldn't we use the fact that if commit_index is 0 then it must be waiting for stuff? I have the following index numbers when starting a new node from scratch (no data):

    "command_commit_index": 0,
    "db_applied_index": 0,
    "fsm_index": 0,
    "last_applied_index": 2,
      "applied_index": 2,
      "commit_index": 2,
      "last_log_index": 2,
      "last_snapshot_index": 0,
      "latest_configuration_index": 0,

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

Is it possible for a snapshot to be transferred when a node is in-sync, or does that only happen when a node has been out of the loop for too long? If it is the latter, then I think your proposal is a good compromise.

The Leader will not send a transfer unless the node tells it it needs it. So the latter.

Couldn't we use the fact that if commit_index is 0 then it must be waiting for stuff?

Just because it's waiting for stuff doesn't mean any read of the node is stale. It has no way of knowing if there have been writes on the cluster or not until it starts getting information from the Leader. Does that make sense? This is more of a "are you ready" question, not "is your data stale".

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

The Leader will not send a transfer unless the node tells it it needs it.

The node figures that out by looking at the information in the heartbeat AFAIK, but I have not looked at the code that closely.

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

One thing that is important to understand is the commit index can be greater than zero, but no writes were ever made on the cluster.

The is because cluster membership changes also go through the Raft system (this is described in the Raft paper). You may already know this. So while a commit index being zero does tell you the node is not yet ready, it doesn't allow you to say anything about the staleness of your SQLite data. There may no SQLite data anywhere on the cluster the node is about to join.

That's why it's more suitable to enhance the /readyz response to signal this.

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

I looked into the code further, and it's not as easy to determine when a snapshot is being streamed by the Leader to a node, as I thought. I can detect when the Snapshot is being installed, but that is after it's been fully received. So it would only cover about half the transition.

So given the difficulty of detecting the "snapshot being streamed" state, I may not be able to add the final check. Is it possible? Probably. Is it worth the amount of changes I would need to make to the code to deal with this case? Doesn't look like it right now.

@aderouineau
Copy link
Author

Should we request consul-raft to make this easier?

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

Should we request consul-raft to make this easier?

It's always an option to ask the maintainers upstream, but I don't plan to do it at this time -- unfortunately I don't have the time to engage with the upstream maintainers to extent it would probably be required to get it in (if they even bothered to work on it -- they could consider all this implementation details).

There are other things we could do. For example, enhance /readyz so that it contacts the Leader when called, checks the Leader's commit index, and then /readyz doesn't return to the caller until the node's own commit index at least equals that it read from the Leader.

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

There are other things we could do. For example, enhance /readyz so that it contacts the Leader when called

This would have to be an optional flag one would pass to /readyz, as I don't want to change existing behaviour. Something like /readyz?commit.

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

BTW, one thing I don't understand is why the output you provided earlier on showed "commit_index": 0. We've established that the Leader is in contact with the node (last commit time was 127ms) so you'd think the commit_index would be updated (this is in the Raft section of your output). It comes from, presumably, here: https://pkg.go.dev/github.com/hashicorp/raft#AppendEntriesRequest If only that Commit Index was non zero, we could solve this issue.

Perhaps you could check something for me. While the transfer and install is taking place on the read-only node, what is the output of /status?pretty on the Leader? If it's Commit Index was > 0, but the read-only node's was zero, even though the Leader was heartbeating with the read-only node, that is something it would be worth investigating -- and asking about upstream.

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

Ha! I just discovered something -- raft.CommitIndex doesn't seem to return the LeaderCommitIndex sent in any Heartbeat message -- it returns the node's local CommitIndex. I instrumented the code to show this.

This may be a solution. Let me try something.

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

OK, I think #1688 might help.

With this change all nodes have the latest Leader Commit Index, as sent by the latest heartbeat message from the Leader (this value only makes sense if a node is not the Leader). Given this information, /readyz could be enhanced so that it blocks (or times out) until the node's local commit index is:

  • greater than zero and
  • at least as great as the Leader's Commit Index was when /readyz was invoked.

Deployment of your read-only node could go something like this:

  • fire up your read-only node, telling it to join with the Leader
  • wait until /readyz?commit returns with HTTP 200
  • then move to None-level reads as normal, with whatever freshness you want.

@aderouineau
Copy link
Author

I can test that PR before merging into master

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

Try #1689, see if it works for you.

http://localhost:4003/readyz?commit&timeout=30s

as an example call.

@otoolep
Copy link
Member

otoolep commented Feb 16, 2024

@aderouineau -- I think #1689 is ready now, if you want to built it. Tested it locally, it seems to work well.

@otoolep
Copy link
Member

otoolep commented Feb 20, 2024

@aderouineau -- thanks again for your help with this issue. With #1689 merged, I believe I've done as much as I can do here. If you have more ideas, let' me know. I tested /readyz?sync and it works as I hoped.

@otoolep otoolep closed this as completed Feb 20, 2024
@otoolep
Copy link
Member

otoolep commented Feb 20, 2024

Fix will be in 8.21.0.

@jasonbigl
Copy link

This is awesome!

I also need this feature, but before this releasae, I use the bash script(like below) to monitor the log for keywords "successfully opened database", it works for me, but your solution is more elegant :)

# wait until database is ready to read
snapshotChecked=0
snapReady=$(grep -F "successfully opened database" /var/log/rqlite.log)
while [[ -z $snapReady && $snapshotChecked -lt 60 ]]; do

    sleep 1

    snapReady=$(grep -F "successfully opened database" /var/log/rqlite.log)

    echo "checking database ready for $snapshotChecked times"

    snapshotChecked=$((snapshotChecked + 1))
done

@otoolep
Copy link
Member

otoolep commented Feb 21, 2024

Available in 8.21.0.

@aderouineau
Copy link
Author

@otoolep Sorry for the late reply. I tested this with 1 voter and 1 reader, and /readyz on the reader reports correctly blocks / times out until the node is restored.

Once a node is initialized, what happens when a future large snapshot transfer happens? Does /readyz still need to be used, or can strict_freshness be used from then on?

@otoolep
Copy link
Member

otoolep commented Feb 23, 2024

@aderouineau -- I'm not sure, I think you'd need to test to be sure. strict_freshness may meet your needs once the initial transfer is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants