-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readyz/freshness do not consider raft index #1672
Comments
Thanks for the feedback.
The fact that the node is "behind" the leader has no bearing on the readyness. Ready is defined as the node being "ready to respond to database requests and cluster management operations" -- and once connected to the Leader this is true except for None consistency. None doesn't involve consensus or the Leader, so that's why you're seeing what you're seeing. But
That's fair, I can see how it could be a little deceiving, the docs need to be clarified. The issue is that even if the node you're querying knew how many Raft log entries it was behind the leader, it's not easy to convert that to a "time". Who knows, for example, when the logs will arrive. However I don't believe the underlying Raft system exposes the Leader index, as it's contained in the heatbeat with the node. So not sure if I can add support for that. So yeah, I need to tweak the docs -- good feedback. But the system is working as designed in both cases. |
I think you've identified an interesting practical issue however -- how does one know when the second node has loaded a large dataset after joining the cluster, one that you previously booted a single node with? One idea would be to enhance
|
Of course, if the node can't contact the Leader, the |
Of course, obviously you can look at the logs, but I'm talking about an automated check. |
Thanks for the quick response, @otoolep!
Do you agree that a read-only node only makes sense with the use of "None" consistency? And therefore that it is actually important for a read-only node to provide a status of the freshness of its data, not just whether it can contact the leader? Could we add a
It would be nice to clarify this quirk in the docs. I do think this is more than a little deceiving though, as the term "freshness" is generally understood to refer to the data. And I believe there is little use in checking the freshness of the node-leader connection if the data itself is far from fresh. Especially when the goal is for an application to query the replica locally, and therefore needs to wait for the replica to be fresh enough or else many queries will fail for bad reasons.
Would the following be possible?
By the way, tracking data freshness could allow an interesting feature: combining |
Agreed, there is an assumption in the way You are correct that read-only nodes only provide read scalability if you use No, So in principle the best definition a read-only node could offer is the following: a There is a fair amount of Raft terminology in the description above, but it comes down to what you're saying. Interesting, I hadn't thought about this whole area in a while, but I see it could be improved. It's not necessarily a redefinition of
|
Just to clarify something. With the the current implementation, a node could, in theory, have heard from the Leader in, say, the last second, but is many log entries behind. This is because the node might still be catching up with Leader, as the Leader is streaming log entries to it. But the freshness check could still pass. That's what I'm proposing to address here. |
Actually, this is where it gets tricky. This is the strictest form of the check. An client making a call might be OK if the node is N entries behind, at most. So it might make sense to also use the above as a default, but allow the client to loosen it somewhat. |
@aderouineau -- not sure if you're getting into the code, but if you look at #1673, it now redefines a
I could make the case that this is a better definition of stale reads, and what I really meant ever since I introduced the concept. |
This basically addresses your issue, I think. :-) |
Actually, there are a couple more subtleties here, so not done yet. |
The issue with the current change is the following:
|
Yes, the new condition in #1673 has a race condition which is exacerbated the larger the update. |
OK, thanks again for pushing on this, it's interesting. Let's see if we can solve it, or at least make clear statements about what we can do. I saw your comment from earlier, about adding a I think what I describe below is the hard case, let me know what you think. All time is in units of seconds.
I believe this is the crux of the issue. Do you agree with my analysis? The issue is that a given log entry is applied at a different absolute times on a different nodes. Now, of course, you could start to introduce the concept of an absolute clock into Log Entries, but that introduces lots of complexity, and systems can really be screwed up if an operator changes the time on the machines. You really, really, want to avoid introducing absolute time in distributed systems, if you can avoid it. I would not introduce it for the purposes of solving this issue. I don't see any problem with offering a new option for NONE reads -- "check that the Last Applied time is also within the freshness", but I'm just pointing out that it doesn't prevent stale reads in the way the user might expect. WDYT? |
I agree with your analysis, and maybe we can find different levels of mitigation. Is it possible to know the latest log entry on the leader even if we are only sent an intermediate one? If it is, then a node could track the time at which it is made aware of the latest index, even if the corresponding log entry has not yet been sent.
Depending on the compromise, we will definitely need to manage expectations in the logs. |
Yes, the moment
So that's the info we have on the Follower. This is because comms between the Leader and the Follower (with the exception of a Snapshot, which is a special case, and actually what you're hitting in particular) is done via AppendEntriesRequests, and those RPCs contain that information. https://pkg.go.dev/github.com/hashicorp/raft#AppendEntriesRequest So yes, we know what the Commit Index is on the Leader, as of the time returned by |
The best thing we might able to do is this: offer freshness control, and Raft entry index control. They are controlled separately, but probably make most sense together.
This would mean "give me the data as long as you've heard from the Leader within the last two seconds and your Applied Index is no more than 2 entries behind the Leader's Committed Index". It's definitely a technical issue at this stage, and most folks may not care, but at least end-users could tune to their needs. To keep existing behaviour, |
I think we can find a way to convert log delta into timestamps. Essentially, I think a node can be considered fresh if both the following are within the freshold (hope you like this pun :p) duration:
|
Can you give me an example? Is the latter simply the latest time any log entry was applied to the database? Or are you proposing to track the applied time on a per index basis? If you're not proposing to track on a per index basis, I don't see how that solves the problem above. |
If you look at the example above, lastCommtIndexSeen is the LastContact, but lastCommitIndexApplied is only 0.1 seconds later than LastContact (assuming lastCommitIndexApplied is the latest time a log was applied to the database). This would be within a freshness threshold of even 0.2 seconds, but the data on the Follower is possibly 2 seconds behind the Leader. If you are proposing to track it on a per index basis, via some sort map (indexes to timestamps) then how large do you make the map? It could get complicated. |
That is not the assumption I am proposing. When a log has just been applied, we check if it is the latest index seen. If it is, we can update lastCommitIndexApplied. This does create an issue for constant updates where the node never gets a chance to apply a log that is the latest. I'm hoping there's a way to track that. |
By the way, the issue I identified was for a read-only node joining the cluster from scratch. Maybe it would help to consider this scenario separately? Or find a solution that at least mitigates this particular problem? |
I see, OK, let me think about that. It might improve the situation.
But that is just a variation on the hard problem I outlined above. The core issue is that the AppliedIndex can always be behind the CommitIndex, and there is no way to translate that delta to absolute time. It can be behind due to catch-up-after-network-disconnect (the scenario I outlined above) or your scenario where the updates happen to coming in fast when the NONE read is issued. It's the same fundamental issue as far as I can see. I know it feels like this should be solvable, but the fundamental issue is that when a node receives a committed log entry from a Leader (and LastContact and CommitIndex are updated on the Follower), there is no way to for the Follower to know when that Log Entry was actually committed on the cluster. It could have been committed 1 second ago, or it could have been committed 3 days ago. Nothing in the Raft protocol involves absolute time (for good reason). I'm pushing on this because we both want to get it right. But this discussion has been super-helpful (thanks so much!) -- it's exposed an important weakness in the NONE read semantics. Even if we add the Let me know if you disagree with my analysis. I will update the docs, and I think about adding the |
Hashicorp Consul offers similar functionality: https://developer.hashicorp.com/consul/docs/agent/config/config-files#max_stale I wonder if it suffers from the same problem? Their docs imply it doesn't, but perhaps they've made the same mistake? |
Ooops, I'm confusing things. It doesn't matter if it was committed 3 days ago, the main question is if the Follower running with the latest committed entry. Sheesh, there is a lot of details here! Ignore that. That said, I think my objection that the "constant updates" problem you mentioned is just a variation of the hard problem is correct, right? I think if you're going to convince me this can be solved, I'd need a step-by-step example. |
@aderouineau -- your engagement on this issue continues to be very helpful. Thanks again. |
Quickly looking at their code, it does seem like they are making the same mistake. Examples: |
While the snapshot is being transferred and installed, can I see the output of |
During the transfer:
The most interesting change during the restore is:
When restore is done, this is where we get a lot more stuff: "db_applied_index" and "fsm_index" are 6 instead of 0, "last_applied_index" is 8 instead of 0, "nodes" is no longer empty, and there more non-zero data in "raft". |
OK, well the output of
Only after the snapshot has been transferred and installed are all these values brought up to date. Now we could enhance the
This shows that when a snapshot is being transferred and installed, the staleness is undefined, at least by the current logic. That said, by definition, if a snapshot is being transferred and installed -- which the system can determine -- the node is definitely behind, but by how much, it seems impossible to know. I think one change I would be OK with is the following. If
Thoughts? |
Well, there may be some subtlety here -- the Last Contact time is only 127ms in the past, so the Leader is contacting the node, but it's not sending any log entries. So I take it back, it looks like it is heartbeating, but since it's not sending any Log Entries with the heartbeat, there is no mechanism to send any "appended at" timestamps (they only come with logs). So a more precise statement would be that in the case you are hitting, there is no Leader timestamps available to compare to. |
So I'm back to the simple proposal -- if a snapshot transfer and install is taking place, all |
Note that this doesn't solve the case where the node comes up, it's behind the leader, needs a snapshot, but doesn't know it yet (and it hasn't started it yet). In this reads may not be marked as stale. I do not know how this can be solved, nor do I know if it's even worth solving. We're starting to hit up against the fundamental nature of these systems now. |
Corrected latest comment. |
Is it possible for a snapshot to be transferred when a node is in-sync, or does that only happen when a node has been out of the loop for too long? If it is the latter, then I think your proposal is a good compromise.
Couldn't we use the fact that if
|
The Leader will not send a transfer unless the node tells it it needs it. So the latter.
Just because it's waiting for stuff doesn't mean any read of the node is stale. It has no way of knowing if there have been writes on the cluster or not until it starts getting information from the Leader. Does that make sense? This is more of a "are you ready" question, not "is your data stale". |
The node figures that out by looking at the information in the heartbeat AFAIK, but I have not looked at the code that closely. |
One thing that is important to understand is the commit index can be greater than zero, but no writes were ever made on the cluster. The is because cluster membership changes also go through the Raft system (this is described in the Raft paper). You may already know this. So while a commit index being zero does tell you the node is not yet ready, it doesn't allow you to say anything about the staleness of your SQLite data. There may no SQLite data anywhere on the cluster the node is about to join. That's why it's more suitable to enhance the |
I looked into the code further, and it's not as easy to determine when a snapshot is being streamed by the Leader to a node, as I thought. I can detect when the Snapshot is being installed, but that is after it's been fully received. So it would only cover about half the transition. So given the difficulty of detecting the "snapshot being streamed" state, I may not be able to add the final check. Is it possible? Probably. Is it worth the amount of changes I would need to make to the code to deal with this case? Doesn't look like it right now. |
Should we request consul-raft to make this easier? |
It's always an option to ask the maintainers upstream, but I don't plan to do it at this time -- unfortunately I don't have the time to engage with the upstream maintainers to extent it would probably be required to get it in (if they even bothered to work on it -- they could consider all this implementation details). There are other things we could do. For example, enhance |
This would have to be an optional flag one would pass to |
BTW, one thing I don't understand is why the output you provided earlier on showed Perhaps you could check something for me. While the transfer and install is taking place on the read-only node, what is the output of |
Ha! I just discovered something -- This may be a solution. Let me try something. |
OK, I think #1688 might help. With this change all nodes have the latest Leader Commit Index, as sent by the latest heartbeat message from the Leader (this value only makes sense if a node is not the Leader). Given this information,
Deployment of your read-only node could go something like this:
|
I can test that PR before merging into master |
Try #1689, see if it works for you.
as an example call. |
@aderouineau -- I think #1689 is ready now, if you want to built it. Tested it locally, it seems to work well. |
@aderouineau -- thanks again for your help with this issue. With #1689 merged, I believe I've done as much as I can do here. If you have more ideas, let' me know. I tested |
Fix will be in 8.21.0. |
This is awesome! I also need this feature, but before this releasae, I use the bash script(like below) to monitor the log for keywords "successfully opened database", it works for me, but your solution is more elegant :) # wait until database is ready to read
snapshotChecked=0
snapReady=$(grep -F "successfully opened database" /var/log/rqlite.log)
while [[ -z $snapReady && $snapshotChecked -lt 60 ]]; do
sleep 1
snapReady=$(grep -F "successfully opened database" /var/log/rqlite.log)
echo "checking database ready for $snapshotChecked times"
snapshotChecked=$((snapshotChecked + 1))
done |
Available in 8.21.0. |
@otoolep Sorry for the late reply. I tested this with 1 voter and 1 reader, and Once a node is initialized, what happens when a future large snapshot transfer happens? Does |
@aderouineau -- I'm not sure, I think you'd need to test to be sure. |
What version are you running?
8.19.0
Are you using Docker or Kubernetes to run your system?
Neither: running rqlited directly.
Are you running a single node or a cluster?
Single voter + read-only
What did you do?
./rqlite-v8.19.0-linux-amd64/rqlited -node-id 1 data1
.boot
in rqlite CLI./rqlite-v8.19.0-linux-amd64/rqlited -raft-non-voter=true -http-addr=localhost:4011 -raft-addr=localhost:4012 -join=localhost:4002 data2
curl -G 'localhost:4011/readyz?noleader'
which returned[+]node ok
curl -G 'localhost:4011/readyz'
which returned[+]node ok [+]leader ok [+]store ok
curl -G 'localhost:4011/db/query?level=none&freshness=1s' --data-urlencode 'q=SELECT * FROM table LIMIT 5'
which returned{"results":[{"error":"no such table: table"}]}
Logs of interest on the read-only node:
/readyz
to be able to consider the staleness of its raft log, either by default or with a parameter. In this case, it should not say "node ok" until the data was available locally.level=none&freshness=1s
to return an error due to staleness. It seems likefreshness
is based on the last time the node contacted the leader, but I think the test should also look at the raft index.The text was updated successfully, but these errors were encountered: