-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-restore occurring even with existing data #1561
Comments
Thanks for the report -- it may be a documentation bug, I'll need to think about the right behaviour here. I remember discussing this feature at the time I implemented and making a specific choice, for a specific reason. I just can't remember why it works with the way it does in the code. :-) |
Ack, thanks @otoolep. I'd be interested in the use case for this behavior if it's intentional. It eludes me :) |
Agreed, it does seem like a bug. However the docs are not right either, as this feature is meant to work with clusters. There should be nothing stopping a 3-node cluster doing an autorestore (the code as it exists today specifically handles clustered systems correctly). But the question is should this flag be ignored if at least one write has been sent to the system at any point in the past. I think the answer is yes, but need to think about it. |
Did some digging, connected with the person who requested this feature. This looks like a bug. I'll fix within the next day or so and release 8.15. The docs will need updating too. I do not consider this a breaking change, it's a bug fix. |
The docs will need fixing too. |
So the intended behavior is different than both the current code and the docs? How should it work? Just wondering how this will fit in the context of the helm chart (where I am working on adding direct support for auto backup/restore) |
The original implementation should have worked like this:
However, as part of fixing this bug, I will look into modifying these steps so the download only happens at stage 3 if it's actually needed. But the goal of downloading it at start-up is make installation robust -- we want to have the data installed very quickly after the system becomes the Leader. But that is pretty wasteful of network resources since installation is a once-off. |
Yeah, deferring download until it's known to be necessary makes sense to me, especially since as you say the data size could be large and usually elections complete fairly quickly. So with this change, it sounds like it should be feasible to always run with auto-restore enabled as part of normal operation. Will the nodes be returning 200 on /readyz before the backup is restored, or is the download+restore considered in band of the bootstrap process and the node isn't reporting ready at that point? And if yes to the former, will rqlite will reject queries with HTTP 503 until the restore completes? Just wondering what the client experience would be with the Helm chart at this time, whether they'll get connection timeouts (because no pods in the rqlite cluster are passing readiness probes yet) or they'll get HTTP errors, or (worse) queries/executes will actually succeed before restore completes. |
Yes, the goal is that the flag is safe to leave set, knowing it will be ignored if it would stomp on existing data (which is the problem today).
Auto-restore is performing using the normal "write" operations, so that system must be ready before it can proceed. I think, therefore, it will report "ready" before the restore completes. Writing data to rqlite while a restore is in progress currently results in undefined behavior. I can probably change that if it makes sense i.e. redefine ready. |
Which means that clients waiting for cluster to come back up, retrying their transactions, could actually break the restoration process, if e.g. the restore tries to insert into a table that now violates a (say) unique constraint because the client slipped something in first. Yeah, I'd revise "ready" to not pass until restoration completes. This will insulate rqlite from client access (at least for K8s or other deployments that use a reverse proxy that checks /readyz) until the data is restored. |
You'd probably be interested in knowing that that couldn't happen. That's because the restore just stomps on whatever SQLite file it finds there, completely replacing it with the file it downloaded from S3. But yes, the experience is not good either way and enhancing the definition of |
Ah, that's an interesting tidbit, yeah. I'd misunderstood what you meant when you said restore uses normal write operations. |
By "normal" I mean it sends the entire SQLite database (compressed) through the Raft log (as a single log entry), so it gets distributed to every node in the cluster, but when applied to the database it doesn't "execute" any SQL commands, it just decompresses the data and writes it over any existing SQLite file. |
Only loosely related, but I'm noticing rqlite always uploads on startup even if the data hasn't changed since last upload, despite the docs saying:
So there's another optimization opportunity here, but may be tricky. It'd be easy enough to update StorageClient (and the S3 implementation of that interface) to include a function GetObjectTimestamp() or some such to get the mtime from S3, but I guess the challenging part is getting the mtime from sqlite to have something to compare it with? If that were possible -- i.e. sqlite could tell us when the last mutation occurred -- we could skip the upload on startup and the full database dump + compress to compute the hash every interval. |
Good point, but that's because it stores the hash of the upload in RAM, which is lost, of course, on restart. I could store it on disk perhaps, best effort. Hadn't occurred to me! https://github.com/rqlite/rqlite/blob/master/auto/backup/uploader.go#L170 |
The main drawback with that is that the next restart will probably have a different leader which won't have the stored hash. Most robust option would be to lean on the storage provider to hold that state. |
True, that's probably why I didn't bother.
It also introduces a dependency on an external system which I'm not too keen on. Not ruling out, could be a best effort thing. |
The dependency already exists for cloud storage though. Barring that, is it possible that could be rqlite state shared with other nodes over the raft log? Maybe that's getting more complicated than it's worth. |
Yeah, you're right, it's there already, it's just a little bit complicated. Gotta pull down the sum before the first upload, compare, etc etc. TBH, I built it before I thought rqlite would support very large data sets, and the price of a single upload at the start wasn't going to be huge (and you get simpler code in return). But with 8.x explicitly about larger data sets, it might be time to add it. A, say, 2GB upload at the start for no good reason is definitely wasteful. |
What version are you running?
8.14.1
Are you using Docker or Kubernetes to run your system?
K8s
Are you running a single node or a cluster?
Single node
What did you do?
Configured both auto-backup (aggressive 15-second interval for testing purposes) and auto-restore. I observed that when restarting a pod with the underlying storage intact between restarts, rqlite is auto-restoring from cloud backup:
Here it detects preexisting node state but auto-restores anyway, despite the fact that the docs say:
I confirmed this by inserting a row into a table, restarting the pod, and then selecting that table where the row I inserted was missing.
rqlite logs attached:
rqlite-0.log
What did you expect to happen?
Not to auto-restore from cloud backup given that existing data is present.
Please include the Status, Nodes, and Expvar output from each node (or at least the Leader!)
The text was updated successfully, but these errors were encountered: