New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce sqlite for storing of filters for IndexStore
.
#10272
Introduce sqlite for storing of filters for IndexStore
.
#10272
Conversation
Would it be possible to use LiteDB instead? To at least avoid another native dependency again? |
Good question. I did not know we depend on this. However, I have personal experience with the LiteDB database and there are several issues:
To be honest, I would not put LiteDB in anything but a toy project. But I guess I'm much more strict than other people wrt what I expect from software. But still not a fan of LiteDB for projects that must work rock solid.
What are the cons? It seems sqlite works everywhere https://www.nuget.org/packages/Microsoft.Data.Sqlite/#supportedframeworks-body-tab. |
Concept ACK. |
…05-IndexStore-with-sqlite # Conflicts: # WalletWasabi.Fluent/Global.cs # WalletWasabi.Tests/UnitTests/Stores/IndexStoreTests.cs # WalletWasabi/Blockchain/BlockFilters/FilterProcessor.cs # WalletWasabi/Stores/IndexStore.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code overall is nice.
I tried to break the feature but was not able to do so.
I don't like that it almost doubles the disk space filters use. Could we delete the MatureIndex once it has been pushed into the db?
🎉
This PR adds an |
WalletWasabi/Stores/SqliteStorage.cs
Outdated
block_height INTEGER NOT NULL PRIMARY KEY, | ||
block_hash BLOB NOT NULL, | ||
filter_data BLOB NOT NULL, | ||
previous_block_hash BLOB NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why BLOB not VARCHAR(64) for hashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash in binary is 32 bytes long. Varchar(64) would require 64 bytes. My only motivation here was to save space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. Counter-argument would be that with varchars it's easier to examine database using sqlite tools. But this is anyway something that is not too hard to change in future if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about that and there are free tools like https://sqlitebrowser.org/ and you can just type query:
SELECT hex(previous_block_hash) FROM filter
and you'll see the hash as a string of characters. Sure, there is some mental overhead for people (i.e. developers) doing that debugging but my reasoning was that saving space has bigger priority (for actual users) for something that still grows.
edit: But then I don't know much much actual data saving in kB/MB we are talking about from the top of my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: But then I don't know much much actual data saving in kB/MB we are talking about from the top of my head.
64 bytes per block for block_hash
and previous_block_hash
combined. Is there all blocks or only after segwit activation? For all blocks it would be 781411*64=50010304, so around 50 MiB if my calculations are correct.
ACK, maybe implement it shortly before 2.0.4 release so reviewers won't have issues by going back and forth and users won't ever have duplicated filters. |
@molnard and @nopara73 has not greenlighted this PR yet. It may happen that they will NACK it. Just FYI. And yes, we can then come up with a strategy to migrate user to the new format. Imho it would be great to have a conversion routine that just converts mature index to sqlite and consequently deletes mature index ... then users would avoid re-downloading all filters which would be great (imho not that hard to implement so worth it imo). |
Normally I'd scream bloody murder, but this time I must admit it is well argued, so cACK. ReflectionsLet me reflect on the specifics points:
Assuming it is wrong, which I am not convinced of, we can easily "fix" it without a database.
Meh
?
Not as easy as opening a file :) Anyhow, if we'd to change to binary for disk saving, then we wouldn't be able to inspect. So with file you can't have your cake and eat it, too, with this you can at least be able to eat a part of it, so this is compelling.
I don't believe this for a second
You added twice as much code than you removed here. Lines of code is usually a good proxy for complexity. Maybe it's not the case this time as CodeScene reports improvements:
We don't need a database for that.
Small pro.
I don't see any logical implication between these two claims. The first claim seems to stand strong, but the second I don't believe, in fact I wouldn't be surprised if the current solution is more robust.
Small pro.
It does not get any better than this. |
It's not so big problem with modern SSDs, they have smart firmware that does wear leveling. When you write data to specific logical sector, physically it's written to different part of SSD each time. But it is true for SD cards, which don't have smart firmware. |
Not sure how to interpret "wrong" here. But it certainly not good in the sense that if such a big file is being stored and there is a power cut, then we might end up with a corrupted plaintext file.
Time-tested universal argument. Haha.
You actually might not have 2 more free GBs on your computer/phone just to update the mature index (mainnet). This is not so hard to imagine. Likely on a phone, the storing of those 2 GBs might be slower than on laptops/desktops. But that's just efficiency.
Not sure why, databases typically strive for ACID properties. What database engines typically offers is "either your transaction is commited or not", so you might lose data if there is a power cut but then you typically always lose some data when there is a power cut. Useful resources regarding what sqlite offers:
The first 4 files are packages.lock.json and those changes add 73+73+73+78 = 297 lines of code.
So if I do the math correctly, the PR adds about 100 lines of code. Plus,
Yes, we don't. We can implement what this PR does without the new depency, but the thing that puts me off doing that is that one then starts to implement a database basically (or database features).
My point was meant to be: databases allow the seek operation (get the last N rows). The plaintext files (as we currently have) allows us to read it from the start or read lines and skip processing them and start processing them from certain point on. I don't believe that the current solution is more robust. I think that sqlite is more robust. Maybe we differ on definition of robustness though.
I still don't see a good reason to do it if it is not necessary. It feels wrong. Footnotes |
I don't disagree here. :) |
Just a note unrelated to the PR but because the topic was brought up: Should we even store the filters on mobile? I think the answer is no. |
I agree and I mentioned it here.
Imho it would be nice to summarize (in an issue?) how it would work or what it would entail to achieve that. This PR (or any other that decreases disk space required) can be seen as another thing that allows to install WW on more devices. |
Linux does not like a lots of files in single directory as well. |
So that would a next step too. Modified OP to mention this as "Next steps". |
WalletWasabi/Stores/SqliteStorage.cs
Outdated
Connection = connection; | ||
} | ||
|
||
private SqliteConnection Connection { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually connections are not stored in fields and properties, but instantiated on demand. Internally there's a pool and an inner connection object (one more controversial thing in ADO.NET), so having a connection string cached in a field would be a better solution.
Sqlite is a bit different because it's a in-process RDBMS, so it can be kept as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to provide more context here: Basically, every WW client downloads block filters to know which bitcoin blocks user should download to get info about his/her coins. New users typically suffer from "downloading tousands of filters" which takes 5 minutes when one does not have Tor enabled but with Tor enabled it takes significantly longer and it's a nuisance.
To have a long-lived sqlite connection makes sense to me because we want performance here. We do not want to spend more time than necessary as people wait for their wallet to actually work (if you hurry to send a payment, it's very bothering).
My second argument here is that I would love to use sqlite for another purpose -- storing RuntimeParams
(some P2P data) (https://github.com/zkSNACKs/WalletWasabi/blob/9b7c67327f9775f532d86336d2ed5d344abc168b/WalletWasabi/Helpers/RuntimeParams.cs) and as such I don't see the benefit of having short lived connections.
I can see a small con of short-lived connections, some process can my sqlite datafile and WW will have a hard time dealing with that.
Is there any important aspect of this issue I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disposal of an ADO.NET connection may not dispose the underlying physical connection due to pooling turned on under the hood (SQLite isn't an exception and does it by default, see code). So, without opting out the code will continue to work exactly like it is now, but it will follow guidelines of ADO.NET usage and free resources when they are not used for some time (source.
// Makes a new logical connection.
using var connection = new SqliteConnection();
// Gets an existing physical connection from the pool or opens it if it doesn't exist yet.
connection.Open();
// Dispose just returns the physical connection to the pool.
There's no performance penalties since you don't opt out from pooling (:
That design is weird, but it was made to improve performance of applications which have been made before pooling got its place in ADO.NET without changing these applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you are saying. But I don't follow what it would bring us. Having a single connection to work with for the duration of application life seems easy to grasp and it seems it models what we actually do (storing filters as we get them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of SQLite it's fine to keep it as is (it is an in process database), but the general practice is disposal of a connection right after finishing all commands in the method. Why so? There are two reasons:
-
A physical connection can be shared thankfully for pooling. That's useful when you have more than one place accessing the same database and potentially from different threads surely if both of them are not accessing it simultaneously.
-
For out of process databases there's a limited number of connections, so a better option here is to limit physical connections made by clients. A good example is PostgreSQL which starts an instance for each incoming connection and stops it when the connection closes. It's a very expensive operation here. Moreover, the listening server has only 100 parallel connections by default.
Sure, we won't use any out of process database, but you said that you are going to use SQLite in another place, so reason number one still applies even if it's partly valid due to the nature of SQLite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, forgot to mention a primary reason behind having a new connection each time. The server can terminate it, so with a single open the service is screwed. Not sure how it applies to SQLite since there's no server, but I have no idea about it internals. Therefore, if a native connection referenced by a handle held can be invalidated for whatever reason, then that reason is valid too.
Honestly, I'm not aware about SQLite internals and wouldn't rely on its implementation. That's why I highly recommend getting a new connection each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, forgot to mention a primary reason behind having a new connection each time. The server can terminate it, so with a single open the service is screwed. Not sure how it applies to SQLite since there's no server, but I have no idea about it internals. Therefore, if a native connection referenced by a handle held can be invalidated for whatever reason, then that reason is valid too.
But we know that there is no server in sqlite.
A physical connection can be shared thankfully for pooling. That's useful when you have more than one place accessing the same database and potentially from different threads surely if both of them are not accessing it simultaneously.
All we have at the moment is this PR. RuntimeParams
can be added to the database or not - no decision has been made. It can also be put in a different sqlite file (I think it's preferable thinking about it). That is, we can have one sqlite file for filters and one for other things and even though it sounds weird, it seems quite rational because IndexStore.sqlite
should be a file that we should provide as a static file that is to be downloaded with Wasabi Wallet OR we should provide put it to (say) https://data.wasabiwallet.io/IndexStore.sqlite so that users or the software can download it to skip the initial pain of downloading the filters. That means that IndexStore.sqlite
is a database that:
- has 1 writer
- has 1 or more readers in general
- the database is needed over all lifetime of the application (block can come at any minute)
- noone outside of Wasabi Wallet app should touch the database
- the database is used as a robust data storage more than anything else (i.e. we want it not to get corrupted and be reasonably fast but the protection against corruption is more important for me)
Also in my experience sqlite works OK with a persistent connection just OK and there is really no good reason why it shouldn't (apart from some bug but I have spent some time googling and I have not found anything of that sort).
I believe you that pooling leads would yield mostly the same performance but I still don't see a sufficient reason to go that way. I mean there is only Wasabi Wallet and the database file, there are no third entities and as I explained the way we interact with the database is quite simple.
Now in my testing I have not found any significant issue. So I would suggest: Let's use the persistent connection for now and as the next step, let's try to measure if your approach is better. I'll be honest and say I don't believe that and I'll eat my hat if I'm wrong but I'm happy to be proven wrong by measurements. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with it.
I asked a question in Slack about locks and possibility to remove them due to having SQLite transactions. Perhaps, you have to instantiate a new connection per each method call of SqliteStorage
since no parallel commands are allowed per connection in ADO.NET and databases I have used (except SQL Server with MARS enabled, but they still run synchronously). Maybe it's different with SQLite, no idea about that (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YohDeadfall So I have checked SqliteConnection
and added some info regarding thread-safety. And you are right that SqliteConnection
is not supposed to be accessed from multiple threads:
Multi-thread. In this mode, SQLite can be safely used by multiple threads provided that no single database connection is used simultaneously in two or more threads.
So that's a very good point.
The point supports your approach of using multiple SqliteConnection
s in general. However, here in this specific case, I believe we need IndexStore.IndexLock
because one wants to handle correctly bitcoin block reorgs. So we use IndexLock
to synchronize reads/writes of the sqlite database. So to move to your idea with multiple sqlite connections, we would need to do some changes to how we use IndexLock
. I'm not sure if it can be done (maybe yes, maybe no, IDK). I'm not too keen to do it in this PR. I'm happy with this PR as is - even though it can work even better. Let's reserve that for a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added "using multiple sqlite connections" as a possible next step to the OP.
Co-authored-by: Yoh Deadfall <yoh.deadfall@hotmail.com>
…05-IndexStore-with-sqlite
TestingGuys, this is a PR that has a big potential for good and can be a problem if it introduces a bug so, testing is important. So, here is how I propose to test it (you are free to add more tests or execute a subset of these).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works well. It didn't recovered when the block_hash was corrupted but I think the same happened before and then there is not breaking change introduced here. Also, the filter_data
could be corrupted but it wouldn't be detected so, i think it is okay.
…IndexStore-with-sqlite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test ACK to see if Kimi can press the merge button.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the IndexStore.sqlite file corrupt by removing a random part from it in a hex editor. Cannot start and quit wasabi anymore, crash report appears but stuck after that. After killing the process, the same happens. Wasabi won't work anymore. The recovery process we have in the master is not working.
TODO:
- Crash properly. After the crash, Wasabi should report what happened and quit properly. The restart function in the crash report should work as well.
- Recover properly. Detect the corruption, delete the file and crash. Next start, rebuild the database.
The migration code should be finished first as a proof of concept. If migration takes too much time it means we have to figure out the UX for this - if that is even possible. I do prefer to think about these in advance, not after this PR merged. For example: if migration takes 20 minutes, every user who will download the next release will have to wait 20 minutes before they can use Wasabi and that is not OK.
|
It should take a few seconds. I remember that decoding all the filters and re-encoding them in a binary format took about 2 seconds in my old machine using this code under "Convert filters to binary" |
@molnard So I tested the latest commit as you did and indeed there is that weird exception. However, then I merged the latest plus So it is possible that what you saw was some artifact of the UI decoupling that has been taking place lately (or IDK). FTR, it's important to say that the recovery mechanism on
To answer that if anything can break, it will break argument: There are clear reasons why SQLite can break summarized here https://www.sqlite.org/howtocorrupt.html (many are good ones like if you delete a hot journal file) and it makes no sense to me to able to recover when a virus modifies a byte in our database. A thought experiment is that, if we were, then I can just create a virus with such behavior with the goal to take down the Wasabi Coordinator by making all Wasabi Clients to download filters all the time. That is all to say that if the Wasabi app can break sqlite then yes, we should try to recover, if there is an external malicious actor in place, then it's not our responsibility to support/fix that. But given that sqlite is like several orders of magnitude more stable than what we have now on Anyway, f3ae8f4 deletes the database if it is corrupted. It's just a bit more involved than expected. |
For testnet:
I think that both can be heavily improved. So it should be a few seconds as Lucas says. Update with commit 2395e63ed4a1ef3d0467df9cfc70a3b13ff38937: For testnet:
Still too much but at least some progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK. 🎊
Congrats on getting this merged, huge effort, well done gentlemen! |
Fixes #7324
Fixes #10235
I have been working on
IndexStore
and I took a bit radical approach by switching from the plaintext storage approach to storing in an sqlite database. It's radical because it adds a new dependencyMicrosoft.Data.Sqlite
but it has some interesting pros:\r\n
vs\n
). That is all to say that having a single file to distribute and having a single checksum seems ideal in the long run.Regarding the new dependency. Yes, that's typically forbidden but there are some justifications:
So my conclusion here is that it's as safe as it can be from our point of view. But still it adds a dependency. True.
API design notes
IndexStore.sqlite
file. There was a question whether everything different data should belong to this database as well. I think it would be beneficial to have 2 sqlite database files. One for filters and one for everything else. There are multiple reasons for that:MatureIndex.dat
is a separate file and it's not mixed with anything else too (it cannot be really)IndexStorage.sqlite
for users to download it fast (say over clearnet, not everyone wants to wait to download 1 GB over Tor and it might be pretty slow sometimes; in the meanwhile you can't use your wallet, so no payments just waiting that's not good UX).Measurements
A. Measure how much time is spent in
IndexStore.AddNewFiltersAsync
in total on testnet-> 56% improvement.
This measurement is not up-to-date. I believe it's even better now. On mainnet I expect bigger saving. I need to re-measure if the PR is not NACKed immediately.
B. Measure with Tor off
Complete downloading of filters on testnet using clearnet (using Tor distorts numbers too much):
master
PR
Complete downloading of filters on mainnet using clearnet (using Tor distorts numbers too much):
master
PR
C. Measure max filter batch size in a single HTTP request
Note that this is a one-line change that's independent on this PR but I find it interesting to measure it too:
-> 50% improvement when switching from 10000 to 30000. This change is not in this PR.
Inspecting sqlite databases
I recommend https://sqlitebrowser.org/ to open
.sqlite
files.SQL query to inspect data:
Note that
block_hash
is stored in little endian format so hashes are in reverse order.Next steps
If the PR is merged, then there are several steps worth considering:
.dat
files).SqliteConnection
s to avoidIndexLock
that serializes accesses to the sqlite database.SmartHeaderChain
in a database too.