-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace json on disk, with leveldb/sql/...? wallet file does not scale. partial writes #4823
Comments
Would using sqlite3 be an option for this. I think it's pretty fast/efficient, though of course not a full blown sql server. I don't think installing a server is going to fly well for most users anyway but there are a lot of python apps that use sqlite3 as a backend (some even handling a lot of data like darktable). It has the advantage there are utils around for viewing/editing the db file (for advanced users doing fixes etc). The python module is widely available or often preinstalled. Anyway, just something to look at. JSON has some advantages but only when files are small enough to visually inspect/edit. |
this is done for channel_db and watchtower_db |
I looked at this again, specifically for the wallet file, to replace json. Being able to do partial writes will become necessary for large wallets, especially with lightning (and especially when we forward). The main issue is that we currently encrypt the whole wallet file (by default). This is good for protecting metadata and also privacy reasons. This does not scale however: every time we need to make sure changes are persisted on disk, we serialize the whole wallet file json, compress it with zlib, encrypt it with AES, and write the whole thing to disk. (even with encryption disabled we write the whole json to disk every time) We could simply replace json with sqlite/leveldb, but we could no longer simply encrypt the whole wallet file. To be able to do partial writes, and have encryption, we either
The latter would be clearly preferred. (at the very least, there are many footguns and potential mistakes to be made with the first approach) (note: ElectrumSV implemented option1 using sqlite) To encrypt transparently below the db (option2), we would probably need the db itself to know about the encryption and do it for us. (I guess it's not strictly needed as you could create a virtual filesystem and do the encryption there yourself (see e.g. TrueCrypt/VeraCrypt) but that might be even more complicated) I have found two viable ways to do option2. Option2A: SQLCipherhttps://github.com/sqlcipher/sqlcipher Option2B: RocksDBRocksDB is a key-value store (like LevelDB). In any case, RocksDB is a C++ project, so we would then need python-bindings, see e.g. https://github.com/twmht/python-rocksdb |
Your proposed solutions are difficult because they both exchange the storage layer and add encryption in one go, it is going to be difficult to get it right and have it be robust on all platforms (especially since the proposed alternatives need native code). I think "event sourcing" should be considered as an alternative. What that means, is storing only changes. This is simple, but requires more disk space. But a typical big wallet pretty much only grows, so it shouldn't be too bad. Event sourcing is extremely inefficient when data is deleted/replaced. This is space-inefficient like a CoW filesystem where snapshots are made after each change. Let's imagine the wallet file is now
Each time "qt_window_position" is changed, we need to reencrypt all the transactions, which are way larger. In event sourcing, the final database state is implicit from a bunch of changes: initial_state:
change0:
change1:
Something like json-path can be used to identify keys in a deep structure. EDIT: or json-patch which has an IETF spec. The change files can be encrypted individually and never change. On wallet startup, they are all decrypted and applied serially, the wallet lives unencrypted in memory as before, as a plain Python object (as before). In contrast to migrating to LevelDB, it doesn't require switching to a flatter DB layout. A naive implementation would write a single change file every single time EDIT: There is a heavy-weight Python-based event sourcing library that even supports encryption: eventsourcing#Application-level encryption |
I like the idea of event sourcing |
The spec json-patch is concerned with encoding of the changes, not how to store them. The library json-patch takes JSON objects or strings (simply encoded objects). So one would have to develop a framing mechanism to put it in one file. |
I can see how this would work with encryption, but one issue is what to do with the compression currently performed before encryption.
|
ElectrumSV will be dropping it's encryption except for things like private key data. My research led me to believe that most encryption options were unproven and would impose too much development overhead, and likely risk. At least for now, we'll be leaving it up to the user to ensure their wallets are secured appropriately if they require privacy. |
I have implemented "append to encrypted" here: f28ebf9 |
Here is a proof of concept for storage using jsonpatch: 4f3b0bf To make it efficient, we might want to replace all lists with dicts in the wallet file. |
Here is my |
Rotonen on IRC suggested using persistent, potentially with ZODB. There is a module for these that can be used to have encryption at rest. |
Fixes: after adding a payment request, if the process was killed, the payreq might get lost. In case of using the GUI, neither the callee nor the caller called wallet.save_db(). Unclear where wallet.save_db() should be called... Now each method tries to persist their changes by default, but as an optimisation, the caller can pass write_to_disk=False e.g. when calling multiple such methods and then call wallet.save_db() itself. If we had partial writes, which would either rm the need for wallet.save_db() or at least make it cheaper, this code might get simpler... related: #6435 related: #4823
I am still looking into the background and previous proposals for this issue. Based on skimming the comments so far here and in #5999 , I would like to propose unifying the wallet format to internally-tagged JSON-per-line objects, which can be written incrementally and periodically "squashed" down to a single snapshot entry. Using @ysangkok 's example above, a wallet with an initial snapshot entry and several incremental entries would look something like:
Conversely, an unencrypted wallet might look like:
where the Note the above is very much a sketch of how it might work and by no means complete. More fields will almost certainly be necessary. Rather than computing the MAC over the entire entry (or entire wallet) as we do now, the MAC would be computed over the ciphertext and other relevant fields in a defined fashion, e.g:
This approach provides some major benefits:
|
I think you have a good point about robustness to crashes and write interruptions. |
@ln2max, I have updated and pushed the I think we can improve it with your idea of using standalone encrypted blobs, one per update. This indeed makes it robust to crashes. I do not like the extra layer of JSON in your proposal. Basically, you are embedding encrypted JSON in another layer of JSON. This is unnecessarily complex, and it leaks some info about the wallet. I think we could use a sequence of encrypted binary blobs, without separator, each with its own MAC and length info. |
On Wed, Sep 22, 2021 at 05:33:24AM -0700, ThomasV wrote:
***@***.***, I have updated and pushed the `jsonpatch` branch:
https://github.com/spesmilo/electrum/commits/jsonpatch
The work has been split in four separate commits, only the last commit
deals with the encryption scheme.
I think we can improve it with your idea of using standalone encrypted
blobs, one per update. This indeed makes it robust to crashes.
Thanks. I will take a look.
I do not like the extra layer of JSON in your proposal. Basically, you
are embedding encrypted JSON in another layer of JSON. This is
unnecessarily complex, and it leaks some info about the wallet.
I see. Yes, this adds theoretical complexity (parsing JSON twice) but, as
far as I can tell, it reduces actual complexity (i.e reduces the
Shannon information content of the overall design) by allowing us to use
the same underlying (and off the shelf/battle-tested) parser in all
cases.
JSON is admittedly not the world's best data serialization format.
However, in this case, JSON as an outer format provides robustness (it's
immediately clear to the parser if the JSON object is corrupted),
human-readability, security (JSON parsers are well-reviewed) and
(perhaps most importantly) modularity/expandability.
One of the original considerations was to provide an extensible header
format for encrypted wallets, so that new fields/descriptors/metadata
can be added later while ensuring backwards compatibility.
The key-value approach of a JSON object is a natural way to do this
while avoiding having to build a custom parser with all its attendant
downsides (security holes, etc).
On the other hand, a binary-only format with MAC and length headers
brings us back to the original question without resolving it.
I think we could use a sequence of encrypted binary blobs, without
separator, each with its own MAC and length info.
Someone who has access to the file but does not have the password would
not be able to see the separation between blobs.
In order to hide information about the wallet in this scheme, it would
be necessary to encrypt the MAC and length info of each incremental
update.
This scheme would strongly impact crash-resistance. If the length
information must be encrypted, and the length information must be known
for a block in order to locate the end of that block in the sea of
binary data, then the length information must be stored encrypted in the
previous (!) block.
As a result, we are now vulnerable to data corruption in either previous
or current block, and we also have to re-write the previous block when
appending to the wallet. (in order to encode MAC and length for the
subsequent block, into the previous block)
If the goal is simply to make it difficult for an adversary to determine
how active a wallet is (based on number-of-appends) then it may be
easier to fix the maximum length of a wallet before "garbage collection"
takes place.
In other words, once a wallet has reached N incremental updates, the
application would compact the incremental updates down to the "snapshot"
(intitial block), where N is chosen so that the number is regularly
reached for an average wallet user.
In this case the adversary would be able to see that the wallet has up
to N-1 appended incremental updates, but because this would be an
extremely common state for a wallet to have, the amount of information
the adversary could derive from it would be small.
"This wallet has 3 appends and a total file size of X KB" is much less
significant than "this wallet of X KB has 33 appends", if the
probability of finding a wallet with 3 appends is much higher than
finding a wallet with 33 appends.
If we restrict most wallets to a maximum length of 5 appends, where most
users will exceed this in normal usage, then the adversary will
generally only encounter wallets with a number of appends in a highly
probable (and therefore probably not useful) range.
Since Electrum usage is subject to "long tail" effects and there will be
high-volume users for whom the default value of N is too small and
impacts performance, we can also increase this number for those users.
In this case, an adversary who gains access to the encrypted wallet
would see that it has an unusually large amount of activity. But would
this really reduce privacy? A highly active wallet is very likely to
have an unusually large overall filesize, and the overall filesize
cannot be hidden.
Therefore, since we can't hide the unusually active nature of the wallet
anyway, increasing N for heavy users will probably not reduce privacy by
very much.
|
No, the length of a cipher block does not need to be in the previous block, it is possible to put it inside at the beginning of each block. But actually, we might not need it at all (see below). What I am concerned about is not only a lack of privacy, but the possibility that an adversary who has access to the encrypted wallet file could remove part of it, and the result would still be seen as valid. For example, they could delete a channel update, which would cause the wallet to use a revoked state. Note that this attack is also possible with the encrypted blobs method I proposed above. In order to avoid that, @SomberNight suggested that we use a single MAC at the beginning of the file, instead of one per encrypted blob. To illustrate this, assume that the file contains
The new proposal would look like this:
The write method would first append the new cipher, then update the MAC. The new MAC is obtained by hashing the previous MAC and the new ciphertext. When decrypting, one would compare the MAC at the beginning of the file to the MAC obtained at the end of each cipher block, and stop when they match. Note that each cipher block has its own PKCS7 padding, so its length is a multiple of 16. The length of each cipher block does not need to be included in the block, if we add a separator at the end of the encrypted text (for example \n). We could also use no separator at all and recompute the hash every 16 bytes. |
Indeed, but a header should remain a header (magic bytes at the beginning of the file). |
We need to be mindful that the attacker might delete some updates from the end and then try to recalculate and update the MAC. |
On Thu, Sep 23, 2021 at 06:23:20AM -0700, ghost43 wrote:
> The write method would first append the new cipher, then update the MAC. The new MAC is obtained by hashing the previous MAC and the new ciphertext.
We need to be mindful that the attacker might delete some updates from the end and then try to recalculate and update the MAC.
Note that if the MAC is cleartext and is calculated over the ciphertexts, then the attacker can recalculate and update it.
So maybe the MAC should be over the plaintexts.
I certainly hope not... as far as I know, HMAC is calculated using a key
so that an attacker *cannot* recalculate a plaintext MAC. The idea being
that Bob can verify the HMAC before decrypting the ciphertext, in order
to ensure the ciphertext has not been modified to e.g exploit the
decryption process.
|
On Thu, Sep 23, 2021 at 01:01:32AM -0700, ThomasV wrote:
What I am concerned about is not only a lack of privacy, but the
possibility that an adversary who has access to the encrypted wallet
file could remove part of it, and the result would still be seen as
valid. For example, they could delete a channel update, which would
cause the wallet to use a revoked state.
Yes, that makes sense. Unfortunately, there's no easy way to distinguish
between "the system crashed while writing the wallet" and "an attacker
removed the last wallet entry". Therefore, any solution which is robust
against write errors is not going to be robust against deliberate edits
by an attacker.
The best we could probably do is, if the wallet-MAC fails or does not match,
attempt to recover as much of the wallet as possible.
In order to retain the security function of the MAC (protecting against
malicious ciphertexts) we would need to have both a wallet-MAC and a
blob-MAC, where we verify the blob-MAC for each blob when attempting to
recover.
The write method would first append the new cipher, then update the
MAC. When decrypting, one would compare the MAC at the beginning of the
file to the MAC obtained at the end of each cipher block, and stop
when they match.
What is the computational overhead associated with computing the MAC
many times over, is it going to slow down wallet reads significantly?
|
On Thu, Sep 23, 2021 at 01:26:21AM -0700, ThomasV wrote:
> One of the original considerations was to provide an extensible header
> format for encrypted wallets, so that new fields/descriptors/metadata
> can be added later while ensuring backwards compatibility.
Indeed, but a header should remain a header (magic bytes at the beginning of the file).
I would not call "header" the operation of embedding every cipherblock in JSON...
*shrugs*
Magic bytes at the beginning of the file are not extensible or modular,
at some point you run out of allocated space in the array.
I don't understand the downside to JSON. As noted above, the performance
impact should be minimal despite the verbosity, due to the availability
of highly optimized parsers.
|
You are right, using an HMAC addresses my concern.
No, reads would be done purely in memory in any of the json-based scemes. Reads from the file are ~only done when opening/loading a wallet file. That is, the whole wallet file is loaded into memory; writes update both the in-memory representation and the file on disk; reads only touch the in-memory representation. (This is btw how it works already, the goal is to optimise the writes updating the file on disk to not have to rewrite the whole file.) |
On Fri, Sep 24, 2021 at 05:41:33AM -0700, ghost43 wrote:
> What is the computational overhead associated with computing the MAC
many times over, is it going to slow down wallet reads significantly?
No, reads would be done purely in memory in any of the json-based scemes. Reads from the file are ~only done when opening/loading a wallet file.
Ah, sorry, I meant overhead with respect to computing HMAC
`length_of_file_in_bytes / 16` times, every time the file is read. I
don't know how fast Python's HMAC implementation is.
Writes would be append-only to the file (apart from maybe updating a fixed position global MAC)
How good is Python's random-access write, i.e altering just the fixed
position global MAC?
As far as I know, it would be necessary to use the "w" file mode, and
thereby re-write the whole file every time we update the fixed position
global MAC. That would eliminate ~all the benefits of the
append-oriented format.
Come to think of it, if in-memory operations are not a bottleneck here,
then the only bottleneck which justifies an append-based format would be
if the "a" (append-only) file open mode offers better performance than
"w".
Has anyone benchmarked this to confirm it's the case?
|
Well, every time the file is read (i.e. wallet gets opened), we decrypt it. I imagine decrypting it and hashing it has comparable performace. I would not be concerned about hashing the whole file once when opening it taking too long. In any case, we have been doing this for years (the current encryption scheme has an HMAC over the whole file already).
We have similar logic that updates short chunks of a raw file at fixed positions for the block header storage. It performs well. electrum/electrum/blockchain.py Lines 437 to 444 in b431d8e
I have not benchmarked the different file open modes;
Still, I believe, in principle, partial writes (e.g. appending updates to the end) should help with each of these steps. |
On Sat, Sep 25, 2021 at 01:24:34AM +0000, ghost43 wrote:
I have not benchmarked the different file open modes;
but I have some stats for [writing](https://github.com/spesmilo/electrum/blob/b431d8e9b8f75b99a43540ed6672772abfb68834/electrum/wallet_db.py#L1395) a large wallet file (~17 MB) to disk on master,
and as you can see, the actual filesystem write is the least of our concerns.
- [`JsonDB.dump`](https://github.com/spesmilo/electrum/blob/b431d8e9b8f75b99a43540ed6672772abfb68834/electrum/json_db.py#L175) 0.4299 sec
It takes a lot of time to json-serialise the whole wallet file for every write.
- [`zlib.compress`](https://github.com/spesmilo/electrum/blob/b431d8e9b8f75b99a43540ed6672772abfb68834/electrum/storage.py#L179) 0.4546 sec
One idea: changing from JSON to CBOR would provide about a 4.5x
performance increase at the cost of human readability, according to the
benchmarks at [the CBOR repo](https://github.com/whyrusleeping/cbor)
A very simple fix for this whole issue would therefore be to
(optionally, primarily for large wallets)) encode the wallet file as
uncompressed CBOR instead of compressed JSON.
|
json on disk does not scale to large files, as every time we need to flush, the whole file needs to be rewritten.
This is a problem for:
The text was updated successfully, but these errors were encountered: