Skip to content
This repository has been archived by the owner on Aug 16, 2021. It is now read-only.

Coinview UTXO is too bloated (Explore alternatives to dbreeze database) #2414

Open
dangershony opened this issue Oct 8, 2018 · 13 comments
Open

Comments

@dangershony
Copy link
Contributor

dangershony commented Oct 8, 2018

After syncing the Bitcoin main chain to height 470k the folder size of the coinview DB is 480GB and the size of the block store is 120GB, this does not sound right (to compare bitcoin core is 200GB all together) somewhere we store data inefficiently or bloat data in store.

This is an investigative task to:

  • Understand why coinview is so bloated
  • Find how bitcoin core is storing utxo data
@MithrilMan
Copy link
Contributor

MithrilMan commented Oct 9, 2018

Analysis of the Bitcoin UTXO set: https://eprint.iacr.org/2017/1095.pdf

quoting:

2.2 The UTXO Bitcoin Core 0.15 format
One of the main changes from the last Bitcoin Core’s major release (v0.15) has
been a change of the internal representation of the chainstate in favor of a better
performance both in reading time and memory usage [6,7].
This new format uses a per-output model in contrast to the previously defined
per-transaction model, that is, every entry in the chainstate now represents
a single UTXO, instead of a collection of all the UTXOs available for a given
transaction. To achieve this, the key-value (known as outpoint-coin in the source
code) structure has been modified. Keys encode both the 32-byte transaction
hash and the index of the unspent output, preceded by the prefix “C”. Regarding
coins, each one encodes a code, that contains metadata about the block
height and whether the transaction is coinbase or not (notice that the transaction
version has been dropped), a compressed amount of bitcoins, and the output
type and script encoded in the same way as the version v0.14.
Storing unspent outputs one by one instead of aggregated in a same transaction
greatly simplifies the structure of the coin and reduces the UTXOs accessing
time. By using the previous structure when a transaction with more than one
unspent output was accessed all data needs to be decoded, and all the non used
outputs encoded and written back into the database. However, this new format
has the downside of increasing the total size of the database [7].

An article summarizing about this and proposing an alternavite solution involving Patricia Trie: https://medium.com/codechain/codechains-merkleized-utxo-set-c76c9376fd4f

@MithrilMan
Copy link
Contributor

this change is quote old (1 years old)
It was related to release of v0.15: https://bitcoincore.org/en/releases/0.15.0/

here an excerpt about performance improvement in that version:

Performance Improvements
Version 0.15 contains a number of significant performance improvements, which make Initial Block Download, startup, transaction and block validation much faster:

The chainstate database (which is used for tracking UTXOs) has been changed from a per-transaction model to a per-output model (See PR 10195). Advantages of this model are that it:
avoids the CPU overhead of deserializing and serializing the unused outputs;
has more predictable memory usage;
uses simpler code;
is adaptable to various future cache flushing strategies.
As a result, validating the blockchain during Initial Block Download (IBD) and reindex is ~30-40% faster, uses 10-20% less memory, and flushes to disk far less frequently. The only downside is that the on-disk database is 15% larger. During the conversion from the previous format a few extra gigabytes may be used.

Earlier versions experienced a spike in memory usage while flushing UTXO updates to disk. As a result, only half of the available memory was actually used as cache, and the other half was reserved to accommodate flushing. This is no longer the case (See PR 10148), and the entirety of the available cache (see -dbcache) is now actually used as cache. This reduces the flushing frequency by a factor 2 or more.
In previous versions, signature validation for transactions has been cached when the transaction is accepted to the mempool. Version 0.15 extends this to cache the entire script validity (See PR 10192). This means that if a transaction in a block has already been accepted to the mempool, the scriptSig does not need to be re-evaluated. Empirical tests show that this results in new block validation being 40-50% faster.
LevelDB has been upgraded to version 1.20 (See PR 10544). This version contains hardware acceleration for CRC on architectures supporting SSE 4.2. As a result, synchronization and block validation are now faster.
SHA256 hashing has been optimized for architectures supporting SSE 4 (See PR 10821). SHA256 is around 50% faster on supported hardware, which results in around 5% faster IBD and block validation. In version 0.15, SHA256 hardware optimization is disabled in release builds by default, but can be enabled by using --enable-experimental-asm when building.
Refill of the keypool no longer flushes the wallet between each key which resulted in a ~20x speedup in creating a new wallet. Part of this speedup was used to increase the default keypool to 1000 keys to make recovery more robust. (See PR 10831).

@MithrilMan
Copy link
Contributor

an interesting read about patricia trie implementation by CodeChain
https://medium.com/codechain/merkle-trie-2fd8793731bf

@MithrilMan
Copy link
Contributor

MithrilMan commented Oct 9, 2018

nice I found you created an issue one year ago referencing this change hehe
#481

@MithrilMan
Copy link
Contributor

MithrilMan commented Oct 9, 2018

Not related to size but to performance, quoting an excerpt DBreeze documentation (https://github.com/hhblaze/DBreeze/blob/master/Documentation/_DBreeze.Documentation.actual.pdf)

Remember, that DBreeze insert and select algorithms work with maximum efficiency
in bulk operations, when keys are supplied sorted in ascending order (descending is a
bit slower). So, sort bulk chunks in memory before inserts/selects

I think we can take advantage of this during our flushes
like in SaveChangesAsync

@MithrilMan
Copy link
Contributor

about size, this comment could be an hint:
hhblaze/DBreeze#21 (comment)

@MithrilMan
Copy link
Contributor

MithrilMan commented Oct 9, 2018

quoting always documentation

We DON’T SEE SPEED DEGRADE, when​ we insert batch of growing up keys - any newly
inserted key is always bigger than maximal existing key (SelectForward will return newly
inserted key as the last one). For such case we should do nothing.
We CAN SEE SPEED DEGRADE, when ​we update batch of values or data-blocks or if we
insert a batch of keys in random order and, especially, if these keys have high entropy.
For such cases we have integrated new methods for transactions and for nested tables:
tran.Technical_SetTable_OverwriteIsNotAllowed(”t1”);

this answers to the commented code (that thus is degrading performance)

//transaction.Technical_SetTable_OverwriteIsNotAllowed("Coins"); // Why it was there and what is it for? No one knows.

documentation knows that :)

@MithrilMan
Copy link
Contributor

Remove in DBreeze is only a "logical" operation, data stays physically inside the file, only search keys are rewritten.

this could explain size problem

@dangershony
Copy link
Contributor Author

dangershony commented Oct 9, 2018

Remove in DBreeze is only a "logical" operation

OMG ok yeah that defo could explain the bloating

@blane-nelson
Copy link

I create a simple utility to unload and reload the data in the tables to see if it would reclaim some space. While it did get some space back, the results where a bit underwhelming as I was only able to see about a 4% reduction in the overall size of the files on disk. In my case that was about 25GB so it provided some benefit but I was hoping more.

I went ahead an posted the code https://github.com/Tuppence-io/Stratis.Utility.CompressDatabase

@fshutdown fshutdown added the bug label Feb 11, 2019
@dangershony
Copy link
Contributor Author

Thanks @blane-nelson for the results.

We don't do too many deletes on a normal node operations, we do delete on reorgs but those are quite rare so I would not expect to see much reduction, the conclusion here is that we need to store less data or change the store,

We do have a refactor task to explore alternative databases, the result of such a research will be made available at due time.

@dangershony dangershony changed the title Coinview UTXO is too bloated Coinview UTXO is too bloated (Explore alternatives to dbreeze database) Mar 5, 2019
@dangershony
Copy link
Contributor Author

Related #2373

@dangershony
Copy link
Contributor Author

Some research about moving store can be done using this site
https://db-engines.com/en/ranking/key-value+store

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants