Coinview UTXO is too bloated (Explore alternatives to dbreeze database) #2414

dangershony · 2018-10-08T09:19:06Z

After syncing the Bitcoin main chain to height 470k the folder size of the coinview DB is 480GB and the size of the block store is 120GB, this does not sound right (to compare bitcoin core is 200GB all together) somewhere we store data inefficiently or bloat data in store.

This is an investigative task to:

Understand why coinview is so bloated
Find how bitcoin core is storing utxo data

MithrilMan · 2018-10-09T20:33:02Z

Analysis of the Bitcoin UTXO set: https://eprint.iacr.org/2017/1095.pdf

quoting:

2.2 The UTXO Bitcoin Core 0.15 format
One of the main changes from the last Bitcoin Core’s major release (v0.15) has
been a change of the internal representation of the chainstate in favor of a better
performance both in reading time and memory usage [6,7].
This new format uses a per-output model in contrast to the previously defined
per-transaction model, that is, every entry in the chainstate now represents
a single UTXO, instead of a collection of all the UTXOs available for a given
transaction. To achieve this, the key-value (known as outpoint-coin in the source
code) structure has been modified. Keys encode both the 32-byte transaction
hash and the index of the unspent output, preceded by the prefix “C”. Regarding
coins, each one encodes a code, that contains metadata about the block
height and whether the transaction is coinbase or not (notice that the transaction
version has been dropped), a compressed amount of bitcoins, and the output
type and script encoded in the same way as the version v0.14.
Storing unspent outputs one by one instead of aggregated in a same transaction
greatly simplifies the structure of the coin and reduces the UTXOs accessing
time. By using the previous structure when a transaction with more than one
unspent output was accessed all data needs to be decoded, and all the non used
outputs encoded and written back into the database. However, this new format
has the downside of increasing the total size of the database [7].

An article summarizing about this and proposing an alternavite solution involving Patricia Trie: https://medium.com/codechain/codechains-merkleized-utxo-set-c76c9376fd4f

MithrilMan · 2018-10-09T20:46:38Z

this change is quote old (1 years old)
It was related to release of v0.15: https://bitcoincore.org/en/releases/0.15.0/

here an excerpt about performance improvement in that version:

Performance Improvements
Version 0.15 contains a number of significant performance improvements, which make Initial Block Download, startup, transaction and block validation much faster:

The chainstate database (which is used for tracking UTXOs) has been changed from a per-transaction model to a per-output model (See PR 10195). Advantages of this model are that it:
avoids the CPU overhead of deserializing and serializing the unused outputs;
has more predictable memory usage;
uses simpler code;
is adaptable to various future cache flushing strategies.
As a result, validating the blockchain during Initial Block Download (IBD) and reindex is ~30-40% faster, uses 10-20% less memory, and flushes to disk far less frequently. The only downside is that the on-disk database is 15% larger. During the conversion from the previous format a few extra gigabytes may be used.

Earlier versions experienced a spike in memory usage while flushing UTXO updates to disk. As a result, only half of the available memory was actually used as cache, and the other half was reserved to accommodate flushing. This is no longer the case (See PR 10148), and the entirety of the available cache (see -dbcache) is now actually used as cache. This reduces the flushing frequency by a factor 2 or more.
In previous versions, signature validation for transactions has been cached when the transaction is accepted to the mempool. Version 0.15 extends this to cache the entire script validity (See PR 10192). This means that if a transaction in a block has already been accepted to the mempool, the scriptSig does not need to be re-evaluated. Empirical tests show that this results in new block validation being 40-50% faster.
LevelDB has been upgraded to version 1.20 (See PR 10544). This version contains hardware acceleration for CRC on architectures supporting SSE 4.2. As a result, synchronization and block validation are now faster.
SHA256 hashing has been optimized for architectures supporting SSE 4 (See PR 10821). SHA256 is around 50% faster on supported hardware, which results in around 5% faster IBD and block validation. In version 0.15, SHA256 hardware optimization is disabled in release builds by default, but can be enabled by using --enable-experimental-asm when building.
Refill of the keypool no longer flushes the wallet between each key which resulted in a ~20x speedup in creating a new wallet. Part of this speedup was used to increase the default keypool to 1000 keys to make recovery more robust. (See PR 10831).

MithrilMan · 2018-10-09T21:07:29Z

an interesting read about patricia trie implementation by CodeChain
https://medium.com/codechain/merkle-trie-2fd8793731bf

MithrilMan · 2018-10-09T21:14:03Z

nice I found you created an issue one year ago referencing this change hehe
#481

MithrilMan · 2018-10-09T21:26:45Z

Not related to size but to performance, quoting an excerpt DBreeze documentation (https://github.com/hhblaze/DBreeze/blob/master/Documentation/_DBreeze.Documentation.actual.pdf)

Remember, that DBreeze insert and select algorithms work with maximum efficiency
in bulk operations, when keys are supplied sorted in ascending order (descending is a
bit slower). So, sort bulk chunks in memory before inserts/selects

I think we can take advantage of this during our flushes
like in SaveChangesAsync

MithrilMan · 2018-10-09T21:40:24Z

about size, this comment could be an hint:
hhblaze/DBreeze#21 (comment)

MithrilMan · 2018-10-09T21:43:38Z

quoting always documentation

We DON’T SEE SPEED DEGRADE, when we insert batch of growing up keys - any newly
inserted key is always bigger than maximal existing key (SelectForward will return newly
inserted key as the last one). For such case we should do nothing.
We CAN SEE SPEED DEGRADE, when we update batch of values or data-blocks or if we
insert a batch of keys in random order and, especially, if these keys have high entropy.
For such cases we have integrated new methods for transactions and for nested tables:
tran.Technical_SetTable_OverwriteIsNotAllowed(”t1”);

this answers to the commented code (that thus is degrading performance)

StratisBitcoinFullNode/src/Stratis.Bitcoin.Features.Consensus/CoinViews/DBreezeCoinView.cs

Line 203 in fec7693

    
                               //transaction.Technical_SetTable_OverwriteIsNotAllowed("Coins"); // Why it was there and what is it for? No one knows.

documentation knows that :)

MithrilMan · 2018-10-09T21:46:00Z

Remove in DBreeze is only a "logical" operation, data stays physically inside the file, only search keys are rewritten.

this could explain size problem

dangershony · 2018-10-09T22:17:29Z

Remove in DBreeze is only a "logical" operation

OMG ok yeah that defo could explain the bloating

blane-nelson · 2018-10-20T15:29:35Z

I create a simple utility to unload and reload the data in the tables to see if it would reclaim some space. While it did get some space back, the results where a bit underwhelming as I was only able to see about a 4% reduction in the overall size of the files on disk. In my case that was about 25GB so it provided some benefit but I was hoping more.

I went ahead an posted the code https://github.com/Tuppence-io/Stratis.Utility.CompressDatabase

dangershony · 2019-02-19T13:26:01Z

Thanks @blane-nelson for the results.

We don't do too many deletes on a normal node operations, we do delete on reorgs but those are quite rare so I would not expect to see much reduction, the conclusion here is that we need to store less data or change the store,

We do have a refactor task to explore alternative databases, the result of such a research will be made available at due time.

dangershony · 2019-03-05T19:31:42Z

Related #2373

dangershony · 2019-03-06T15:55:39Z

Some research about moving store can be done using this site
https://db-engines.com/en/ranking/key-value+store

dangershony added improvement Investigate labels Oct 8, 2018

dangershony mentioned this issue Oct 9, 2018

Is there a way to clean deleted data? hhblaze/DBreeze#49

Closed

fshutdown added the bug label Feb 11, 2019

dangershony changed the title ~~Coinview UTXO is too bloated~~ Coinview UTXO is too bloated (Explore alternatives to dbreeze database) Mar 5, 2019

dangershony added the Refactor label Mar 5, 2019

dangershony mentioned this issue Feb 4, 2020

Coindb refactor block-core/blockcore#61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coinview UTXO is too bloated (Explore alternatives to dbreeze database) #2414

Coinview UTXO is too bloated (Explore alternatives to dbreeze database) #2414

dangershony commented Oct 8, 2018 •

edited

MithrilMan commented Oct 9, 2018 •

edited

MithrilMan commented Oct 9, 2018

MithrilMan commented Oct 9, 2018

MithrilMan commented Oct 9, 2018 •

edited

MithrilMan commented Oct 9, 2018 •

edited

MithrilMan commented Oct 9, 2018

MithrilMan commented Oct 9, 2018 •

edited

MithrilMan commented Oct 9, 2018

dangershony commented Oct 9, 2018 •

edited

blane-nelson commented Oct 20, 2018

dangershony commented Feb 19, 2019

dangershony commented Mar 5, 2019

dangershony commented Mar 6, 2019

Coinview UTXO is too bloated (Explore alternatives to dbreeze database) #2414

Coinview UTXO is too bloated (Explore alternatives to dbreeze database) #2414

Comments

dangershony commented Oct 8, 2018 • edited

MithrilMan commented Oct 9, 2018 • edited

MithrilMan commented Oct 9, 2018

MithrilMan commented Oct 9, 2018

MithrilMan commented Oct 9, 2018 • edited

MithrilMan commented Oct 9, 2018 • edited

MithrilMan commented Oct 9, 2018

MithrilMan commented Oct 9, 2018 • edited

MithrilMan commented Oct 9, 2018

dangershony commented Oct 9, 2018 • edited

blane-nelson commented Oct 20, 2018

dangershony commented Feb 19, 2019

dangershony commented Mar 5, 2019

dangershony commented Mar 6, 2019

dangershony commented Oct 8, 2018 •

edited

MithrilMan commented Oct 9, 2018 •

edited

MithrilMan commented Oct 9, 2018 •

edited

MithrilMan commented Oct 9, 2018 •

edited

MithrilMan commented Oct 9, 2018 •

edited

dangershony commented Oct 9, 2018 •

edited