New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Refactor of Clarity storage (iteration 1) #4437
base: next
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## next #4437 +/- ##
==========================================
- Coverage 83.46% 83.40% -0.06%
==========================================
Files 448 455 +7
Lines 324321 326383 +2062
==========================================
+ Hits 270698 272233 +1535
- Misses 53623 54150 +527
... and 23 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
54be976
to
9b5f9cd
Compare
@@ -2865,65 +2865,6 @@ mod test { | |||
|
|||
let contract_id = QualifiedContractIdentifier::local("docs-test").unwrap(); | |||
|
|||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was causing problems with contract existence/non-existence and it seems to be doing the same thing as the "more modern" code below - can anybody confirm that this removal is OK?
Note that the comparisons between the results from this code and the code below have also been removed.
9b5f9cd
to
5a68254
Compare
I'm looking at the EDIT: Have you considered |
Great question! When it comes to
Meaning that if there were any ordering changes then a migration would need to be written.
Yeah, I had a look at EDIT: There is actually a discussion here regarding EDIT: Just wanted to toss in that the actual serialization implementation is only a few lines of code, so it doesn't really matter which encoding we end up with as long as it has a serde-like interface, it's trivial to swap out prior to merging -- the bigger work is done, i.e. getting binary serialization+storage in-place 😄 |
There are some nice benchmarks of Rust serialization frameworks here. Both are similar in performance, and either one would be much faster than JSON, so the choice should depend on other aspects
JSON is "self-describing" in that the field names, types (to some extent), as well as the overall structure is encoded into the output data. The extra metadata gives the application information on what the data is and what to do with it, so even if the application's internal structures change, it can use that information to ignore unrecognized fields, know if a field is missing and use a default value, or fail with a descriptive error On the other hand, many of these fast binary formats get their speed by omitting the extra metadata. Instead, they make the assumption that the application knows exactly how the data should be structured. This limits the validation you can do, and the portability between languages, CPU architectures, and even different versions of the application. Sometimes, the trade-off worth it, depending on the use-case. What I'm specifically concerned about here is:
There are also some fast binary serialization formats that are portable and support versioning, like Cap’n Proto or FlatBuffers, but they work by generating code from a custom schema language, so there's significant added complexity (much more work than adding a |
5a68254
to
8b71be8
Compare
e185650
to
ed6615e
Compare
❗❗❗ This PR draft is currently being wrapped-up and is NOT READY FOR CONSIDERATION ❗❗❗
Remaining Work:
👋 Introduction
This work was born from a combination of the A/B-tester I have recently been working on and the open umbrella issue #4316. When testing the A/B-tester it became clear that replaying blocks up to more recent blocks in-between each bugfix takes way too much time, so I started investigating ways to optimize specifically the Clarity VM, with a focus on the
oracle-v1
contracts. Needless to say, there are a number of optimizations which can be done. While there have been other initiatives looking for low-hanging-fruits, I decided to try tackling some deeper optimizations.The primary goal of this PR is to lay some groundwork for refactoring & optimizing the persistence layer of the Stacks node, with focus on the Clarity database. I decided to start with two rather large data-structures which are, today, being serialized using
serde_json
and written/read to SQLite as plain-text JSON. These structures are read + deserialized a number of times during contract execution.As seems to be usual for my core refactoring attempts, this started out as a "small and isolated" change which quickly cascaded onto a much larger surface area. The PR in itself is quite large, but I hope that the changes are clear - I did make an effort on documentation :)
⛓ Consensus
While this PR does potentially affect Clarity consensus-critical storage, no behavior [should have] been changed (i.e. consensus keys, serialization, etc.). Therefore this PR should be seen as non-consensus-critical -- however an extra eye on the consensus-critical code paths is warranted, just in-case I missed something that tests aren't catching.
✅️ Issues Resolved in this PR
ContractContext
, source code &ContractAnalysis
#4444serde_json
withspeedy
forContractContext
andContractAnalysis
(de)serialization #4449speedy
'sReadable
andWritable
derives to types to be (de)serialized.ContractContext
andContractAnalysis
types & contract source #4450fake
crate for helping to create faked data instead of repeating boilerplate initializations in tests #4446fake
'sDummy
derive to types to be faked.Box
ing ofSymbolicExpression
s in Clarity #4445hashbrown
'sHashMap
andHashSet
#4441rusqlite
'sarray
feature and start using binary/blob columns directly over the blob API #4442get
/put
/etc. methods in Clarity toget_data
/put_data
#4443🚩 Additional Changes & Improvements
The
AnalysisDatabase
type has been removed and baked into theClarityDatabase
struct instead. This is motivated by the fact that these are two different interfaces working against the same database and tables, and thus moving such logic together helps to reduce the perceived complexity in preparation for a larger refactor. It would have been another story IMO if there had been a significant number of analysis-related methods we were talking about -- but since it's only a few, the additional abstraction only adds complexity and confusion.AnalysisDatabase
toClarityDatabase
andanalysis_db
toclarity_db
.This PR makes it an error to attempt to insert a contract multiple times into the backing store, even at different block hashes. Previously there was no check for this, allowing duplicate contracts to be deployed at different block heights, potentially without the caller realizing it. This situation needs to be explicitly handled now.
🧪 Testing
📌 A lot of
/tests/*.rs
etc. files have been touched due to renames and test-tweaking.RollbackWrapper
.ContractContext
, source code &ContractAnalysis
#4444) is disabled by default intest
anddev
modes. This is because too many tests fail otherwise which are relying on very specific control of what exists/doesn't exist in the underlying database.ClarityDatabase::test_insert_contract_hash()
-- this method only ensured that the contract hash existed in the MARF, so it has been changed to one oftest_insert_contract()
ortest_insert_contract_with_analysis
in said tests. These two methods have also been added to tests prior to their "need" of this stored data. Related to [Optimization] Refactor storage schema for Clarity contracts & analyses #4448.⏰ Performance
The following benchmark shows a simulated single contract write / read in the current
next
vs. this branch. This benchmark was performed on a modern Intel CPU with an NVMe drive. Note that these benchmarks use disk-based databases andoptimized
includes bothspeedy
serialization andlz4_flex
compression, wherenext
uses onlyserde_json
serialization. No caching is used.next
optimized
insert
672.04 us
(✅ 1.00x)237.78 us
(🚀 2.83x faster)select
295.95 us
(✅ 1.00x)117.13 us
(🚀 2.53x faster)ℹ️ More to come with hyperfine + stacks-inspect replay benchmarking
🚩 New Dependencies
Dummy
attribute littered throughout this PR comes from. Currently added as a dependency and not dev-dependency because I had issues with conditional compiles in test/dev mode -- but it should work to get this intodev-dependencies
with a little more effort. All code usingfake
is conditional oncfg!(test)
, so it should be optimized away during a build anyway.rusqlite
uses this crate, and specifically itsLruCache
struct, internally.default-features = false
enables its "performance mode" with unsafe code.Readable
andWritable
attributes littered throughout this PR come from. This is only used for node-specific, non-consensus data (i.e. data which was previously stored in the Claritymetadata_table
SQLite table):ContractContext
ContractAnalysis
💡 Some Potential Next Steps
This PR sets the stage for further refactoring the Clarity (and other) database(s). Some ideas for next steps are things like (but not limited to):
metadata_table
database table. By normalizing this structure we could perform much more efficient, targeted reads and writes for specific map keys.index-of
,element-at
,slice
etc. could be delegated to the underlying database engine, which specializes in query logic, selecting only the items needed for a particular operation.speedy
instead ofserde_json
.