[Refactor] Refactor of Clarity storage (iteration 1) #4437

cylewitruk · 2024-02-27T18:41:20Z

❗❗❗ This PR draft is currently being wrapped-up and is NOT READY FOR CONSIDERATION ❗❗❗

Remaining Work:

More mainnet sync tests
Some additional unit tests are needed, particularly surrounding the Clarity cache
Finish DB migration implementation and tests
Performance benchmarking (w/hyperfine)
Storage benchmarking (size-on-disk)
Create issues for changes

👋 Introduction

This work was born from a combination of the A/B-tester I have recently been working on and the open umbrella issue #4316. When testing the A/B-tester it became clear that replaying blocks up to more recent blocks in-between each bugfix takes way too much time, so I started investigating ways to optimize specifically the Clarity VM, with a focus on the oracle-v1 contracts. Needless to say, there are a number of optimizations which can be done. While there have been other initiatives looking for low-hanging-fruits, I decided to try tackling some deeper optimizations.

The primary goal of this PR is to lay some groundwork for refactoring & optimizing the persistence layer of the Stacks node, with focus on the Clarity database. I decided to start with two rather large data-structures which are, today, being serialized using serde_json and written/read to SQLite as plain-text JSON. These structures are read + deserialized a number of times during contract execution.

As seems to be usual for my core refactoring attempts, this started out as a "small and isolated" change which quickly cascaded onto a much larger surface area. The PR in itself is quite large, but I hope that the changes are clear - I did make an effort on documentation :)

⛓ Consensus

While this PR does potentially affect Clarity consensus-critical storage, no behavior [should have] been changed (i.e. consensus keys, serialization, etc.). Therefore this PR should be seen as non-consensus-critical -- however an extra eye on the consensus-critical code paths is warranted, just in-case I missed something that tests aren't catching.

✅️ Issues Resolved in this PR

[Optimization] Add caching for Clarity contracts - ContractContext, source code & ContractAnalysis #4444
[Optimization] Replace serde_json with speedy for ContractContext and ContractAnalysis (de)serialization #4449
- 📌 This issue touches a lot of files, but mostly due to adding speedy's Readable and Writable derives to types to be (de)serialized.
[Optimization] Implement LZ4 compression for serialized ContractContext and ContractAnalysis types & contract source #4450
[Optimization] Refactor storage schema for Clarity contracts & analyses #4448
[trivial] [Testing] Introduce the fake crate for helping to create faked data instead of repeating boilerplate initializations in tests #4446
- 📌 This issue touches a lot of files, but mostly due to adding fake's Dummy derive to types to be faked.
[trivial] [Docs] Improve documentation around Clarity database types/methods. #4447
[trivial] [Optimization] Improve efficiency of checking for contract existence prior to contract deploy #4440
[trivial] [Chore] Remove Boxing of SymbolicExpressions in Clarity #4445
[trivial] [Chore] Add wrapper types for hashbrown's HashMap and HashSet #4441
[trivial] [Chore] Activate rusqlite's array feature and start using binary/blob columns directly over the blob API #4442
[trivial] [Chore] Rename get/put/etc. methods in Clarity to get_data/put_data #4443

🚩 Additional Changes & Improvements

The AnalysisDatabase type has been removed and baked into the ClarityDatabase struct instead. This is motivated by the fact that these are two different interfaces working against the same database and tables, and thus moving such logic together helps to reduce the perceived complexity in preparation for a larger refactor. It would have been another story IMO if there had been a significant number of analysis-related methods we were talking about -- but since it's only a few, the additional abstraction only adds complexity and confusion.
- 📌 This change touches a lot of files, but primarily due to the renaming of AnalysisDatabase to ClarityDatabase and analysis_db to clarity_db.
This PR makes it an error to attempt to insert a contract multiple times into the backing store, even at different block hashes. Previously there was no check for this, allowing duplicate contracts to be deployed at different block heights, potentially without the caller realizing it. This situation needs to be explicitly handled now.

🧪 Testing

📌 A lot of /tests/*.rs etc. files have been touched due to renames and test-tweaking.

Unit tests have been written for new code.
- ℹ️ Some areas are still uncovered -- I'm working on more tests for them.
Added new tests for existing code, particularly surrounding the RollbackWrapper.
Existing tests have been adapted to work with the new paradigm, with a few "gotcha's":
- The clarity cache ([Optimization] Add caching for Clarity contracts - ContractContext, source code & ContractAnalysis #4444) is disabled by default in test and dev modes. This is because too many tests fail otherwise which are relying on very specific control of what exists/doesn't exist in the underlying database.
- A lot of tests use methods which now expect the contract and/or its analysis to exist in the underlying database, which now failed due to the use of ClarityDatabase::test_insert_contract_hash() -- this method only ensured that the contract hash existed in the MARF, so it has been changed to one of test_insert_contract() or test_insert_contract_with_analysis in said tests. These two methods have also been added to tests prior to their "need" of this stored data. Related to [Optimization] Refactor storage schema for Clarity contracts & analyses #4448.
- A number of tests expect to be able to insert the same contract multiple times, which is now an error. I've adapted these tests as well.
- ⚠️ Please put extra focus on reviewing all adapted tests -- it is possible that I misunderstood what the test was trying to accomplish, or simply made a mistake.
All tests in the workspace are passing.
A release-mode node has been partially synced to mainnet without issue.
- ℹ️ I'm working on more testing here.

⏰ Performance

The following benchmark shows a simulated single contract write / read in the current next vs. this branch. This benchmark was performed on a modern Intel CPU with an NVMe drive. Note that these benchmarks use disk-based databases and optimized includes both speedy serialization and lz4_flex compression, where next uses only serde_json serialization. No caching is used.

	`next`	`optimized`
`insert`	`672.04 us` (✅ 1.00x)	`237.78 us` (🚀 2.83x faster)
`select`	`295.95 us` (✅ 1.00x)	`117.13 us` (🚀 2.53x faster)

ℹ️ More to come with hyperfine + stacks-inspect replay benchmarking

🚩 New Dependencies

fake - Used to help with filling structs with random (but format-correct) data in tests. This is where the Dummy attribute littered throughout this PR comes from. Currently added as a dependency and not dev-dependency because I had issues with conditional compiles in test/dev mode -- but it should work to get this into dev-dependencies with a little more effort. All code using fake is conditional on cfg!(test), so it should be optimized away during a build anyway.
lru - Used in global caching of Clarity data, including contracts and analyses. Note that rusqlite uses this crate, and specifically its LruCache struct, internally.
lz4_flex - A pure-Rust implementation of the LZ4 compression algorithm. Currently set to use safe-only code, but default-features = false enables its "performance mode" with unsafe code.
speedy - A very fast binary serializer written in pure Rust. This is where the Readable and Writable attributes littered throughout this PR come from. This is only used for node-specific, non-consensus data (i.e. data which was previously stored in the Clarity metadata_table SQLite table):
- Serialized ContractContext
- Serialized ContractAnalysis

💡 Some Potential Next Steps

This PR sets the stage for further refactoring the Clarity (and other) database(s). Some ideas for next steps are things like (but not limited to):

Optimizing how Clarity maps are stored. Maps are currently JSON-serialized in their entirety and written/read to/from the metadata_table database table. By normalizing this structure we could perform much more efficient, targeted reads and writes for specific map keys.
Clarity lists are another contender with the same motivations as above. Lists could be normalized and operations like index-of, element-at, slice etc. could be delegated to the underlying database engine, which specializes in query logic, selecting only the items needed for a particular operation.
Clarity functions can be normalized to allow fast lookup and validation of contract functions prior to loading the entire contract for a contract call.
Ensuring all binary data is stored as binary instead of its hexadecimal string representation.
Changing serialization of the remainder of non-consensus-critical node-local data to use speedy instead of serde_json.
Employ compression on stored Clarity values which are large.
Moving on to optimizing some of the other databases, with focus on migrating binary stored as hexidecimal strings to raw bytes.

codecov · 2024-02-27T19:21:45Z

Codecov Report

Attention: Patch coverage is 82.89778% with 517 lines in your changes are missing coverage. Please review.

Project coverage is 83.40%. Comparing base (383d586) to head (ed6615e).

Additional details and impacted files

@@            Coverage Diff             @@
##             next    #4437      +/-   ##
==========================================
- Coverage   83.46%   83.40%   -0.06%     
==========================================
  Files         448      455       +7     
  Lines      324321   326383    +2062     
==========================================
+ Hits       270698   272233    +1535     
- Misses      53623    54150     +527

Files	Coverage Δ
clarity/src/vm/analysis/arithmetic_checker/mod.rs	`92.75% <ø> (ø)`
.../src/vm/analysis/contract_interface_builder/mod.rs	`95.20% <100.00%> (ø)`
clarity/src/vm/analysis/mod.rs	`95.23% <100.00%> (+0.14%)`	⬆️
clarity/src/vm/analysis/read_only_checker/mod.rs	`87.16% <100.00%> (ø)`
clarity/src/vm/analysis/read_only_checker/tests.rs	`100.00% <100.00%> (ø)`
clarity/src/vm/analysis/tests/mod.rs	`100.00% <ø> (ø)`
clarity/src/vm/analysis/trait_checker/mod.rs	`100.00% <100.00%> (ø)`
clarity/src/vm/analysis/type_checker/contexts.rs	`93.50% <100.00%> (ø)`
clarity/src/vm/analysis/type_checker/mod.rs	`91.66% <100.00%> (ø)`
...src/vm/analysis/type_checker/v2_05/tests/assets.rs	`100.00% <100.00%> (ø)`
... and 74 more

... and 23 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 383d586...ed6615e. Read the comment docs.

cylewitruk · 2024-02-27T20:53:54Z

clarity/src/vm/docs/mod.rs

@@ -2865,65 +2865,6 @@ mod test {

        let contract_id = QualifiedContractIdentifier::local("docs-test").unwrap();

-        {


This code was causing problems with contract existence/non-existence and it seems to be doing the same thing as the "more modern" code below - can anybody confirm that this removal is OK?
Note that the comparisons between the results from this code and the code below have also been removed.

cylewitruk · 2024-02-28T14:17:01Z

I realize that this PR may look intimidating. Now that I've taken the time to wrap my head around all of the changes and box-in the different moving parts it's easier for me (mentally) to break this into smaller PR's if preferred -- in which case I would break it up as follows:

The following, in no particular order:

[Chore] Add wrapper types for hashbrown's HashMap and HashSet #4441
[Chore] Activate rusqlite's array feature and start using binary/blob columns directly over the blob API #4442
[Chore] Rename get/put/etc. methods in Clarity to get_data/put_data #4443
[Chore] Remove Boxing of SymbolicExpressions in Clarity #4445
[Testing] Introduce the fake crate for helping to create faked data instead of repeating boilerplate initializations in tests #4446

These issues together:

Followed by these, in no particular order:

While #4447 could be incrementally applied depending on the code changes.

jbencin · 2024-02-28T15:30:08Z

I'm looking at the speedy docs and it looks like the purpose of this crate to read/write to a buffer with as little overhead as possible. Does it contain any sort of versioning info or other metadata? What happens if in the future we add/remove/reorder fields in a serializable struct?

EDIT: Have you considered rkyv? It's similar (fast, minimalistic binary protocol, not "self describing", lacks schema evolution), but much more widely used (several popular crates support it, such as hashbrown) and has some additional features, like a limited form of validation

cylewitruk · 2024-02-28T17:37:50Z

I'm looking at the speedy docs and it looks like the purpose of this crate to read/write to a buffer with as little overhead as possible. Does it contain any sort of versioning info or other metadata? What happens if in the future we add/remove/reorder fields in a serializable struct?

Great question!

When it comes to speedy, you can add and remove fields, but:

to be able to deserialize a v1 { name } into a v2 { name, phone } then phone (the added field) would need the #[speedy(default_on_eof)] attribute to indicate that it's OK if it's empty.
nothing is needed to deserialize from a v2 { name, phone } into a v1 { name }.
however, ordering must be maintained.

Meaning that if there were any ordering changes then a migration would need to be written.

EDIT: Have you considered rkyv? It's similar (fast, minimalistic binary protocol, not "self describing", lacks schema evolution), but much more widely used (several popular crates support it, such as hashbrown) and has some additional features, like a limited form of validation

Yeah, I had a look at rkyv as well (prior to speedy) - it's a bit less straightforward, but I'd be more than happy to add a bench and test suite for adding/removing/reordering like I did with speedy 👍

EDIT: There is actually a discussion here regarding speedy and forwards-compatibility including a potential solution. Tbh it doesn't look like there was really any thought into forwards-/backwards-compatibility in the existing [Stacks] code, but then again field names are serialized.., but this is a good discussion to have so that we make sure we account for versioning moving forward.

EDIT: Just wanted to toss in that the actual serialization implementation is only a few lines of code, so it doesn't really matter which encoding we end up with as long as it has a serde-like interface, it's trivial to swap out prior to merging -- the bigger work is done, i.e. getting binary serialization+storage in-place 😄

jbencin · 2024-02-28T19:37:55Z

I'd be more than happy to add a bench and test suite for adding/removing/reordering

There are some nice benchmarks of Rust serialization frameworks here. Both are similar in performance, and either one would be much faster than JSON, so the choice should depend on other aspects

it doesn't look like there was really any thought into forwards-/backwards-compatibility in the existing [Stacks] code, but then again field names are serialized

JSON is "self-describing" in that the field names, types (to some extent), as well as the overall structure is encoded into the output data. The extra metadata gives the application information on what the data is and what to do with it, so even if the application's internal structures change, it can use that information to ignore unrecognized fields, know if a field is missing and use a default value, or fail with a descriptive error

On the other hand, many of these fast binary formats get their speed by omitting the extra metadata. Instead, they make the assumption that the application knows exactly how the data should be structured. This limits the validation you can do, and the portability between languages, CPU architectures, and even different versions of the application. Sometimes, the trade-off worth it, depending on the use-case. What I'm specifically concerned about here is:

If something changes in the binary format...
- Is there any chance of silent data corruption, like blindly memcpying binary data into the wrong variables?
- In the event of an explicit failure, do we get meaningful error messages, like which struct/field it failed at?
Would this be more brittle than the current JSON encoding? Would we have to rebuild the chainstate for changes that would be backwards compatible if we were using JSON?

There are also some fast binary serialization formats that are portable and support versioning, like Cap’n Proto or FlatBuffers, but they work by generating code from a custom schema language, so there's significant added complexity (much more work than adding a #[derive(...)]). Probably not something we want to invest in right now

…ontract storage.

…ng now

…ence

cylewitruk force-pushed the refactor/clarity-contract-storage branch from 54be976 to 9b5f9cd Compare February 27, 2024 19:34

cylewitruk commented Feb 27, 2024

View reviewed changes

cylewitruk force-pushed the refactor/clarity-contract-storage branch from 9b5f9cd to 5a68254 Compare February 28, 2024 08:56

cylewitruk mentioned this pull request Feb 28, 2024

[Optimization] Replace serde_json with speedy for ContractContext and ContractAnalysis (de)serialization #4449

Open

cylewitruk force-pushed the refactor/clarity-contract-storage branch from 5a68254 to 8b71be8 Compare March 1, 2024 14:20

cylewitruk added 20 commits March 4, 2024 13:06

working on introducing new storage for clarity contracts and analyses

65a6dc8

further work on improving clarity contract storage

8a45a36

rename of put/get to put_data/get_data for clarity. further work on c…

fa05257

…ontract storage.

added 'fake' and implementations for faking structs for tests

dc6e953

further work on fakes for signatures/representations

3e2c045

further work on tests for new db paths

ddb0c98

fixed bug with pending contracts in kv store, all clarity tests passi…

a1b5012

…ng now

logging tweaks

dca20a2

cargo fmt

d05f850

refactoring of results/errors after rebase on next

a0fc6cd

trying to fix failing tests

59b65fd

fixing issues with failing tests

2a8883b

revert cargo.toml release profile settings

47b9f95

refactoring of contract analysis storage

a23a34d

working on fixes for failing tests

546afe2

wip: all tests pass

bad0886

wip: documentation

a85eac7

wip: documentation

f23e001

wip: documentation

dff3c68

improvement: use a more performant option for checking contract exist…

97e8089

…ence

cylewitruk added 4 commits March 4, 2024 13:06

chore: remove unneeded parameter from

2e851ff

wip: rename functions with '2' suffix, cargo fmt and other small things

2a35ea1

chore: restore changed logging levels

e4fd282

chore: regenerate broken cargo.lock after rebase

ed6615e

cylewitruk force-pushed the refactor/clarity-contract-storage branch from e185650 to ed6615e Compare March 4, 2024 12:06

cylewitruk mentioned this pull request Mar 4, 2024

[Testing] Introduce proptest property testing library and add type alises for hashbrown types #4477

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Refactor of Clarity storage (iteration 1) #4437

[Refactor] Refactor of Clarity storage (iteration 1) #4437

cylewitruk commented Feb 27, 2024 •

edited

codecov bot commented Feb 27, 2024 •

edited

cylewitruk Feb 27, 2024 •

edited

cylewitruk commented Feb 28, 2024

jbencin commented Feb 28, 2024 •

edited

cylewitruk commented Feb 28, 2024 •

edited

jbencin commented Feb 28, 2024 •

edited

		@@ -2865,65 +2865,6 @@ mod test {

		let contract_id = QualifiedContractIdentifier::local("docs-test").unwrap();

		{

[Refactor] Refactor of Clarity storage (iteration 1) #4437

Are you sure you want to change the base?

[Refactor] Refactor of Clarity storage (iteration 1) #4437

Conversation

cylewitruk commented Feb 27, 2024 • edited

👋 Introduction

⛓ Consensus

✅️ Issues Resolved in this PR

🚩 Additional Changes & Improvements

🧪 Testing

⏰ Performance

🚩 New Dependencies

💡 Some Potential Next Steps

codecov bot commented Feb 27, 2024 • edited

Codecov Report

cylewitruk Feb 27, 2024 • edited

Choose a reason for hiding this comment

cylewitruk commented Feb 28, 2024

jbencin commented Feb 28, 2024 • edited

cylewitruk commented Feb 28, 2024 • edited

jbencin commented Feb 28, 2024 • edited

cylewitruk commented Feb 27, 2024 •

edited

codecov bot commented Feb 27, 2024 •

edited

cylewitruk Feb 27, 2024 •

edited

jbencin commented Feb 28, 2024 •

edited

cylewitruk commented Feb 28, 2024 •

edited

jbencin commented Feb 28, 2024 •

edited