Replace DbLedger with simple blob store #2566
Description
Problem
DbLedger
currently requires rocksdb
, while we use very little of what it has to offer. RocksDb types are also This also has significant build costs and is a large non-Rust dependency.
Solution
Re-implement file-based ledger storage, with a focus on writing as fast as possible, as well as providing more specifically useful API and other characteristics.
Goals
- Remove rocksdb dependency
- Store ledger data reliably
- Exceed current DbLedger performance
Design
There are two primary components: BlobStore
and KVStore
BlobStore
Essentially a struct that should behave as close to DbLedger
as possible so that replacement is easy. It relies on the KVStore
for all actual storage
KVStore
Key-value persistence layer that maps keys to a value.
Value
is just a type alias for clarity: type Value = Vec<u8>
. All operations must be ACID transactions to whatever degree possible.
Operations:
put (key: Key, value: Value) -> Result<()>
get (key: Key) -> Result<Option<Value>>
delete (key: Key) -> Result<bool>
- return value indicates if there was something to delete
put_many (rows: [(K, V)]) -> Result<()>
get_many (keys: [Key]) -> Result<Vec<Value>>
delete_many (keys: [Key]) -> Result<()>
scan(range: RangeInclusive<Key>) -> Result<impl Iterator<Item=(Key, Value)>>
The storage strategy is essentially a simple log-structured merge tree (LSMT) approach. There is an in-memory table and series of on-disk SSTables
[1]. When storing a record, it is first written to a write-log, then added to the in-memory table.
When this table fills up, it is then dumped to disk in two files. 1 file simple contains the (Key, Value)
pairs. The second file is the index file, which contains some metadata, and a list of (Key, Loc)
pairs where Loc
represents to location and extent of the record in the data file. These files combined form a logical SSTable
. When the in-memory table is dumped, it creates an SSTable
of level 0. As level 0 fills up, the compaction process begins.
The compaction strategy is known as "levelled compaction" and is mostly a simplified version of what's described by ScyllaDB and LevelDB [2][3].
Implementation
Performance is very critical for this feature. I have rudimentary benchmarks that indicate same-or-better perf when compared to the current DbLedger
but not very rigorous or well-documented ones. I will create them.
Questions
- Should most of this comment (probably everything above this) be expanded on and moved to something like a proposal in the book?
- Should this be a new crate in the workspace?
- Currently everything is mostly a new module in the solana crate. This seems like it could be a separate crate. That would make boundaries clear and might make builds, test, and benchmarking simpler (and possibly faster).
In-flight
- principled indexing and storage using simple LSMTree approach
- insert
- flushing in-memory table
- merging on disk tables
- read
- simple read
- merge read
- range scan
- recovery
- in-memory write buffer
- SSTable indexes
- integrate
- Change naming to be more inline with recent developments (block/slot/etc)
- Testing
- basic integration tests
- utility functions for testing
- Isolate and test unit-testable components or find some way to do I/O free testing
- Benchmarks
- basic benchmarks
- Documentation
- document all modules, types, and methods exposed to the rest of the crate
- document important internal modules, types, and methods