New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: MMAP backed bucket map that is suitable as a KV store for the accounts index #18179
Conversation
|
I have done several passes of test code that could be helpful here to gather metrics across # accounts, insert, lookup, memory usage for hashmap, dashmap, btree, binned btree, skip-list and binned skip-list. So far, I've been focusing on the various in-memory representations. |
runtime/src/data_bucket.rs
Outdated
| path: PathBuf, | ||
| mmap: MmapMut, | ||
| pub cell_size: u64, | ||
| pub capacity: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify if capacity is in units of cell_size, or bytes?
Edit: I believe it’s cell_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. num_cells would be better :)
runtime/src/data_bucket.rs
Outdated
| let drive = &drives[r]; | ||
| let pos = format!("{}", thread_rng().gen_range(0, u128::MAX),); | ||
| let file = drive.join(pos); | ||
| let mut data = OpenOptions::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in the future this could recycle old MMaps instead of remove_file below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. V2
runtime/src/bucket_map.rs
Outdated
| for i in self.data.len() as u64..(sz.0 + 1) { | ||
| self.data.push(DataBucket::new( | ||
| self.drives.clone(), | ||
| (1 << i) + MIN_ELEMS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you grow the DataBucket cell_size exponentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is to fit the data.len() as per best_fit_bucket
|
Some benchmarks. So single insert isn't super optimized yet, but its still pretty good.
it seems to start outperform the |
|
4096 buckets on Lemond with each thread adding 4096 items: Overhead per account: (2774007808 + 2712141824)/(4096 * 4096) = 327 bytes |
| keys.par_chunks(chunks) | ||
| let mut hashes = vec![]; | ||
| for ix in 0..self.accounts_index.account_maps.num_buckets() { | ||
| let keys = self.accounts_index.account_maps.keys(ix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun @sakridge @jeffwashington
wdyt, would this work here? the initial design had a snapshot slot that would keep the data behind that slot form being cleaned up as the index is updated. So we shouldn't need a global lock on all the keys to do the snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brooksprumo is good to reference here, as he's been working on snapshotting related logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code path is already dead except for debug validation. We don’t want to use account index for hash calculation.
Even so, does bucket[x] have keys that are all < the keys in bucket[x+1]? This particular algorithm is written to assume the keys vector is in sorted order. The btree provides that currently. If not, we have to save key and hash from this scan, sort by key, then hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm what @jeffwashington said that on the snapshot side this is never called/used. The snapshot code that calls AccountsDb::calculate_accounts_hash_without_index() is inside an if that is always false.
There are hooks for testing, but none of the tests actually exercise this code path. I assume in the past this was different, but as of now AccountsDb::calculate_accounts_hash_without_index() is never called in the snapshot code (nor anything that calls these snapshot functions).
Relevant snapshot functions:
snapshot_utils::snapshot_bank()
snapshot_utils::package_snapshot()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm what @jeffwashington said that on the snapshot side this is never called/used. The snapshot code that calls AccountsDb::calculate_accounts_hash_without_index() is inside an if that is always false.
There are hooks for testing, but none of the tests actually exercise this code path. I assume in the past this was different, but as of now AccountsDb::calculate_accounts_hash_without_index() is never called in the snapshot code (nor anything that calls these snapshot functions).
Relevant snapshot functions:
snapshot_utils::snapshot_bank()
snapshot_utils::package_snapshot()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even so, does bucket[x] have keys that are all < the keys in bucket[x+1]?
Yep. that is guaranteed. https://github.com/solana-labs/solana/pull/18179/files#diff-d90ce8cbcea9459dbca60678b48f6d6d2816b8d4de8b8ebfd5c434b819f153b3R97
At boot, the validator creates a fixed number of buckets. On lemond with 128 cores, I see performance improve up to 4096 buckets.
|
fyi, I am working on this |
|
@aeyakovenko , @sakridge |
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
|
just a comment here since its stale. This is still actively in prototyping. |
Problem
Accounts index takes up way to much RAM
Summary of Changes
Design
Each bucket index or bucket data can double independently.
Locking
Each bucket has its own RWLock. We should test performance with spin::RwLock vs sync::RwLock. The DataBucket allocator can function with atomics, so it's possible to redesign this such that the write lock is only necessary when reallocating.
In Order Scan
The buckets are in order, while keys in each bucket are going to be out of order. To scan the entire index in order, read each bucket in order and use a rayon parallel sort.
Fixes #
tag @carllin @jeffwashington @sakridge @ryoqun