Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: MMAP backed bucket map that is suitable as a KV store for the accounts index #18179

Closed
wants to merge 39 commits into from

Conversation

aeyakovenko
Copy link
Member

@aeyakovenko aeyakovenko commented Jun 23, 2021

Problem

Accounts index takes up way to much RAM

Summary of Changes

Design
  1. create a large set of buckets, 32k or more up front
  2. pubkeys based on their first 16 bits are stored in each bucket
  3. each bucket has a mmap for the index entry and N mmaps for variable sized [SlotInfo] data
  4. each data mmap stores a 2^n SlotInfo slices

Each bucket index or bucket data can double independently.

Locking

Each bucket has its own RWLock. We should test performance with spin::RwLock vs sync::RwLock. The DataBucket allocator can function with atomics, so it's possible to redesign this such that the write lock is only necessary when reallocating.

In Order Scan

The buckets are in order, while keys in each bucket are going to be out of order. To scan the entire index in order, read each bucket in order and use a rayon parallel sort.

Fixes #

tag @carllin @jeffwashington @sakridge @ryoqun

@aeyakovenko aeyakovenko changed the title SSD backed bucket map that is suitable as a KV store for the accounts index WIP: MMAP backed bucket map that is suitable as a KV store for the accounts index Jun 23, 2021
@jeffwashington
Copy link
Contributor

jeffwashington commented Jun 23, 2021

I have done several passes of test code that could be helpful here to gather metrics across # accounts, insert, lookup, memory usage for hashmap, dashmap, btree, binned btree, skip-list and binned skip-list. So far, I've been focusing on the various in-memory representations.

Copy link
Contributor

@jon-chuang jon-chuang left a comment

Just doing some basic sanity checking

path: PathBuf,
mmap: MmapMut,
pub cell_size: u64,
pub capacity: u64,
Copy link
Contributor

@jon-chuang jon-chuang Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify if capacity is in units of cell_size, or bytes?
Edit: I believe it’s cell_size.

Copy link
Member Author

@aeyakovenko aeyakovenko Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. num_cells would be better :)

let drive = &drives[r];
let pos = format!("{}", thread_rng().gen_range(0, u128::MAX),);
let file = drive.join(pos);
let mut data = OpenOptions::new()
Copy link
Contributor

@jon-chuang jon-chuang Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in the future this could recycle old MMaps instead of remove_file below

Copy link
Member Author

@aeyakovenko aeyakovenko Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. V2

for i in self.data.len() as u64..(sz.0 + 1) {
self.data.push(DataBucket::new(
self.drives.clone(),
(1 << i) + MIN_ELEMS,
Copy link
Contributor

@jon-chuang jon-chuang Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you grow the DataBucket cell_size exponentially?

Copy link
Contributor

@jon-chuang jon-chuang Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is to fit the data.len() as per best_fit_bucket

@aeyakovenko
Copy link
Member Author

aeyakovenko commented Jun 26, 2021

Some benchmarks. So single insert isn't super optimized yet, but its still pretty good.

  • on my intel mac
test bucket_map_bench_hashmap_baseline       ... bench:         269 ns/iter (+/- 134)
test bucket_map_bench_insert_1               ... bench:         770 ns/iter (+/- 116)
test bucket_map_bench_insert_16x512          ... bench:  15,533,122 ns/iter (+/- 44,959,341)
test bucket_map_bench_insert_16x512_baseline ... bench:  23,809,525 ns/iter (+/- 4,480,271)

it seems to start outperform the RwLock<HashMap> with 16+ threads/buckets and 256+ items per thread

@aeyakovenko
Copy link
Member Author

aeyakovenko commented Jun 26, 2021

4096 buckets on Lemond with each thread adding 4096 items:

running 1 test
time: 471ns per item
dir_size: "/tmp/bucket_map_test_mt" 2712141824
dir_size: "/mnt/data/aeyakovenko/bucket_map_test_mt" 2774007808

Overhead per account: (2774007808 + 2712141824)/(4096 * 4096) = 327 bytes
(that's pretty decent I think)

keys.par_chunks(chunks)
let mut hashes = vec![];
for ix in 0..self.accounts_index.account_maps.num_buckets() {
let keys = self.accounts_index.account_maps.keys(ix);
Copy link
Member Author

@aeyakovenko aeyakovenko Jun 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun @sakridge @jeffwashington

wdyt, would this work here? the initial design had a snapshot slot that would keep the data behind that slot form being cleaned up as the index is updated. So we shouldn't need a global lock on all the keys to do the snapshot.

Copy link
Contributor

@carllin carllin Jun 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brooksprumo is good to reference here, as he's been working on snapshotting related logic

Copy link
Contributor

@jeffwashington jeffwashington Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code path is already dead except for debug validation. We don’t want to use account index for hash calculation.
Even so, does bucket[x] have keys that are all < the keys in bucket[x+1]? This particular algorithm is written to assume the keys vector is in sorted order. The btree provides that currently. If not, we have to save key and hash from this scan, sort by key, then hash.

Copy link
Contributor

@brooksprumo brooksprumo Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm what @jeffwashington said that on the snapshot side this is never called/used. The snapshot code that calls AccountsDb::calculate_accounts_hash_without_index() is inside an if that is always false.

There are hooks for testing, but none of the tests actually exercise this code path. I assume in the past this was different, but as of now AccountsDb::calculate_accounts_hash_without_index() is never called in the snapshot code (nor anything that calls these snapshot functions).

Relevant snapshot functions:
snapshot_utils::snapshot_bank()
snapshot_utils::package_snapshot()

Copy link
Contributor

@brooksprumo brooksprumo Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm what @jeffwashington said that on the snapshot side this is never called/used. The snapshot code that calls AccountsDb::calculate_accounts_hash_without_index() is inside an if that is always false.

There are hooks for testing, but none of the tests actually exercise this code path. I assume in the past this was different, but as of now AccountsDb::calculate_accounts_hash_without_index() is never called in the snapshot code (nor anything that calls these snapshot functions).

Relevant snapshot functions:
snapshot_utils::snapshot_bank()
snapshot_utils::package_snapshot()

Copy link
Member Author

@aeyakovenko aeyakovenko Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffwashington

Even so, does bucket[x] have keys that are all < the keys in bucket[x+1]?

Yep. that is guaranteed. https://github.com/solana-labs/solana/pull/18179/files#diff-d90ce8cbcea9459dbca60678b48f6d6d2816b8d4de8b8ebfd5c434b819f153b3R97

At boot, the validator creates a fixed number of buckets. On lemond with 128 cores, I see performance improve up to 4096 buckets.

@jeffwashington
Copy link
Contributor

jeffwashington commented Jun 30, 2021

fyi, I am working on this

@jeffwashington
Copy link
Contributor

jeffwashington commented Jun 30, 2021

@aeyakovenko , @sakridge
fyi that pubkeys are almost perfectly uniform currently on mnb - at least into the highest order 16 bins.

@jeffwashington
Copy link
Contributor

jeffwashington commented Jun 30, 2021

ledger-tool verify test

wiggins, mnb, 61M accounts, 71M accounts with duplicates, single normal folder for files
master:
generate_index total_us= 16595216i scan_stores_us= 16823797i
generate_index total_us=825187084i scan_stores_us=276917451i
new accounts_index^

@stale
Copy link

stale bot commented Jul 8, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Jul 8, 2021
@stale
Copy link

stale bot commented Jul 15, 2021

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jul 15, 2021
@jeffwashington
Copy link
Contributor

jeffwashington commented Jul 21, 2021

just a comment here since its stale. This is still actively in prototyping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants