Skip to content

Commit

Permalink
persistent-state: Use block based SST format
Browse files Browse the repository at this point in the history
We saw an issue with the plain table format where if we failed
compaction of a large enough table and then tried to restart, the
compaction would end up creating sst files larger than the upper bound
that PlainTableReader can open (2 GiB), despite our options being
configured to cap the sst file size at 256MiB.

After running some experiments it seems that block-based format compacts
much more quickly (10x+ for the 125GiB table tested).

Running our persistent_state benchmark on the format also showed between
a 7-28% speedup in put/get performance across the board:

RockDB get primary key/lookup
	time:   [15.188 us 15.198 us 15.208 us]
	change: [-10.775% -10.651% -10.526%] (p = 0.00 < 0.05)
RockDB get secondary key/lookup_multi
	time:   [1.4019 ms 1.4028 ms 1.4037 ms]
	change: [-23.846% -23.722% -23.573%] (p = 0.00 < 0.05)
RockDB get secondary key/lookup
	time:   [1.4236 ms 1.4252 ms 1.4268 ms]
	change: [-23.066% -22.942% -22.795%] (p = 0.00 < 0.05)
RockDB get secondary unique key/lookup_multi
	time:   [28.533 us 28.558 us 28.584 us]
	change: [-11.832% -11.638% -11.465%] (p = 0.00 < 0.05)
RockDB get secondary unique key/lookup
	time:   [27.850 us 27.869 us 27.888 us]
	change: [-7.8900% -7.7159% -7.5518%] (p = 0.00 < 0.05)
RocksDB lookup_range/lookup_range
	time:   [41.017 ms 41.047 ms 41.075 ms]
	change: [-28.254% -28.183% -28.115%] (p = 0.00 < 0.05)
RocksDB with large strings/lookup_range
	time:   [658.64 ms 677.51 ms 702.72 ms]
	change: [-16.058% -12.452% -9.1835%] (p = 0.00 < 0.05)

Refs: REA-2863
Change-Id: I15dc94d0f9346007e6a255483cf2565bae12a571
Reviewed-on: https://gerrit.readyset.name/c/readyset/+/5206
Tested-by: Buildkite CI
Reviewed-by: Griffin Smith <griffin@readyset.io>
  • Loading branch information
lukoktonos committed Jun 21, 2023
1 parent 5fb5e88 commit 8a32b30
Showing 1 changed file with 5 additions and 15 deletions.
20 changes: 5 additions & 15 deletions dataflow-state/src/persistent_state.rs
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ use readyset_data::DfValue;
use readyset_errors::{internal_err, invariant, ReadySetError, ReadySetResult};
use readyset_util::intervals::BoundPair;
use rocksdb::{
self, ColumnFamilyDescriptor, CompactOptions, EncodingType, IteratorMode,
PlainTableFactoryOptions, SliceTransform, WriteBatch, DB,
self, BlockBasedOptions, ColumnFamilyDescriptor, CompactOptions, IteratorMode, SliceTransform,
WriteBatch, DB,
};
use serde::de::DeserializeOwned;
use serde::{Deserialize, Serialize};
Expand Down Expand Up @@ -1182,19 +1182,9 @@ impl IndexParams {
// For hash map indices, optimize for point queries and in-prefix range iteration, but
// don't allow cross-prefix range iteration.
IndexType::HashMap => {
opts.set_plain_table_factory(&PlainTableFactoryOptions {
user_key_length: 0, // variable key length
bloom_bits_per_key: 10,
hash_table_ratio: 0.75,
index_sparseness: 16,
huge_page_tlb_size: 0,
encoding_type: EncodingType::default(),
full_scan_mode: false,
// Store the plain table index and bloom filter in the table file itself. This
// speeds up re-opening the db on restart *significantly* (up to multiple hours
// for large tables) by avoiding recomputing the index on startup
store_index_in_file: true,
});
let mut block_opts = BlockBasedOptions::default();
block_opts.set_bloom_filter(10.0, true);
opts.set_block_based_table_factory(&block_opts);

// We're either going to be doing direct point lookups, in the case of unique
// indexes, or iterating within a range.
Expand Down

0 comments on commit 8a32b30

Please sign in to comment.