persistent-state: Use block based SST format

We saw an issue with the plain table format where if we failed compaction of a large enough table and then tried to restart, the compaction would end up creating sst files larger than the upper bound that PlainTableReader can open (2 GiB), despite our options being configured to cap the sst file size at 256MiB. After running some experiments it seems that block-based format compacts much more quickly (10x+ for the 125GiB table tested). Running our persistent_state benchmark on the format also showed between a 7-28% speedup in put/get performance across the board: RockDB get primary key/lookup time: [15.188 us 15.198 us 15.208 us] change: [-10.775% -10.651% -10.526%] (p = 0.00 < 0.05) RockDB get secondary key/lookup_multi time: [1.4019 ms 1.4028 ms 1.4037 ms] change: [-23.846% -23.722% -23.573%] (p = 0.00 < 0.05) RockDB get secondary key/lookup time: [1.4236 ms 1.4252 ms 1.4268 ms] change: [-23.066% -22.942% -22.795%] (p = 0.00 < 0.05) RockDB get secondary unique key/lookup_multi time: [28.533 us 28.558 us 28.584 us] change: [-11.832% -11.638% -11.465%] (p = 0.00 < 0.05) RockDB get secondary unique key/lookup time: [27.850 us 27.869 us 27.888 us] change: [-7.8900% -7.7159% -7.5518%] (p = 0.00 < 0.05) RocksDB lookup_range/lookup_range time: [41.017 ms 41.047 ms 41.075 ms] change: [-28.254% -28.183% -28.115%] (p = 0.00 < 0.05) RocksDB with large strings/lookup_range time: [658.64 ms 677.51 ms 702.72 ms] change: [-16.058% -12.452% -9.1835%] (p = 0.00 < 0.05) Refs: REA-2863 Change-Id: I15dc94d0f9346007e6a255483cf2565bae12a571 Reviewed-on: https://gerrit.readyset.name/c/readyset/+/5206 Tested-by: Buildkite CI Reviewed-by: Griffin Smith <griffin@readyset.io>
readysettech · Jun 21, 2023 · 8a32b30 · 8a32b30
1 parent 5fb5e88
commit 8a32b30
Showing 1 changed file with 5 additions and 15 deletions.
diff --git a/dataflow-state/src/persistent_state.rs b/dataflow-state/src/persistent_state.rs
@@ -85,8 +85,8 @@ use readyset_data::DfValue;
 use readyset_errors::{internal_err, invariant, ReadySetError, ReadySetResult};
 use readyset_util::intervals::BoundPair;
 use rocksdb::{
-    self, ColumnFamilyDescriptor, CompactOptions, EncodingType, IteratorMode,
-    PlainTableFactoryOptions, SliceTransform, WriteBatch, DB,
+    self, BlockBasedOptions, ColumnFamilyDescriptor, CompactOptions, IteratorMode, SliceTransform,
+    WriteBatch, DB,
 };
 use serde::de::DeserializeOwned;
 use serde::{Deserialize, Serialize};
@@ -1182,19 +1182,9 @@ impl IndexParams {
             // For hash map indices, optimize for point queries and in-prefix range iteration, but
             // don't allow cross-prefix range iteration.
             IndexType::HashMap => {
-                opts.set_plain_table_factory(&PlainTableFactoryOptions {
-                    user_key_length: 0, // variable key length
-                    bloom_bits_per_key: 10,
-                    hash_table_ratio: 0.75,
-                    index_sparseness: 16,
-                    huge_page_tlb_size: 0,
-                    encoding_type: EncodingType::default(),
-                    full_scan_mode: false,
-                    // Store the plain table index and bloom filter in the table file itself. This
-                    // speeds up re-opening the db on restart *significantly* (up to multiple hours
-                    // for large tables) by avoiding recomputing the index on startup
-                    store_index_in_file: true,
-                });
+                let mut block_opts = BlockBasedOptions::default();
+                block_opts.set_bloom_filter(10.0, true);
+                opts.set_block_based_table_factory(&block_opts);
 
                 // We're either going to be doing direct point lookups, in the case of unique
                 // indexes, or iterating within a range.