Skip to content

Devirtualization And Key Prefix Cache Benchmark

ZengJingtao edited this page Jul 5, 2023 · 4 revisions

Performance measurement of devirtualization and prefix cache

Devirtualization: Eliminate virtual function calls, in our optimization, mainly Comparator virtual function calls

Prefix cache: Extract the fixed-length (8-byte) prefix from the Key, put it into an independent array, and perform a binary search on the array by integer

For more information, please refer to Devirtualization And Key Prefix Cache Principle

The test machine has dual- E5-2682 V4 with a total of 32 cores and 64 threads, 256G DDR4 2400Hz. Newer and better machines will have better performance.

1. Test parameters

Only the key parameters are listed here and explain why they are set as these. For simplicity, Distributed Compaction was not used in this test.

1.1. db_bench parameters

parameter name parameter value effect
num 100000000 Set num to be larger, so that various metadata of DB cannot be loaded into CPU Cache
key_size 8 Cause the keys written by db_bench to be consecutive big endian integers
value_size 20 Set it smaller to minimize the impact on performance of memory allocation and memcpy
json db_bench_enterprise.yaml With ToplingZipTable, consecutive big endian integers will use UintIndex_AllOne index, which is the fastest and smallest index in ToplingDB
json db_bench_community.yaml Use BlockBasedTable, and set enough BlockCache, the speed is much lower than ToplingZipTable

1.2. db_bench_*.yaml common parameters

parameter name parameter value effect
target_file_size_base 512K The purpose of setting target_file_size_base very small is to generate a lot of files, thus reflecting the effect of key prefix cache
max_bytes_for_level_base 4M Let each layer have a sufficient number of files
max_bytes_for_level_multiplier 4 Let each layer have a sufficient number of files
Statistics::stats_level kDisableAll Disable performance measurement

1.3. ToplingZipTable parameter in db_bench_enterprise.yaml

parameter name parameter value effect
acceptCompressionRatio 0.8 When testing a larger value_size, it can be set to a small value to let ToplingZipTable give up the compression of Value
minDictZipValueSize 30 Larger than the value_size=20 specified in this test, let ToplingZipTable give up the compression of Value
minPreadLen -1 Reading value always uses mmap
enableStatistics false Disable performance measurement to avoid unnecessary overhead

1.4.BlockBasedTable parameter in db_bench_enterprise.yaml

The data of each layer is not compressed and there is no BloomFilter, because according to the distribution of the Key in this test, only the meta data of the SST file can be used to determine whether the key is in the file without BloomFilter.

2. Conditional compiling TOPLINGDB_NO_OPT_FindFileInRange

In order to reflect our optimization effect, a new conditional compilation macro TOPLINGDB_NO_OPT_FindFileInRange is added to control whether to enable our optimization.

 int FindFileInRange(const InternalKeyComparator& icmp,
                     const LevelFilesBrief& file_level, const Slice& key,
                     uint32_t left, uint32_t right) {
+#ifdef TOPLINGDB_NO_OPT_FindFileInRange
+  #pragma message "TOPLINGDB_NO_OPT_FindFileInRange is defined, intended for benchmark comparing"
+  // here is upstream rocksdb code
   auto cmp = [&](const FdWithKeyRange& f, const Slice& k) -> bool {
     return icmp.InternalKeyComparator::Compare(f.largest_key, k) < 0;
   };
   const auto& b = file_level.files;
   return static_cast<int>(std::lower_bound(b + left, b + right, key, cmp) - b);
+#else // ToplingDB Devirtualization and Key Prefix Cache optimization
+  if (icmp.IsForwardBytewise()) {
+    BytewiseCompareInternalKey cmp;
+    return (int)FindFileInRangeTmpl(cmp, file_level, key, left, right);
+  }
+  else if (icmp.IsReverseBytewise()) {
+    RevBytewiseCompareInternalKey cmp;
+    return (int)FindFileInRangeTmpl(cmp, file_level, key, left, right);
+  }
+  else {
+    FallbackVirtCmp cmp{&icmp};
+    return (int)FindFileInRangeTmpl(cmp, file_level, key, left, right);
+  }
+#endif
 }

3. Compiling

Before each compilation, the file version_set.cc where touch FindFileInRange is located is recompiled to ensure that the correct code is compiled.

3.1. Disable FindFileInRange optimization

touch db/version_set.cc
make db_bench -j`nproc` OPTIMIZE_LEVEL='-O3' DEBUG_LEVEL=0 \
     EXTRA_CXXFLAGS=-DTOPLINGDB_NO_OPT_FindFileInRange

3.2. Enable FindFileInRange optimization

touch db/version_set.cc
make db_bench -j`nproc` OPTIMIZE_LEVEL='-O3' DEBUG_LEVEL=0

4. Use db_bench_enterprise.yaml

4.1. Fill the data with db_bench

Only fill the data and execute flush to ensure that each layer of the LSM has data. Whether to enable FindFileInRange optimization does not affect the data filling

$ ./db_bench -json=sideplugin/rockside/sample-conf/db_bench_enterprise.yaml \
             -key_size=8 -value_size=20 -batch_size=10 -num=100000000 \
             -disable_wal=true -benchmarks=fillseq,flush

Set seed to 1672971774512063 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.10.0
Date:       Fri Jan  6 10:22:54 2023
CPU:        64 * Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
CPUCache:   40960 KB
Keys:       8 bytes each (+ 0 bytes user-defined timestamp)
Values:     20 bytes each (10 bytes after compression)
Entries:    100000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2670.3 MB (estimated)
FileSize:   1716.6 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: CSPPMemTabFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/db_bench_enterprise]
fillseq: 0.452 micros/op 2211711 ops/sec 45.214 seconds 100000000 operations; 59.1 MB/s
flush memtable

4.2. The shape of the LSM tree after filling the data

** Compaction Stats [default] **
Level    Files      Size     Score    omitted...
-----------------------------------   omitted...
  L0      1/0       1.64 MB   0.4     omitted...
  L1     17/0       3.14 MB   0.8     omitted...
  L2     76/0      15.26 MB   1.0     omitted...
  L3    328/0      63.70 MB   1.0     omitted...
  L4   1336/0     255.22 MB   1.0     omitted...
  L5   5392/0    1023.65 MB   1.0     omitted...
  L6   2952/0     558.28 MB   0.0     omitted...
 Sum  10102/0       1.88 GB   0.0     omitted...

4.3. Confirm that the UintIndex_AllOne index is used

You can view the information of a single SST through WebView, and you can see that ToplingZipTable does use the UintIndex_AllOne index.

The most commonly used primary key index auto_increment id in MyTopling (MySQL) will use the UintIndex series index. If there is no hole in the id, it is UintIndex_AllOne. We actively select this index with key_size=8 in db_bench.

image

4.4. reading performance benchmark

$ ./db_bench -json sideplugin/rockside/sample-conf/db_bench_enterprise.yaml \
             -num=100000000 -benchmarks=readrandom -use_existing_db=true

The db_bench command performs a separate readrandom test, and use_existing_db=true means to use the previously populated data.

@@ in the following flame graph represents an anonymous namespace.

Disable FindFileInRange optimization

# ... ... ... Delete some spaces and wrap lines appropriately in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_enterprise]
readrandom: 1.544 micros/op 647552 ops/sec 154.428 seconds 100000000 operations;
           17.3 MB/s (100000000 of 100000000 found)

flame graph (View individually, down to each function) image

Enable FindFileInRange optimization

# ... ... ... Delete some spaces in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_enterprise]
readrandom: 1.009 micros/op 991406 ops/sec 100.867 seconds 100000000 operations;
           26.5 MB/s (100000000 of 100000000 found)

flame graph (View individually, down to each function) image

explanation

On average, the time-consuming of the optimized version of FindFileInRange is only 65% ​​of the original version, and, according to the flame graph, we can calculate the performance improvement of FindFileInRange itself.

Because the time consumption of other parts except FindFileInRange is an invariant in the two tests, we divide the time consumption of FindFileInRange by the time consumption of other parts, which is the relative time consumption value of FindFileInRange. This relative time-consuming value can be directly compared. In the end, we calculated that the performance improvement of FindFileInRange itself is 4.25 times. This calculation process is roughly at the level of mathematics application problems in the fifth grade of elementary school.

Optimization time-consuming ratio relatively time-consuming multiple
disable 42.68% 42.68/(100-42.68) = 0.7446 4.25 = 0.7446/0.1751
enable 14.90% 14.90/(100-14.90) = 0.1751 1.00

5. Use db_bench_community.yaml

5.1. Fill the data with db_bench

The parameters are exactly the same as db_bench_enterprise except using db_bench_community.yaml.

$ ./db_bench -json=sideplugin/rockside/sample-conf/db_bench_community.yaml \
             -key_size=8 -value_size=20 -batch_size=10 -num=100000000 \
             -disable_wal=true -benchmarks=fillseq,flush
Set seed to 1672996553785931 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.10.0
Date:       Fri Jan  6 17:15:53 2023
CPU:        64 * Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
CPUCache:   40960 KB
Keys:       8 bytes each (+ 0 bytes user-defined timestamp)
Values:     20 bytes each (10 bytes after compression)
Entries:    100000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2670.3 MB (estimated)
FileSize:   1716.6 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: CSPPMemTabFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/dev/shm/db_bench_community]
fillseq      :       0.449 micros/op 2225012 ops/sec 44.944 seconds 100000000 operations;   59.4 MB/s
flush memtable

We can see that because the same MemTable is used, the performance of writing data is basically the same.

5.2. The shape of the LSM tree after filling the data

** Compaction Stats [default] **
Level    Files      Size     Score    ...
-----------------------------------   ...
  L0      1/0       1.16 MB   0.3     ...
  L1      1/0       3.22 MB   0.8     ...
  L2      4/0      12.89 MB   0.8     ...
  L3     19/0      61.16 MB   1.0     ...
  L4     79/0     254.35 MB   1.0     ...
  L5    318/0    1023.89 MB   1.0     ...
  L6    555/0       1.74 GB   0.0     ...
 Sum    977/0       3.07 GB   0.0     ...

The number of files is less than when using ToplingZipTable, probably because the implementation of TableBuilder::EstimateFileSize() is different. ToplingZipTable returns the sum of the key value sizes from Add, and BlockBasedTable has its own calculation method.

The number of files is less, and FindFileInRange itself takes less time. At this point, BlockBasedTable takes a small advantage.

The same is not compressed, but the file size of BlockBasedTable is larger. This is because the Index of ToplingZipTable uses UintIndex_AllOne, and the space occupation is O(1). The size of the Index has nothing to do with the data, so it can actually be considered that it does not occupy space.

ToplingDB has an option parameter use_raw_size_as_estimated_file_size for the calculation of BlockBasedTableBuilder::EstimateFileSize(), which is defined as an option of BlockBasedTable Factory in json/yaml. If it is true, this calculation method is the same as ToplingZipTable, but in this test, in order to make the BlockBasedTable Behavior is same as RocksDB upstream, we don't define this option (default false).

5.3. View BlockIndex

ToplingDB also implements the function of viewing SST meta information for BlockBasedTable in WebView. image

5.4. reading performance benchmark

$ ./db_bench -json sideplugin/rockside/sample-conf/db_bench_community.yaml \
             -num=100000000 -benchmarks=readrandom -use_existing_db=true

The db_bench command performs a separate readrandom test, and use_existing_db=true means to use the previously populated data.

@@ in the following flame graph represents an anonymous namespace.

Disable FindFileInRange optimization

# ... ... ... Delete some spaces and wrap lines appropriately in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_community]
readrandom: 4.182 micros/op 239138 ops/sec 418.168 seconds 100000000 operations;
            6.4 MB/s (100000000 of 100000000 found)

flame graph (View individually, down to each functionimage

Enable FindFileInRange optimization

# ... ... ... Delete some spaces in the following results
------------------------------------------------
DB path: [/dev/shm/db_bench_community]
readrandom: 3.644 micros/op 274432 ops/sec 364.389 seconds 100000000 operations;
            7.3 MB/s (100000000 of 100000000 found)

flame graph (View individually, down to each functionimage

Explanation

  1. Neither BlockBasedTable nor ToplingZipTable actually uses compression. BlockBasedTable requires BlockCache, and at the same time, there will be PageCache when reading files (unless direct io is used)
  2. ToplingZipTable does not require BlockCache, only PageCache of the file system

Even if BlockBasedTable is configured with 8G BlockCache, which is almost three times the total data volume, BlockBasedTable is still much slower than ToplingZipTable.

Below we use the same method to calculate the performance difference between the optimized version of FindFileInRange and the original version:

Optimization time-consuming ratio relatively time-consuming multiple
Disable 11.65% 11.65/(100-11.65) = 0.1319 4.19 = 0.1319/0.0315
Enable 3.05% 3.05/(100-3.05) = 0.0315 1.00

It can be seen that although db_bench_community uses BlockBasedTable, the speed is much slower than ToplingZipTable, but according to the same method, the calculated performance difference between disabling and enabling optimization is 4.19, and the difference between 4.25 calculated in db_bench_enterprise can be considered an error.

6. Performance comparison between ToplingZipTable and BlockBasedTable

Using the same calculation method, we can calculate the performance improvement of FindFileInRange, and also calculate the performance difference of ToplingZipTable and BlockBasedTable.

The time consumption other than TableReader::Get is an invariant. Because the number of files in BlockBaseTable is small, this block will be smaller than ToplingZipTable, so ToplingZipTable will take a little advantage

BlockBasedTableReader::Get is divided into 4 parts in the flame graph (There may be a problem with the flame graph tool. These 4 parts should have been counted together), and the total time-consuming ratio is 37.53 + 37.42 + 2.09 + 2.42 = 79.46%

SST time-consuming ratio relatively time-consuming multiple
BlockBased 79.46% 79.46/(100-79.46) = 3.8685 7.19 = 3.8685/0.5378
ToplingZip 34.97% 34.97/(100-34.97) = 0.5378 1.00

Calculating using a non-optimized FindFileInRange flame graph (34.08 + 36.86 + 2.09 = 73.03%):

SST time-consuming ratio relatively time-consuming multiple
BlockBased 73.03% 73.03/(100-73.03) = 2.7078 8.75 = 2.7078/0.3094
ToplingZip 23.63% 23.63/(100-23.63) = 0.3094 1.00

It is calculated that the performance of ToplingZipTable is 7.19 and 8.75 times that of BlockBasedTable. This error is a bit large and difficult to explain.

Note, there are the following prerequisites here:

Options ToplingZipTable BlockBasedTable
Compression uncompressed uncompressed
Cache PageCache hit all BlockCache hit all
BloomFilter BloomFilter is always not needed The SST produced by this test does not require a BloomFilter
Index The fastest index UintIndex_AllOne is selected general index

It can be seen that the test conditions are the optimal conditions for both sides. When ToplingZipTable uses its own general index NestLoudsTrie, the speed of searching Key will be slower, and when compression is enabled, the speed of getting Value will be slower.

Clone this wiki locally