Devirtualization And Key Prefix Cache Principle

[中文版 (Chinese)]

Devirtualization and prefix caching

1. Background

ToplingDB is a KV storage engine developed by topling. It was forked from RocksDB. It has undergone many modifications and fixed many bugs in RocksDB. Dozens of modifications have also sent Pull Requests to upstream RocksDB, so that users who do not use ToplingDB can also Benefit, although RocksDB's acceptance is not positive enough, but we have indeed tried our best!

This article mainly talks about the devirtualization of ToplingDB. We know that in C++, virtual functions are a core feature, which is the basis of runtime polymorphism. Since it is runtime polymorphism, there must be corresponding virtual function calls overhead. At the compiler level, there is automatic devirtualization, that is, when the compiler "knows" the actual type of an object, it "knows" which virtual function to call. At this time, it can directly call the function without accessing The virtual table, and even inline the function, the new final keyword in C++11, in addition to making the code stricter at the language level, also helps the compiler to perform this optimization better.

In RocksDB, virtual functions are used everywhere. A typical example is the Comparator, which is used to determine the sorting method of Key. The default comparator is BytewiseComparator, another commonly used comparator is ReverseBytewiseComparator. The devirtualization described in this article takes Comparator as an example.

2. Comparator vs Encoding

In most cases, we use the default byTewiseComparators, which can only be customized in specific scenarios.

On the other hand, even in these special scenarios that need to customize Comparators, we still have other solutions:

Without customizing Comparators, the default byTewiseComparator still uses the key to encode the key to compare it under the action of BytewiseComparators.

In Todis, we use the encoding scheme

In MyTopling, the encoding scheme of MyRocks is reused, using BytewiseComparator.

3. Why BytewiseComparator

BytewiseComparator, its behavior is equivalent to the comparison method of std::string, in terms of form, it has nothing special compared with various custom special comparators. In dialectics, form and content are very important themes. They are a pair of basic categories of dialectics. Content is the sum of all internal elements of a thing, and form is the structure and organization of these internal elements.

So, from the content, let's see, what's special about BytewiseComparator:

// Reversing the position of x, y leads to ReverseBytewiseComparator
int compare(Slice x, Slice y) {
  int c = memcmp(x.data(), y.data(), min(x.size(), y.size()));
  if (c) return c;
  else return x.size() < y.size() ? -1 :
              x.size() > y.size() ? +1 : 0;
}

First: It is simple and uses all the necessary information. From the perspective of information theory, its entropy is the lowest.

Second: Efficient, all system primitives are used, and any other "universal" comparison method cannot be more efficient than it.

Secondly, from the perspective of conceptual extension, the key order defined by BytewiseComparator is the "natural" order of many data:

Unsigned integers stored in BigEndian, unlimited number of digits (8-bit, 16-bit, 32-bit, 64-bit...) word list (including but not limited to English words), so the order defined by BytewiseComparator is also called "lexicographical order"

4. Data structure

The data structure includes data storage schemes and processing algorithms. RocksDB is the de facto standard in the storage engine industry. The implementation of its data structure is exquisite and is in constant improvement and evolution. "Improvement" is difficult to make a breakthrough.

Therefore, we must find another way. The biggest core competitiveness of ToplingDB lies in the core performance components:

Topling CSPP MemTable & CSPP WBWI Topling ZipTable The core idea is: use special data structures (Trie&Succinct), only support BytewiseComparator, all for performance.

Therefore, Todis and MyTopling use the equivalent encoding scheme of BytewiseComparator, which is perfectly adapted to the performance components of ToplingDB.

5. Hotspot transfer

There are multiple steps in the code, and each step will consume a certain amount of time. Among them, the hotspot is the most worthy of optimization. For example, there is a hot spot that consumes 80% of the execution time. We optimize this hot spot and put this The speed of the hot code has increased to 10 times, let's look at the overall effect of the optimization, it seems to be OK:

How many times the speed of the whole program has been increased to	100/28=	3.57142857
New code as a percentage of total execution time	8/28=	0.28571429
The original remaining 20% is now the proportion of time	20/28=	0.71428571

What if the hotspot was 100 times faster? This is a bit pessimistic:

How many times the speed of the whole program has been increased to	100/20.8=	4.80769231
New code as a percentage of total execution time	0.8/28=	0.03846154
The original remaining 20% is now the proportion of time	20/20.8=	0.96153846

Therefore, after optimizing a hotspot, the code that was not a hotspot before becomes a new hotspot, and the more thorough the optimization, the more prominent the new hotspot!

6. Topic: Devirtualization

Comparator is an abstract interface that can realize runtime polymorphism, and real runtime polymorphism is mostly BytewiseComparator, and some performance components of ToplingDB only support BytewiseComparator. These performance components eliminate corresponding hotspots, and then new hotspots emerge.

In the entire execution stack of RocksDB, Comparator is often the bottom layer, so it is most likely to become a time-consuming user. Where Comparator is used, in addition to specific component modules (such as MemTable, SST), there are many management and scheduling modules. We can see in the flame graph that these management and scheduling modules are new hotspots, for example:

FindFileInRange	Find hit SST files in each layer of LSM
MergingIterator	Merge multi-layer (plus MemTable) data

Their biggest time-consuming comes from the overhead of the Comparator. The simple call chain overhead is even greater than the necessary calculation (memcmp...), and the call chain overhead is our goal of devirtualization. First of all, we need to identify under what circumstances we can devirtualize. Naturally, it is the BytewiseComparator (and ReverseBytewiseComparator) scenario mentioned above. This is very simple. Each specific Comparator has its own name, and we can just judge the name. Then how to achieve devirtualization, we use templates, and we can achieve devirtualization with very little code modification.

6.1 FindFileInRange: Devirtualization V1

....... omitted omitted omitted omitted omitted

+template<class Cmp>
+size_t FindFileInRangeTmpl(const FdWithKeyRange* a, size_t lo, size_t hi,
+                           Slice key, Cmp cmp) {
+  while (lo < hi) {
+    size_t mid = (lo + hi) / 2;
+    if (cmp(a[mid].largest_key, key))
+      lo = mid + 1;
+    else
+      hi = mid;
+  }
+  return lo;
+}
+
@@ -96,6 +142,16 @@ int FindFileInRange(const InternalKeyComparator& icmp,
     const Slice& key,  uint32_t left, uint32_t right) {
+  if (IsForwardBytewiseComparator(icmp.user_comparator())) {
+    BytewiseCompareInternalKey cmp;
+    return (int)FindFileInRangeTmpl(file_level.files, left, right, key, cmp);
+  }
+  else if (IsBytewiseComparator(icmp.user_comparator())) {
+    RevBytewiseCompareInternalKey cmp;
+    return (int)FindFileInRangeTmpl(file_level.files, left, right, key, cmp);
+  }
   auto cmp = [&](const FdWithKeyRange& f, const Slice& k) -> bool {
     return icmp.InternalKeyComparator::Compare(f.largest_key, k) < 0;
   };

Key code changes, it's that simple! The omitted part:

BytewiseCompareInternalKey RevBytewiseCompareInternalKey	inline, stl style comparator
IsForwardBytewiseComparator IsBytewiseComparator	As the name implies, judge the specific type of Comparator

The method of devirtualization of MergingIterator is similar.

FindFileInRange: Devirtualization V2

Soon, we found a new hot spot in the flame graph: IsForwardBytewiseComparator!

What's going on here? It turns out that IsForwardBytewiseComparator is judged by Comparator Name, because this judgment in FindFileInRange is only executed once, and Comparator Name is compared with a constant string, so we have always thought that it is almost zero-cost, but the facts are slapped!

Fortunately, as long as this problem is discovered, it is still very simple to deal with:

- private:
-  size_t timestamp_size_;
+  bool IsForwardBytewise() const noexcept { return 0 == opt_cmp_type_; }
+  bool IsReverseBytewise() const noexcept { return 1 == opt_cmp_type_; }
+  bool IsBytewise() const noexcept { return opt_cmp_type_ <= 1; }
+
+ protected:
+  uint16_t timestamp_size_;
+  // 0: forward bytewise, 1: rev byitewise, others: unknown
+  uint8_t opt_cmp_type_ = 255;
 };

 // Return a builtin comparator that uses lexicographic byte-wise
@@ -150,12 +156,21 @@ extern const Comparator* BytewiseComparator();
 // ordering.
 extern const Comparator* ReverseBytewiseComparator();

-bool IsForwardBytewiseComparator(const Comparator* cmp);
 bool IsForwardBytewiseComparator(const Slice& name);
-bool IsReverseBytewiseComparator(const Comparator* cmp);
 bool IsReverseBytewiseComparator(const Slice& name);
-
-bool IsBytewiseComparator(const Comparator* cmp);
 bool IsBytewiseComparator(const Slice& name);

+inline bool IsForwardBytewiseComparator(const Comparator* cmp) {
+  assert(cmp->IsForwardBytewise() == IsForwardBytewiseComparator(cmp->Name()));
+  return cmp->IsForwardBytewise();
+}
+inline bool IsReverseBytewiseComparator(const Comparator* cmp) {
+  assert(cmp->IsReverseBytewise() == IsReverseBytewiseComparator(cmp->Name()));
+  return cmp->IsReverseBytewise();
+}
+inline bool IsBytewiseComparator(const Comparator* cmp) {
+  assert(cmp->IsBytewise() == IsBytewiseComparator(cmp->Name()));
+  return cmp->IsBytewise();
+}

We have added a member opt_cmp_type_ to mark the actual type of Comparator.

It has been a long time since timestamp_size_ was added to RocksDB Comparator (used to support user layer timestamp), because timestamp_size_ is very short (generally 8 bytes), we changed it from size_t to uint16, so that even if opt_cmp_type_ member is added, the actual size of Comparator It doesn't change either (actually there is still 5 bytes of padding).

After this modification, we even discovered a bug in RocksDB. Although the bug was innocuous, we still submitted a corresponding Pull Request to RocksDB. This also reflects the importance of unit testing. Our modifications are only about performance, not behavior. Therefore, we only need to run through the existing unit tests, and the correctness of the code is basically guaranteed.

FindFileInRange: Devirtualization V3

The parameter key in FindFileInRangeTmpl is always hot (in L1 Cache), but the largest_key of the file may not be. This is an excellent prefetch scenario:

@@ -128,6 +128,7 @@ size_t FindFileInRangeTmpl(const FdWithKeyRange* a, size_t lo, size_t hi,
                            Slice key, Cmp cmp) {
   while (lo < hi) {
     size_t mid = (lo + hi) / 2;
+    __builtin_prefetch(a[mid].largest_key.data_);
     if (cmp(a[mid].largest_key, key))
       lo = mid + 1;
     else

Between prefetch and memcmp accessing the largest_key, there are a lot of codes to be executed (after inline), which can effectively cover up the cache miss delay.

FindFileInRange: Devirtualization V4

So far, in FindFileInRangeTmpl, memcmp is still called, and the content of file.largest_key is still indirectly accessed through pointers. How can it be further optimized?

We know that BytewiseComparator always compares from front to back. As long as the prefixes up to a certain point are different, the following content does not need to be compared. Therefore, we can perform fixed-length prefix search optimization:

When comparing fixed prefixes, do not call memcmp, fixed prefixes are 8 bytes, use uint64 for comparison, fully inline copy the fixed prefixes to a separate (uint64) array, and eliminate indirect access. The uint64 array is parallel to the file array, first in the uint64 array Search equal range in uint64 array, then exact search CPU Cache utilization is maximized when searching in uint64 array

@@ -113,6 +113,9 @@ struct BytewiseCompareInternalKey {
     if (x.size_ != y.size_) return x.size_ < y.size_;
     return GetUnalignedU64(x.data_ + n) > GetUnalignedU64(y.data_ + n);
   }
+  FORCE_INLINE bool operator()(uint64_t x, uint64_t y) const noexcept { return x < y; }
 };
@@ -122,12 +125,31 @@ struct RevBytewiseCompareInternalKey {
     if (x.size_ != y.size_) return x.size_ > y.size_;
     return GetUnalignedU64(x.data_ + n) > GetUnalignedU64(y.data_ + n);
   }
+  FORCE_INLINE bool operator()(uint64_t x, uint64_t y) const noexcept { return x > y; }
 };
 template<class Cmp>
-size_t FindFileInRangeTmpl(const FdWithKeyRange* a, size_t lo, size_t hi,
+size_t FindFileInRangeTmpl(const LevelFilesBrief& brief, size_t lo, size_t hi,
                            Slice key, Cmp cmp) {
+  const uint64_t* pxcache = brief.prefix_cache;
+  const uint64_t  key_prefix = HostPrefixCache(key);
+  const FdWithKeyRange* a = brief.files;
+  size_t mid;
+  while (lo < hi) {
+    mid = (lo + hi) / 2;
+    if (cmp(pxcache[mid], key_prefix))
+      lo = mid + 1;
+    else if (cmp(key_prefix, pxcache[mid]))
+      hi = mid;
+    else
+      goto exact_search;
+  }
+  return lo;
+
   while (lo < hi) {
-    size_t mid = (lo + hi) / 2;
+    mid = (lo + hi) / 2;
+  exact_search:
     __builtin_prefetch(a[mid].largest_key.data_);
     if (cmp(a[mid].largest_key, key))

The implementation of HostPrefixCache is also very simple and straightforward:

inline uint64_t HostPrefixCache(const Slice& ikey) {
  uint64_t data = 0; // The following memcpy call is also optimized by the compiler, not to mention __bswap_64
  memcpy(&data, ikey.data_, std::min<size_t>(ikey.size_ - 8, 8));
  if (port::kLittleEndian) return __bswap_64(data); else return data;
}

After so many optimizations, FindFileInRange has exceeded 40% in the flame graph from the beginning, and now it is not obvious in the flame graph!

In fact, the latter two improvements are not devirtualization, just because they have come all the way from the idea of devirtualization at the beginning.

For the devirtualization of MergingIterator, the idea and method are similar to those of FindFileInRange, and even the subsequent prefix search optimization has a similar idea (only key modifications are listed):

+struct MgHeapElem {
+  IteratorWrapper* iter;
+  uint64_t key_prefix;
+  MgHeapElem() : iter(nullptr), key_prefix(0) {}
+  MgHeapElem(std::nullptr_t) : iter(nullptr), key_prefix(0) {}
+  MgHeapElem(IteratorWrapper* i) : iter(i) { key_prefix = HostPrefixCache(i->key()); }
+  IteratorWrapper* operator->() const noexcept { return iter; }
+};
+inline void UpdateIterElem(MgHeapElem&x){x.key_prefix = HostPrefixCache(x.iter->key());}
+inline void UpdateIterElem(IteratorWrapper*) {} // do nothing
@@ -60,50 +85,66 @@ static FORCE_INLINE bool RevBytewiseCompareInternalKey(Slice x,
  struct MaxInlineBytewiseComp {
-  bool operator()(const IteratorWrapper* a, const IteratorWrapper* b) const noexcept {
-    return BytewiseCompareInternalKey(a->key(), b->key());
+  bool operator()(const MgHeapElem& a, const MgHeapElem& b) const noexcept {
+    if (a.key_prefix < b.key_prefix)      return true;
+    else if (a.key_prefix > b.key_prefix) return false;
+    else return BytewiseCompareInternalKey(a->key(), b->key());
   }
   MaxInlineBytewiseComp(const InternalKeyComparator*) {}
 };
@@ -259,6 +300,7 @@ class MergingIterTmpl : public MergingIterator {
     // as the current points to the current record. move the iterator forward.
     current_->Next();
     if (current_->Valid()) {
+      UpdateIterElem(current_);
@@ -301,6 +343,7 @@ class MergingIterTmpl : public MergingIterator {
     current_->Prev();
     if (current_->Valid()) {
+      UpdateIterElem(current_);

Finally, MergingIterator is always visible in the flame graph, because its underlying iterator function takes a lot of time.

7. Conclusion

The compiler can help us perform the most basic devirtualization (Devirtualization), but the benefits of higher-level devirtualization are much greater, but this requires us to use our brains, analyze more, think more, and find corresponding solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly