Skip to content

Performance Evaluation

Anastasia Braginsky edited this page Jun 22, 2020 · 2 revisions

Evaluation Setup

For evaluation we generate a variety of synthetic workloads. Our hardware features an industry-standard 12-core Xeon E5-4650 CPU with 192 GB RAM.

Compared solutions: We mostly focus on Oak’s ZeroCopy (ZC) API. To quantify the benefit of zero-copying, we also run gets with the legacy API, to which we refer as Oak-Copy. For scans, we experiment with both the Set and Stream APIs. Since our goal is to offer Oak as an alternative to Java’s standard skiplist, our baseline is the JDK8 ConcurrentSkipListMap, which we call CSLM.

To isolate the impact of off-heap allocation from other algorithmic aspects, we also implement an off-heap variant of the CSLM, which we call CSLM-OffHeap. Note that whereas CSLM and Oak-Copy offer an object-based API, CSLM-OffHeap also exposes Oak’s ZC API. Internally, CSLM-OffHeap maintains a concurrent skiplist over an intermediate cell object. Each cell references a key buffer and a value buffer allocated in off-heap blocks through Oak’s memory manager.

Methodology: The exercised key and value sizes are 100B and 1KB, respectively. In each experiment, a specific range of keys is accessed. Accessed keys are sampled uniformly at random from that range. The range is used to control the dataset size: Every experiment starts with an ingestion stage,which runs in a single thread and populates the KV-map with 50% of the unique keys in the range using putIfAbsent operations. It is followed by the sustained-rate stage, which runs the target workload for 30 seconds through one or more symmetric worker threads. Every data point is the median of 5 runs; the deviations among runs were all within 10%. In each experiment, all algorithms run with the same RAM budget. Oak and CSLM-OffHeap split the available memory between the off-heap pool and the heap, allocating the former with just enough resources to host the raw data. CSLM allocates all the available memory to heap.

Scalability with parallelism benchmarks: In the following set of experiments, shown in Figure 4, we fix the available memory to 36GB and the ingested data set size to 10M KV-pairs. We measure the sustained-rate stage throughput for multiple workloads, while scaling the number of worker threads from 1 to 12. In this setting, the raw data is less than 35% of the available memory, so the GC effect in all algorithms is minor.

Figure 1 bellow depicts the results of a put-only workload. Oak is markedly faster than the alternatives already with one thread and continues to outperform CSLM by at least 2x with any number of threads. Moreover, whereas CSLM’s throughput flattens out with around 8 threads,Oak continues to improve all the way up to 12. CSLM-OffHeap scale as well as Oak does, suggesting that GC is the limiting factor for CSLM’s scalability. Moreover, Oak’s improvement over CSLM-OffHeap remains steady with the number of threads, suggesting that Oak’s advantage stems from factors that affect also the sequential execution speed, such as better locality, fewer Java objects (and consequently, less GC overhead), and fewer redirections.

Figure 1 (ingestion):

Figure 2 bellow evaluates incremental updates invoked using Oak’s putIfAbsentComputeIfPresent API and (non-atomic) merge in the skiplists. Each in-place update modifies 8 bytes of the value. This workload almost does not increase the number of objects, and hence, the GC effect is minor. As expected, all algorithms exhibit good scaling, however Oak's throughput is almost twice higher.

Figure 2 (putIfAbsentComputeIfPresent):

PUT IF ABSENT COMPUTE IF PRESENT2 only (after 10GB ingested)

Figure 3 illustrates the get-only workload. As expected in a read-only workload, all solutions scale linearly without saturating at 12 threads. Here, too, Oak (ZC) is much faster than the alternatives. In particular, Oak scales 12x with 12 threads compared to its single-threaded execution, and outperforms CSLM by 1.7x at peak performance. We also exercise the legacy API Oak-Copy, finding that copying induces a significant penalty but doesn't inhibits scalability. Please note that Oak's gets with copy/de-serialization performs the same as CSLM's gets. A mix of 95% gets and 5% puts (Figure 4) shows similar results: Oak scales 12x with 12 threads and outperforms CSLM by 1.7x in all thread counts, while CSLM-OffHeap is slower than both.

Figure 3 (retrieval):

GET only (after 10GB ingested)

Figure 4 (retrieval with background ingestion):

95% GET WITH 5% PUT (after 10GB ingested)

We proceed to ascending scans, shown in Figure 5 bellow. We focus on long scans traversing 10K key-value pairs. In such scans, performance is dominated by the iteration through the entries rather than the search time for the first key, which differentiates them from gets. In this scenario, Oak’s Set API is 2x slower than the alternatives. Its performance suffers from the creation of ephemeral OakBuffer objects for all traversed keys and values, while its competitors retrieve references to existing objects. Oak’s Stream API, which reuses the same object for all the entries returned throughout the scan, eliminates this overhead. The locality of access offere dby the chunk-based organization is particularly beneficial in this case, allowing Oak’s Stream API to outperform CSLM by nearly 4x with 12 threads. CSLM and CSLM-OffHeap throughput performance is identical in ascending and descending scans.

Figure 5 (scan forward):

ASCENDING SCAN (after 10GB ingested)

Finally, Figure 6 depicts the performance of descending scans over 10K KV pairs. Recall that the skiplists implement such scans by issuing a new lookup for each traversed key,and thus pay the (logarithmic) search cost 10K times per scan. Oak, in contrast, issues a lookup only when a chunk (holding2K–4K KV pairs) is exhausted. As expected, Oak outperforms CSLM by more than 8x even with "slow" non-stream API. With the Stream API, Oak’s throughput doubles.

Figure 6 (scan backward):

DESCENDING SCAN (after 10GB ingested)

Clone this wiki locally