GeneralPurposeAllocator: Considerably improve worst case performance #17383

squeek502 · 2023-10-03T07:08:48Z

Before this PR, GeneralPurposeAllocator could run into incredibly degraded performance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly slot_count allocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large, and to find a particular bucket, the entire (doubly linked) list would have to be scanned linearly.

This allocation pattern has been seen in the wild: Vexu/arocc#508 (comment)

In that case, the length of the bucket list for the 128 size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds.

To address this, there are three different changes happening here:

std.Treap is used instead of a doubly linked list for the lists of buckets. This takes the time complexity of searchBucket [used in resize and free] from O(n) to O(log n), but increases the time complexity of insert from O(1) to O(log n) [before, all new buckets would get added to the head of the list]. This is still a huge win because search happens way more often than insertion of new buckets. Note: Any data structure with O(log n) or better search/insert/delete would also work for this use-case.
If the 'current' bucket for a size class is full, the list of buckets is never traversed and instead a new bucket is allocated. Previously, traversing the bucket list could only find a non-full bucket in specific circumstances, and only because of a separate optimization that is no longer needed (before, after any resize/free, the affected bucket would be moved to the head of the bucket list to allow searchBucket to perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket's alloc_cursor only moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless.
Size + alignment information for small allocations has been moved into the Bucket data instead of keeping it in a separate HashMap. This offers an improvement over the HashMap since whenever we need to get/modify the length/alignment of an allocation it's extremely likely we will already have calculated any bucket-related information necessary to get the data.

The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of GeneralPurposeAllocator is unchanged.

In the degraded arocc case, these changes bring Debug performance from ~8 minutes to ~20 seconds.

Benchmark 1: test-master.bat
  Time (mean ± σ):     481.263 s ±  5.440 s    [User: 479.159 s, System: 1.937 s]
  Range (min … max):   477.416 s … 485.109 s    2 runs

Benchmark 2: test-optim-treap.bat
  Time (mean ± σ):     19.639 s ±  0.037 s    [User: 18.183 s, System: 1.452 s]
  Range (min … max):   19.613 s … 19.665 s    2 runs

Summary
  'test-optim-treap.bat' ran
   24.51 ± 0.28 times faster than 'test-master.bat'

Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With .stack_trace_frames = 0 the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used.

These changes may or mat not introduce a slight performance regression in the average case:

Here's the standard library tests on Windows in Debug mode:

Benchmark 1 (10 runs): std-tests-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.0s  ± 30.8ms    15.9s  … 16.1s           1 (10%)        0%
  peak_rss           42.8MB ± 8.24KB    42.8MB … 42.8MB          0 ( 0%)        0%
Benchmark 2 (10 runs): std-tests-optim-treap.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.2s  ± 37.6ms    16.1s  … 16.3s           0 ( 0%)        💩+  1.3% ±  0.2%
  peak_rss           42.8MB ± 5.18KB    42.8MB … 42.8MB          0 ( 0%)          +  0.1% ±  0.0%

And on Linux:

Benchmark 1: ./test-master
  Time (mean ± σ):     16.091 s ±  0.088 s    [User: 15.856 s, System: 0.453 s]
  Range (min … max):   15.870 s … 16.166 s    10 runs
 
Benchmark 2: ./test-optim-treap
  Time (mean ± σ):     16.028 s ±  0.325 s    [User: 15.755 s, System: 0.492 s]
  Range (min … max):   15.735 s … 16.709 s    10 runs
 
Summary
  './test-optim-treap' ran
    1.00 ± 0.02 times faster than './test-master'

Here are some more benchmark results using a very targeted benchmark that intentionally only does worst-case allocation patterns:

Benchmark code

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer std.debug.assert(gpa.deinit() == .ok);
    const allocator = gpa.allocator();

    const alloc_size = 128;
    const slot_count = @divExact(std.mem.page_size, std.math.ceilPowerOfTwoAssert(usize, alloc_size));
    const rounds = 5000;
    var unfreed_slices: [rounds][]u8 = undefined;
    var i: usize = 0;
    while (i < rounds) : (i += 1) {
        unfreed_slices[i] = try allocator.alloc(u8, alloc_size);
        for (0..(slot_count - 1)) |_| {
            const slice = try allocator.alloc(u8, alloc_size);
            allocator.free(slice);
        }
    }

    for (&unfreed_slices) |slice| {
        allocator.free(slice);
    }
}

On Linux:

Debug:

Benchmark 1 (3 runs): ./gpa-degen-master
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.66s  ± 65.5ms    3.59s  … 3.70s           0 ( 0%)        0%
  peak_rss           41.9MB ± 2.36KB    41.9MB … 41.9MB          0 ( 0%)        0%
  cpu_cycles         14.5G  ±  280M     14.2G  … 14.6G           0 ( 0%)        0%
  instructions       25.1G  ±  559M     24.5G  … 25.5G           0 ( 0%)        0%
  cache_references   98.7M  ±  518K     98.1M  … 99.1M           0 ( 0%)        0%
  cache_misses       13.6M  ±  156K     13.4M  … 13.7M           0 ( 0%)        0%
  branch_misses      56.8M  ± 1.64M     55.0M  … 58.2M           0 ( 0%)        0%
Benchmark 2 (9 runs): ./gpa-degen-optim-treap
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           617ms ± 5.66ms     607ms …  624ms          0 ( 0%)        ⚡- 83.2% ±  1.2%
  peak_rss           41.9MB ± 1.81KB    41.9MB … 41.9MB          2 (22%)          -  0.0% ±  0.0%
  cpu_cycles         1.73G  ± 12.9M     1.71G  … 1.75G           0 ( 0%)        ⚡- 88.0% ±  1.3%
  instructions       2.79G  ± 12.1M     2.78G  … 2.82G           0 ( 0%)        ⚡- 88.9% ±  1.5%
  cache_references   38.7M  ±  502K     37.9M  … 39.3M           0 ( 0%)        ⚡- 60.8% ±  0.8%
  cache_misses        195K  ± 10.1K      184K  …  215K           0 ( 0%)        ⚡- 98.6% ±  0.8%
  branch_misses      4.39M  ± 25.9K     4.36M  … 4.43M           0 ( 0%)        ⚡- 92.3% ±  1.9%

ReleaseFast:

Benchmark 1 (27 runs): ./gpa-degen-master-release
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           187ms ± 4.41ms     178ms …  195ms          0 ( 0%)        0%
  peak_rss           20.7MB ± 2.22KB    20.7MB … 20.7MB          0 ( 0%)        0%
  cpu_cycles          705M  ± 13.2M      680M  …  728M           0 ( 0%)        0%
  instructions        115M  ± 17.8       115M  …  115M           1 ( 4%)        0%
  cache_references   29.5M  ±  230K     29.1M  … 30.0M           0 ( 0%)        0%
  cache_misses       12.8M  ± 15.0K     12.8M  … 12.8M           0 ( 0%)        0%
  branch_misses      35.4K  ±  338      35.2K  … 36.4K           1 ( 4%)        0%
Benchmark 2 (195 runs): ./gpa-degen-optim-treap-release
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          25.6ms ± 2.80ms    20.3ms … 31.9ms          0 ( 0%)        ⚡- 86.3% ±  0.7%
  peak_rss           20.9MB ± 2.05KB    20.9MB … 20.9MB          0 ( 0%)          +  1.0% ±  0.0%
  cpu_cycles         35.2M  ± 2.55M     30.8M  … 46.5M           2 ( 1%)        ⚡- 95.0% ±  0.3%
  instructions       40.3M  ±  912K     38.3M  … 43.4M           1 ( 1%)        ⚡- 65.0% ±  0.3%
  cache_references   1.23M  ±  186K      851K  … 1.78M           3 ( 2%)        ⚡- 95.8% ±  0.3%
  cache_misses       12.3K  ±  508      11.2K  … 14.1K           6 ( 3%)        ⚡- 99.9% ±  0.0%
  branch_misses      52.5K  ± 2.95K     48.9K  … 57.8K           0 ( 0%)        💩+ 48.3% ±  3.2%

On Windows:

Debug:

Benchmark 1 (3 runs): gpa-degen-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.47s  ±  165ms    4.29s  … 4.62s           0 ( 0%)        0%
  peak_rss           44.5MB ± 2.36KB    44.5MB … 44.5MB          0 ( 0%)        0%
Benchmark 2 (9 runs): gpa-degen-optim-treap.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           562ms ± 3.21ms     557ms …  567ms          0 ( 0%)        ⚡- 87.4% ±  2.5%
  peak_rss           44.5MB ± 2.05KB    44.5MB … 44.5MB          0 ( 0%)          -  0.0% ±  0.0%

ReleaseFast:

Benchmark 1 (9 runs): gpa-degen-master-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           564ms ± 44.9ms     497ms …  603ms          0 ( 0%)        0%
  peak_rss           23.5MB ± 2.05KB    23.5MB … 23.5MB          0 ( 0%)        0%
Benchmark 2 (120 runs): gpa-degen-optim-treap-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          41.8ms ± 1.55ms    38.5ms … 46.1ms          0 ( 0%)        ⚡- 92.6% ±  1.4%
  peak_rss           23.7MB ± 18.6KB    23.7MB … 23.9MB          1 ( 1%)          +  0.9% ±  0.1%

Various notes:

A memory pool is used for the Treap.Nodes. This has two slightly weird things:
- Because the GPA doesn't have an init function and is directly instantiated instead, the memory pool can't use backing_allocator and instead always uses the page_allocator
- The number of allocated Nodes will always stay at the peak number of Nodes necessary, meaning that e.g. if a program needs 5000 buckets at one point, then all 5000 of those nodes will live for the rest of the program even if all memory in the buckets is freed (but those 5000 nodes will also be re-used whenever a new node is needed).
I initially used a skip list implementation that I wrote for this because I wasn't aware of std.Treap, but std.Treap slightly outperformed it in my benchmarks and provides all the same benefits.

Before this commit, GeneralPurposeAllocator could run into incredibly degraded performance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly `slot_count` allocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large. This allocation pattern has been seen in the wild: Vexu/arocc#508 (comment) In that case, the length of the bucket list for the `128` size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds. To address this, there are three different changes happening here: 1. std.Treap is used instead of a doubly linked list for the lists of buckets. This takes the time complexity of searchBucket [used in resize and free] from O(n) to O(log n), but increases the time complexity of insert from O(1) to O(log n) [before, all new buckets would get added to the head of the list]. Note: Any data structure with O(log n) or better search/insert/delete would also work for this use-case. 2. If the 'current' bucket for a size class is full, the list of buckets is never traversed and instead a new bucket is allocated. Previously, traversing the bucket list could only find a non-full bucket in specific circumstances, and only because of a separate optimization that is no longer needed (before, after any resize/free, the affected bucket would be moved to the head of the bucket list to allow searchBucket to perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket's alloc_cursor only moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless. 3. Size + alignment information for small allocations has been moved into the Bucket data instead of keeping it in a separate HashMap. This offers an improvement over the HashMap since whenever we need to get/modify the length/alignment of an allocation it's extremely likely we will already have calculated any bucket-related information necessary to get the data. The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of GeneralPurposeAllocator is unchanged. In the degraded `arocc` case, these changes bring Debug performance from ~8 minutes to ~20 seconds. Benchmark 1: test-master.bat Time (mean ± σ): 481.263 s ± 5.440 s [User: 479.159 s, System: 1.937 s] Range (min … max): 477.416 s … 485.109 s 2 runs Benchmark 2: test-optim-treap.bat Time (mean ± σ): 19.639 s ± 0.037 s [User: 18.183 s, System: 1.452 s] Range (min … max): 19.613 s … 19.665 s 2 runs Summary 'test-optim-treap.bat' ran 24.51 ± 0.28 times faster than 'test-master.bat' Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With `.stack_trace_frames = 0` the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used. These changes may or mat not introduce a slight performance regression in the average case: Here's the standard library tests on Windows in Debug mode: Benchmark 1 (10 runs): std-tests-master.exe measurement mean ± σ min … max outliers delta wall_time 16.0s ± 30.8ms 15.9s … 16.1s 1 (10%) 0% peak_rss 42.8MB ± 8.24KB 42.8MB … 42.8MB 0 ( 0%) 0% Benchmark 2 (10 runs): std-tests-optim-treap.exe measurement mean ± σ min … max outliers delta wall_time 16.2s ± 37.6ms 16.1s … 16.3s 0 ( 0%) 💩+ 1.3% ± 0.2% peak_rss 42.8MB ± 5.18KB 42.8MB … 42.8MB 0 ( 0%) + 0.1% ± 0.0% And on Linux: Benchmark 1: ./test-master Time (mean ± σ): 16.091 s ± 0.088 s [User: 15.856 s, System: 0.453 s] Range (min … max): 15.870 s … 16.166 s 10 runs Benchmark 2: ./test-optim-treap Time (mean ± σ): 16.028 s ± 0.325 s [User: 15.755 s, System: 0.492 s] Range (min … max): 15.735 s … 16.709 s 10 runs Summary './test-optim-treap' ran 1.00 ± 0.02 times faster than './test-master'

…oved nodes

andrewrk

Fantastic work!

Because the GPA doesn't have an init function and is directly instantiated instead, the memory pool can't use backing_allocator and instead always uses the page_allocator

I was already thinking before this that GPA should lose the feature of using backing_allocator and become hard-coded to always use OS calls directly. An allocator that wraps an existing one and tries to provide some kind of features on top of it could be interesting, but it's a bit of a different concern than what GPA aims to provide. So, this is a step in the right direction IMO.

…rching the list Follow up to ziglang#17383. This is a minor optimization that only matters when a small allocation is resized/free'd soon after it is allocated. The only real difference I was able to observe with this was via a synthetic benchmark that allocates a full bucket and then frees all but one of the slots, over and over in a loop: Debug build: Benchmark 1 (9 runs): gpa-degen-master.exe measurement mean ± σ min … max outliers delta wall_time 575ms ± 5.19ms 569ms … 583ms 0 ( 0%) 0% peak_rss 43.8MB ± 1.37KB 43.8MB … 43.8MB 1 (11%) 0% Benchmark 2 (10 runs): gpa-degen-search-cur.exe measurement mean ± σ min … max outliers delta wall_time 532ms ± 5.55ms 520ms … 539ms 0 ( 0%) ⚡- 7.5% ± 0.9% peak_rss 43.8MB ± 65.2KB 43.8MB … 44.0MB 1 (10%) + 0.0% ± 0.1% ReleaseFast build: Benchmark 1 (129 runs): gpa-degen-master-release.exe measurement mean ± σ min … max outliers delta wall_time 38.9ms ± 1.12ms 36.7ms … 42.4ms 8 ( 6%) 0% peak_rss 23.2MB ± 2.39KB 23.2MB … 23.2MB 0 ( 0%) 0% Benchmark 2 (151 runs): gpa-degen-search-cur-release.exe measurement mean ± σ min … max outliers delta wall_time 33.2ms ± 999us 31.9ms … 36.3ms 20 (13%) ⚡- 14.7% ± 0.6% peak_rss 23.2MB ± 2.26KB 23.2MB … 23.2MB 0 ( 0%) + 0.0% ± 0.0%

…rching the list Follow up to #17383. This is a minor optimization that only matters when a small allocation is resized/free'd soon after it is allocated. The only real difference I was able to observe with this was via a synthetic benchmark that allocates a full bucket and then frees all but one of the slots, over and over in a loop: Debug build: Benchmark 1 (9 runs): gpa-degen-master.exe measurement mean ± σ min … max outliers delta wall_time 575ms ± 5.19ms 569ms … 583ms 0 ( 0%) 0% peak_rss 43.8MB ± 1.37KB 43.8MB … 43.8MB 1 (11%) 0% Benchmark 2 (10 runs): gpa-degen-search-cur.exe measurement mean ± σ min … max outliers delta wall_time 532ms ± 5.55ms 520ms … 539ms 0 ( 0%) ⚡- 7.5% ± 0.9% peak_rss 43.8MB ± 65.2KB 43.8MB … 44.0MB 1 (10%) + 0.0% ± 0.1% ReleaseFast build: Benchmark 1 (129 runs): gpa-degen-master-release.exe measurement mean ± σ min … max outliers delta wall_time 38.9ms ± 1.12ms 36.7ms … 42.4ms 8 ( 6%) 0% peak_rss 23.2MB ± 2.39KB 23.2MB … 23.2MB 0 ( 0%) 0% Benchmark 2 (151 runs): gpa-degen-search-cur-release.exe measurement mean ± σ min … max outliers delta wall_time 33.2ms ± 999us 31.9ms … 36.3ms 20 (13%) ⚡- 14.7% ± 0.6% peak_rss 23.2MB ± 2.26KB 23.2MB … 23.2MB 0 ( 0%) + 0.0% ± 0.0%

squeek502 added 3 commits October 2, 2023 21:11

Treap: Add InorderIterator

da7ecfb

Treap: do not set key to undefined in remove to allow re-use of rem…

95f4c15

…oved nodes

squeek502 force-pushed the gpa-optim-treap branch from 42ae395 to 95f4c15 Compare October 3, 2023 08:22

andrewrk approved these changes Oct 3, 2023

View reviewed changes

andrewrk merged commit 47f0860 into ziglang:master Oct 3, 2023
10 checks passed

squeek502 mentioned this pull request Oct 4, 2023

GeneralPurposeAllocator.searchBucket: check current bucket before searching the list #17389

Merged

squeek502 mentioned this pull request May 15, 2024

GeneralPurposeAllocator: assertion failure when retain_metadata and never_unmap are used #19977

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeneralPurposeAllocator: Considerably improve worst case performance #17383

GeneralPurposeAllocator: Considerably improve worst case performance #17383

squeek502 commented Oct 3, 2023 •

edited

andrewrk left a comment

GeneralPurposeAllocator: Considerably improve worst case performance #17383

GeneralPurposeAllocator: Considerably improve worst case performance #17383

Conversation

squeek502 commented Oct 3, 2023 • edited

On Linux:

On Windows:

andrewrk left a comment

Choose a reason for hiding this comment

squeek502 commented Oct 3, 2023 •

edited