Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeneralPurposeAllocator: Considerably improve worst case performance #17383

Merged
merged 3 commits into from
Oct 3, 2023

Conversation

squeek502
Copy link
Collaborator

@squeek502 squeek502 commented Oct 3, 2023

Before this PR, GeneralPurposeAllocator could run into incredibly degraded performance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly slot_count allocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large, and to find a particular bucket, the entire (doubly linked) list would have to be scanned linearly.

This allocation pattern has been seen in the wild: Vexu/arocc#508 (comment)

In that case, the length of the bucket list for the 128 size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds.

To address this, there are three different changes happening here:

  1. std.Treap is used instead of a doubly linked list for the lists of buckets. This takes the time complexity of searchBucket [used in resize and free] from O(n) to O(log n), but increases the time complexity of insert from O(1) to O(log n) [before, all new buckets would get added to the head of the list]. This is still a huge win because search happens way more often than insertion of new buckets. Note: Any data structure with O(log n) or better search/insert/delete would also work for this use-case.
  2. If the 'current' bucket for a size class is full, the list of buckets is never traversed and instead a new bucket is allocated. Previously, traversing the bucket list could only find a non-full bucket in specific circumstances, and only because of a separate optimization that is no longer needed (before, after any resize/free, the affected bucket would be moved to the head of the bucket list to allow searchBucket to perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket's alloc_cursor only moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless.
  3. Size + alignment information for small allocations has been moved into the Bucket data instead of keeping it in a separate HashMap. This offers an improvement over the HashMap since whenever we need to get/modify the length/alignment of an allocation it's extremely likely we will already have calculated any bucket-related information necessary to get the data.

The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of GeneralPurposeAllocator is unchanged.

In the degraded arocc case, these changes bring Debug performance from ~8 minutes to ~20 seconds.

Benchmark 1: test-master.bat
  Time (mean ± σ):     481.263 s ±  5.440 s    [User: 479.159 s, System: 1.937 s]
  Range (min … max):   477.416 s … 485.109 s    2 runs

Benchmark 2: test-optim-treap.bat
  Time (mean ± σ):     19.639 s ±  0.037 s    [User: 18.183 s, System: 1.452 s]
  Range (min … max):   19.613 s … 19.665 s    2 runs

Summary
  'test-optim-treap.bat' ran
   24.51 ± 0.28 times faster than 'test-master.bat'

Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With .stack_trace_frames = 0 the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used.

These changes may or mat not introduce a slight performance regression in the average case:

Here's the standard library tests on Windows in Debug mode:

Benchmark 1 (10 runs): std-tests-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.0s  ± 30.8ms    15.9s  … 16.1s           1 (10%)        0%
  peak_rss           42.8MB ± 8.24KB    42.8MB … 42.8MB          0 ( 0%)        0%
Benchmark 2 (10 runs): std-tests-optim-treap.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.2s  ± 37.6ms    16.1s  … 16.3s           0 ( 0%)        💩+  1.3% ±  0.2%
  peak_rss           42.8MB ± 5.18KB    42.8MB … 42.8MB          0 ( 0%)          +  0.1% ±  0.0%

And on Linux:

Benchmark 1: ./test-master
  Time (mean ± σ):     16.091 s ±  0.088 s    [User: 15.856 s, System: 0.453 s]
  Range (min … max):   15.870 s … 16.166 s    10 runs
 
Benchmark 2: ./test-optim-treap
  Time (mean ± σ):     16.028 s ±  0.325 s    [User: 15.755 s, System: 0.492 s]
  Range (min … max):   15.735 s … 16.709 s    10 runs
 
Summary
  './test-optim-treap' ran
    1.00 ± 0.02 times faster than './test-master'

Here are some more benchmark results using a very targeted benchmark that intentionally only does worst-case allocation patterns:

Benchmark code
const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer std.debug.assert(gpa.deinit() == .ok);
    const allocator = gpa.allocator();

    const alloc_size = 128;
    const slot_count = @divExact(std.mem.page_size, std.math.ceilPowerOfTwoAssert(usize, alloc_size));
    const rounds = 5000;
    var unfreed_slices: [rounds][]u8 = undefined;
    var i: usize = 0;
    while (i < rounds) : (i += 1) {
        unfreed_slices[i] = try allocator.alloc(u8, alloc_size);
        for (0..(slot_count - 1)) |_| {
            const slice = try allocator.alloc(u8, alloc_size);
            allocator.free(slice);
        }
    }

    for (&unfreed_slices) |slice| {
        allocator.free(slice);
    }
}

On Linux:

Debug:

Benchmark 1 (3 runs): ./gpa-degen-master
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.66s  ± 65.5ms    3.59s  … 3.70s           0 ( 0%)        0%
  peak_rss           41.9MB ± 2.36KB    41.9MB … 41.9MB          0 ( 0%)        0%
  cpu_cycles         14.5G  ±  280M     14.2G  … 14.6G           0 ( 0%)        0%
  instructions       25.1G  ±  559M     24.5G  … 25.5G           0 ( 0%)        0%
  cache_references   98.7M  ±  518K     98.1M  … 99.1M           0 ( 0%)        0%
  cache_misses       13.6M  ±  156K     13.4M  … 13.7M           0 ( 0%)        0%
  branch_misses      56.8M  ± 1.64M     55.0M  … 58.2M           0 ( 0%)        0%
Benchmark 2 (9 runs): ./gpa-degen-optim-treap
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           617ms ± 5.66ms     607ms …  624ms          0 ( 0%)        ⚡- 83.2% ±  1.2%
  peak_rss           41.9MB ± 1.81KB    41.9MB … 41.9MB          2 (22%)          -  0.0% ±  0.0%
  cpu_cycles         1.73G  ± 12.9M     1.71G  … 1.75G           0 ( 0%)        ⚡- 88.0% ±  1.3%
  instructions       2.79G  ± 12.1M     2.78G  … 2.82G           0 ( 0%)        ⚡- 88.9% ±  1.5%
  cache_references   38.7M  ±  502K     37.9M  … 39.3M           0 ( 0%)        ⚡- 60.8% ±  0.8%
  cache_misses        195K  ± 10.1K      184K  …  215K           0 ( 0%)        ⚡- 98.6% ±  0.8%
  branch_misses      4.39M  ± 25.9K     4.36M  … 4.43M           0 ( 0%)        ⚡- 92.3% ±  1.9%

ReleaseFast:

Benchmark 1 (27 runs): ./gpa-degen-master-release
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           187ms ± 4.41ms     178ms …  195ms          0 ( 0%)        0%
  peak_rss           20.7MB ± 2.22KB    20.7MB … 20.7MB          0 ( 0%)        0%
  cpu_cycles          705M  ± 13.2M      680M  …  728M           0 ( 0%)        0%
  instructions        115M  ± 17.8       115M  …  115M           1 ( 4%)        0%
  cache_references   29.5M  ±  230K     29.1M  … 30.0M           0 ( 0%)        0%
  cache_misses       12.8M  ± 15.0K     12.8M  … 12.8M           0 ( 0%)        0%
  branch_misses      35.4K  ±  338      35.2K  … 36.4K           1 ( 4%)        0%
Benchmark 2 (195 runs): ./gpa-degen-optim-treap-release
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          25.6ms ± 2.80ms    20.3ms … 31.9ms          0 ( 0%)        ⚡- 86.3% ±  0.7%
  peak_rss           20.9MB ± 2.05KB    20.9MB … 20.9MB          0 ( 0%)          +  1.0% ±  0.0%
  cpu_cycles         35.2M  ± 2.55M     30.8M  … 46.5M           2 ( 1%)        ⚡- 95.0% ±  0.3%
  instructions       40.3M  ±  912K     38.3M  … 43.4M           1 ( 1%)        ⚡- 65.0% ±  0.3%
  cache_references   1.23M  ±  186K      851K  … 1.78M           3 ( 2%)        ⚡- 95.8% ±  0.3%
  cache_misses       12.3K  ±  508      11.2K  … 14.1K           6 ( 3%)        ⚡- 99.9% ±  0.0%
  branch_misses      52.5K  ± 2.95K     48.9K  … 57.8K           0 ( 0%)        💩+ 48.3% ±  3.2%

On Windows:

Debug:

Benchmark 1 (3 runs): gpa-degen-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.47s  ±  165ms    4.29s  … 4.62s           0 ( 0%)        0%
  peak_rss           44.5MB ± 2.36KB    44.5MB … 44.5MB          0 ( 0%)        0%
Benchmark 2 (9 runs): gpa-degen-optim-treap.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           562ms ± 3.21ms     557ms …  567ms          0 ( 0%)        ⚡- 87.4% ±  2.5%
  peak_rss           44.5MB ± 2.05KB    44.5MB … 44.5MB          0 ( 0%)          -  0.0% ±  0.0%

ReleaseFast:

Benchmark 1 (9 runs): gpa-degen-master-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           564ms ± 44.9ms     497ms …  603ms          0 ( 0%)        0%
  peak_rss           23.5MB ± 2.05KB    23.5MB … 23.5MB          0 ( 0%)        0%
Benchmark 2 (120 runs): gpa-degen-optim-treap-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          41.8ms ± 1.55ms    38.5ms … 46.1ms          0 ( 0%)        ⚡- 92.6% ±  1.4%
  peak_rss           23.7MB ± 18.6KB    23.7MB … 23.9MB          1 ( 1%)          +  0.9% ±  0.1%

Various notes:

  • A memory pool is used for the Treap.Nodes. This has two slightly weird things:
    • Because the GPA doesn't have an init function and is directly instantiated instead, the memory pool can't use backing_allocator and instead always uses the page_allocator
    • The number of allocated Nodes will always stay at the peak number of Nodes necessary, meaning that e.g. if a program needs 5000 buckets at one point, then all 5000 of those nodes will live for the rest of the program even if all memory in the buckets is freed (but those 5000 nodes will also be re-used whenever a new node is needed).
  • I initially used a skip list implementation that I wrote for this because I wasn't aware of std.Treap, but std.Treap slightly outperformed it in my benchmarks and provides all the same benefits.

Before this commit, GeneralPurposeAllocator could run into incredibly degraded performance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly `slot_count` allocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large.

This allocation pattern has been seen in the wild: Vexu/arocc#508 (comment)

In that case, the length of the bucket list for the `128` size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds.

To address this, there are three different changes happening here:

1. std.Treap is used instead of a doubly linked list for the lists of buckets. This takes the time complexity of searchBucket [used in resize and free] from O(n) to O(log n), but increases the time complexity of insert from O(1) to O(log n) [before, all new buckets would get added to the head of the list]. Note: Any data structure with O(log n) or better search/insert/delete would also work for this use-case.
2. If the 'current' bucket for a size class is full, the list of buckets is never traversed and instead a new bucket is allocated. Previously, traversing the bucket list could only find a non-full bucket in specific circumstances, and only because of a separate optimization that is no longer needed (before, after any resize/free, the affected bucket would be moved to the head of the bucket list to allow searchBucket to perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket's alloc_cursor only moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless.
3. Size + alignment information for small allocations has been moved into the Bucket data instead of keeping it in a separate HashMap. This offers an improvement over the HashMap since whenever we need to get/modify the length/alignment of an allocation it's extremely likely we will already have calculated any bucket-related information necessary to get the data.

The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of GeneralPurposeAllocator is unchanged.

In the degraded `arocc` case, these changes bring Debug performance from ~8 minutes to ~20 seconds.

Benchmark 1: test-master.bat
  Time (mean ± σ):     481.263 s ±  5.440 s    [User: 479.159 s, System: 1.937 s]
  Range (min … max):   477.416 s … 485.109 s    2 runs

Benchmark 2: test-optim-treap.bat
  Time (mean ± σ):     19.639 s ±  0.037 s    [User: 18.183 s, System: 1.452 s]
  Range (min … max):   19.613 s … 19.665 s    2 runs

Summary
  'test-optim-treap.bat' ran
   24.51 ± 0.28 times faster than 'test-master.bat'

Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With `.stack_trace_frames = 0` the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used.

These changes may or mat not introduce a slight performance regression in the average case:

Here's the standard library tests on Windows in Debug mode:

Benchmark 1 (10 runs): std-tests-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.0s  ± 30.8ms    15.9s  … 16.1s           1 (10%)        0%
  peak_rss           42.8MB ± 8.24KB    42.8MB … 42.8MB          0 ( 0%)        0%
Benchmark 2 (10 runs): std-tests-optim-treap.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.2s  ± 37.6ms    16.1s  … 16.3s           0 ( 0%)        💩+  1.3% ±  0.2%
  peak_rss           42.8MB ± 5.18KB    42.8MB … 42.8MB          0 ( 0%)          +  0.1% ±  0.0%

And on Linux:

Benchmark 1: ./test-master
  Time (mean ± σ):     16.091 s ±  0.088 s    [User: 15.856 s, System: 0.453 s]
  Range (min … max):   15.870 s … 16.166 s    10 runs
 
Benchmark 2: ./test-optim-treap
  Time (mean ± σ):     16.028 s ±  0.325 s    [User: 15.755 s, System: 0.492 s]
  Range (min … max):   15.735 s … 16.709 s    10 runs
 
Summary
  './test-optim-treap' ran
    1.00 ± 0.02 times faster than './test-master'
Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!

  • Because the GPA doesn't have an init function and is directly instantiated instead, the memory pool can't use backing_allocator and instead always uses the page_allocator

I was already thinking before this that GPA should lose the feature of using backing_allocator and become hard-coded to always use OS calls directly. An allocator that wraps an existing one and tries to provide some kind of features on top of it could be interesting, but it's a bit of a different concern than what GPA aims to provide. So, this is a step in the right direction IMO.

@andrewrk andrewrk merged commit 47f0860 into ziglang:master Oct 3, 2023
10 checks passed
squeek502 added a commit to squeek502/zig that referenced this pull request Oct 4, 2023
…rching the list

Follow up to ziglang#17383. This is a minor optimization that only matters when a small allocation is resized/free'd soon after it is allocated.

The only real difference I was able to observe with this was via a synthetic benchmark that allocates a full bucket and then frees all but one of the slots, over and over in a loop:

Debug build:

Benchmark 1 (9 runs): gpa-degen-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           575ms ± 5.19ms     569ms …  583ms          0 ( 0%)        0%
  peak_rss           43.8MB ± 1.37KB    43.8MB … 43.8MB          1 (11%)        0%
Benchmark 2 (10 runs): gpa-degen-search-cur.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           532ms ± 5.55ms     520ms …  539ms          0 ( 0%)        ⚡-  7.5% ±  0.9%
  peak_rss           43.8MB ± 65.2KB    43.8MB … 44.0MB          1 (10%)          +  0.0% ±  0.1%

ReleaseFast build:

Benchmark 1 (129 runs): gpa-degen-master-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          38.9ms ± 1.12ms    36.7ms … 42.4ms          8 ( 6%)        0%
  peak_rss           23.2MB ± 2.39KB    23.2MB … 23.2MB          0 ( 0%)        0%
Benchmark 2 (151 runs): gpa-degen-search-cur-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          33.2ms ±  999us    31.9ms … 36.3ms         20 (13%)        ⚡- 14.7% ±  0.6%
  peak_rss           23.2MB ± 2.26KB    23.2MB … 23.2MB          0 ( 0%)          +  0.0% ±  0.0%
andrewrk pushed a commit that referenced this pull request Oct 4, 2023
…rching the list

Follow up to #17383. This is a minor optimization that only matters when a small allocation is resized/free'd soon after it is allocated.

The only real difference I was able to observe with this was via a synthetic benchmark that allocates a full bucket and then frees all but one of the slots, over and over in a loop:

Debug build:

Benchmark 1 (9 runs): gpa-degen-master.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           575ms ± 5.19ms     569ms …  583ms          0 ( 0%)        0%
  peak_rss           43.8MB ± 1.37KB    43.8MB … 43.8MB          1 (11%)        0%
Benchmark 2 (10 runs): gpa-degen-search-cur.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           532ms ± 5.55ms     520ms …  539ms          0 ( 0%)        ⚡-  7.5% ±  0.9%
  peak_rss           43.8MB ± 65.2KB    43.8MB … 44.0MB          1 (10%)          +  0.0% ±  0.1%

ReleaseFast build:

Benchmark 1 (129 runs): gpa-degen-master-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          38.9ms ± 1.12ms    36.7ms … 42.4ms          8 ( 6%)        0%
  peak_rss           23.2MB ± 2.39KB    23.2MB … 23.2MB          0 ( 0%)        0%
Benchmark 2 (151 runs): gpa-degen-search-cur-release.exe
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          33.2ms ±  999us    31.9ms … 36.3ms         20 (13%)        ⚡- 14.7% ±  0.6%
  peak_rss           23.2MB ± 2.26KB    23.2MB … 23.2MB          0 ( 0%)          +  0.0% ±  0.0%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants