runtime (gc_blocks.go): use best-fit allocation #5105

niaow · 2025-12-01T03:26:00Z

The allocator originally just looped through the blocks until it found a sufficiently-long range. This is simple, but it fragments very easily and can degrade to a full heap scan for long requests.

Instead, we now maintain a sorted nested list of free ranges by size. The allocator will select the shortest sufficient-length range, generally reducing fragmentation. This data structure can find a range in time directly proportional to the requested length.

Performance in the problematic go/format benchmark:

                    │ linear.txt  │            best-fit.txt             │
                    │   sec/op    │   sec/op     vs base                │
Format/array1-10000   31.77m ± 4%   25.71m ± 2%  -19.08% (p=0.000 n=20)

                    │  linear.txt  │             best-fit.txt             │
                    │     B/s      │     B/s       vs base                │
Format/array1-10000   1.945Mi ± 4%   2.403Mi ± 2%  +23.53% (p=0.000 n=20)

The allocator originally just looped through the blocks until it found a sufficiently-long range. This is simple, but it fragments very easily and can degrade to a full heap scan for long requests. Instead, we now maintain a sorted nested list of free ranges by size. The allocator will select the shortest sufficient-length range, generally reducing fragmentation. This data structure can find a range in time directly proportional to the requested length.

niaow · 2025-12-01T03:32:46Z

This is the same basic mechanism as #1181, but it is a lot cleaner.

niaow · 2025-12-01T03:47:01Z

This adds 100-300 bytes of code. We need to decide if this is worth it.

eliasnaur · 2025-12-01T08:14:07Z

I often run out of memory because of fragmentation, so I heartily support anything that combats it.

aykevl · 2025-12-02T11:00:27Z

@dgryski can you take a look? To see whether it helps with GC performance?

dgryski · 2025-12-02T20:21:53Z

In general, "best fit" is going to reduce fragmentation at the expense of CPU time. An allocation-heavy benchmark (in this case the binary trees benchmark game) shows this to be the case:

~/go/src/github.com/dgryski/trifles/binarytrees $ hyperfine -N "./trees-dev.exe 15"  "./trees-best.exe 15"
Benchmark 1: ./trees-dev.exe 15
  Time (mean ± σ):     784.9 ms ±  15.6 ms    [User: 1507.8 ms, System: 2174.3 ms]
  Range (min … max):   758.9 ms … 804.3 ms    10 runs

Benchmark 2: ./trees-best.exe 15
  Time (mean ± σ):      1.027 s ±  0.022 s    [User: 1.877 s, System: 2.854 s]
  Range (min … max):    0.998 s …  1.057 s    10 runs

Summary
  ./trees-dev.exe 15 ran
    1.31 ± 0.04 times faster than ./trees-best.exe 15

Our current allocation scheme is "next fit".

Interestingly, using -gc=precise -target=wasip1, best fit comes out faste.r

Running the binary trees benchmark with -gc=precise on native instead of -gc=conservative occasionally gives a SEGV. :(

niaow · 2025-12-02T20:35:14Z

It might be worth waiting until #5104 is merged. I remember I was able to optimize the free ranges construction on my experiments branch by exploiting the new metadata format. The current free range construction code just loops over the individual blocks.

Also, are you using array-backed trees or trees that are each allocated as separate fixed-size objects? If the latter is the case, that is basically the worst case for this - there isn't really any meaningful fragmentation in the first place.

niaow · 2025-12-02T20:40:20Z

Can you link the trees code so I can debug the SEGV?

dgryski · 2025-12-02T20:41:23Z

I'm using https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-go-2.html . So yes, fixed-sized allocations for the tree nodes.

niaow · 2025-12-02T21:01:37Z

Oh right that SEGV is the race condition where we release the GC lock before writing the layout bitmap. I fixed it in #5102 while reorganizing the alloc code, but then kinda forgot about it.

niaow · 2025-12-02T22:55:45Z

Also, the main issue with the binary trees benchmark here is that the collector is not the bottleneck. The lock is the bottleneck. If you switch to -scheduler tasks to eliminate the lock contention, there is a gigantic performance improvement. The actual impact of the best-fit change is negligible.

[niaow@finch tinygo]$ time /tmp/bintree-dev.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m5.167s
user    0m5.343s
sys     0m22.502s
[niaow@finch tinygo]$ time /tmp/dev-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.220s
user    0m0.209s
sys     0m0.012s
[niaow@finch tinygo]$ time /tmp/bintree-best-fit-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.226s
user    0m0.218s
sys     0m0.009s

Add SSTGCHint() and related functions to optimize GC behavior for SST's simpler memory patterns. SST has fundamentally different memory characteristics: - Single shared stack (no per-goroutine allocations) - Fixed-size event queues (pre-allocated) - Tasks created once at startup - Run-to-completion (no blocking state) The best-fit allocator from PR tinygo-org#5105 is not critical for SST because the allocation patterns are much more predictable and less prone to fragmentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime (gc_blocks.go): use best-fit allocation #5105

runtime (gc_blocks.go): use best-fit allocation #5105

Uh oh!

niaow commented Dec 1, 2025

Uh oh!

niaow commented Dec 1, 2025

Uh oh!

niaow commented Dec 1, 2025

Uh oh!

eliasnaur commented Dec 1, 2025

Uh oh!

aykevl commented Dec 2, 2025

Uh oh!

dgryski commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

dgryski commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

runtime (gc_blocks.go): use best-fit allocation #5105

Are you sure you want to change the base?

runtime (gc_blocks.go): use best-fit allocation #5105

Uh oh!

Conversation

niaow commented Dec 1, 2025

Uh oh!

niaow commented Dec 1, 2025

Uh oh!

niaow commented Dec 1, 2025

Uh oh!

eliasnaur commented Dec 1, 2025

Uh oh!

aykevl commented Dec 2, 2025

Uh oh!

dgryski commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

dgryski commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

niaow commented Dec 2, 2025 •

edited

Loading