Skip to content

Conversation

@niaow
Copy link
Member

@niaow niaow commented Dec 1, 2025

The allocator originally just looped through the blocks until it found a sufficiently-long range. This is simple, but it fragments very easily and can degrade to a full heap scan for long requests.

Instead, we now maintain a sorted nested list of free ranges by size. The allocator will select the shortest sufficient-length range, generally reducing fragmentation. This data structure can find a range in time directly proportional to the requested length.

Performance in the problematic go/format benchmark:

                    │ linear.txt  │            best-fit.txt             │
                    │   sec/op    │   sec/op     vs base                │
Format/array1-10000   31.77m ± 4%   25.71m ± 2%  -19.08% (p=0.000 n=20)

                    │  linear.txt  │             best-fit.txt             │
                    │     B/s      │     B/s       vs base                │
Format/array1-10000   1.945Mi ± 4%   2.403Mi ± 2%  +23.53% (p=0.000 n=20)

The allocator originally just looped through the blocks until it found a sufficiently-long range.
This is simple, but it fragments very easily and can degrade to a full heap scan for long requests.

Instead, we now maintain a sorted nested list of free ranges by size.
The allocator will select the shortest sufficient-length range, generally reducing fragmentation.
This data structure can find a range in time directly proportional to the requested length.
@niaow
Copy link
Member Author

niaow commented Dec 1, 2025

This is the same basic mechanism as #1181, but it is a lot cleaner.

@niaow
Copy link
Member Author

niaow commented Dec 1, 2025

This adds 100-300 bytes of code. We need to decide if this is worth it.

@eliasnaur
Copy link
Contributor

I often run out of memory because of fragmentation, so I heartily support anything that combats it.

@aykevl
Copy link
Member

aykevl commented Dec 2, 2025

@dgryski can you take a look? To see whether it helps with GC performance?

@dgryski
Copy link
Member

dgryski commented Dec 2, 2025

In general, "best fit" is going to reduce fragmentation at the expense of CPU time. An allocation-heavy benchmark (in this case the binary trees benchmark game) shows this to be the case:

~/go/src/github.com/dgryski/trifles/binarytrees $ hyperfine -N "./trees-dev.exe 15"  "./trees-best.exe 15"
Benchmark 1: ./trees-dev.exe 15
  Time (mean ± σ):     784.9 ms ±  15.6 ms    [User: 1507.8 ms, System: 2174.3 ms]
  Range (min … max):   758.9 ms … 804.3 ms    10 runs

Benchmark 2: ./trees-best.exe 15
  Time (mean ± σ):      1.027 s ±  0.022 s    [User: 1.877 s, System: 2.854 s]
  Range (min … max):    0.998 s …  1.057 s    10 runs

Summary
  ./trees-dev.exe 15 ran
    1.31 ± 0.04 times faster than ./trees-best.exe 15

Our current allocation scheme is "next fit".

Interestingly, using -gc=precise -target=wasip1, best fit comes out faste.r

Running the binary trees benchmark with -gc=precise on native instead of -gc=conservative occasionally gives a SEGV. :(

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

It might be worth waiting until #5104 is merged. I remember I was able to optimize the free ranges construction on my experiments branch by exploiting the new metadata format. The current free range construction code just loops over the individual blocks.

Also, are you using array-backed trees or trees that are each allocated as separate fixed-size objects? If the latter is the case, that is basically the worst case for this - there isn't really any meaningful fragmentation in the first place.

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

Can you link the trees code so I can debug the SEGV?

@dgryski
Copy link
Member

dgryski commented Dec 2, 2025

I'm using https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-go-2.html . So yes, fixed-sized allocations for the tree nodes.

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

Oh right that SEGV is the race condition where we release the GC lock before writing the layout bitmap. I fixed it in #5102 while reorganizing the alloc code, but then kinda forgot about it.

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

Also, the main issue with the binary trees benchmark here is that the collector is not the bottleneck. The lock is the bottleneck. If you switch to -scheduler tasks to eliminate the lock contention, there is a gigantic performance improvement. The actual impact of the best-fit change is negligible.

[niaow@finch tinygo]$ time /tmp/bintree-dev.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m5.167s
user    0m5.343s
sys     0m22.502s
[niaow@finch tinygo]$ time /tmp/dev-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.220s
user    0m0.209s
sys     0m0.012s
[niaow@finch tinygo]$ time /tmp/bintree-best-fit-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.226s
user    0m0.218s
sys     0m0.009s

amken3d pushed a commit to amken3d/tinygo that referenced this pull request Dec 3, 2025
Add SSTGCHint() and related functions to optimize GC behavior for
SST's simpler memory patterns.

SST has fundamentally different memory characteristics:
- Single shared stack (no per-goroutine allocations)
- Fixed-size event queues (pre-allocated)
- Tasks created once at startup
- Run-to-completion (no blocking state)

The best-fit allocator from PR tinygo-org#5105 is not critical for SST because
the allocation patterns are much more predictable and less prone to
fragmentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants