sha256: use freelist to avoid allocations #14
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Running the benchmarks and inspecting profiles, it seems that most allocations were related to sha256 calculations to building the Merkle Tree.
The indirect slice created by appending the left and right child to calculate the hash value has a high cost since the slice isn't returned, so the cost and size of the allocation aren't amortized.
A usual alternative is re-using these kinds of slices to avoid allocs in further operations. I leveraged the
sync.Pool
type which tends to be a reasonable choice, because it comes from the stdlib and has some internal optimizations regarding thread locality and, of course, it's thread-safe.Whenever the pool needs a new slice, it creates a slice with
2*NodeSize
size.This implementation works correctly if the total size of
lChild
andrChild
is at most2*NodeSize
considering the nature ofcopy
. If that isn't the case, this could be switched to usingappend
most probably having the same kind of perf improvement. Using pools with "growable" elements may be a more delicate discussion.Here's a bench compare between
develop
and this change:So, both total bytes allocated and the number of allocations had a dramatic improvement. Total time/op didn't get affected or got slightly better in some cases.
I think that by the way the benchmarks have created the results (even before this change) are affected by a lot of core operations about tree building. Since this change is at the core of most operations, that's the explanation it improved mostly all benchs.
Also, since less garbage is created this change would reduce GC pressure in the big picture application lifecycle.