Reduce memory allocations on servers that process a large number of streams#1999
Reduce memory allocations on servers that process a large number of streams#1999djc merged 1 commit intoquinn-rs:mainfrom
Conversation
|
Haven't looked closely at your changes yet but conceptually the idea of having some kind of free list seems conceptually sound. If we want to merge this I think we should probably come up with some heuristics that make this an improvement in a majority of downstream users (in the default configuration), though we could presumably add some configuration knobs. |
e9cad7d to
29c32e7
Compare
|
Thanks for the super quick feedback, definitely appreciated!
Sweet! Haven't thought about heuristics yet, but things that we might need to tune are:
|
|
I'm happy to improve our performance for this use case, because numerous disposable streams are a major QUIC design pattern, and one I use personally in downstream projects besides. I think the downside of this change is modest. Connections which use a consistent proportion of their stream window (whether that's all, none, or in between) will receive benefits without drawbacks. Only for connections which usually leave their stream window mostly empty but on rare occasion fill it will we see an increase in average memory use by the pooled I'm interested in input from @jeromegn, responsible for previous work to reduce average memory use due to
How much of an impact do you see to these factors? The profile excerpts you showed demonstrate unsurprisingly that this change reduces allocator load from the affected code, but it's not obvious to me what proportion of Quinn's allocator load this accounts for. We should still have at least one allocation per incoming datagram for the datagram itself, for example. |
29c32e7 to
83ffc41
Compare
Wow, awesome find!
While very welcome, that may be a harder problem, since ownership of datagram memory is extremely dynamic: it may move between threads, hang around until application data is consumed, etc. There might be some low hanging fruit for handling frames that don't carry application data without allocating, at least. |
83ffc41 to
cad7ddc
Compare
|
I think I've addressed the nits. If it's ok for you, I'd like to remove the free_recv alloc (#1999 (comment)) in a follow up PR. I did just try to do it, and it's more work than expected, and I'm working on another PR that is more important for us (reducing the Bytes allocations). |
SGTM, I expect that to take significant additional review anyway. |
cad7ddc to
7ff82fb
Compare
Recv is used to receive data during the lifetime of a stream. When a stream is closed, instead of throwing away the corresponding Recv, put it in a free list so they can be reused. This reduces allocations by avoiding repeated allocations of BinaryHeaps within Assemblers, and allocations of Box(es) for the actual Recvs. For servers that serve millions of streams, this drastically reduces allocations, effectively allocating a max_(uni|bi)_streams window worth of Recvs and then reusing them instead of allocating for each new stream.
7ff82fb to
3f2e1aa
Compare
|
Thanks! |

The solana blockchain uses uni streams to deliver transactions. A transaction can be up to 1232 bytes, each transaction is a stream. A validator accepts thousands of connections and each connection sends millions of streams.
The Solana validator Agave uses Quinn for transaction processing, with jemalloc as the global allocator. We're encountering an issue where, due to the high volume of allocations while ingesting transactions, jemalloc struggles during network spikes. It blows out the tcache, spends excessive time managing arenas, allocating large amounts of memory, and then attempting to free it during decay. This results in a significant amount of time spent on syscalls and arena locking, leading to contention and random stalls in other threads.
Quinn is one of the largest source of allocations, so I've been working on improving things.
This is a profile of a single node validator ingesting 100k transactions with master, with a 512 streams receive window:
And this is with the PR applied:
As you can see with master we get 2 allocations for each stream ingested. With the PR we get 2 allocations for each stream in the window.
The PR essentially introduces a free list of Recvs. When it's time to create a new Recv, it tries to pop one from the free list.
This approach works great for us, but it relies on a few key assumptions: that there will be a large number of streams, the stream receive window will generally be fully utilized, and each stream will receive a similar amount of data. If these assumptions aren't met, this PR may lead to some memory inefficiency.
I'm wondering whether you'd be interested in merging something like this, otherwise happy to keep in our fork. Also this is just a first pass, if you're interested I could try to make this behavior opt-in, or tunable somehow to limit the amount of spare memory allocated. I'm open to ideas!