Skip to content

Reduce memory allocations on servers that process a large number of streams#1999

Merged
djc merged 1 commit intoquinn-rs:mainfrom
alessandrod:reuse-recvs
Oct 14, 2024
Merged

Reduce memory allocations on servers that process a large number of streams#1999
djc merged 1 commit intoquinn-rs:mainfrom
alessandrod:reuse-recvs

Conversation

@alessandrod
Copy link
Contributor

The solana blockchain uses uni streams to deliver transactions. A transaction can be up to 1232 bytes, each transaction is a stream. A validator accepts thousands of connections and each connection sends millions of streams.

The Solana validator Agave uses Quinn for transaction processing, with jemalloc as the global allocator. We're encountering an issue where, due to the high volume of allocations while ingesting transactions, jemalloc struggles during network spikes. It blows out the tcache, spends excessive time managing arenas, allocating large amounts of memory, and then attempting to free it during decay. This results in a significant amount of time spent on syscalls and arena locking, leading to contention and random stalls in other threads.

Quinn is one of the largest source of allocations, so I've been working on improving things.

This is a profile of a single node validator ingesting 100k transactions with master, with a 512 streams receive window:

image

And this is with the PR applied:

image

As you can see with master we get 2 allocations for each stream ingested. With the PR we get 2 allocations for each stream in the window.

The PR essentially introduces a free list of Recvs. When it's time to create a new Recv, it tries to pop one from the free list.

This approach works great for us, but it relies on a few key assumptions: that there will be a large number of streams, the stream receive window will generally be fully utilized, and each stream will receive a similar amount of data. If these assumptions aren't met, this PR may lead to some memory inefficiency.

I'm wondering whether you'd be interested in merging something like this, otherwise happy to keep in our fork. Also this is just a first pass, if you're interested I could try to make this behavior opt-in, or tunable somehow to limit the amount of spare memory allocated. I'm open to ideas!

@djc
Copy link
Member

djc commented Sep 29, 2024

Haven't looked closely at your changes yet but conceptually the idea of having some kind of free list seems conceptually sound. If we want to merge this I think we should probably come up with some heuristics that make this an improvement in a majority of downstream users (in the default configuration), though we could presumably add some configuration knobs.

@alessandrod
Copy link
Contributor Author

Thanks for the super quick feedback, definitely appreciated!

Haven't looked closely at your changes yet but conceptually the idea of having some kind of free list seems conceptually sound. If we want to merge this I think we should probably come up with some heuristics that make this an improvement in a majority of downstream users (in the default configuration), though we could presumably add some configuration knobs.

Sweet! Haven't thought about heuristics yet, but things that we might need to tune are:

  • num of recvs in the freelist: is there a reason to ever want this lower than max_*_streams?
  • num of elements in the Assembler BinaryHeap: I guess the larger the stream, the more elements it'll contain. We might need to shrink $at_some_point

@Ralith
Copy link
Collaborator

Ralith commented Sep 29, 2024

I'm happy to improve our performance for this use case, because numerous disposable streams are a major QUIC design pattern, and one I use personally in downstream projects besides.

I think the downside of this change is modest. Connections which use a consistent proportion of their stream window (whether that's all, none, or in between) will receive benefits without drawbacks. Only for connections which usually leave their stream window mostly empty but on rare occasion fill it will we see an increase in average memory use by the pooled Recvs, and that memory use should still be much smaller than the worst-case memory when the stream window is actually full of streams which are individually full of unread stream data. I'd therefore prefer to avoid introducing new configuration unless someone has a concrete use case that we can't cover automatically.

I'm interested in input from @jeromegn, responsible for previous work to reduce average memory use due to Recvs in #1682.

It blows out the tcache, spends excessive time managing arenas, allocating large amounts of memory, and then attempting to free it during decay. This results in a significant amount of time spent on syscalls and arena locking, leading to contention and random stalls in other threads.

How much of an impact do you see to these factors? The profile excerpts you showed demonstrate unsurprisingly that this change reduces allocator load from the affected code, but it's not obvious to me what proportion of Quinn's allocator load this accounts for. We should still have at least one allocation per incoming datagram for the datagram itself, for example.

@alessandrod
Copy link
Contributor Author

It blows out the tcache, spends excessive time managing arenas, allocating large amounts of memory, and then attempting to free it during decay. This results in a significant amount of time spent on syscalls and arena locking, leading to contention and random stalls in other threads.

How much of an impact do you see to these factors? The profile excerpts you showed demonstrate unsurprisingly that this change reduces allocator load from the affected code, but it's not obvious to me what proportion of Quinn's allocator load this accounts for. We should still have at least one allocation per incoming datagram for the datagram itself, for example.

In the screenshot you can see that the binary heap allocation is 44% of all allocations (blocks) made by quinn, and the Recv box is another 44%. Transactions can be up to 1232 bytes, but many are much smaller, therefore you get a few txs in a single datagram.

The datagram allocation you're referring to, in that particular run was 5%:

Screenshot 2024-09-30 at 6 36 01 pm

I'm going to send a PR for this one too :)

@Ralith
Copy link
Collaborator

Ralith commented Sep 30, 2024

In the screenshot you can see that the binary heap allocation is 44% of all allocations (blocks) made by quinn, and the Recv box is another 44%.

Wow, awesome find!

I'm going to send a PR for [reducing datagram allocations] too :)

While very welcome, that may be a harder problem, since ownership of datagram memory is extremely dynamic: it may move between threads, hang around until application data is consumed, etc. There might be some low hanging fruit for handling frames that don't carry application data without allocating, at least.

veeso added a commit to veeso/solana-quinn that referenced this pull request Oct 4, 2024
@alessandrod
Copy link
Contributor Author

I think I've addressed the nits. If it's ok for you, I'd like to remove the free_recv alloc (#1999 (comment)) in a follow up PR. I did just try to do it, and it's more work than expected, and I'm working on another PR that is more important for us (reducing the Bytes allocations).

@Ralith
Copy link
Collaborator

Ralith commented Oct 11, 2024

I'd like to remove the free_recv alloc (#1999 (comment)) in a follow up PR

SGTM, I expect that to take significant additional review anyway.

Recv is used to receive data during the lifetime of a stream.  When a
stream is closed, instead of throwing away the corresponding Recv, put
it in a free list so they can be reused.

This reduces allocations by avoiding repeated allocations of BinaryHeaps
within Assemblers, and allocations of Box(es) for the actual Recvs. For
servers that serve millions of streams, this drastically reduces
allocations, effectively allocating a max_(uni|bi)_streams window worth
of Recvs and then reusing them instead of allocating for each new
stream.
@djc djc added this pull request to the merge queue Oct 14, 2024
@djc
Copy link
Member

djc commented Oct 14, 2024

Thanks!

Merged via the queue into quinn-rs:main with commit 41850c8 Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants