-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mv: adjust the overhead estimation for view updates #17420
Conversation
In order to avoid running out of memory, we can't underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously. After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each remote view update. The overhead consists of: - 512 bytes - a write_response_handler - less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream - 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV() - 192 bytes - a continuation to the coroutine above - 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin() - 112 bytes - a continuation to the coroutine above - 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes This patch changes the previous overhead estimate of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead. The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size. Besides that, considering that we're waiting for a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it. The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary to handle the excessive memory used by view updates. Fixes scylladb#17364
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I don't like that we replaced the number 256 that we pulled out of thin air by a number 2288 which is while has some experimental backing, may change in the future without anyone noticing. One way to do this would have been to have one object which has everything we need - the mutation, task, etc., and be able to request the size of this single object. But I'm guessing it will be very hard to do this, so increasing this number is a very good start, and this is an important patch.
We should backport this PR everywhere possible. It's a trivial change, but has profound implications to how the MV flow control works or doesn't work. Before this patch, for small view updates, while MV flow control strives to limit view update memory use to 10%, it can severely underestimate its memory use and reach closer to 100%. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
An important note: It is a pre-existing condition that we don't have a test for MV flow control. Not having such a test allowed us to miss critical bugs in the MV flow control algorithm such as the infamous da54d0f and now this one - in the first case the flow control never worked at all (it never delayed the client) and in the second case it worked, but reached much more than 10% of memory use. Because it's an existing problem, and because your fix is "obvious" and also backed by a test not included here (which you wrote for a different feature, flow control of a large partition deletion), I am not insisting that this PR should wait for the existence of this test. However, @kostja I do think that someone should be assigned to write such a test. An effective test should be able to reproduce the bug fixed in this PR. |
Are the numbers before or after 14bf09f ? |
They're after 14bf09f but it doesn't seem like the patch helped much in this case - I'm not sure in how many places does the patch apply in this case, but in each of the cases it may have improved the situation by 16 bytes at most out of a total of over 2000. In particular, the |
How comfortable are we with backporting this to 5.4 and 5.2? |
I added the backport 5.4 and 5.2 labels because I thought we should do it. I think the patch is relatively risk-free in that it just allows MV flow control to kick in earlier (because of the higher estimate), but will not reduce performance - the controller will converge on the right write rate - the rate at which the view update memory remains around constant size and doesn't grow or doesn't go down. It doesn't matter how we measure the memory use for the convergence), and the worst thing that could happen if we over-estimate the memory use is that we throttle the write too much in case of a big sudden burst. We wanted to allow a write burst which is close to 10% memory to not experience large delays (allowing burst is why we used the function The only thing that makes me unhappy is the lack of test that reproduces this problem and its fix. But that's the way it is. I'll do the backports later today or tomorrow, unless someone objects. |
Why not let the bot do the backports? |
It's worth noting that the overhead will most likely be slightly different in previous versions. In particular, I see the this code path has been recently coroutinized and we store coroutines differently than how we store futures in memory. IIRC, when I was comparing memory used by the 2 approaches, the continuation-based case used slightly more memory, something like 2600 vs 2288. Changing it from 256 in 5.2 and 5.4 would still be much more precise, but to get exact numbers I'd need to perform the analysis again |
How do I do that? |
|
Since this was closed last month I assume before the automation was merged. you just need to add a label |
In order to avoid running out of memory, we can't underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously. After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each remote view update. The overhead consists of: - 512 bytes - a write_response_handler - less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream - 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV() - 192 bytes - a continuation to the coroutine above - 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin() - 112 bytes - a continuation to the coroutine above - 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes This patch changes the previous overhead estimate of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead. The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size. Besides that, considering that we're waiting for a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it. The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary to handle the excessive memory used by view updates. Fixes scylladb#17364 Closes scylladb#17420
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously.
After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead.
The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it.
The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.
Fixes #17364