mv: adjust the overhead estimation for view updates #17420

wmitros · 2024-02-20T07:30:44Z

In order to avoid running out of memory, we can't
underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously.

After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:

512 bytes - a write_response_handler
less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream
448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV()
192 bytes - a continuation to the coroutine above
320 bytes - the coroutine in result_parallel_for_each started in mutate_begin()
112 bytes - a continuation to the coroutine above
192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes #17364

In order to avoid running out of memory, we can't underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously. After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each remote view update. The overhead consists of: - 512 bytes - a write_response_handler - less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream - 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV() - 192 bytes - a continuation to the coroutine above - 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin() - 112 bytes - a continuation to the coroutine above - 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes This patch changes the previous overhead estimate of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead. The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size. Besides that, considering that we're waiting for a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it. The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary to handle the excessive memory used by view updates. Fixes scylladb#17364

nyh

Looks good. I don't like that we replaced the number 256 that we pulled out of thin air by a number 2288 which is while has some experimental backing, may change in the future without anyone noticing. One way to do this would have been to have one object which has everything we need - the mutation, task, etc., and be able to request the size of this single object. But I'm guessing it will be very hard to do this, so increasing this number is a very good start, and this is an important patch.

nyh · 2024-02-20T09:25:35Z

We should backport this PR everywhere possible. It's a trivial change, but has profound implications to how the MV flow control works or doesn't work. Before this patch, for small view updates, while MV flow control strives to limit view update memory use to 10%, it can severely underestimate its memory use and reach closer to 100%.

scylladb-promoter · 2024-02-20T09:37:40Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 2 hr 6 min
Builder: spider2.cloudius-systems.com

nyh · 2024-02-20T09:40:28Z

An important note: It is a pre-existing condition that we don't have a test for MV flow control. Not having such a test allowed us to miss critical bugs in the MV flow control algorithm such as the infamous da54d0f and now this one - in the first case the flow control never worked at all (it never delayed the client) and in the second case it worked, but reached much more than 10% of memory use.

Because it's an existing problem, and because your fix is "obvious" and also backed by a test not included here (which you wrote for a different feature, flow control of a large partition deletion), I am not insisting that this PR should wait for the existence of this test. However, @kostja I do think that someone should be assigned to write such a test. An effective test should be able to reproduce the bug fixed in this PR.

mykaul · 2024-02-20T09:50:11Z

Are the numbers before or after 14bf09f ?

wmitros · 2024-02-20T16:23:11Z

Are the numbers before or after 14bf09f ?

They're after 14bf09f but it doesn't seem like the patch helped much in this case - I'm not sure in how many places does the patch apply in this case, but in each of the cases it may have improved the situation by 16 bytes at most out of a total of over 2000. In particular, the less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream is a separate problem

mykaul · 2024-03-17T12:18:06Z

How comfortable are we with backporting this to 5.4 and 5.2?

nyh · 2024-03-17T14:59:31Z

I added the backport 5.4 and 5.2 labels because I thought we should do it.

I think the patch is relatively risk-free in that it just allows MV flow control to kick in earlier (because of the higher estimate), but will not reduce performance - the controller will converge on the right write rate - the rate at which the view update memory remains around constant size and doesn't grow or doesn't go down. It doesn't matter how we measure the memory use for the convergence), and the worst thing that could happen if we over-estimate the memory use is that we throttle the write too much in case of a big sudden burst. We wanted to allow a write burst which is close to 10% memory to not experience large delays (allowing burst is why we used the function x^3 and not x), and if we overestimate, we can find ourselves allowing smaller bursts. But I don't think this is a big risk.

The only thing that makes me unhappy is the lack of test that reproduces this problem and its fix. But that's the way it is.

I'll do the backports later today or tomorrow, unless someone objects.

mykaul · 2024-03-17T15:09:56Z

Why not let the bot do the backports?

wmitros · 2024-03-18T08:32:37Z

I added the backport 5.4 and 5.2 labels because I thought we should do it.

I think the patch is relatively risk-free in that it just allows MV flow control to kick in earlier (because of the higher estimate), but will not reduce performance - the controller will converge on the right write rate - the rate at which the view update memory remains around constant size and doesn't grow or doesn't go down. It doesn't matter how we measure the memory use for the convergence), and the worst thing that could happen if we over-estimate the memory use is that we throttle the write too much in case of a big sudden burst. We wanted to allow a write burst which is close to 10% memory to not experience large delays (allowing burst is why we used the function x^3 and not x), and if we overestimate, we can find ourselves allowing smaller bursts. But I don't think this is a big risk.

The only thing that makes me unhappy is the lack of test that reproduces this problem and its fix. But that's the way it is.

I'll do the backports later today or tomorrow, unless someone objects.

It's worth noting that the overhead will most likely be slightly different in previous versions. In particular, I see the this code path has been recently coroutinized and we store coroutines differently than how we store futures in memory. IIRC, when I was comparing memory used by the 2 approaches, the continuation-based case used slightly more memory, something like 2600 vs 2288. Changing it from 256 in 5.2 and 5.4 would still be much more precise, but to get exact numbers I'd need to perform the analysis again

nyh · 2024-03-18T09:02:36Z

Why not let the bot do the backports?

How do I do that?

mykaul · 2024-03-18T10:39:50Z

Why not let the bot do the backports?

How do I do that?

@yaronkaikov ?

yaronkaikov · 2024-03-18T10:42:17Z

Why not let the bot do the backports?

How do I do that?

Since this was closed last month I assume before the automation was merged. you just need to add a label promoted-to-master (which today we automatically add it once the commit is promoted)

yaronkaikov · 2024-03-18T10:48:48Z

5.4 - #17859

5.2 - #17858

In order to avoid running out of memory, we can't underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously. After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each remote view update. The overhead consists of: - 512 bytes - a write_response_handler - less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream - 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV() - 192 bytes - a continuation to the coroutine above - 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin() - 112 bytes - a continuation to the coroutine above - 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes This patch changes the previous overhead estimate of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead. The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size. Besides that, considering that we're waiting for a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it. The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary to handle the excessive memory used by view updates. Fixes scylladb#17364 Closes scylladb#17420

nyh approved these changes Feb 20, 2024

View reviewed changes

nyh added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Feb 20, 2024

wmitros mentioned this pull request Feb 20, 2024

mv: throttle view update generation for large queries #16819

Merged

scylladb-promoter closed this in 4c767c3 Feb 21, 2024

wmitros mentioned this pull request Mar 13, 2024

Need materialized-view admission control #17426

Open

mykaul added the Backport candidate label Mar 13, 2024

wmitros mentioned this pull request Mar 18, 2024

mv: memory used by a view updates batch is incorrectly divided #17854

Closed

scylladb deleted a comment from mergify bot Mar 18, 2024

yaronkaikov added the promoted-to-master label Mar 18, 2024

nyh mentioned this pull request Mar 25, 2024

Reduce memory overhead of pending view updates #18008

Open

mykaul removed the backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed label Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mv: adjust the overhead estimation for view updates #17420

mv: adjust the overhead estimation for view updates #17420

wmitros commented Feb 20, 2024

nyh left a comment

nyh commented Feb 20, 2024

scylladb-promoter commented Feb 20, 2024

nyh commented Feb 20, 2024

mykaul commented Feb 20, 2024

wmitros commented Feb 20, 2024

mykaul commented Mar 17, 2024

nyh commented Mar 17, 2024

mykaul commented Mar 17, 2024

wmitros commented Mar 18, 2024

nyh commented Mar 18, 2024

mykaul commented Mar 18, 2024

yaronkaikov commented Mar 18, 2024

yaronkaikov commented Mar 18, 2024

mv: adjust the overhead estimation for view updates #17420

mv: adjust the overhead estimation for view updates #17420

Conversation

wmitros commented Feb 20, 2024

nyh left a comment

Choose a reason for hiding this comment

nyh commented Feb 20, 2024

scylladb-promoter commented Feb 20, 2024

🟢 CI State: SUCCESS

Build Details:

nyh commented Feb 20, 2024

mykaul commented Feb 20, 2024

wmitros commented Feb 20, 2024

mykaul commented Mar 17, 2024

nyh commented Mar 17, 2024

mykaul commented Mar 17, 2024

wmitros commented Mar 18, 2024

nyh commented Mar 18, 2024

mykaul commented Mar 18, 2024

yaronkaikov commented Mar 18, 2024

yaronkaikov commented Mar 18, 2024