Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mv: adjust the overhead estimation for view updates #17420

Closed
wants to merge 1 commit into from

Conversation

wmitros
Copy link
Contributor

@wmitros wmitros commented Feb 20, 2024

In order to avoid running out of memory, we can't
underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously.

After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:

  • 512 bytes - a write_response_handler
  • less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream
  • 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV()
  • 192 bytes - a continuation to the coroutine above
  • 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin()
  • 112 bytes - a continuation to the coroutine above
  • 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes #17364

In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.

After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes scylladb#17364
Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I don't like that we replaced the number 256 that we pulled out of thin air by a number 2288 which is while has some experimental backing, may change in the future without anyone noticing. One way to do this would have been to have one object which has everything we need - the mutation, task, etc., and be able to request the size of this single object. But I'm guessing it will be very hard to do this, so increasing this number is a very good start, and this is an important patch.

@nyh
Copy link
Contributor

nyh commented Feb 20, 2024

We should backport this PR everywhere possible. It's a trivial change, but has profound implications to how the MV flow control works or doesn't work. Before this patch, for small view updates, while MV flow control strives to limit view update memory use to 10%, it can severely underestimate its memory use and reach closer to 100%.

@nyh nyh added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Feb 20, 2024
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 6 min
  • Builder: spider2.cloudius-systems.com

@nyh
Copy link
Contributor

nyh commented Feb 20, 2024

An important note: It is a pre-existing condition that we don't have a test for MV flow control. Not having such a test allowed us to miss critical bugs in the MV flow control algorithm such as the infamous da54d0f and now this one - in the first case the flow control never worked at all (it never delayed the client) and in the second case it worked, but reached much more than 10% of memory use.

Because it's an existing problem, and because your fix is "obvious" and also backed by a test not included here (which you wrote for a different feature, flow control of a large partition deletion), I am not insisting that this PR should wait for the existence of this test. However, @kostja I do think that someone should be assigned to write such a test. An effective test should be able to reproduce the bug fixed in this PR.

@mykaul
Copy link
Contributor

mykaul commented Feb 20, 2024

Are the numbers before or after 14bf09f ?

@wmitros
Copy link
Contributor Author

wmitros commented Feb 20, 2024

Are the numbers before or after 14bf09f ?

They're after 14bf09f but it doesn't seem like the patch helped much in this case - I'm not sure in how many places does the patch apply in this case, but in each of the cases it may have improved the situation by 16 bytes at most out of a total of over 2000. In particular, the less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream is a separate problem

@mykaul
Copy link
Contributor

mykaul commented Mar 17, 2024

How comfortable are we with backporting this to 5.4 and 5.2?

@nyh
Copy link
Contributor

nyh commented Mar 17, 2024

I added the backport 5.4 and 5.2 labels because I thought we should do it.

I think the patch is relatively risk-free in that it just allows MV flow control to kick in earlier (because of the higher estimate), but will not reduce performance - the controller will converge on the right write rate - the rate at which the view update memory remains around constant size and doesn't grow or doesn't go down. It doesn't matter how we measure the memory use for the convergence), and the worst thing that could happen if we over-estimate the memory use is that we throttle the write too much in case of a big sudden burst. We wanted to allow a write burst which is close to 10% memory to not experience large delays (allowing burst is why we used the function x^3 and not x), and if we overestimate, we can find ourselves allowing smaller bursts. But I don't think this is a big risk.

The only thing that makes me unhappy is the lack of test that reproduces this problem and its fix. But that's the way it is.

I'll do the backports later today or tomorrow, unless someone objects.

@mykaul
Copy link
Contributor

mykaul commented Mar 17, 2024

Why not let the bot do the backports?

@wmitros
Copy link
Contributor Author

wmitros commented Mar 18, 2024

I added the backport 5.4 and 5.2 labels because I thought we should do it.

I think the patch is relatively risk-free in that it just allows MV flow control to kick in earlier (because of the higher estimate), but will not reduce performance - the controller will converge on the right write rate - the rate at which the view update memory remains around constant size and doesn't grow or doesn't go down. It doesn't matter how we measure the memory use for the convergence), and the worst thing that could happen if we over-estimate the memory use is that we throttle the write too much in case of a big sudden burst. We wanted to allow a write burst which is close to 10% memory to not experience large delays (allowing burst is why we used the function x^3 and not x), and if we overestimate, we can find ourselves allowing smaller bursts. But I don't think this is a big risk.

The only thing that makes me unhappy is the lack of test that reproduces this problem and its fix. But that's the way it is.

I'll do the backports later today or tomorrow, unless someone objects.

It's worth noting that the overhead will most likely be slightly different in previous versions. In particular, I see the this code path has been recently coroutinized and we store coroutines differently than how we store futures in memory. IIRC, when I was comparing memory used by the 2 approaches, the continuation-based case used slightly more memory, something like 2600 vs 2288. Changing it from 256 in 5.2 and 5.4 would still be much more precise, but to get exact numbers I'd need to perform the analysis again

@nyh
Copy link
Contributor

nyh commented Mar 18, 2024

Why not let the bot do the backports?

How do I do that?

@mykaul
Copy link
Contributor

mykaul commented Mar 18, 2024

Why not let the bot do the backports?

How do I do that?

@yaronkaikov ?

@yaronkaikov
Copy link
Contributor

Why not let the bot do the backports?

How do I do that?

Since this was closed last month I assume before the automation was merged. you just need to add a label promoted-to-master (which today we automatically add it once the commit is promoted)

@scylladb scylladb deleted a comment from mergify bot Mar 18, 2024
@yaronkaikov
Copy link
Contributor

5.4 - #17859

5.2 - #17858

@mykaul mykaul removed the backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed label Mar 27, 2024
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this pull request Apr 30, 2024
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.

After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes scylladb#17364

Closes scylladb#17420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Materialized view flow control doesn't account small view updates properly, and can lead to OOM
5 participants