-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assert failure: (../../../src/v/utils/retry_chain_node.cc:167) '_num_children == 0' Fiber stopped before its dependencies #3378
Comments
This looks like a lifetime issue. Most likely we're continue pushing data to the |
This should be fixed by the |
For v21.11.3-si-beta5 |
Reproduced in It only started happening after about 2 hours, by which time the cluster was also hitting lots of bad_alloc. So this is probably something that's triggered in exception handling path. |
is the bad_alloc condition understood already? if not, next time you see it, assuming we aren't dealing with something like massive ram situation, grabbing a core with |
The bad_alloc is (probably) an accumulation of remote_segment instances and readers (#3392). We'll see if we're still leaking memory after that fix (I'm building test cases that run faster on local instances where getting a core is also much easier than on GKE pods). |
After leaving system running for longer, ended up with nodes in a CrashLoopBackOff with this assertion a few seconds after startup. Far too early for bad_allocs to happen, so looks like that wasn't part of the cause. Reader lifetime changed a bit in aeda711 but not sure if that was intended to have anything to do with this issue. Looking at the backtrace (https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193) (it's in |
I wonder if it's intrinsically problematic to be destroying these without
calling stop()
I think that this is exactly what happened but not sure why it happened.
The mechanism of the failure is simple. The *remote_segment_batch_reader*
has a *_parser* field which is initialized in *init_parser()* method and
*retry_chain_node* field of the reader is passed there. When the d-tor of
the reader is called without prior call to *stop()* the *retry_chain_node*
is destroyed first. It's destructor detects that it's still used by the
parser and it triggers assertion.
The easiest fix for this is to change order of fields inside the
*remote_segment_batch_reader* so the *_parser* would be destroyed before
the *retry_chain_node* instance. But it'll be better to make sure the stop
method is always called for the reader.
…On Wed, Jan 5, 2022, 03:03 John Spray ***@***.***> wrote:
After leaving system running for longer, ended up with nodes in a
CrashLoopBackOff with this assertion a few seconds after startup. Far too
early for bad_allocs to happen, so looks like that wasn't part of the cause.
Reader lifetime changed a bit in aeda711
<aeda711>
but not sure if that was intended to have anything to do with this issue.
Looking at the backtrace (
https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193) (it's
in ~remote_segment_batch_reader) I wonder if it's intrinsically
problematic to be destroying these without calling stop()
—
Reply to this email directly, view it on GitHub
<#3378 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWMNRLUL7M267SNIZKFRLUUODFBANCNFSM5K7ZNHWA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io>
How to reproduce this on a single node dev environment(Using a release build of v21.11.3-si-beta6) Topic settings: Redpanda resource settings: 4096MB RAM, 4 cores. Shadow indexing cache size is set small, to 640MB. Sequential producer and consumer
The read is the thing that actually triggers the crash, but you need enough data to flow through the system to get to a point where reads are having to hydrate segments, and hitting resource limits while doing so. |
...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io>
...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io>
...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io> (cherry picked from commit 6104b67)
...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io> (cherry picked from commit 6104b67)
Backported to 21.11.x in #3448 |
Version & Environment
redpanda version: Redpanda v21.11.3-beta2
BYOC cluster
SI cache size 20G
random kills pods
produce/consume some data with franz-go bench
https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193
The text was updated successfully, but these errors were encountered: