Assert failure: (../../../src/v/utils/retry_chain_node.cc:167) '_num_children == 0' Fiber stopped before its dependencies #3378

VadimPlh · 2021-12-30T14:18:09Z

Version & Environment

redpanda version: Redpanda v21.11.3-beta2

BYOC cluster
SI cache size 20G
random kills pods
produce/consume some data with franz-go bench

https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193

Lazin · 2021-12-30T18:28:11Z

This looks like a lifetime issue. Most likely we're continue pushing data to the remote_segment_batch_reader when the remote_segment is closed.

Lazin · 2021-12-30T18:28:43Z

This should be fixed by the v21.11.3_si_beta4

VadimPlh · 2022-01-01T20:58:00Z

For v21.11.3-si-beta5
The same problem in logs link

jcsp · 2022-01-04T15:29:04Z

Reproduced in v21.11.3-si-beta5 and without any chaos (was not running pod kills, leader transfers, etc). Just one producer, one sequential reader, and 8 parallel random readers.

It only started happening after about 2 hours, by which time the cluster was also hitting lots of bad_alloc. So this is probably something that's triggered in exception handling path.

dotnwat · 2022-01-04T18:26:29Z

It only started happening after about 2 hours, by which time the cluster was also hitting lots of bad_alloc. So this is probably something that's triggered in exception handling path.

is the bad_alloc condition understood already? if not, next time you see it, assuming we aren't dealing with something like massive ram situation, grabbing a core with gcore might be helpful to analyze.

jcsp · 2022-01-04T23:32:01Z

is the bad_alloc condition understood already?

The bad_alloc is (probably) an accumulation of remote_segment instances and readers (#3392). We'll see if we're still leaking memory after that fix (I'm building test cases that run faster on local instances where getting a core is also much easier than on GKE pods).

jcsp · 2022-01-05T00:03:18Z

After leaving system running for longer, ended up with nodes in a CrashLoopBackOff with this assertion a few seconds after startup. Far too early for bad_allocs to happen, so looks like that wasn't part of the cause.

Reader lifetime changed a bit in aeda711 but not sure if that was intended to have anything to do with this issue.

Looking at the backtrace (https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193) (it's in ~remote_segment_batch_reader) I wonder if it's intrinsically problematic to be destroying these without calling stop()

Lazin · 2022-01-05T05:29:22Z

I wonder if it's intrinsically problematic to be destroying these without

calling stop() I think that this is exactly what happened but not sure why it happened. The mechanism of the failure is simple. The *remote_segment_batch_reader* has a *_parser* field which is initialized in *init_parser()* method and *retry_chain_node* field of the reader is passed there. When the d-tor of the reader is called without prior call to *stop()* the *retry_chain_node* is destroyed first. It's destructor detects that it's still used by the parser and it triggers assertion. The easiest fix for this is to change order of fields inside the *remote_segment_batch_reader* so the *_parser* would be destroyed before the *retry_chain_node* instance. But it'll be better to make sure the stop method is always called for the reader.

…

On Wed, Jan 5, 2022, 03:03 John Spray ***@***.***> wrote: After leaving system running for longer, ended up with nodes in a CrashLoopBackOff with this assertion a few seconds after startup. Far too early for bad_allocs to happen, so looks like that wasn't part of the cause. Reader lifetime changed a bit in aeda711 <aeda711> but not sure if that was intended to have anything to do with this issue. Looking at the backtrace ( https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193) (it's in ~remote_segment_batch_reader) I wonder if it's intrinsically problematic to be destroying these without calling stop() — Reply to this email directly, view it on GitHub <#3378 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAWMNRLUL7M267SNIZKFRLUUODFBANCNFSM5K7ZNHWA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were assigned.Message ID: ***@***.***>

...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp · 2022-01-05T22:40:30Z

How to reproduce this on a single node dev environment

(Using a release build of v21.11.3-si-beta6)

Topic settings:
rpk --brokers 192.168.1.100:9092 topic create -c segment.bytes=$(( 2**23 )) -c retention.bytes=$(( 2*2**23 )) -c redpanda.remote.read=true -c redpanda.remote.write=true verifier-a -r 1 -p 512

Redpanda resource settings: 4096MB RAM, 4 cores.

Shadow indexing cache size is set small, to 640MB.

Sequential producer and consumer
while true ; do ./si-verifier --brokers 192.168.1.100:9092 --seq_read=0 --msg_size=128000 --produce_msgs=1000 --rand_read_msgs=0 --parallel=0 --topic verifier-a ; done

while true ; do ./si-verifier --brokers 192.168.1.100:9092 --seq_read=1 --msg_size=128000 --produce_msgs=0 --rand_read_msgs=0 --topic verifier-a ; done

The read is the thing that actually triggers the crash, but you need enough data to flow through the system to get to a point where reads are having to hydrate segments, and hitting resource limits while doing so.

...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io>

...and not in the destructor chain of partition_record_batch_reader_impl. This is necessary to make sure the reader is properly cleaned up via stop(). Fixes: redpanda-data#3378 Signed-off-by: John Spray <jcs@vectorized.io> (cherry picked from commit 6104b67)

jcsp · 2022-01-11T19:17:38Z

Backported to 21.11.x in #3448

VadimPlh added the kind/bug Something isn't working label Dec 30, 2021

VadimPlh assigned Lazin Dec 30, 2021

jcsp added the area/cloud-storage Shadow indexing subsystem label Jan 4, 2022

jcsp mentioned this issue Jan 5, 2022

cloud_storage: ensure readers only destroyed in remote_partition #3401

Merged

jcsp assigned jcsp and unassigned Lazin Jan 10, 2022

jcsp closed this as completed in #3401 Jan 11, 2022

This was referenced Jan 11, 2022

[v21.11.x] cloud_storage: ensure readers only destroyed in remote_partition #3439

Closed

[v21.11.x] cloud_storage: ensure readers only destroyed in remote_partition #3447

Closed

jcsp mentioned this issue Jan 11, 2022

[v21.11.x] cloud_storage: ensure readers only destroyed in remote_partition #3448

Merged

jcsp mentioned this issue Jan 13, 2022

Shadow indexing backport tracker #3473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert failure: (../../../src/v/utils/retry_chain_node.cc:167) '_num_children == 0' Fiber stopped before its dependencies #3378

Assert failure: (../../../src/v/utils/retry_chain_node.cc:167) '_num_children == 0' Fiber stopped before its dependencies #3378

VadimPlh commented Dec 30, 2021

Lazin commented Dec 30, 2021

Lazin commented Dec 30, 2021

VadimPlh commented Jan 1, 2022

jcsp commented Jan 4, 2022 •

edited

dotnwat commented Jan 4, 2022

jcsp commented Jan 4, 2022 •

edited

jcsp commented Jan 5, 2022

Lazin commented Jan 5, 2022 via email

jcsp commented Jan 5, 2022 •

edited

jcsp commented Jan 11, 2022

Assert failure: (../../../src/v/utils/retry_chain_node.cc:167) '_num_children == 0' Fiber stopped before its dependencies #3378

Assert failure: (../../../src/v/utils/retry_chain_node.cc:167) '_num_children == 0' Fiber stopped before its dependencies #3378

Comments

VadimPlh commented Dec 30, 2021

Version & Environment

Lazin commented Dec 30, 2021

Lazin commented Dec 30, 2021

VadimPlh commented Jan 1, 2022

jcsp commented Jan 4, 2022 • edited

dotnwat commented Jan 4, 2022

jcsp commented Jan 4, 2022 • edited

jcsp commented Jan 5, 2022

Lazin commented Jan 5, 2022 via email

jcsp commented Jan 5, 2022 • edited

How to reproduce this on a single node dev environment

jcsp commented Jan 11, 2022

jcsp commented Jan 4, 2022 •

edited

jcsp commented Jan 4, 2022 •

edited

jcsp commented Jan 5, 2022 •

edited