Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert failure: (../../../src/v/utils/retry_chain_node.cc:167) '_num_children == 0' Fiber stopped before its dependencies #3378

Closed
VadimPlh opened this issue Dec 30, 2021 · 10 comments · Fixed by #3401
Assignees
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working

Comments

@VadimPlh
Copy link
Contributor

Version & Environment

redpanda version: Redpanda v21.11.3-beta2

BYOC cluster
SI cache size 20G
random kills pods
produce/consume some data with franz-go bench

https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193

@VadimPlh VadimPlh added the kind/bug Something isn't working label Dec 30, 2021
@Lazin
Copy link
Contributor

Lazin commented Dec 30, 2021

This looks like a lifetime issue. Most likely we're continue pushing data to the remote_segment_batch_reader when the remote_segment is closed.

@Lazin
Copy link
Contributor

Lazin commented Dec 30, 2021

This should be fixed by the v21.11.3_si_beta4

@VadimPlh
Copy link
Contributor Author

VadimPlh commented Jan 1, 2022

For v21.11.3-si-beta5
The same problem in logs link

@jcsp jcsp added the area/cloud-storage Shadow indexing subsystem label Jan 4, 2022
@jcsp
Copy link
Contributor

jcsp commented Jan 4, 2022

Reproduced in v21.11.3-si-beta5 and without any chaos (was not running pod kills, leader transfers, etc). Just one producer, one sequential reader, and 8 parallel random readers.

It only started happening after about 2 hours, by which time the cluster was also hitting lots of bad_alloc. So this is probably something that's triggered in exception handling path.

@dotnwat
Copy link
Member

dotnwat commented Jan 4, 2022

It only started happening after about 2 hours, by which time the cluster was also hitting lots of bad_alloc. So this is probably something that's triggered in exception handling path.

is the bad_alloc condition understood already? if not, next time you see it, assuming we aren't dealing with something like massive ram situation, grabbing a core with gcore might be helpful to analyze.

@jcsp
Copy link
Contributor

jcsp commented Jan 4, 2022

is the bad_alloc condition understood already?

The bad_alloc is (probably) an accumulation of remote_segment instances and readers (#3392). We'll see if we're still leaking memory after that fix (I'm building test cases that run faster on local instances where getting a core is also much easier than on GKE pods).

@jcsp
Copy link
Contributor

jcsp commented Jan 5, 2022

After leaving system running for longer, ended up with nodes in a CrashLoopBackOff with this assertion a few seconds after startup. Far too early for bad_allocs to happen, so looks like that wasn't part of the cause.

Reader lifetime changed a bit in aeda711 but not sure if that was intended to have anything to do with this issue.

Looking at the backtrace (https://gist.github.com/VadimPlh/e13ff63a35fbd46d82139f81324b6193) (it's in ~remote_segment_batch_reader) I wonder if it's intrinsically problematic to be destroying these without calling stop()

@Lazin
Copy link
Contributor

Lazin commented Jan 5, 2022 via email

jcsp added a commit to jcsp/redpanda that referenced this issue Jan 5, 2022
...and not in the destructor chain of
partition_record_batch_reader_impl.

This is necessary to make sure the reader is properly cleaned
up via stop().

Fixes: redpanda-data#3378

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp
Copy link
Contributor

jcsp commented Jan 5, 2022

How to reproduce this on a single node dev environment

(Using a release build of v21.11.3-si-beta6)

Topic settings:
rpk --brokers 192.168.1.100:9092 topic create -c segment.bytes=$(( 2**23 )) -c retention.bytes=$(( 2*2**23 )) -c redpanda.remote.read=true -c redpanda.remote.write=true verifier-a -r 1 -p 512

Redpanda resource settings: 4096MB RAM, 4 cores.

Shadow indexing cache size is set small, to 640MB.

Sequential producer and consumer
while true ; do ./si-verifier --brokers 192.168.1.100:9092 --seq_read=0 --msg_size=128000 --produce_msgs=1000 --rand_read_msgs=0 --parallel=0 --topic verifier-a ; done

while true ; do ./si-verifier --brokers 192.168.1.100:9092 --seq_read=1 --msg_size=128000 --produce_msgs=0 --rand_read_msgs=0 --topic verifier-a ; done

The read is the thing that actually triggers the crash, but you need enough data to flow through the system to get to a point where reads are having to hydrate segments, and hitting resource limits while doing so.

jcsp added a commit to jcsp/redpanda that referenced this issue Jan 10, 2022
...and not in the destructor chain of
partition_record_batch_reader_impl.

This is necessary to make sure the reader is properly cleaned
up via stop().

Fixes: redpanda-data#3378

Signed-off-by: John Spray <jcs@vectorized.io>
jcsp added a commit to jcsp/redpanda that referenced this issue Jan 10, 2022
...and not in the destructor chain of
partition_record_batch_reader_impl.

This is necessary to make sure the reader is properly cleaned
up via stop().

Fixes: redpanda-data#3378

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp jcsp assigned jcsp and unassigned Lazin Jan 10, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Jan 11, 2022
...and not in the destructor chain of
partition_record_batch_reader_impl.

This is necessary to make sure the reader is properly cleaned
up via stop().

Fixes: redpanda-data#3378

Signed-off-by: John Spray <jcs@vectorized.io>
(cherry picked from commit 6104b67)
jcsp added a commit to jcsp/redpanda that referenced this issue Jan 11, 2022
...and not in the destructor chain of
partition_record_batch_reader_impl.

This is necessary to make sure the reader is properly cleaned
up via stop().

Fixes: redpanda-data#3378

Signed-off-by: John Spray <jcs@vectorized.io>
(cherry picked from commit 6104b67)
@jcsp
Copy link
Contributor

jcsp commented Jan 11, 2022

Backported to 21.11.x in #3448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working
Projects
None yet
4 participants