-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in TopicDeleteCloudStorageTest.topic_delete_unavailable_test
#14065
Comments
failure in debug the files that are not being deleted, by the test
looking in docker-rp-13, there are a couple of instances of
and then no more until timeout, as if the scrubber stopped running |
so the inconsistency is on partition 1. partition 0 1 and 2 go through deletion @ 07:52:43
then files for 0 are removed @ 07:52:43
files for 1 are removed @ 07:53:18
and the timeout (30 seconds) expires and the node is terminated so i'm leaning on recent changes making this process take longer overall. The question is if bumping the timeout is the right thing to do |
Increasing the timeout works, but the original failing result is kinda weird, considering the code. the delete_partition op should delete all files then launch a background thread to finalize the remote partition, and this takes a sub-second. then the op should immediately process the next delete_partition op. in a failing test run instead, I see that after the first delete_operation op the code does nothing on the shard and waits for housekeeping to kick in to process the next, which would mean that either the topic_delta deque contains only one element each time, or that the previous delete_partion op generated an exception (but i don't see traces of this in the log). edit: once local delete of a partition is requested, _pending_deltas gets all the partition assignments for a topic, and then notifies the fetch_delta() watcher |
so the sequence of events from https://ci-artifacts.dev.vectorized.cloud/redpanda/38604/018b186e-7e74-47df-853e-7211a9498a1e/vbuild/ducktape/results/2023-10-10--001/TopicDeleteCloudStorageTest/topic_delete_unavailable_test/cloud_storage_type=CloudStorageType.ABS/143/
In the regular case, all the partitions get processed one after another, but in the slow case only one partition gets processed immediately, and the other after 30 seconds. |
these lines are after we call Stopping archiver on partition 2
|
demarcation log line: partition::stop
|
current direction is unserstanding who is holding this fiber active
well after the archival for partition 2 is stopped. it's a symptom of ntp_archiver::stop not having completed, and this is holding data deletion. for the other partition/nodes, the archival is stop is immediate |
some other comments: npt_archiver::stop gets called by partition::stop and it completes, because in the log we see this that comes after awaiting ntp_archiver::stop(), while for partition 0 it's immediate
the difference is that rp_13 is leader for partition 2 and not 0, so it could be beneficial to look to the node leader for partition 0 given that the scrubber prints this before waiting on gate.close()
the (2) means that there are 2 holders for the gate, which seems strange. There should be at max 3 holders for the gate. 1 releases it immediately just after start, the other 2:
So the count of 2 is not clear Then this is the list of methods that hold ntp_archiver's gate
next step: |
Thanks to @VladLazar for pointing out the source of redpanda/src/v/archival/scrubber.cc Line 77 in 31e94ca
this retry chain node is not aware of the abort source of the scrubber. a fiber from scrubber::run is created before scrubber::stop, it get's to hang on network operations due to the firewall, and holds the scrubber's gate. at shutdown, the retry_chain is destroyed and the partition deletion completes. A quick solution is to have a finer-grained retry_chain_node for each network operation, and 30 secs of timeout instead of the current 5min. another solution would be to modify retry_chain_node with the possibility to listen for multiple abort sources, with an |
sev/low because with this finding we know that the hang is not due to a deadlock, it's a timeout of 5min holding the stop gate |
Thanks! |
Andrea and I discussed this today. I'll look into fixing this in the next few days since Andrea's gonna be out for the rest of the week. |
appeared again in: #14995 |
Not seen in at least two months, closing |
https://buildkite.com/redpanda/redpanda/builds/38604
JIRA Link: CORE-1496
The text was updated successfully, but these errors were encountered: