CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in `TopicDeleteCloudStorageTest.topic_delete_unavailable_test` #14065

mmaslankaprv · 2023-10-10T14:31:52Z

https://buildkite.com/redpanda/redpanda/builds/38604

Module: rptest.tests.topic_delete_test
Class: TopicDeleteCloudStorageTest
Method: topic_delete_unavailable_test
Arguments: {
    "cloud_storage_type": 2
}

test_id:    TopicDeleteCloudStorageTest.topic_delete_unavailable_test
status:     FAIL
run time:   66.323 seconds

TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 264, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_delete_test.py", line 625, in topic_delete_unavailable_test
    wait_until(lambda: topic_storage_purged(self.redpanda, self.topic),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

JIRA Link: CORE-1496

The text was updated successfully, but these errors were encountered:

andijcr · 2023-10-13T12:39:10Z

failure in debug

the files that are not being deleted, by the test

[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:106]: Files remain for topic topic-jizgvjvxsg:
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: tx.snapshot
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 64-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 66-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 72-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: archival_metadata.snapshot
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 68-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 70-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 72-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 74-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 70-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 66-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,155 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 68-1-v1.log
[INFO  - 2023-10-10 07:53:16,155 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 64-1-v1.log

looking in docker-rp-13, there are a couple of instances of

DEBUG 2023-10-10 07:52:48,747 [shard 0:main] archival - [fiber1] - upload_housekeeping_service.cc:163 - Activating upload housekeeping
DEBUG 2023-10-10 07:52:48,747 [shard 0:main] archival - upload_housekeeping_service.cc:467 - housekeeping_workflow::resume, state: active, backlog: 1
DEBUG 2023-10-10 07:52:48,747 [shard 0:au  ] archival - upload_housekeeping_service.cc:354 - Running job purger with quota 4
INFO  2023-10-10 07:52:48,747 [shard 0:au  ] archival - purger.cc:350 - Running with 4 quota, 1 topic lifecycle markers
DEBUG 2023-10-10 07:52:48,747 [shard 0:au  ] archival - purger.cc:374 - Grace period for {{kafka}/{topic-jizgvjvxsg}} is still in effect for. Skipping scrub.
INFO  2023-10-10 07:52:48,747 [shard 0:au  ] archival - upload_housekeeping_service.cc:405 - housekeeping_workflow, transition to idle state from active, 1 jobs executed, 0 jobs failed. Housekeeping round lasted approx. 3 sec. Job execution time in the round: 0 sec
DEBUG 2023-10-10 07:52:48,747 [shard 0:au  ] archival - upload_housekeeping_service.cc:311 - housekeeping_workflow, BG job, state: idle, backlog: 1

and then no more until timeout, as if the scrubber stopped running

andijcr · 2023-10-17T16:53:11Z

investigating some recent pr that might influence this

#13720
#13916
#13918
#13919

andijcr · 2023-10-17T21:13:51Z

so the inconsistency is on partition 1.

partition 0 1 and 2 go through deletion @ 07:52:43

DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_table.cc:166 - Created lifecycle marker for topic {{kafka}/{topic-jizgvjvxsg}} 30
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/0}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/0}, offset: 48, new_assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 4, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/0} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/1}, offset: 48, new_assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/1} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/2}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/2}, offset: 48, new_assignment: { id: 2, group_id: 3, replicas: {{node_id: 4, shard: 1}, {node_id: 2, shard: 0}, {node_id: 3, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/2} with revision 30 matched by revision 48
DEBUG 2023-10-10 07:52:43,894 [shard 1:main] cluster - topic_table.cc:166 - Created lifecycle marker for topic {{kafka}/{topic-jizgvjvxsg}} 30
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/0} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/1} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/2} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/0}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/0}, offset: 48, new_assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 4, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/1}, offset: 48, new_assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/2}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/2}, offset: 48, new_assignment: { id: 2, group_id: 3, replicas: {{node_id: 4, shard: 1}, {node_id: 2, shard: 0}, {node_id: 3, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
INFO  2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:76 - dispatched to cores: cluster::errc:0 (local delete true)
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:85 - Deallocating ntp: {{kafka}/{topic-jizgvjvxsg}}, in_progress ops: {}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 2, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 4, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 1, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:546 - Deallocated ntp: {kafka/topic-jizgvjvxsg/0}, current assignment: {node_id: 2, shard: 1},{node_id: 4, shard: 1},{node_id: 1, shard: 1}, to_delete: {node_id: 2, shard: 1},{node_id: 4, shard: 1},{node_id: 1, shard: 1}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 2, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 3, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 1, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:546 - Deallocated ntp: {kafka/topic-jizgvjvxsg/1}, current assignment: {node_id: 2, shard: 1},{node_id: 3, shard: 1},{node_id: 1, shard: 1}, to_delete: {node_id: 2, shard: 1},{node_id: 3, shard: 1},{node_id: 1, shard: 1}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 4, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 2, core: 0], total allocated: 0
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 3, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:546 - Deallocated ntp: {kafka/topic-jizgvjvxsg/2}, current assignment: {node_id: 4, shard: 1},{node_id: 2, shard: 0},{node_id: 3, shard: 1}, to_delete: {node_id: 4, shard: 1},{node_id: 2, shard: 0},{node_id: 3, shard: 1}
DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/0}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 4, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}
DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}
DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/2}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 2, group_id: 3, replicas: {{node_id: 4, shard: 1}, {node_id: 2, shard: 0}, {node_id: 3, shard: 1}} }, previous assignment: {nullopt}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] storage - readers_cache.cc:327 - {redpanda/controller/0} - removing reader: [0,48] lower_bound: 49
TRACE 2023-10-10 07:52:43,895 [shard 1:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/0}] erasing from shard table, rev: 48
TRACE 2023-10-10 07:52:43,895 [shard 1:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/2}] erasing from shard table, rev: 48
DEBUG 2023-10-10 07:52:43,895 [shard 1:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}
TRACE 2023-10-10 07:52:43,895 [shard 0:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/0}] erasing from shard table, rev: 48
TRACE 2023-10-10 07:52:43,895 [shard 0:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/2}] erasing from shard table, rev: 48

then files for 0 are removed @ 07:52:43

INFO  2023-10-10 07:52:43,982 [shard 1:main] storage - log_manager.cc:546 - Asked to remove: {kafka/topic-jizgvjvxsg/0}

files for 1 are removed @ 07:53:18

INFO  2023-10-10 07:53:18,690 [shard 1:main] storage - log_manager.cc:546 - Asked to remove: {kafka/topic-jizgvjvxsg/2}

and the timeout (30 seconds) expires and the node is terminated

so i'm leaning on recent changes making this process take longer overall. The question is if bumping the timeout is the right thing to do

andijcr · 2023-10-19T10:21:12Z

Increasing the timeout works, but the original failing result is kinda weird, considering the code.

the delete_partition op should delete all files then launch a background thread to finalize the remote partition, and this takes a sub-second. then the op should immediately process the next delete_partition op.
this is what is done on a normal test run.

in a failing test run instead, I see that after the first delete_operation op the code does nothing on the shard and waits for housekeeping to kick in to process the next, which would mean that either the topic_delta deque contains only one element each time, or that the previous delete_partion op generated an exception (but i don't see traces of this in the log).

edit: once local delete of a partition is requested, _pending_deltas gets all the partition assignments for a topic, and then notifies the fetch_delta() watcher

andijcr · 2023-10-19T11:45:07Z

so the sequence of events from https://ci-artifacts.dev.vectorized.cloud/redpanda/38604/018b186e-7e74-47df-853e-7211a9498a1e/vbuild/ducktape/results/2023-10-10--001/TopicDeleteCloudStorageTest/topic_delete_unavailable_test/cloud_storage_type=CloudStorageType.ABS/143/
seems to be

topic deletion is requested
the request eventually gets propagated to topic_table::do_local_delete, which in turns fills _pending_deltas with all the partitions assignments, and notify the watchers
controller_backend::fetch_deltas is a watcher, it receives a copy of _pending_deltas, saves it, and launches reconcile_ntp that for each delta (partition) deletes local files and performs finalization of the remote partition.

In the regular case, all the partitions get processed one after another, but in the slow case only one partition gets processed immediately, and the other after 30 seconds.
Since the partitions move together, it implies that an exception stops the loop after the first. But controller_backend::housekeeping runs every 1s and performs the same reconcile_ntp op, so it would clear the rest quickly.
So this could mean that the partitions are removed separately, with a delay between op of ~30secs, but I don't see a mechanism for this.

andijcr · 2023-10-19T15:43:58Z

these lines are after we call Stopping archiver on partition 2

TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:606 - upload_until_term_change: got units (current 0), paused=false
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1976 - Uploading next candidates called for {kafka/topic-jizgvjvxsg/2}
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1623 - scheduling uploads, start offset: 74, last offset: 117, upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - archival_policy.cc:114 - Upload policy for {kafka/topic-jizgvjvxsg/2} invoked, start offset: 74
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - archival_policy.cc:185 - Upload policy for {kafka/topic-jizgvjvxsg/2}: can't find candidate, only non-data batches to upload (kafka start_offset: 72, kafka last_stable_offset: 72)
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1464 - upload candidate not found, start_upload_offset: 74, last_stable_offset: 117
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1679 - scheduled 0 uploads for upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: non-compacted, returning
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: compacted, returning
TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1967 - Segment uploads complete: 0 successful uploads
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:949 - [upload_loop_epilogue] Manifest is clean, skipping upload
TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:690 - upload_until_term_change: released units (current 1)
TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:703 - Nothing to upload, applying backoff algorithm

andijcr · 2023-10-19T15:48:35Z

demarcation log line: partition::stop

DEBUG 2023-10-10 07:53:18,680 [shard 1:main] cluster - partition.cc:579 - Stopped partition {kafka/topic-jizgvjvxsg/2}

andijcr · 2023-10-23T20:11:09Z

current direction is unserstanding who is holding this fiber active

TRACE 2023-10-10 07:53:01,847 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:606 - upload_until_term_change: got units (current 0), paused=false
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1976 - Uploading next candidates called for {kafka/topic-jizgvjvxsg/2}
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1623 - scheduling uploads, start offset: 74, last offset: 117, upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - archival_policy.cc:114 - Upload policy for {kafka/topic-jizgvjvxsg/2} invoked, start offset: 74
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - archival_policy.cc:185 - Upload policy for {kafka/topic-jizgvjvxsg/2}: can't find candidate, only non-data batches to upload (kafka start_offset: 72, kafka last_stable_offset: 72)
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1464 - upload candidate not found, start_upload_offset: 74, last_stable_offset: 117
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1679 - scheduled 0 uploads for upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: non-compacted, returning
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: compacted, returning
TRACE 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1967 - Segment uploads complete: 0 successful uploads
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:949 - [upload_loop_epilogue] Manifest is clean, skipping upload
TRACE 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:690 - upload_until_term_change: released units (current 1)
TRACE 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:703 - Nothing to upload, applying backoff algorithm

well after the archival for partition 2 is stopped. it's a symptom of ntp_archiver::stop not having completed, and this is holding data deletion. for the other partition/nodes, the archival is stop is immediate

andijcr · 2023-10-24T12:46:50Z

some other comments:
redpanda_13_deadlock.zip

npt_archiver::stop gets called by partition::stop and it completes, because in the log we see this
DEBUG 2023-10-10 07:52:44,045 [shard 1:main] cluster - partition.cc:558 - Stopping archiver on partition: {kafka/topic-jizgvjvxsg/2}
before calling stop on the archiver and we see this
DEBUG 2023-10-10 07:53:18,680 [shard 1:main] cluster - partition.cc:579 - Stopped partition {kafka/topic-jizgvjvxsg/2}

that comes after awaiting ntp_archiver::stop(),
but the process took >30 seconds (failing the test)

while for partition 0 it's immediate

DEBUG 2023-10-10 07:52:43,971 [shard 1:main] cluster - partition.cc:558 - Stopping archiver on partition: {kafka/topic-jizgvjvxsg/0}
DEBUG 2023-10-10 07:52:43,971 [shard 1:main] cluster - partition.cc:579 - Stopped partition {kafka/topic-jizgvjvxsg/0}

the difference is that rp_13 is leader for partition 2 and not 0, so it could be beneficial to look to the node leader for partition 0

given that fiber9 kafka/topic-jizgvjvxsg/2 is alive for a long period of time, something might be holding the npt_archiver gate, or blocking on the scrubber and preventing the abort source to fire

the scrubber prints this before waiting on gate.close()

DEBUG 2023-10-10 07:52:44,045 [shard 1:main] cluster - partition.cc:558 - Stopping archiver on partition: {kafka/topic-jizgvjvxsg/2}
INFO  2023-10-10 07:52:44,045 [shard 1:main] archival - scrubber.cc:153 - Stopping scrubber (2)...

the (2) means that there are 2 holders for the gate, which seems strange. There should be at max 3 holders for the gate. 1 releases it immediately just after start, the other 2:

void scrubber::release() release 1 and should be called before by DEBUG 2023-10-10 07:52:44,045 [shard 1:main] archival - [fiber1] - upload_housekeeping_service.cc:190 - Deregistering job: scrubber
scrubber::run(retry_chain_node& rtc_node, run_quota_t quota) holds the gate and runs in a loop, but the period of gate holding is ended by this:
DEBUG 2023-10-10 07:52:30,869 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - scrubber.cc:63 - Skipping cloud partition scrub: next scrub in 0m
and it does not hold the gate across suspension points in shard 1, so it should not contribute for the count

So the count of 2 is not clear

Then this is the list of methods that hold ntp_archiver's gate

upload_until_abort. Seems unlikely to hang, listens for abort source except for a section in https://github.com/redpanda-data/redpanda/blame/894cd42be66b6113419564e104c5ad571fa7f3f1/src/v/archival/ntp_archiver_service.cc#L696-L706 but these calls listen to the abort source
download_manifest. not called
upload_manifest. last log DEBUG 2023-10-10 07:52:30,499 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1035 - [upload_loop_epilogue] Uploading partition manifest, insync_offset=115, path={"60000000/meta/kafka/topic-jizgvjvxsg/2_30/manifest.bin"} so success, the function terminates and no gate is held DEBUG 2023-10-10 07:52:30,502 [shard 1:au ] cloud_storage - [fiber9~76~0|1|9997ms] - remote.cc:385 - Successfully uploaded partition manifest to {"60000000/meta/kafka/topic-jizgvjvxsg/2_30/manifest.bin"}
get_aborted_transactions. no log in the code. rm_stm (should exist) holds a read_lock on a mutex that has a write lock taken during local state removal (check this). The lock protects rm_stm::do_aborted_transactions, but this should not hang. still a possible candidate. it's waited by ntp_archiver::schedule_uploads, but this terminates. it's called by upload, but i don't see this hanging.
upload_tx. not called
upload_next_candidate. (there is a mutex) last call: DEBUG 2023-10-10 07:53:11,852 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1976 - Uploading next candidates called for {kafka/topic-jizgvjvxsg/2} and ends with TRACE 2023-10-10 07:53:11,852 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1967 - Segment uploads complete: 0 successful uploads late but the function does not hang. might be useful to sequence what was called before this.
maybe_truncate_manifest. not called
find_reupload_candidate. not called
upload it shouldn't and is connected either to the internal abort source or by adjacent_segment_merger's abort source, but it's already stopped
prepare_transfer_leadership. not called

next step:
check the node leader for partition 0
check get_aborted_transactions if it can hangs due to the read lock

andijcr · 2023-10-24T14:09:38Z

Thanks to @VladLazar for pointing out the source of
INFO 2023-10-10 07:52:44,045 [shard 1:main] archival - scrubber.cc:153 - Stopping scrubber (2)...

redpanda/src/v/archival/scrubber.cc

Line 77 in 31e94ca

retry_chain_node anomaly_detection_rtc(5min, 100ms, &rtc_node);

this retry chain node is not aware of the abort source of the scrubber. a fiber from scrubber::run is created before scrubber::stop, it get's to hang on network operations due to the firewall, and holds the scrubber's gate. at shutdown, the retry_chain is destroyed and the partition deletion completes.

A quick solution is to have a finer-grained retry_chain_node for each network operation, and 30 secs of timeout instead of the current 5min.

another solution would be to modify retry_chain_node with the possibility to listen for multiple abort sources, with an or combinator, so as to be able to respond to scrubber::stop in a direct manner

andijcr · 2023-10-24T14:13:03Z

sev/low because with this finding we know that the hang is not due to a deadlock, it's a timeout of 5min holding the stop gate

piyushredpanda · 2023-10-24T14:16:00Z

Thanks!
While sev/low is good, I'd love to see the fix as well :)

VladLazar · 2023-10-24T16:58:49Z

Andrea and I discussed this today. I'll look into fixing this in the next few days since Andrea's gonna be out for the rest of the week.

mmaslankaprv · 2023-11-16T12:38:35Z

appeared again in: #14995

piyushredpanda · 2024-06-18T05:20:03Z

Not seen in at least two months, closing

mmaslankaprv added kind/bug Something isn't working ci-failure labels Oct 10, 2023

andijcr added the area/cloud-storage Shadow indexing subsystem label Oct 13, 2023

andijcr added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Oct 13, 2023

andijcr self-assigned this Oct 13, 2023

andijcr mentioned this issue Oct 17, 2023

anomaly_detector::run: individual retry_chain_nodes for each step #14237

Closed

7 tasks

andijcr mentioned this issue Oct 20, 2023

CI Failure (timeout) in test_archival_service_rpfixture #14254

Closed

andijcr assigned VladLazar Oct 24, 2023

andijcr added sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages and removed sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. labels Oct 24, 2023

VladLazar mentioned this issue Oct 25, 2023

cloud_storage: give each housekeeping job its own RTC root node #14429

Merged

7 tasks

piyushredpanda unassigned andijcr Oct 26, 2023

VladLazar closed this as completed in #14429 Oct 30, 2023

mmaslankaprv reopened this Nov 16, 2023

VladLazar removed their assignment Dec 1, 2023

WillemKauf mentioned this issue May 13, 2024

[v23.3.x] cloud_storage: correct list_object() request headers and parameters (manual backport) #18447

Merged

7 tasks

piyushredpanda closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in `TopicDeleteCloudStorageTest.topic_delete_unavailable_test` #14065

CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in `TopicDeleteCloudStorageTest.topic_delete_unavailable_test` #14065

mmaslankaprv commented Oct 10, 2023 •

edited by jira bot

Loading

andijcr commented Oct 13, 2023

andijcr commented Oct 17, 2023

andijcr commented Oct 17, 2023

andijcr commented Oct 19, 2023 •

edited

Loading

andijcr commented Oct 19, 2023

andijcr commented Oct 19, 2023

andijcr commented Oct 19, 2023 •

edited

Loading

andijcr commented Oct 23, 2023

andijcr commented Oct 24, 2023 •

edited

Loading

andijcr commented Oct 24, 2023

andijcr commented Oct 24, 2023

piyushredpanda commented Oct 24, 2023

VladLazar commented Oct 24, 2023

mmaslankaprv commented Nov 16, 2023

piyushredpanda commented Jun 18, 2024

CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in TopicDeleteCloudStorageTest.topic_delete_unavailable_test #14065

CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in TopicDeleteCloudStorageTest.topic_delete_unavailable_test #14065

Comments

mmaslankaprv commented Oct 10, 2023 • edited by jira bot Loading

andijcr commented Oct 13, 2023

andijcr commented Oct 17, 2023

andijcr commented Oct 17, 2023

andijcr commented Oct 19, 2023 • edited Loading

andijcr commented Oct 19, 2023

andijcr commented Oct 19, 2023

andijcr commented Oct 19, 2023 • edited Loading

andijcr commented Oct 23, 2023

andijcr commented Oct 24, 2023 • edited Loading

andijcr commented Oct 24, 2023

andijcr commented Oct 24, 2023

piyushredpanda commented Oct 24, 2023

VladLazar commented Oct 24, 2023

mmaslankaprv commented Nov 16, 2023

piyushredpanda commented Jun 18, 2024

CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in `TopicDeleteCloudStorageTest.topic_delete_unavailable_test` #14065

CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in `TopicDeleteCloudStorageTest.topic_delete_unavailable_test` #14065

mmaslankaprv commented Oct 10, 2023 •

edited by jira bot

Loading

andijcr commented Oct 19, 2023 •

edited

Loading

andijcr commented Oct 19, 2023 •

edited

Loading

andijcr commented Oct 24, 2023 •

edited

Loading