Clear topics orphan files #8396

ZeDRoman · 2023-01-24T21:53:28Z

When node is restarted while it was executing partition delete operation, this operation may be not finished and we will have orphan files for that partition left on device.
On node restart we don't have that partition data in memory so we couldn't retry partition delete operation and won't clean up partition files on reconciliation.
This pr bring garbage collector mechanism that will force delete partition files after node bootstrap.
We gather all existing ntps and then go through all data directories and check if topic directory responsible for any exisitng ntp if it don't we delete it.

Fixes #7895

Backports Required

UX Changes

Mechanism for cleaning partition orphan files

Release Notes

Bug Fixes

Mechanism for cleaning partition orphan files

bharathv · 2023-01-26T18:40:20Z

src/v/cluster/controller_backend.cc

+            }
+        }
+    }
+    auto last_applied_revision = _topics.local().last_applied_revision();


last_applied_revision is updated as topic table applies the deltas. So by the time this cleanup runs, it may not have applied all the deltas, so we potentially may be operating on a not-yet-up-to-date snapshot of the topics, no?

similar question: what are the guarantees that a node has the correct information it needs to make this orphan file removal decision independently?

After controller backend bootstrap we must have reconcilied all messages in controller_log and topic_table deltas that were on node before restart.
Orphan files might be left only for topics that existed before we have restarted node. So all deltas for topics that may have orphan files must be reconcilied at this point.
If we running deletion of topic files while node is running we will retry it until success. So we cannot get orphan files in runtime.

Orphan files might be left only for topics that existed before we have restarted node. So all deltas for topics that may have orphan files must be reconcilied at this point.

Ah yes, last_applied_revision is guaranteed to be atleast the dirty offset of the controller log at bootstrap.

bharathv · 2023-01-26T18:42:33Z

src/v/storage/log_manager.cc

+  ss::sstring topic_directory_path,
+  model::topic_namespace nt,
+  const log_manager::ntp_to_revision& topics_on_node,
+  model::revision_id& last_applied_revision) {


nit: revision_id is trivially copyable, pass by value.

nit: revision_id is trivially copyable, pass by value.

agree

bharathv · 2023-01-26T18:44:37Z

src/v/storage/log_manager.cc

+  model::revision_id& last_applied_revision) {
+    return directory_walker::walk(
+      topic_directory_path,
+      [&topics_on_node, topic_directory_path, nt, &last_applied_revision](


nit: pass revision by value.

bharathv · 2023-01-26T18:49:12Z

src/v/storage/log_manager.cc

+        co_return;
+    }
+
+    for (const auto& ns : namespaces) {


think this loop may affect startup time on nodes with a lot of topics * partitions. We are recursing through every ns/topic/partition on the node. wondering if this can work in the background.

dotnwat · 2023-01-26T23:45:59Z

src/v/cluster/controller_backend.cc

@@ -473,6 +476,43 @@ ss::future<> controller_backend::fetch_deltas() {
      });
 }

+ss::future<> controller_backend::clear_orphan_topic_files() {
+    if (ss::this_shard_id() != 0) {


is there a shard alias you can use instead of hard coding 0?

I didn't found any alias for that.
It seems like common approach for such things

dotnwat · 2023-01-26T23:47:14Z

src/v/cluster/controller_backend.cc

+    auto topics = _topics.local().all_topics();
+    storage::log_manager::ntp_to_revision topics_on_node;
+    // Init with default namespace to clean if there is no topics
+    absl::flat_hash_set<model::ns> namespaces = {{model::ns("kafka")}};


use existing pre-defined constant for namespace name.

dotnwat · 2023-01-26T23:51:59Z

src/v/cluster/controller_backend.cc

+            }
+        }
+    }
+    auto last_applied_revision = _topics.local().last_applied_revision();


similar question: what are the guarantees that a node has the correct information it needs to make this orphan file removal decision independently?

dotnwat · 2023-01-26T23:54:36Z

src/v/storage/log_manager.cc

+  ss::sstring topic_directory_path,
+  model::topic_namespace nt,
+  const log_manager::ntp_to_revision& topics_on_node,
+  model::revision_id& last_applied_revision) {


nit: revision_id is trivially copyable, pass by value.

agree

ZeDRoman · 2023-01-30T14:54:26Z

Ci failure #8383

bharathv

lgtm, just a clarifying question and some nits.

bharathv · 2023-01-31T02:34:17Z

src/v/cluster/controller_backend.cc

@@ -297,6 +298,10 @@ void controller_backend::setup_metrics() {
 ss::future<> controller_backend::start() {
    setup_metrics();
    return bootstrap_controller_backend().then([this] {
+        if (ss::this_shard_id() == ss::shard_id{0}) {


think Noah meant cluster::controller_stm_shard instead of 0.

bharathv · 2023-01-31T04:02:41Z

src/v/cluster/controller_backend.cc

+            }
+        }
+    }
+    auto last_applied_revision = _topics.local().last_applied_revision();


Orphan files might be left only for topics that existed before we have restarted node. So all deltas for topics that may have orphan files must be reconcilied at this point.

Ah yes, last_applied_revision is guaranteed to be atleast the dirty offset of the controller log at bootstrap.

bharathv · 2023-01-31T04:03:44Z

src/v/cluster/controller_backend.cc

+    for (const auto& t : topics) {
+        namespaces.emplace(t.ns);
+        auto meta = _topics.local().get_topic_assignments(t);
+        if (meta) {


nit: we can add a check if (_gate.closed()) break; for each topic iteration to abort early if needed.

Also can you inverse the check to reduce indents

if (!meta) {
continue;
}

bharathv · 2023-01-31T05:10:55Z

src/v/cluster/controller_backend.cc

@@ -473,6 +478,40 @@ ss::future<> controller_backend::fetch_deltas() {
      });
 }

+ss::future<> controller_backend::clear_orphan_topic_files() {
+    auto topics = _topics.local().all_topics();
+    storage::log_manager::ntp_to_revision topics_on_node;


q: This is a map of ntp -> revision of all the ntps in the system, can be potentially big. What is the use of constructing this up front? When recursing through the topic/partition directories can't we just lookup the topic table? Something like..

ntp foo = partition_path::parse_partition_directory() if (foo.revision_id < last_applied_id_snapshot && !topic_table.contains() && !topic_table.previous_replicas.contains()) { cleanup(); }

Just trying to understand the use of creating a snapshot of all the ntps in the system.

Previously I was thinking about creating this map to get snapshot of topictable right after bootstrap.
But I agree that we don't need it now

+1 on using the topic table directly.

ZeDRoman · 2023-02-01T08:59:31Z

Rebase dev

VladLazar

Largely makes sense to me. The snapshotting of the topic table and the begging of the operation makes me a bit uneasy.

Also, I found myself wondering if this couldn't be done while replaying the controller log. Perhaps we could write a new message type to the controller log (something like delete_complete) once deletion is complete. When looking for the starting point for the replay we could look for that instead of the deletion. If we don't find delete_complete it means we might have left around orphaned files.

src/v/cluster/controller_backend.cc

VladLazar · 2023-02-01T11:23:20Z

src/v/cluster/controller_backend.cc

+            ssx::spawn_with_gate(
+              _gate, [this] { return clear_orphan_topic_files(); });
+        }


clear_topic_orphan_files can throw std::filesystem::filesystem_error. It would be nice to log something in that case. Also, if I recall correctly, seastar doesn't like background failed futures.

VladLazar · 2023-02-01T11:23:40Z

src/v/cluster/controller_backend.cc

@@ -473,6 +478,40 @@ ss::future<> controller_backend::fetch_deltas() {
      });
 }

+ss::future<> controller_backend::clear_orphan_topic_files() {
+    auto topics = _topics.local().all_topics();
+    storage::log_manager::ntp_to_revision topics_on_node;


+1 on using the topic table directly.

ZeDRoman · 2023-02-01T12:24:59Z

Largely makes sense to me. The snapshotting of the topic table and the begging of the operation makes me a bit uneasy.

Also, I found myself wondering if this couldn't be done while replaying the controller log. Perhaps we could write a new message type to the controller log (something like delete_complete) once deletion is complete. When looking for the starting point for the replay we could look for that instead of the deletion. If we don't find delete_complete it means we might have left around orphaned files.

We have discussed this multiple times with Michal and John.

This delete_complete message will be replicated on all nodes, but we don't want to replicate messages that related only for one node.
We will need to save delete topic message in controller snapshot if there is no topic_deleted message from every node. We don't want to bring that additional logic

VladLazar · 2023-02-02T09:42:28Z

Largely makes sense to me. The snapshotting of the topic table and the begging of the operation makes me a bit uneasy.
Also, I found myself wondering if this couldn't be done while replaying the controller log. Perhaps we could write a new message type to the controller log (something like delete_complete) once deletion is complete. When looking for the starting point for the replay we could look for that instead of the deletion. If we don't find delete_complete it means we might have left around orphaned files.

We have discussed this multiple times with Michal and John.
1. This delete_complete message will be replicated on all nodes, but we don't want to replicate messages that related only for one node.

2. We will need to save delete topic message in controller snapshot if there is no topic_deleted message from every node. We don't want to bring that additional logic

Thanks for the context. The reasoning makes sense to me. I think this approach is fine.

bharathv · 2023-02-06T23:01:06Z

src/v/storage/CMakeLists.txt

@@ -47,6 +47,7 @@ v_cc_library(
    v::compression
    v::rprandom
    v::resource_mgmt
+    v::cluster


Don't think we should make storage depend on cluster, that creates a circular dependency.

I believe you added it to pass _topic_table into storage. To avoid this, how about making the function definition into

ss::future<> log_manager::remove_orphan_files( ss::sstring data_directory_path, absl::flat_hash_set<model::ns> namespaces, ss::noncopyable_function<bool(model::ntp, partition_path::metadata)> orphan_filter )

and then pass the filter as lambda from controller backend that captures all the needed context, something like.. (need to move some code around)

ss::future<> controller_backend::clear_orphan_topic_files( model::revision_id bootstrap_revision) { // Init with default namespace to clean if there is no topics absl::flat_hash_set<model::ns> namespaces = {{model::kafka_namespace}}; for (const auto& t : _topics.local().all_topics()) { namespaces.emplace(t.ns); } return _storage.local().log_mgr().remove_orphan_files( _data_directory, std::move(namespaces), [&, bootstrap_revision]( model::ntp ntp, storage::partition_path::metadata p) { return topic_files_are_orphan( ntp, p, _topics, bootstrap_revision, _self); }); }

Great idea!
Thank you

ZeDRoman · 2023-02-07T17:11:00Z

/ci-repeat 3
skip-units

bharathv · 2023-02-08T23:26:14Z

/ci-repeat 5

bharathv · 2023-02-09T02:15:51Z

@ZeDRoman Can you confirm if the following failures are related/unrelated to this patch? (check debug builds in the last repeat run, failed more than once). Release build failures are #8679

test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_rack_awareness
status:     FAIL
run time:   6 minutes 35.753 seconds

test_id:    rptest.tests.topic_delete_test.TopicDeleteCloudStorageTest.topic_delete_unavailable_test
status:     FAIL
run time:   3 minutes 53.274 seconds

ZeDRoman · 2023-02-09T10:15:50Z

/ci-repeat 5

ZeDRoman · 2023-02-09T10:17:08Z

@ZeDRoman Can you confirm if the following failures are related/unrelated to this patch? (check debug builds in the last repeat run, failed more than once). Release build failures are #8679
test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_rack_awareness
status:     FAIL
run time:   6 minutes 35.753 seconds

test_id:    rptest.tests.topic_delete_test.TopicDeleteCloudStorageTest.topic_delete_unavailable_test
status:     FAIL
run time:   3 minutes 53.274 seconds

I have checked the logs. It doesn't seem that tests are related to changes in this pr.
But it is strage that test have failed multiple times. Want to restart CI

When redpanda is restarted while delete operation is not finish Partition files might be left on disk. We need to cleanup orphan partition files after bootstrap.

ZeDRoman · 2023-02-10T09:18:16Z

/ci-repeat 5

ZeDRoman · 2023-02-10T15:22:14Z

ci failures:
#8745
#8291
ShadowIndexingCompactedTopicTest.test_upload - new, but not result of this pr

github-actions bot added the area/redpanda label Jan 24, 2023

ZeDRoman requested a review from bharathv January 24, 2023 22:05

ZeDRoman force-pushed the clear_orphan_files branch 2 times, most recently from 4ff1f95 to 850b0cb Compare January 25, 2023 16:42

ZeDRoman marked this pull request as ready for review January 26, 2023 09:57

bharathv reviewed Jan 26, 2023

View reviewed changes

bharathv requested a review from mmaslankaprv January 26, 2023 18:50

dotnwat reviewed Jan 27, 2023

View reviewed changes

ZeDRoman force-pushed the clear_orphan_files branch from d63ee17 to f0ecfcb Compare January 30, 2023 10:12

bharathv reviewed Jan 31, 2023

View reviewed changes

jcsp requested a review from VladLazar January 31, 2023 20:25

ZeDRoman force-pushed the clear_orphan_files branch from f0ecfcb to 1e9e4fc Compare February 1, 2023 08:59

VladLazar reviewed Feb 1, 2023

View reviewed changes

ZeDRoman force-pushed the clear_orphan_files branch 3 times, most recently from 0befbee to a149c60 Compare February 6, 2023 15:03

bharathv reviewed Feb 6, 2023

View reviewed changes

ZeDRoman force-pushed the clear_orphan_files branch from a149c60 to fa5427c Compare February 7, 2023 13:59

bharathv previously approved these changes Feb 7, 2023

View reviewed changes

ZeDRoman dismissed bharathv’s stale review via 0902a5d February 8, 2023 09:14

ZeDRoman force-pushed the clear_orphan_files branch from fa5427c to 0902a5d Compare February 8, 2023 09:14

ZeDRoman requested review from bharathv, VladLazar and dotnwat February 8, 2023 16:31

bharathv previously approved these changes Feb 8, 2023

View reviewed changes

cluster: refactor unify contains_node function

903aa1e

ZeDRoman dismissed bharathv’s stale review via a3901c4 February 9, 2023 20:16

ZeDRoman force-pushed the clear_orphan_files branch from 0902a5d to a3901c4 Compare February 9, 2023 20:16

ZeDRoman added 3 commits February 9, 2023 23:17

controller: clean up topic orphan files

55c897a

When redpanda is restarted while delete operation is not finish Partition files might be left on disk. We need to cleanup orphan partition files after bootstrap.

ducktape: test orphan files after delete topic

3650248

ducktape: test orphan files after partition movement

b044630

ZeDRoman force-pushed the clear_orphan_files branch from a3901c4 to b044630 Compare February 9, 2023 20:17

ZeDRoman requested a review from bharathv February 13, 2023 17:41

bharathv approved these changes Feb 13, 2023

View reviewed changes

bharathv merged commit b2ab646 into redpanda-data:dev Feb 13, 2023

ZeDRoman mentioned this pull request Feb 16, 2023

Revert "Clear topics orphan files" #8932

Merged

ZeDRoman mentioned this pull request Mar 3, 2023

Clear topics orphan files with fix #8960

Merged

6 tasks

Clear topics orphan files #8396

Clear topics orphan files #8396

Conversation

ZeDRoman commented Jan 24, 2023 • edited

Backports Required

UX Changes

Release Notes

Bug Fixes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZeDRoman commented Jan 30, 2023

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZeDRoman commented Feb 1, 2023

VladLazar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZeDRoman commented Feb 1, 2023

VladLazar commented Feb 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZeDRoman commented Feb 7, 2023

bharathv commented Feb 8, 2023

bharathv commented Feb 9, 2023

ZeDRoman commented Feb 9, 2023

ZeDRoman commented Feb 9, 2023

ZeDRoman commented Feb 10, 2023

ZeDRoman commented Feb 10, 2023

ZeDRoman commented Jan 24, 2023 •

edited