Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: topic deletion fixes #8090

Merged
merged 14 commits into from
Jan 18, 2023
Merged

cloud_storage: topic deletion fixes #8090

merged 14 commits into from
Jan 18, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jan 6, 2023

This PR fixes #8046 and #8071 together, in order to conveniently test them together before re-enabling the tests.

#8046 was just a bug in the test: it was using segment counts as a proxy for amounts of data removed from local storage, which does not work properly when a leadership transfer happens to overlap with the test, generating extra segments. While debugging this I found another test issues, where under some circumstances we were mistaking a change in the manifest.json object's ETag for a failure to delete. These are both fixed.

#8071 was a real bug in Redpanda, where partition::stop could get hung up on S3 operations in flight. When the test used firewall to block connectivity to S3, then stuck PUT requests could exist, preventing partition shutdown. This causes the test to fail its requirement that the deletion of local data proceeds even if remote data can't be deleted.

Fixing #8071 in a general way involved subscribing to an abort source to call shutdown() on http clients when a partition is stopping. To get that abort source into the right places without too much complexity, retry_chain_node is improved to remove its mode with no abort source, so that its public API can conveniently expose the abort source for the remote object to use consistently. The removed constructors for retry_chain_node were only used in unit tests, or by mistake: there is no legitimate situation where a retry_chain_node should exist without an abort source somewhere in its hierarchy, as the purpose of the class involves sleeps on retry.

Backports Required

Needs backporting together with ntp_archiver refactor.

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

None

Release Notes

Improvements

  • Topic deletion now removes local data more reliably in situations where a tiered storage topic experiences an inabilty to connect to an object storage backend.

@jcsp
Copy link
Contributor Author

jcsp commented Jan 6, 2023

/ci-repeat 5 dt-repeat=100 skip-unit

@jcsp jcsp force-pushed the issue-8046 branch 3 times, most recently from d99a122 to f611d65 Compare January 9, 2023 14:48
@jcsp jcsp changed the title tests: improvements to topic deletion tests tests: cloud storage topic deletion fixes Jan 9, 2023
@jcsp jcsp added the kind/bug Something isn't working label Jan 9, 2023
@jcsp jcsp requested review from abhijat and Lazin January 9, 2023 14:56
@jcsp jcsp marked this pull request as ready for review January 9, 2023 14:58
@jcsp
Copy link
Contributor Author

jcsp commented Jan 9, 2023

This has a lint issue, but I'm letting it run through CI before I push the fix.

@jcsp jcsp added the area/cloud-storage Shadow indexing subsystem label Jan 9, 2023
@jcsp jcsp changed the title tests: cloud storage topic deletion fixes cloud_storage: topic deletion fixes Jan 9, 2023
Comment on lines 430 to 441
auto as_sub = parent.root_abort_source().subscribe(
// Lifetimes:
// - `lease` is scoped to this function, as is the
// abort source subscription: as will always be deregistered
// before lease is destroyed.
// - `ctxlog` is also function scoped.
[&lease, &ctxlog]() noexcept {
vlog(
ctxlog.debug,
"Cancelling in-flight requests on partition shutdown");
lease.client->shutdown();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting that the subscription is an raii object here. i wonder if this pattern is hinting at a need for some refactoring related to how the client relates to the parent context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the raii object will be alive until the GET request is completed, the actual processing of the downloaded data is happening inside the functor which is passed into this method as a parameter

@jcsp
Copy link
Contributor Author

jcsp commented Jan 10, 2023

/ci-repeat 5 skip-unit dt-repeat=10 tests/rptest/tests/topic_delete_test.py

Lazin
Lazin previously approved these changes Jan 11, 2023
/// Find abort source in the root of the tree
/// Always traverses the tree back to the root and returns the abort
/// source if it was set in the root c-tor.
ss::abort_source& root_abort_source();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like root_abort_source is used in previous commit.

vlog(
ctxlog.debug,
"Cancelling in-flight requests on partition shutdown");
lease.client->shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we no longer need to stop all connections to S3 inside application.cc during shutdown anymore. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR didn't cover all the places we call into a client, but yeah, if we come back and make all uses of clients abort-safe, we should in principle no longer need the early shutdown. To make that really robust against future mistakes, it would be neat to make the client lease itself carry the abort source, so that it's impossible to get a client without providing an abort source.

Comment on lines 430 to 441
auto as_sub = parent.root_abort_source().subscribe(
// Lifetimes:
// - `lease` is scoped to this function, as is the
// abort source subscription: as will always be deregistered
// before lease is destroyed.
// - `ctxlog` is also function scoped.
[&lease, &ctxlog]() noexcept {
vlog(
ctxlog.debug,
"Cancelling in-flight requests on partition shutdown");
lease.client->shutdown();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the raii object will be alive until the GET request is completed, the actual processing of the downloaded data is happening inside the functor which is passed into this method as a parameter

ss::lowres_clock::duration initial_backoff);
retry_chain_node(
ss::lowres_clock::duration timeout,
ss::lowres_clock::duration initial_backoff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call to remove this. Just to clarify why this was added. In situation when the fiber uses some other abort source to stop or doesn't have an abort source at all (which is never the case) it might be useful to have this.

/// Create a head of the chain without backoff
retry_chain_node();
// No default constructor: we always need an abort source.
retry_chain_node() = delete;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: type of _parent can be changed now since we will never use std::monostate variant.

@jcsp
Copy link
Contributor Author

jcsp commented Jan 12, 2023

/ci-repeat 5 skip-unit dt-repeat=10 tests/rptest/tests/topic_delete_test.py

@jcsp
Copy link
Contributor Author

jcsp commented Jan 13, 2023

/ci-repeat 5 skip-unit dt-repeat=10 tests/rptest/tests/topic_delete_test.py

@jcsp
Copy link
Contributor Author

jcsp commented Jan 13, 2023

More debug logging, and have also generalized the abort source handling to cover all the request types in remote.cc, in case the failures resulted from other request types in flight (although I couldn't see that in the logs)

@jcsp
Copy link
Contributor Author

jcsp commented Jan 13, 2023

/ci-repeat 5 skip-unit dt-repeat=10 tests/rptest/tests/topic_delete_test.py

@jcsp
Copy link
Contributor Author

jcsp commented Jan 13, 2023

/ci-repeat 5 skip-unit dt-repeat=10 tests/rptest/tests/topic_delete_test.py

@jcsp
Copy link
Contributor Author

jcsp commented Jan 13, 2023

Last log seems like it might be log_eviction_stm hanging the partition shutdown, which is weird but hmm.

It was kind of odd that the predicate function got called
with a reference to the abort source to set the reason on it,
rather than just having the predicate return the reason.

No functional change, just a refactor.
This enables cluster::partition::stop to proceed promptly,
rather than being blocked if ntp_archiver has an S3 request
in flight.

Fixes: redpanda-data#8071
For well-behaved code that respects the caller's abort
source (and not just their own on-shutdown abort source),
the abort source to use should always be the one that
is passed into retry_chain_node.
...by calling shutdown() on the http client if it fires.

This prevents cluster::partition::stop hanging if a request is
in flight.
This never made sense: any context where we're doing retries
is a context where we should be able to abort on shutdown.
Using the remote's abort source means these sleeps will only
abort on shutdown, not when the calling partition is trying
to stop for other reasons.
Constructing a retry_chain_node without an abort source
never makes sense: it is always used in a context where
it should be abortable.
...and use it in remote_partition::erase.  This is necessary
because we now require a usable abourt source in all cloud
storage paths, and the partition's abort source is already
fired once we get to removing the persistent state.
Now that retry_chain_node always has a parent (either
an abort source or another retry_chain_node), the monostate
variant is no longer used.
Rather than having remote subscribe to an abort source
to shutdown the client after leasing it, require that
anyone leasing a client object provide an abort source,
which the client will subscribe to.

As with the change to retry_chain_node, there are no
use cases where a proper abort source is not available,
so this is not an onerous interface change.

The abort source has to outlive the lease, but that's
a common pattern in our code, where abort sources
are passed by reference from longer lived objects
into shorted lived objects.
These may attempt to generate raft writes, do
remote I/O, or generate snapshots.  It makes sense to shut
them down ASAP, rather than keeping them alive while shutting
down the more lightweight local storage components.
This test aims to reproduce rare failures seen in
topic_delete_unavailable_test, where the test system
was running so slowly that a node ended up having to
recover via snapshot, and subsequently experienced
a hang in log_eviction_stm::stop.
@jcsp jcsp marked this pull request as ready for review January 17, 2023 21:11
@jcsp
Copy link
Contributor Author

jcsp commented Jan 17, 2023

@Lazin sorry this one got so noisy. Please could you re-review?

After all that debug + the raft fix landing elsehwere, the only changed bits since your previous review are the commits from "utils: remove unused variant type in retry_chain_node" onward.

@jcsp jcsp requested a review from Lazin January 17, 2023 23:09
@Lazin
Copy link
Contributor

Lazin commented Jan 18, 2023

Looks like the code would be a bit simpler if we will be able to pass more than one abort source to the remote. I'm struggling with a similar problem. There is a separate worker which starts upload through ntp_archiver and it have to use its own abort_source, which is a bummer since ntp_archiver uses its own.

@jcsp jcsp merged commit 1e99f51 into redpanda-data:dev Jan 18, 2023
@jcsp jcsp deleted the issue-8046 branch January 18, 2023 09:36
@vshtokman
Copy link
Contributor

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 7e47ffab4989814d9438c335de06d008c2879ebb 2348e5d56a9a464f647d0d4cc7347b83a5e067ed 5f96ac737079500db2a079ad97f471a71902f65c ceb19997e81caca72c1a84f4fba4b492322cf073 3a08af73ca12c34d8adfbb5dd138208ffe214e47 ecf3bd3aa8afcaac6e5cbccd3ac675b39141d81d 7aa73d134e01d881b2066ff97621e6067f918104 632676ceaaaf9088dae791f46ae8004286219175 c10e266650c2a2bc15b3fd6393d93489d4b49c99 37c3f78b85b099d48598659cbe6e892aaeab6bb4 5bb964875d6496e2524d6b349ffc74760eaf5c51 49fe3eb536d71b632d3ba6f1729bbdb2a45a636a 73abc87984c231c65b7b3b89e635489619b8bf5d 3fdf1ada82cc94177608e75109918b49e6f44c6a

Workflow run logs.

@jcsp jcsp mentioned this pull request Feb 10, 2023
6 tasks
@jcsp
Copy link
Contributor Author

jcsp commented Feb 10, 2023

A lot of this is too invasive to backport, but the test bits are worth a try:
#8791

@jcsp jcsp mentioned this pull request Feb 13, 2023
6 tasks
@jcsp
Copy link
Contributor Author

jcsp commented Feb 13, 2023

#8850

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI Failure (Segments were not removed) in TopicDeleteCloudStorageTest.topic_delete_cloud_storage_test
5 participants