-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote partition_manifest validation for topic recovery #16915
remote partition_manifest validation for topic recovery #16915
Conversation
6ea2d18
to
de76f5f
Compare
2b779c2
to
338a119
Compare
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5ba2-f06b-4254-944c-8f621456b8ca:
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5ba2-f06e-4180-b3a8-65034779a9d0:
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5ba2-f067-4f2b-88c7-069a925426c0:
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5ba2-f071-43b5-b667-bd2aa49f158f:
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5bb4-6654-4dfa-8349-e920128e17e2:
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5bb4-665b-4689-b62f-152e1645b0c1:
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5bb4-6658-4926-a2da-c93177cd2c1b:
new failures in https://buildkite.com/redpanda/redpanda/builds/46492#018e5bb4-6656-4377-8779-0cf4949f088a:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e0a-4a9c-47db-8868-807498b8dbfe:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e0a-4a94-40ed-b0a7-6143a871d571:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e0a-4aa0-43dd-a7bc-96bd7db712e5:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e0a-4a99-4e35-a423-269656f9ef63:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e1c-34e7-4b8f-a78e-a84eed8c4ce4:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e1c-34ef-4c03-a951-f8ebde0cd29b:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e1c-34ec-4da9-a494-3ddde49d513c:
new failures in https://buildkite.com/redpanda/redpanda/builds/46523#018e5e1c-34ea-49ca-b480-2c2e5c118ba8:
new failures in https://buildkite.com/redpanda/redpanda/builds/46568#018e61e6-c6cb-4fca-a45b-d3cbeeefa737:
new failures in https://buildkite.com/redpanda/redpanda/builds/46568#018e61e6-c6c8-45c0-be1d-eb9cb0d6cf85:
new failures in https://buildkite.com/redpanda/redpanda/builds/46568#018e61e6-c6c5-4606-a862-8228315cfe14:
new failures in https://buildkite.com/redpanda/redpanda/builds/46568#018e61f9-fa6d-4d1e-8c80-810f4abe6242:
new failures in https://buildkite.com/redpanda/redpanda/builds/46568#018e61f9-fa6a-4301-a79c-38f16f6b0550:
new failures in https://buildkite.com/redpanda/redpanda/builds/46568#018e61f9-fa64-4d64-941f-b783c746bf60:
new failures in https://buildkite.com/redpanda/redpanda/builds/46568#018e61f9-fa67-48b5-ab62-9c2d18000713:
new failures in https://buildkite.com/redpanda/redpanda/builds/46717#018e76e5-1db9-4459-884e-3cda38001308:
new failures in https://buildkite.com/redpanda/redpanda/builds/46717#018e76e5-1db3-4876-91dc-3e97ffba9112:
new failures in https://buildkite.com/redpanda/redpanda/builds/46717#018e76e5-1dbe-4cb0-bad6-94ded757fb18:
new failures in https://buildkite.com/redpanda/redpanda/builds/46717#018e76f7-fcd1-441a-a00f-0362c30c0046:
new failures in https://buildkite.com/redpanda/redpanda/builds/46717#018e76f7-fcd7-496f-a234-a2ab12290165:
new failures in https://buildkite.com/redpanda/redpanda/builds/46717#018e76f7-fcd4-4ea2-a1a5-2ff4321e2916:
new failures in https://buildkite.com/redpanda/redpanda/builds/46933#018e821a-f5a5-45c6-8123-08a27295983c:
|
291d47f
to
6396208
Compare
nobody seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
6396208
to
a6bf506
Compare
force push to fix conflicts and restart ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The high level structure seems reasonable. Left a bunch of nits, and some questions about how this is exposed via configs.
Also I think the topic recovery validator and new bits in the anomaly detector could use some unit testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments/questions.
how do you feel about test coverage and what testing do you think we need to do?
agree with pretty much all of andrews comments as well.
a6bf506
to
fc65155
Compare
force push: rebase, fixing merge conflicts, addressed some of the comments. TBD:
waiting ci to check regessions |
64a4d10
to
0fdd16c
Compare
fixed merge conflict, added unit test for anomalies detector, structured topic_recovery_validator with an internal class to factor common parts TBD: extend TopicRecoveryTest to test more combinations of recovery checks |
max_num_segments limits the total number of segment_meta that will be checked in a run. To check a segment meta the code checks the existance of its segment in remote storage. This done with a HEAD request. the limit is counted from most recent to oldest, so that a depth of 1 will only cause the newest segment to be checked. an appropriate low number will limit the scrub to only the stm manifest section. we could limit by - num segments - offset - total time a limit by num segments correlates directly with the speed of the operation while being deterministic. offset could map better with what the user wants, but the current implementation of reverse iteration in segment_meta_cstore makes it a bit involved to implement. total time it's a strong guarantee but the final result depends on the load of the system. This commit implements a limit by num of segments with the parameter max_num_segments. The implementation of this mode it's easier to reason about, and other PRs can implement the other two modes.
dc06b08
to
4119e98
Compare
last failure was interesting, i had to manually cast the inline void rjson_serialize(
json::Writer<json::StringBuffer>& w, const cluster::recovery_checks& rc) {
w.StartObject();
// TODO investigate the reason. seems like a manually casting to uint16_t of
// rc.mode enum is needed, otherwise we get an assertion at decoding time,
// when we try to read an unsigned but the json value is not tagged as
// unsigned
write_member(
w,
"mode",
static_cast<std::underlying_type_t<decltype(rc.mode)>>(rc.mode));
write_member(w, "max_segment_depth", rc.max_segment_depth);
w.EndObject();
} |
#17198 issue is known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my last review I hadn't considered adding the option to the create request rather than the topic property, and the more I think about it the more natural it seems. WDYT?
class partition_validator { | ||
public: | ||
// Each partition gets a separate retry_chain_node and a logger tied to it. | ||
// From this a common retry_chain_logger is created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth unit testing this partition validator on its own? Defining it in the .cc file seems to discourage testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth unit testing this partition validator on its own? Defining it in the .cc file seems to discourage testing
good point. @andijcr we want unit testing to be the first line of defense for testing new code.
however, if we have some testing that we feel like will be exercising this, i'm ok if we do this as a follow up rather than continuing to iterate on this large PR. if we don't think any of this code is getting exercised, then we need to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/rptest/tests/topic_recovery_test.py::TopicRecoveryTest.[test_prevent_topic_recovery|test_many_partitions] are written to test this code. i'll get a unit test going like for anomalies_detector. the dimension are [validation mode, cardinality 4], [partition damage, cardinality ~6], [download results, cardinatily ~3]
auto validation_map = co_await maybe_validate_recovery_topic( | ||
assignable_config, bucket, _cloud_storage_api.local(), _as.local()); | ||
if (std::ranges::any_of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider the timeout here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean the timeout for the create_topics operations?
yes we should (https://github.com/redpanda-data/core-internal/issues/1221)
in the ducktape local environment, and from some testing also on a cluster with s3, the current defaults seems to be ok.
setting the cluster default validation to no_check would work as an escape hatch in case this operation becomes the bottleneck.
maybe i can return an optional<map<partition_id, validation_res>> to remove the runtime cost in case of nocheck
partition_manifest_exists chains a check for a serde manifest and a json manifest
describes how to perform validation for a partition. one of manifest existance only manifest file integrity and metadata check on contained objects no_check
4119e98
to
5cf31d6
Compare
cluster level default values cloud_storage_recovery_topic_validation_mode is the recovery_validation_mode cloud_storage_recovery_topic_validation_depth is the validation depth, meaninful when validation_mode is model::recovery_validation_mode::check_manifest_own_metadata_only, cloud_storage_recovery_topic_force_ovveride_cfg is the escape hatch to use cluster level defaults instead of the topic properties
5cf31d6
to
36f0e99
Compare
as discussed, removed recovery_checks from topic_properites and updated the rest of the code. Not checks are performed based on cluster properties. |
@@ -988,6 +988,40 @@ ss::future<download_result> remote::segment_exists( | |||
existence_check_type::segment); | |||
} | |||
|
|||
ss::future<remote::partition_manifest_existence> | |||
remote::partition_manifest_exists( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a dedicated method to check manifests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to perform an existence check (HEAD instead of GET).
since partition_manifest could be serde/json format, i chained the checks here.
I guess would be difficult to find manifest.json in the wild, but I don't think we have a policy that prevents reading very old data
, rev_id_{rev_id} | ||
, op_rtc_{retry_chain_node{ | ||
as, | ||
300s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting timeout in the root rtc is very often an error. It only works if the object is used immediately after creation.
You can create a temporary rtc based on this one right before invoking method of the _remote and pass the timeout to the c-tor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so something like
opt_rtc_{as} {...} // have to check syntax
and in do_validate_manifest_existence() and do_validate_manifest_metadata()
auto rtc=retry_chain_node(&opt_rtc_, 300s, config::shard_local_cfg().cloud_storage_initial_backoff_ms.value())?
would the op_logger_ still work like this, or do i need to move it after rtc?
new function that will perform validation for all the partition of a topic, in parallel. returns a map(partition_id, validation_result) that can be used to decide if to block topic recovery in case of fatal error the function will read topic_property::recovery_checks and the cluster defaults to perform either manifest existance or metadata checks, via the anomalies_detector class. parallelism is limited, and the checks are performed with a long timeout to be resiliant against backoff requests. for each partition the possible result can be passed <- validation successful no_manifest <- no manifest in cloud storage, allowed failure <- some inconsistencies that needs intervention download_issue <- likely misconfigured cloud storage or service issue passed/no_manifest can be accepted for recovery, while failure/download_issue should raise an error error logs will point out partitions that did not pass the check
if any partition fails validation, stop creation. a missing manifest is not considered a failure, for the purpose of recovery
for consistency between version, be explicit on the default behavior of checks during recovery
simple test to ensure that a higher number of topics/partitions are not detrimental to the system
and an optional topic to restore, and an optional dict of topic properties to add during recovey
test that missing segments will stop recovery early
when the check mode is no_check, validation is skipped. this commit reduces the runtime cost by not populating the map<partition_id, validation_result>. on the caller side, this is already interpreted as "validation ok"
36f0e99
to
0ce9a2a
Compare
updated on comments, the major change is match partition_probe to determine which segment_meta anomaly to consider fatal |
Introduced a new function to perform metadata validation for a remote topic to be used during the process of topic recovery.
For each partition, checks if the manifest exists in the cloud and optionally checks if the file can be decoded and self-consistent up to N most recent segment_meta,
For each partition, the S3 cost is 1 HeadObject request for the partition_manifest OR 1Get + 1HeadObject request for N segment_meta.
in general, we could limit the depth of the check by
num segments correlates directly with the speed of the operation while being deterministic.
offset could map better with what the user wants, but it's more challenging to implement with the current reverse iteration implementation of segment_meta_cstore.
total time, it's a strong guarantee, but the final result depends on the load of the system.
This PR implements a max_num_segment limit, as it's easier to reason about. A follow-up can implement the other two modes if there is a request for this.
The check is performed in parallel for each partition with a cap.
The result can be
The validation mode is driven by a new (nullopt) topic property and a cluster-level default + force flag.
A new topic_property allows users to define validation manually by editing the topic_manifest.json file.
otherwise, cluster-level defaults will be used.
The last one is an escape hatch to override
recovery_checks
with the cluster defaultsFixes https://github.com/redpanda-data/core-internal/issues/1138
Fixes https://github.com/redpanda-data/core-internal/issues/1139
Backports Required
Release Notes
Improvements