-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: Improve scrubber #15948
Conversation
src/v/config/configuration.cc
Outdated
@@ -1741,7 +1741,7 @@ configuration::configuration() | |||
"cloud_storage_full_scrub_interval_ms", | |||
"Time interval between a final scrub and thte next scrub", | |||
{.needs_restart = needs_restart::no, .visibility = visibility::tunable}, | |||
1h) | |||
10h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider something "round" (to the day) like 12h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to 12
src/v/cloud_storage/types.h
Outdated
|
||
void compact(model::offset local); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/v/cloud_storage/types.cc
Outdated
/// Returns number of discarded elements | ||
template<class T, size_t quota = max_number_of_manifest_anomalies> | ||
size_t insert_with_size_limit( | ||
absl::node_hash_set<T>& self, const absl::node_hash_set<T>& other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I found self
to be a bit confusing, since it's easy to read this and think "I'm reading the a class method, oh wait no I'm not, what's going on?". Maybe consider something like dest
and to_add
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/v/cloud_storage/types.cc
Outdated
@@ -262,17 +267,53 @@ size_t anomalies::count_segment_meta_anomaly_type(anomaly_type type) const { | |||
return static_cast<size_t>(count); | |||
} | |||
|
|||
/// Returns number of discarded elements | |||
template<class T, size_t quota = max_number_of_manifest_anomalies> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IMO size_limit
would be more descriptive than quota
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/v/cloud_storage/types.cc
Outdated
auto total_size = self.size() + other.size(); | ||
auto to_remove = total_size - quota; | ||
size_t num_removed = 0; | ||
if (self.size() < to_remove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could be <=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -246,6 +248,9 @@ void scrub_segment_meta( | |||
} | |||
} | |||
|
|||
// Limit on number of anomalies that can be stored in the manifest | |||
static constexpr size_t max_number_of_manifest_anomalies = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about making this configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok for this value to be hardcoded. I can't imagine a situation when we need to tweak it manually. The actual value doesn't really matter because in most practical situations this value should be low (ideally zero).
src/v/cloud_storage/types.cc
Outdated
/// Returns number of discarded elements | ||
template<class T, size_t quota = max_number_of_manifest_anomalies> | ||
size_t insert_with_size_limit( | ||
absl::node_hash_set<T>& self, const absl::node_hash_set<T>& other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it makes sense for the container to be ordered. Naively I'd expect the container to be evicted from LRU, but instead we're removing random elements to satisfy the limits. In the field this may be surprising and easy to make incorrect assumptions about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think serde doesn't support ordered set. And I don't want to change the data format much because it's already part of the release and somebody may already turn the feature on.
Also, this will prioritize new data anyway because we're removing from the 'dest' container first.
Limit number of elements stored in the anomalies collection in the manifest. Only last 100 elements are stored. Also, several counters are added to indicate number of discarded anomalies.
3434eed
to
5effd5b
Compare
new failures in https://buildkite.com/redpanda/redpanda/builds/43445#018cd647-a1b6-4178-9abb-5959726988e1:
new failures in https://buildkite.com/redpanda/redpanda/builds/43445#018cd658-4094-4a62-9819-0315c3121947:
|
/backport v23.3.x |
Limit number of scrubber anomalies in memory to 100.
Limit full scrub interval to 10h.
Backports Required
Release Notes