Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Improve scrubber #15948

Merged
merged 2 commits into from
Jan 11, 2024
Merged

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Jan 4, 2024

Limit number of scrubber anomalies in memory to 100.
Limit full scrub interval to 10h.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

@@ -1741,7 +1741,7 @@ configuration::configuration()
"cloud_storage_full_scrub_interval_ms",
"Time interval between a final scrub and thte next scrub",
{.needs_restart = needs_restart::no, .visibility = visibility::tunable},
1h)
10h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider something "round" (to the day) like 12h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to 12

Comment on lines 418 to 419

void compact(model::offset local);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/// Returns number of discarded elements
template<class T, size_t quota = max_number_of_manifest_anomalies>
size_t insert_with_size_limit(
absl::node_hash_set<T>& self, const absl::node_hash_set<T>& other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I found self to be a bit confusing, since it's easy to read this and think "I'm reading the a class method, oh wait no I'm not, what's going on?". Maybe consider something like dest and to_add or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -262,17 +267,53 @@ size_t anomalies::count_segment_meta_anomaly_type(anomaly_type type) const {
return static_cast<size_t>(count);
}

/// Returns number of discarded elements
template<class T, size_t quota = max_number_of_manifest_anomalies>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMO size_limit would be more descriptive than quota

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

auto total_size = self.size() + other.size();
auto to_remove = total_size - quota;
size_t num_removed = 0;
if (self.size() < to_remove) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be <=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -246,6 +248,9 @@ void scrub_segment_meta(
}
}

// Limit on number of anomalies that can be stored in the manifest
static constexpr size_t max_number_of_manifest_anomalies = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about making this configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok for this value to be hardcoded. I can't imagine a situation when we need to tweak it manually. The actual value doesn't really matter because in most practical situations this value should be low (ideally zero).

/// Returns number of discarded elements
template<class T, size_t quota = max_number_of_manifest_anomalies>
size_t insert_with_size_limit(
absl::node_hash_set<T>& self, const absl::node_hash_set<T>& other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it makes sense for the container to be ordered. Naively I'd expect the container to be evicted from LRU, but instead we're removing random elements to satisfy the limits. In the field this may be surprising and easy to make incorrect assumptions about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think serde doesn't support ordered set. And I don't want to change the data format much because it's already part of the release and somebody may already turn the feature on.
Also, this will prioritize new data anyway because we're removing from the 'dest' container first.

Limit number of elements stored in the anomalies collection in the
manifest. Only last 100 elements are stored. Also, several counters are
added to indicate number of discarded anomalies.
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 4, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/43445#018cd647-a1b6-4178-9abb-5959726988e1:

"rptest.tests.internal_topic_protection_test.InternalTopicProtectionLargeClusterTest.test_schemas_topic"

new failures in https://buildkite.com/redpanda/redpanda/builds/43445#018cd658-4094-4a62-9819-0315c3121947:

"rptest.tests.internal_topic_protection_test.InternalTopicProtectionLargeClusterTest.test_schemas_topic"

@Lazin Lazin requested a review from andrwng January 9, 2024 13:26
@Lazin Lazin merged commit 4b6ea26 into redpanda-data:dev Jan 11, 2024
19 checks passed
@Lazin
Copy link
Contributor Author

Lazin commented Jan 11, 2024

/backport v23.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants