-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
demote ERROR message to DEBUG for timequerys at the edge of spillover retention #16302
demote ERROR message to DEBUG for timequerys at the edge of spillover retention #16302
Conversation
log a warning if when trying to get a async_manifest_view_cursor for a timequery the result is out_of_range. such error can happen if retention is kicking in and deleting some or all spillover manifests: in this case there is a window where the manifest are reclaimable but still kept in memory, and the timequery hits one of the manifest in the reclaimable range. handling this as a warning and no result is acceptable because the kafka client issuing the request can handle this failure, and otherwise handling this edge cases would increase the complexity of the callstack
9c723c0
to
4ca7281
Compare
log_start_offset.value()()); | ||
co_return; | ||
} | ||
&& ss::visit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ss::visit is better than hods_alternative/get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now I see that it's not the full change
new failures in https://buildkite.com/redpanda/redpanda/builds/44350#018d4656-2621-46f4-8e9b-d464c2fe8463:
|
failure is likely unrelated "RuntimeError: Internal object storage scrub detected fatal anomalies: [{'ns': 'kafka', 'topic': '__consumer_offsets', 'partition': 15, 'revision_id': 32, 'missing_segments': ['f9fd300e/kafka/__consumer_offsets/15_32/44-219-21700-4-v1.log.5'], 'last_complete_scrub_at': 1706284164049}]" |
/backport v23.3.x |
/backport v23.2.x |
Oops! Something went wrong. |
Oops! Something went wrong. |
Consider this trace:
A list_offsets request with a timestamp hits the spillover region, but at the same time, retention happens, and the first few spillover manifests become eligible for GC.
In this case, the timequery will still start from the first spillover manifest, but if it hits one of the collectible manifests, this will be logged as an error, and no result will be returned.
The kafka client is okay with this result, but the ERROR line will trigger the tests.
In theory, we could restrict the search space (like it was attempted here ), but we are dealing with suspension points in an unstable moment. We could hit various edge cases (what if retention hits the whole spillover region?)
So this pr recognize this category of out_of_range error and logs a DEBUG message instead of an ERROR.
The change is in the last commit; the previous ones are minor things found while studying this trace.
annotated trace below
Fixes #15489
Fixes #16026
Backports Required
Release Notes