-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v23.3.x] Backport timequery bugfixes to 23.3.x #18599
Commits on May 21, 2024
-
rptest: allow to skip end of test scrubbing
(cherry picked from commit 4cdcc0e)
Configuration menu - View commit details
-
Copy full SHA for a63cae3 - Browse repository at this point
Copy the full SHA a63cae3View commit details -
m/tests: allow generating records with matching timestamps
(cherry picked from commit 99d2bec)
Configuration menu - View commit details
-
Copy full SHA for 5e5b475 - Browse repository at this point
Copy the full SHA 5e5b475View commit details -
storage: stylistic improvements to timequery tests
(cherry picked from commit d97d61f)
Configuration menu - View commit details
-
Copy full SHA for 855386e - Browse repository at this point
Copy the full SHA 855386eView commit details -
storage: document log reader semantics
(cherry picked from commit 4f87afa)
Configuration menu - View commit details
-
Copy full SHA for 7ec2bae - Browse repository at this point
Copy the full SHA 7ec2baeView commit details -
model: add bounded_offset_interval type
Encapsulates common operations on offset intervals. For now, although it is named bounded, the maximum offset can still be set to `model::offset::max()`. I will likely change this in the future as it requires changing quite a bit of call sites, most likely only tests. This little data structure tries to be very light weight and impose minimum overhead on basic interval operations like intersection or inclusion. It is also quite hard to use it incorrectly due to the checked construction variant and debug assertions. Later, we might want to refactor things like log_reader to use this instead of min and max offsets like they do today. Once that is done, the checked variant needs to be called only once at the kafka layer. For everyone else it becomes a ~0 cost abstraction. (cherry picked from commit f13bfa6)
Configuration menu - View commit details
-
Copy full SHA for 2c5174d - Browse repository at this point
Copy the full SHA 2c5174dView commit details -
storage: add min/max bounds for timequeries
kafka ListOffsets request allows to query offsets by timestamps. The result of such a request is the first kafka offset in the log that has a timestamp equal to or greater than the requested timestamp. Or, -1 if no such record can be found. The implementation we have today assumes that the start of the physical log matches the start of the log as it is seen by external users/kafka clients. However, this is not always true. In particular, when [trim-prefix][1] (prefix truncation) is used. There are 2 sources of problems: 1) trim-prefix is synchronously applied at cluster layer where it changes the visibility of the log from the client point-of-view, but it is asynchronously applied to the consensus log/physical log/ disk_log_impl class, cloud storage. 2) trim-prefix is executed with an offset of a record that in in the middle of a batch. As a result, in these scenarios, if the clients sends a kafka Fetch request with the received offset they'll be replied with OffsetOutOfRange error. This commit changes such queries are implemented at the lower levels of the system by carrying the information about the visible start and end of the log together with the timestamp. Then, at the lower levels, we use these to limit our search only to that range. Although this commit does not change the implementation of the tiered storage timequery it does fix the trim-prefix problem there too in the general case because of check and "bump" added in #11579. Tiered Storage timequeries have some additional problems which I plan to address in #18097. [1]: https://docs.redpanda.com/current/reference/rpk/rpk-topic/rpk-topic-trim-prefix/ (cherry picked from commit 76a1ea2)
Configuration menu - View commit details
-
Copy full SHA for 9436a0a - Browse repository at this point
Copy the full SHA 9436a0aView commit details -
cloud_storage: do not read from cloud if it was truncated completely
Previous code contained a bug which is masked by the retry logic in replicated partition: const auto kafka_start_override = _partition->kafka_start_offset_override(); if ( !kafka_start_override.has_value() || kafka_start_override.value() <= res.value().offset) { // The start override doesn't affect the result of the timequery. co_return res; } vlog( klog.debug, "{} timequery result {} clamped by start override, fetching result at " "start {}", ntp(), res->offset, kafka_start_override.value()); ... (cherry picked from commit f9ed5ca)
Configuration menu - View commit details
-
Copy full SHA for 480d777 - Browse repository at this point
Copy the full SHA 480d777View commit details -
rptest: test trim-prefix with timequery
(cherry picked from commit a40999d)
Configuration menu - View commit details
-
Copy full SHA for b534246 - Browse repository at this point
Copy the full SHA b534246View commit details -
kafka: fix off-by-one in timequery
Not easy to test that this is right so not going to for now. (cherry picked from commit 8f2de96)
Configuration menu - View commit details
-
Copy full SHA for ae6dc6d - Browse repository at this point
Copy the full SHA ae6dc6dView commit details -
rptest: kill existing connections in firewall_blocked
The intention of firewall_blocked is to always prevent communication. However, if a connection already exists it might take a long time until linux gives up retrying sending and the problem is reported. To avoid this, we attempt to kill the connections instantly. This was necessary for a variation of a timequery test I was writing. Although is not strictly necessary anymore, I consider it to be a nice addition. (cherry picked from commit 4c706fb)
Configuration menu - View commit details
-
Copy full SHA for 022d912 - Browse repository at this point
Copy the full SHA 022d912View commit details -
archival: live settable manifest upload timeout
This is needed for a ducktape test where we want to change the manifest upload timeout at runtime. E.g. set it to 0 to prevent manifest uploading from on point in time onward. It is declared in configuration.cc as not requiring restart but in fact it did require one prior to this commit. Other properties of the archiver configuration should be fixed too in a separate commit. (cherry picked from commit aab5fe7)
Configuration menu - View commit details
-
Copy full SHA for 4951141 - Browse repository at this point
Copy the full SHA 4951141View commit details -
cloud_storage: throw if can not create manifest cursor
It is not reasonable to continue work after this point. The absence of a cursor cursor is interpreted as EOF in other parts of the system. Not throwing makes it impossible to differentiate between "no more data available" vs. "an error occurred". This is required for an upcoming commit that fixes a timequery bug where cloud storage returns "no offset found" instead of propagating an internal error. (cherry picked from commit b53deac)
Configuration menu - View commit details
-
Copy full SHA for cb17f1a - Browse repository at this point
Copy the full SHA cb17f1aView commit details -
cloud_storage: do not ignore out of range errors
This code tried to be clever and ignore exception in some of the cases where it was assumed it is safe to do so. I.e. if the start offset stored in the cloud moved forward. I don't believe covering these edge cases is necessary. - Reasoning about correctness becomes harder as it masks the case where by mistake read an out of range offset. - It hide from the client the fact that the offset they just tried to read does not exist anymore. As a user, if the log is prefix truncated, then I expect the reply to Fetch to be out of range and not an empty response. (cherry picked from commit d1543ee)
Configuration menu - View commit details
-
Copy full SHA for 312ba09 - Browse repository at this point
Copy the full SHA 312ba09View commit details -
cloud_storage: test querying with spillover
Introduce a new test to show the existence of a bug. In particular, ```cpp // BUG: This assertion is disabled because it currently fails. // test_log.debug("Timestamp undershoots the partition"); // BOOST_TEST_REQUIRE(timequery(*this, model::timestamp(100), 3 * 6)); ``` The assertion is commented out because it is failing. (cherry picked from commit 41eed62)
Configuration menu - View commit details
-
Copy full SHA for 33c9346 - Browse repository at this point
Copy the full SHA 33c9346View commit details -
cloud_storage: search for query from archive start not clean offset
Starting the cursor from the clean offset is only required when computing retention because of an implementation detail which is documented in the added comment and referenced commits. In all other cases we must start searching from the archive start offset. This is particularly important for timequeries which must return the first visible batch above the timestamp. (cherry picked from commit c5eb52d)
Configuration menu - View commit details
-
Copy full SHA for 6ed4048 - Browse repository at this point
Copy the full SHA 6ed4048View commit details -
cloud_storage: throw on out of range timequery with spillover
When reading from tiered storage, we create a `async_manifest_view_cursor` using a query (offset or timestamp) and a begin offset which is set to the start of the stm region or start of archive (spillover). There is a bug inside `async_manifest_view_cursor` which causes it to throw out_of_range error when spillover contains data which is logically prefix truncated but matches the timequery. This is mainly because the begin offset is not properly propagated together with the query which makes it possible for the query to match a spillover manifest which is below the begin offset. In this commit, we remove the logic to ignore the out of range error and propagate it to the caller. In a later commit, we will revisit the code so that this edge cases is handled correctly inside the async_manifest_view and it does seek to the correct offset rather than throwing an out of range exception up to the client. (cherry picked from commit 680a67e)
Configuration menu - View commit details
-
Copy full SHA for 50e90ce - Browse repository at this point
Copy the full SHA 50e90ceView commit details -
cloud_storage: simplify the out_of_range logging code
No functional changes. (cherry picked from commit 3a9058a)
Configuration menu - View commit details
-
Copy full SHA for 574b9e8 - Browse repository at this point
Copy the full SHA 574b9e8View commit details -
cloud_storage: carry min/max offset bounds into cloud storage timequery
Tiered Storage physically has a superset of the addressable data. This can be caused at least by the following: a) trim-prefix, b) retention being applied but garbage collection not finishing yet. For offset queries this isn't problematic because the bounds can be applied at higher level. In particular, partition object does validate that offset is in range before passing control to the remote partition. For timequeries prior to this commit such bounds were not enforced leading to a bug where cloud storage would return an offset -1 (no data found) in result when there actually was data or returning a wrong offset. Wrong offset: it would be returned because reads could have started prior to the partition visible/addressable offset. E.g. after retention was applied but before GC was run. Or, after a trim-prefix with an offset which falls in a middle of a batch. Missing offset: would be returned when the higher level reader was created with visible/addressable partition offset bounds, say \[1000, 1200\] but cloud storage would find the offset in a manifest with bounds \[300, 400\] leading to an out of range error which used to be ignored. (cherry picked from commit 0735bdf)
Configuration menu - View commit details
-
Copy full SHA for a449bbf - Browse repository at this point
Copy the full SHA a449bbfView commit details -
kafka: remove timequery retry hack from replicated partition
It is not required anymore because we carry the actual partition start and end to the lowest layer of the systems. (cherry picked from commit 9846ed9)
Configuration menu - View commit details
-
Copy full SHA for 063aeee - Browse repository at this point
Copy the full SHA 063aeeeView commit details -
cloud_storage: add additional timequery tests
These cover more edge cases and highlight better an existing bug with the timequery in which start offset for timequery is ignored or handled inconsistently. See added code comments starting with "BUG:". (cherry picked from commit 5ae5fcd)
Configuration menu - View commit details
-
Copy full SHA for 17ac0c7 - Browse repository at this point
Copy the full SHA 17ac0c7View commit details -
cloud_storage: reword code comment for in_stm with timequery
I believe this makes it clearer. (cherry picked from commit 943aa52)
Configuration menu - View commit details
-
Copy full SHA for 2897d54 - Browse repository at this point
Copy the full SHA 2897d54View commit details -
cloud_storage: document in_archive and in_stm methods
(cherry picked from commit 1d7a1e3)
Configuration menu - View commit details
-
Copy full SHA for 1059e17 - Browse repository at this point
Copy the full SHA 1059e17View commit details