Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.3.x] Backport timequery bugfixes to 23.3.x #18599

Merged
merged 22 commits into from
May 21, 2024

Conversation

nvartolomei
Copy link
Contributor

Backport of #18097
Backport of #18112

Closes #18282
Closes #18566

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fix a scenario where list_offset with a timestamp could return a lower offset than partition start after a trim-prefix command. This could lead to consumers being stuck with an out-of-range-offset exception if they began consuming from an offset below the one which was used in the trim-prefix command.
  • Fix an edge case where a timequery returns no results if it races with tiered storage retention and garbage collection. This is important at least for consumers that fall behind retention. They interpret such response as the partition is empty and jump to the HWM instead of resuming consuming from the first available message.

Encapsulates common operations on offset intervals. For now, although it
is named bounded, the maximum offset can still be set to
`model::offset::max()`. I will likely change this in the future as it
requires changing quite a bit of call sites, most likely only tests.

This little data structure tries to be very light weight and impose
minimum overhead on basic interval operations like intersection or
inclusion. It is also quite hard to use it incorrectly due to the
checked construction variant and debug assertions.

Later, we might want to refactor things like log_reader to use this
instead of min and max offsets like they do today. Once that is done,
the checked variant needs to be called only once at the kafka layer. For
everyone else it becomes a ~0 cost abstraction.

(cherry picked from commit f13bfa6)
kafka ListOffsets request allows to query offsets by timestamps. The
result of such a request is the first kafka offset in the log that has a
timestamp equal to or greater than the requested timestamp. Or, -1 if no
such record can be found.

The implementation we have today assumes that the start of the physical
log matches the start of the log as it is seen by external users/kafka
clients.

However, this is not always true. In particular, when [trim-prefix][1]
(prefix truncation) is used. There are 2 sources of problems:

  1) trim-prefix is synchronously applied at cluster layer where it
     changes the visibility of the log from the client point-of-view,
     but it is asynchronously applied to the consensus log/physical log/
     disk_log_impl class, cloud storage.

  2) trim-prefix is executed with an offset of a record that in in the
     middle of a batch.

As a result, in these scenarios, if the clients sends a kafka Fetch
request with the received offset they'll be replied with
OffsetOutOfRange error.

This commit changes such queries are implemented at the lower levels
of the system by carrying the information about the visible start and
end of the log together with the timestamp. Then, at the lower levels,
we use these to limit our search only to that range.

Although this commit does not change the implementation of the tiered
storage timequery it does fix the trim-prefix problem there too in the
general case because of check and "bump" added in
#11579.

Tiered Storage timequeries have some additional problems which I plan to
address in #18097.

[1]: https://docs.redpanda.com/current/reference/rpk/rpk-topic/rpk-topic-trim-prefix/

(cherry picked from commit 76a1ea2)
Previous code contained a bug which is masked by the retry logic in
replicated partition:

    const auto kafka_start_override = _partition->kafka_start_offset_override();
    if (
      !kafka_start_override.has_value()
      || kafka_start_override.value() <= res.value().offset) {
        // The start override doesn't affect the result of the timequery.
        co_return res;
    }
    vlog(
      klog.debug,
      "{} timequery result {} clamped by start override, fetching result at "
      "start {}",
      ntp(),
      res->offset,
      kafka_start_override.value());

    ...

(cherry picked from commit f9ed5ca)
Not easy to test that this is right so not going to for now.

(cherry picked from commit 8f2de96)
The intention of firewall_blocked is to always prevent communication.
However, if a connection already exists it might take a long time until
linux gives up retrying sending and the problem is reported. To avoid
this, we attempt to kill the connections instantly.

This was necessary for a variation of a timequery test I was writing.
Although is not strictly necessary anymore, I consider it to be a nice
addition.

(cherry picked from commit 4c706fb)
This is needed for a ducktape test where we want to change the manifest
upload timeout at runtime. E.g. set it to 0 to prevent manifest
uploading from on point in time onward.

It is declared in configuration.cc as not requiring restart but in fact
it did require one prior to this commit.

Other properties of the archiver configuration should be fixed too in a
separate commit.

(cherry picked from commit aab5fe7)
It is not reasonable to continue work after this point. The absence of a
cursor cursor is interpreted as EOF in other parts of the system. Not
throwing makes it impossible to differentiate between "no more data
available" vs. "an error occurred".

This is required for an upcoming commit that fixes a timequery bug where
cloud storage returns "no offset found" instead of propagating an
internal error.

(cherry picked from commit b53deac)
This code tried to be clever and ignore exception in some of the cases
where it was assumed it is safe to do so. I.e. if the start offset
stored in the cloud moved forward.

I don't believe covering these edge cases is necessary.
- Reasoning about correctness becomes harder as it masks the case where
  by mistake read an out of range offset.
- It hide from the client the fact that the offset they just tried to
  read does not exist anymore. As a user, if the log is prefix
  truncated, then I expect the reply to Fetch to be out of range and not
  an empty response.

(cherry picked from commit d1543ee)
Introduce a new test to show the existence of a bug. In particular,

```cpp
// BUG: This assertion is disabled because it currently fails.
// test_log.debug("Timestamp undershoots the partition");
// BOOST_TEST_REQUIRE(timequery(*this, model::timestamp(100), 3 * 6));
```

The assertion is commented out because it is failing.

(cherry picked from commit 41eed62)
Starting the cursor from the clean offset is only required when
computing retention because of an implementation detail which is
documented in the added comment and referenced commits.

In all other cases we must start searching from the archive start
offset. This is particularly important for timequeries which must return
the first visible batch above the timestamp.

(cherry picked from commit c5eb52d)
When reading from tiered storage, we create a
`async_manifest_view_cursor` using a query (offset or timestamp) and a
begin offset which is set to the start of the stm region or start of
archive (spillover).

There is a bug inside `async_manifest_view_cursor` which causes it to
throw out_of_range error when spillover contains data which is logically
prefix truncated but matches the timequery. This is mainly because the
begin offset is not properly propagated together with the query which
makes it possible for the query to match a spillover manifest which is
below the begin offset.

In this commit, we remove the logic to ignore the out of range error and
propagate it to the caller.

In a later commit, we will revisit the code so that this edge cases is
handled correctly inside the async_manifest_view and it does seek to the
correct offset rather than throwing an out of range exception up to the
client.

(cherry picked from commit 680a67e)
No functional changes.

(cherry picked from commit 3a9058a)
Tiered Storage physically has a superset of the addressable data. This
can be caused at least by the following: a) trim-prefix, b) retention
being applied but garbage collection not finishing yet.

For offset queries this isn't problematic because the bounds can be
applied at higher level. In particular, partition object does validate
that offset is in range before passing control to the remote partition.

For timequeries prior to this commit such bounds were not enforced
leading to a bug where cloud storage would return an offset -1 (no data
found) in result when there actually was data or returning a wrong
offset.

Wrong offset: it would be returned because reads could have started prior
to the partition visible/addressable offset. E.g. after retention was
applied but before GC was run. Or, after a trim-prefix with an offset
which falls in a middle of a batch.

Missing offset: would be returned when the higher level reader was
created with visible/addressable partition offset bounds, say \[1000,
1200\] but cloud storage would find the offset in a manifest with bounds
\[300, 400\] leading to an out of range error which used to be ignored.

(cherry picked from commit 0735bdf)
It is not required anymore because we carry the actual partition start
and end to the lowest layer of the systems.

(cherry picked from commit 9846ed9)
These cover more edge cases and highlight better an existing bug with
the timequery in which start offset for timequery is ignored or handled
inconsistently. See added code comments starting with "BUG:".

(cherry picked from commit 5ae5fcd)
I believe this makes it clearer.

(cherry picked from commit 943aa52)
@nvartolomei
Copy link
Contributor Author

@nvartolomei nvartolomei requested review from Lazin and andrwng May 21, 2024 18:29
@andrwng
Copy link
Contributor

andrwng commented May 21, 2024

CI failure appears to be #14892

@andrwng
Copy link
Contributor

andrwng commented May 21, 2024

CI failure appears to be #14892

Oh oops you already said this :)

@nvartolomei nvartolomei changed the title Backport timequery bugfixes to 23.3.x [v23.3.x] Backport timequery bugfixes to 23.3.x May 21, 2024
@nvartolomei nvartolomei added this to the v23.3.x-next milestone May 21, 2024
@nvartolomei nvartolomei added the kind/backport PRs targeting a stable branch label May 21, 2024
@nvartolomei nvartolomei modified the milestones: v23.3.x-next, v23.3.16 May 21, 2024
@dotnwat dotnwat merged commit c53eeae into v23.3.x May 21, 2024
16 of 19 checks passed
@dotnwat dotnwat deleted the nv/backport-v23.3.x-timequery branch May 21, 2024 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants