Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Add randomized tests back #8810

Merged
merged 4 commits into from
Feb 17, 2023

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Feb 10, 2023

Add the tests that check tiered storage read path with presence of inconsistencies (like overlaps between segments).
Modify offset_translator_state so it could tolerate seeing gaps out of order.

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Improvements

  • Make tiered-storage robust in presence of inconsistencies in data in the object store

src/v/storage/offset_translator_state.cc Show resolved Hide resolved
int64_t gap_length = delta - last_delta;

if (gap_length != 0) {
// The 'gap_length' can be negative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - I think we should rewrite this function a bit if it allows the offset to be in the middle of the map so that consistency is checked in any case. It should compare offset with the last offset in the map and:

  • if offset is within the map range, check that delta is consistent
  • if not, check that it is non-decreasing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/v/storage/tests/offset_translator_state_test.cc Outdated Show resolved Hide resolved
@Lazin Lazin force-pushed the pr/add-randomized-tests-back branch from d7cdfea to 26ad183 Compare February 14, 2023 20:02
@Lazin Lazin requested a review from ztlpn February 14, 2023 20:03
base_offset,
rbegin->first);
if (base_offset <= rbegin->first) {
if (!_last_offset2batch.contains(last_offset)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add handling of two additional cases:

  • the gap is before the beginning of the map - then we can't really say anything, because corresponding batches have already been prefix truncated from the translator, just exit.
  • last offset is in the map, but base offset is different - that's inconsistency.

something like:

    if (base_offset <= rbegin->first) {
        if (last_offset > _last_offset2batch.begin()->first) {
            // check that exactly the same the gap is already present in the map
            auto it = _last_offset2batch.find(last_offset);
            if (it == _last_offset2batch.end()) {
                throw;
            }
            if (it->second.base_offset != base_offset) {
                throw;
            }
        }
        return;
    }

@@ -144,7 +157,7 @@ void offset_translator_state::add_gap(
bool offset_translator_state::add_absolute_delta(
model::offset offset, int64_t delta) {
auto prev = model::prev_offset(offset);

vlog(stlog.debug, "add_absolute_delta: {}, delta{}", offset, delta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vlog(stlog.debug, "add_absolute_delta: {}, delta{}", offset, delta);
vlog(stlog.debug, "add_absolute_delta: {}, delta {}", offset, delta);

src/v/storage/offset_translator_state.cc Outdated Show resolved Hide resolved
last_delta);

if (gap_length < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also check that the gap is not too big. so the check is probably best expressed in terms of base_offset: last_offset < base_offset <= prev

@@ -219,6 +219,12 @@ remote_segment::offset_data_stream(
.file_pos = 0,
});
}
vlog(
_ctxlog.debug,
"Offset data stream start reading at {}, log offset {}, delta {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about an alternative solution - can we for a single partition read save somewhere next rp offset and pass it to the segment reader. The segment reader will then just skip every batch before this offset. This way overlaps shouldn't be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no good place to save it. The higher level reader (the one that scans remote_partition) is not retained between fetch requests. It's recreated and uses its reader_config to find cached remote_segment reader.

The solution I was thinking about is to store offset_translator_state inside the remote_segment_batch_reader . Currently it's part of the partition_record_batch_reader_impl. The partition reader passes its _ot_state to segment reader on every read_some call. Because of that the states of the reader and offset translator may diverge. To avoid this we can move ot-state to the segment reader. When the partition reader scans to the end of the segment it will fetch ot_state from the old reader and pass it to the next one. When the reader is saved for later use it will truncate it to avoid storing information about scanned record batches. But this can be implemented a bit later since it's a larger change. I'd rather use simple fix now to merge this before release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also require a deep copy of the ot_state to avoid interaction between different fetch requests.

_last_offset2batch.emplace(
prev,
batch_info{.base_offset = base_offset, .next_delta = delta});
return true;
} else {
// The record batch is already added, the delta computed using
// previous calls to 'add_gap' and 'add_absolute_delta' should match
// the parameter of this call. There is no need to change the state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have a suspicion that we can still end up with inconsistent data in the offset translator even if the segments themselves are good. The problem is that add_absolute_delta adds a fake gap, not a real one and if we then go back and start adding real gaps, we'll see inconsistency.

Example (not sure if this is possible with the way remote_partition chooses the order of segments, but probably it is):

  • execute fetch from offset 300:
  • start reading segment [200;400]: do add_absolute_delta(200, d_200)
  • add some gaps from the segment [200;400]
  • segment ends, we request the next segment: it is overlapping segment [100;500]
  • start reading segment [100;500]: do add_absolute_delta(100, d_100), but the offset translator state doesn't change! Because offset 100 is before the OT map end.
  • start adding gaps from segment [100;500]. Uh-oh, the gaps are inconsistent, the OT state in the 100-200 range is bogus.

I see two solutions:

  1. (outlined in another comment) just save the next rp offset and skip batches in the new segment until we reach it
  2. make add_absolute_delta throw away all OT map entries greater than offset - this way if we read some batches twice, that's okay because we will just rebuild OT state. This way we can even leave code of add_gap as-is because it will only add batches to the end of the OT map.

@Lazin Lazin force-pushed the pr/add-randomized-tests-back branch 2 times, most recently from 7c1ca8e to bd0522f Compare February 15, 2023 16:38
_parent._cur_ot_state->get().add_gap(
header.base_offset, header.last_offset());
_parent._cur_delta = model::offset_delta(
_parent._cur_ot_state->get().last_delta());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if current offset is not the last offset that we've seen? Then it will diverge from the true delta.

I think it should just be unconditional += header.last_offset_delta + 1 because the segment reader always reads one segment in order without any duplicates.

@Lazin Lazin requested a review from ztlpn February 15, 2023 19:33
@Lazin Lazin force-pushed the pr/add-randomized-tests-back branch from bd0522f to 9d0ba2f Compare February 15, 2023 19:33
Comment on lines 240 to 243
auto end = _last_offset2batch.end();
if (it == end || it->second.next_delta < delta) {
vassert(
it != _last_offset2batch.end(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the code is a bit strange in that we check it == end in the if condition and then assert that it is not end. Probably should be just assert?

// The record batch is already added, the delta computed using
// previous calls to 'add_gap' and 'add_absolute_delta' should match
// the parameter of this call. There is no need to change the state
// of the offset_translator_state. The invariant: any gap above the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we do change it in the end :) so comment should be altered

last_base_offset,
gap_length));
}

_last_offset2batch.emplace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case when gap_length=0? Previously we didn't modify the map in this case

@Lazin Lazin force-pushed the pr/add-randomized-tests-back branch from 9d0ba2f to a5f20c6 Compare February 16, 2023 14:27
@Lazin Lazin requested a review from ztlpn February 16, 2023 14:27
@Lazin
Copy link
Contributor Author

Lazin commented Feb 16, 2023

CI failure is unrelated

@Lazin Lazin force-pushed the pr/add-randomized-tests-back branch from a5f20c6 to 66d76b1 Compare February 17, 2023 13:19
ztlpn
ztlpn previously approved these changes Feb 17, 2023
Make offset_translator_state process gaps out of order with the
condition that the out of order data is the same as previously seen by
the offsset_translator_state. This is needed to handle inconsistencies
in the manifests, e.g. overlapping segments.

Fix problem in remote_segment that may cuase the state of the reader to
go out of sync with offset translator state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants