-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: Add randomized tests back #8810
Conversation
int64_t gap_length = delta - last_delta; | ||
|
||
if (gap_length != 0) { | ||
// The 'gap_length' can be negative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - I think we should rewrite this function a bit if it allows the offset to be in the middle of the map so that consistency is checked in any case. It should compare offset
with the last offset in the map and:
- if
offset
is within the map range, check that delta is consistent - if not, check that it is non-decreasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
d7cdfea
to
26ad183
Compare
base_offset, | ||
rbegin->first); | ||
if (base_offset <= rbegin->first) { | ||
if (!_last_offset2batch.contains(last_offset)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add handling of two additional cases:
- the gap is before the beginning of the map - then we can't really say anything, because corresponding batches have already been prefix truncated from the translator, just exit.
- last offset is in the map, but base offset is different - that's inconsistency.
something like:
if (base_offset <= rbegin->first) {
if (last_offset > _last_offset2batch.begin()->first) {
// check that exactly the same the gap is already present in the map
auto it = _last_offset2batch.find(last_offset);
if (it == _last_offset2batch.end()) {
throw;
}
if (it->second.base_offset != base_offset) {
throw;
}
}
return;
}
@@ -144,7 +157,7 @@ void offset_translator_state::add_gap( | |||
bool offset_translator_state::add_absolute_delta( | |||
model::offset offset, int64_t delta) { | |||
auto prev = model::prev_offset(offset); | |||
|
|||
vlog(stlog.debug, "add_absolute_delta: {}, delta{}", offset, delta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vlog(stlog.debug, "add_absolute_delta: {}, delta{}", offset, delta); | |
vlog(stlog.debug, "add_absolute_delta: {}, delta {}", offset, delta); |
last_delta); | ||
|
||
if (gap_length < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also check that the gap is not too big. so the check is probably best expressed in terms of base_offset: last_offset < base_offset <= prev
@@ -219,6 +219,12 @@ remote_segment::offset_data_stream( | |||
.file_pos = 0, | |||
}); | |||
} | |||
vlog( | |||
_ctxlog.debug, | |||
"Offset data stream start reading at {}, log offset {}, delta {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about an alternative solution - can we for a single partition read save somewhere next rp offset and pass it to the segment reader. The segment reader will then just skip every batch before this offset. This way overlaps shouldn't be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no good place to save it. The higher level reader (the one that scans remote_partition
) is not retained between fetch requests. It's recreated and uses its reader_config
to find cached remote_segment
reader.
The solution I was thinking about is to store offset_translator_state
inside the remote_segment_batch_reader
. Currently it's part of the partition_record_batch_reader_impl
. The partition reader passes its _ot_state
to segment reader on every read_some
call. Because of that the states of the reader and offset translator may diverge. To avoid this we can move ot-state to the segment reader. When the partition reader scans to the end of the segment it will fetch ot_state from the old reader and pass it to the next one. When the reader is saved for later use it will truncate it to avoid storing information about scanned record batches. But this can be implemented a bit later since it's a larger change. I'd rather use simple fix now to merge this before release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also require a deep copy of the ot_state to avoid interaction between different fetch requests.
_last_offset2batch.emplace( | ||
prev, | ||
batch_info{.base_offset = base_offset, .next_delta = delta}); | ||
return true; | ||
} else { | ||
// The record batch is already added, the delta computed using | ||
// previous calls to 'add_gap' and 'add_absolute_delta' should match | ||
// the parameter of this call. There is no need to change the state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have a suspicion that we can still end up with inconsistent data in the offset translator even if the segments themselves are good. The problem is that add_absolute_delta
adds a fake gap, not a real one and if we then go back and start adding real gaps, we'll see inconsistency.
Example (not sure if this is possible with the way remote_partition
chooses the order of segments, but probably it is):
- execute fetch from offset 300:
- start reading segment [200;400]: do
add_absolute_delta(200, d_200)
- add some gaps from the segment [200;400]
- segment ends, we request the next segment: it is overlapping segment [100;500]
- start reading segment [100;500]: do
add_absolute_delta(100, d_100)
, but the offset translator state doesn't change! Because offset 100 is before the OT map end. - start adding gaps from segment [100;500]. Uh-oh, the gaps are inconsistent, the OT state in the 100-200 range is bogus.
I see two solutions:
- (outlined in another comment) just save the next rp offset and skip batches in the new segment until we reach it
- make
add_absolute_delta
throw away all OT map entries greater thanoffset
- this way if we read some batches twice, that's okay because we will just rebuild OT state. This way we can even leave code ofadd_gap
as-is because it will only add batches to the end of the OT map.
7c1ca8e
to
bd0522f
Compare
The tests check two cases: - one overlapping batch - duplicate segments
_parent._cur_ot_state->get().add_gap( | ||
header.base_offset, header.last_offset()); | ||
_parent._cur_delta = model::offset_delta( | ||
_parent._cur_ot_state->get().last_delta()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if current offset is not the last offset that we've seen? Then it will diverge from the true delta.
I think it should just be unconditional += header.last_offset_delta + 1
because the segment reader always reads one segment in order without any duplicates.
bd0522f
to
9d0ba2f
Compare
auto end = _last_offset2batch.end(); | ||
if (it == end || it->second.next_delta < delta) { | ||
vassert( | ||
it != _last_offset2batch.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the code is a bit strange in that we check it == end
in the if condition and then assert that it is not end. Probably should be just assert?
// The record batch is already added, the delta computed using | ||
// previous calls to 'add_gap' and 'add_absolute_delta' should match | ||
// the parameter of this call. There is no need to change the state | ||
// of the offset_translator_state. The invariant: any gap above the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we do change it in the end :) so comment should be altered
last_base_offset, | ||
gap_length)); | ||
} | ||
|
||
_last_offset2batch.emplace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case when gap_length=0? Previously we didn't modify the map in this case
9d0ba2f
to
a5f20c6
Compare
CI failure is unrelated |
a5f20c6
to
66d76b1
Compare
Make offset_translator_state process gaps out of order with the condition that the out of order data is the same as previously seen by the offsset_translator_state. This is needed to handle inconsistencies in the manifests, e.g. overlapping segments. Fix problem in remote_segment that may cuase the state of the reader to go out of sync with offset translator state.
66d76b1
to
e98fac4
Compare
Add the tests that check tiered storage read path with presence of inconsistencies (like overlaps between segments).
Modify
offset_translator_state
so it could tolerate seeing gaps out of order.Backports Required
UX Changes
Release Notes
Improvements