Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: ensure monotonic stable offset updates #17039

Merged

Conversation

nvartolomei
Copy link
Contributor

This change is similar to
#1826 but for stable offsets this time.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

This change is similar to
redpanda-data#1826 but for stable
offsets this time.
@nvartolomei
Copy link
Contributor Author

/dt

nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Mar 14, 2024
Dirty entries are ones that aren't eligible for eviction as they might
have not yet been written durably to disk. Evicting them would otherwise
lead to "disappeared" data on cache miss.

We are making an implicit assumption that readers that want to read
beyond storage stable offset will always read through the batch cache.
Otherwise, the read-after-write semantics won't hold as the data
buffered in appender chunk cache isn't exposed.

---

Batch cache entries are `batch_cache::range` data structures. A range is
a 32KiB memory arena with some ancillary data structures. It is
associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1
mapping with a segment. When a batch is put in the cache, the
`batch_cache_index` will append to an existing range or allocate a new
one and insert it into the `batch_cache`.

We propose to track dirty batches at `batch_cache::range` granularity
rather than per batch.

Using the insight that dirty batches are inserted with monotonically
increasing offsets, and the fact that we always mark as clean (clear the
dirty flag) for a prefix of offsets, it is enough to track only the
maximum dirty batch offset in a range and advance it when new dirty
batches are inserted. For marking offsets as clean (on flush) we check
if `clean_offset >= max_dirty_offset` and mark the range as clean. In
the alternative case, nothing needs to be done.

A `batch_cache_index` can be associated with multiple ranges which
potentially need to be marked as clean. As an optimization, we propose
to track which ranges might contain dirty data and when the request for
marking a prefix of offsets comes in to iterate only through this subset
of ranges. We will use this information together with the already
existing index which maps offsets to ranges
`absl::btree_map<model::offset, batch_cache::entry>` to iterate only
through the entries and ranges which contain the dirty offsets we are
interested in. The complexity of this iteration is O(N) where N is the
number of dirty batches that must be marked as clean.

To this end, we are introducing a new data structure called
`batch_cache_dirty_tracker` to keep track of minimum and maximum dirty
offsets in an index that we need for the optimization.

Technically, we could achieve a similar performance optimization by
doing a binary search over the index using desired mark clean `up_to`
offset as a needle and iterate backwards through ranges marking them as
clean and stopping once we hit a clean range. However, the new data
structure has some additional benefits:

This data structure facilitates some invariants checks:
  - Dirty offsets are monotonic: `vassert(_max <= range.first && _max <
    range.second, ...)`
  - Truncation is requested only with a clean index:
    `vassert(_dirty_tracker.clean(), ...)`. See the next section for why
    this is important.
  - At the end of batch cache / batch cache index lifetime there are no
    dirty batches left in the index. This already caught 2 bugs in the
    existing code:
      1) redpanda-data#17032
      2) redpanda-data#17039

Also, it improves the debugging experience as we include the state of
the dirty tracker in the `batch_cache_index` ostream insertion operator.

---

Truncation needs special attention too. When truncation is requested we
must evict batches with offsets higher than truncation point but still
retain dirty batches with smaller offsets. Since we can't have holes in
the ranges and can evict only at the range granularity we require a
flush prior truncation. It is already present so no changes required.
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, just some clarifying questions

@@ -340,7 +340,7 @@ ss::future<> segment::do_flush() {
// outstanding flushes once the one executed later in terms of offset
// finishes we guarantee that all previous flushes finished.
_tracker.committed_offset = std::max(o, _tracker.committed_offset);
_tracker.stable_offset = _tracker.committed_offset;
_tracker.stable_offset = std::max(o, _tracker.stable_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to connect the dots here: should o here be exactly the same as what was advanced to during the appender flush?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also did you have a specific sequence in mind that leads to this not being non-monotonic previously? That would really help understand this change (this change looks much more intuitive, but just want to know what went wrong before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to connect the dots here: should o here be exactly the same as what was advanced to during the appender flush?

Yes.

Also did you have a specific sequence in mind that leads to this not being non-monotonic previously? That would really help understand this change (this change looks much more intuitive, but just want to know what went wrong before)

The updated test fails without this change.

A issues a flush, stores the dirty offset in o
B does 10 additional writes , issues a flush, stores the dirty offset in o, actually flushes, updates stable_offset to N+10
A finally does the actual appender::flush, moves the stable_offset back to N overwriting the value set by B

@piyushredpanda piyushredpanda merged commit 1c4ca8c into redpanda-data:dev Mar 14, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17039-v23.2.x-647 remotes/upstream/v23.2.x
git cherry-pick -x b9e93efc7d0c2e81064abfd7cb847d44a2ac5910

Workflow run logs.

nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Mar 14, 2024
Dirty entries are ones that aren't eligible for eviction as they might
have not yet been written durably to disk. Evicting them would otherwise
lead to "disappeared" data on cache miss.

We are making an implicit assumption that readers that want to read
beyond storage stable offset will always read through the batch cache.
Otherwise, the read-after-write semantics won't hold as the data
buffered in appender chunk cache isn't exposed.

---

Batch cache entries are `batch_cache::range` data structures. A range is
a 32KiB memory arena with some ancillary data structures. It is
associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1
mapping with a segment. When a batch is put in the cache, the
`batch_cache_index` will append to an existing range or allocate a new
one and insert it into the `batch_cache`.

We propose to track dirty batches at `batch_cache::range` granularity
rather than per batch.

Using the insight that dirty batches are inserted with monotonically
increasing offsets, and the fact that we always mark as clean (clear the
dirty flag) for a prefix of offsets, it is enough to track only the
maximum dirty batch offset in a range and advance it when new dirty
batches are inserted. For marking offsets as clean (on flush) we check
if `clean_offset >= max_dirty_offset` and mark the range as clean. In
the alternative case, nothing needs to be done.

A `batch_cache_index` can be associated with multiple ranges which
potentially need to be marked as clean. As an optimization, we propose
to track which ranges might contain dirty data and when the request for
marking a prefix of offsets comes in to iterate only through this subset
of ranges. We will use this information together with the already
existing index which maps offsets to ranges
`absl::btree_map<model::offset, batch_cache::entry>` to iterate only
through the entries and ranges which contain the dirty offsets we are
interested in. The complexity of this iteration is O(N) where N is the
number of dirty batches that must be marked as clean.

To this end, we are introducing a new data structure called
`batch_cache_dirty_tracker` to keep track of minimum and maximum dirty
offsets in an index that we need for the optimization.

Technically, we could achieve a similar performance optimization by
doing a binary search over the index using desired mark clean `up_to`
offset as a needle and iterate backwards through ranges marking them as
clean and stopping once we hit a clean range. However, the new data
structure has some additional benefits:

This data structure facilitates some invariants checks:
  - Dirty offsets are monotonic: `vassert(_max <= range.first && _max <
    range.second, ...)`
  - Truncation is requested only with a clean index:
    `vassert(_dirty_tracker.clean(), ...)`. See the next section for why
    this is important.
  - At the end of batch cache / batch cache index lifetime there are no
    dirty batches left in the index. This already caught 2 bugs in the
    existing code:
      1) redpanda-data#17032
      2) redpanda-data#17039

Also, it improves the debugging experience as we include the state of
the dirty tracker in the `batch_cache_index` ostream insertion operator.

---

Truncation needs special attention too. When truncation is requested we
must evict batches with offsets higher than truncation point but still
retain dirty batches with smaller offsets. Since we can't have holes in
the ranges and can evict only at the range granularity we require a
flush prior truncation. It is already present so no changes required.
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Mar 15, 2024
Dirty entries are ones that aren't eligible for eviction as they might
have not yet been written durably to disk. Evicting them would otherwise
lead to "disappeared" data on cache miss.

We are making an implicit assumption that readers that want to read
beyond storage stable offset will always read through the batch cache.
Otherwise, the read-after-write semantics won't hold as the data
buffered in appender chunk cache isn't exposed.

---

Batch cache entries are `batch_cache::range` data structures. A range is
a 32KiB memory arena with some ancillary data structures. It is
associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1
mapping with a segment. When a batch is put in the cache, the
`batch_cache_index` will append to an existing range or allocate a new
one and insert it into the `batch_cache`.

We propose to track dirty batches at `batch_cache::range` granularity
rather than per batch.

Using the insight that dirty batches are inserted with monotonically
increasing offsets, and the fact that we always mark as clean (clear the
dirty flag) for a prefix of offsets, it is enough to track only the
maximum dirty batch offset in a range and advance it when new dirty
batches are inserted. For marking offsets as clean (on flush) we check
if `clean_offset >= max_dirty_offset` and mark the range as clean. In
the alternative case, nothing needs to be done.

A `batch_cache_index` can be associated with multiple ranges which
potentially need to be marked as clean. As an optimization, we propose
to track which ranges might contain dirty data and when the request for
marking a prefix of offsets comes in to iterate only through this subset
of ranges. We will use this information together with the already
existing index which maps offsets to ranges
`absl::btree_map<model::offset, batch_cache::entry>` to iterate only
through the entries and ranges which contain the dirty offsets we are
interested in. The complexity of this iteration is O(N) where N is the
number of dirty batches that must be marked as clean.

To this end, we are introducing a new data structure called
`batch_cache_dirty_tracker` to keep track of minimum and maximum dirty
offsets in an index that we need for the optimization.

Technically, we could achieve a similar performance optimization by
doing a binary search over the index using desired mark clean `up_to`
offset as a needle and iterate backwards through ranges marking them as
clean and stopping once we hit a clean range. However, the new data
structure has some additional benefits:

This data structure facilitates some invariants checks:
  - Dirty offsets are monotonic: `vassert(_max <= range.first && _max <
    range.second, ...)`
  - Truncation is requested only with a clean index:
    `vassert(_dirty_tracker.clean(), ...)`. See the next section for why
    this is important.
  - At the end of batch cache / batch cache index lifetime there are no
    dirty batches left in the index. This already caught 2 bugs in the
    existing code:
      1) redpanda-data#17032
      2) redpanda-data#17039

Also, it improves the debugging experience as we include the state of
the dirty tracker in the `batch_cache_index` ostream insertion operator.

---

Truncation needs special attention too. When truncation is requested we
must evict batches with offsets higher than truncation point but still
retain dirty batches with smaller offsets. Since we can't have holes in
the ranges and can evict only at the range granularity we require a
flush prior truncation. It is already present so no changes required.
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Mar 15, 2024
Dirty entries are ones that aren't eligible for eviction as they might
have not yet been written durably to disk. Evicting them would otherwise
lead to "disappeared" data on cache miss.

We are making an implicit assumption that readers that want to read
beyond storage stable offset will always read through the batch cache.
Otherwise, the read-after-write semantics won't hold as the data
buffered in appender chunk cache isn't exposed.

---

Batch cache entries are `batch_cache::range` data structures. A range is
a 32KiB memory arena with some ancillary data structures. It is
associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1
mapping with a segment. When a batch is put in the cache, the
`batch_cache_index` will append to an existing range or allocate a new
one and insert it into the `batch_cache`.

We propose to track dirty batches at `batch_cache::range` granularity
rather than per batch.

Using the insight that dirty batches are inserted with monotonically
increasing offsets, and the fact that we always mark as clean (clear the
dirty flag) for a prefix of offsets, it is enough to track only the
maximum dirty batch offset in a range and advance it when new dirty
batches are inserted. For marking offsets as clean (on flush) we check
if `clean_offset >= max_dirty_offset` and mark the range as clean. In
the alternative case, nothing needs to be done.

A `batch_cache_index` can be associated with multiple ranges which
potentially need to be marked as clean. As an optimization, we propose
to track which ranges might contain dirty data and when the request for
marking a prefix of offsets comes in to iterate only through this subset
of ranges. We will use this information together with the already
existing index which maps offsets to ranges
`absl::btree_map<model::offset, batch_cache::entry>` to iterate only
through the entries and ranges which contain the dirty offsets we are
interested in. The complexity of this iteration is O(N) where N is the
number of dirty batches that must be marked as clean.

To this end, we are introducing a new data structure called
`batch_cache_dirty_tracker` to keep track of minimum and maximum dirty
offsets in an index that we need for the optimization.

Technically, we could achieve a similar performance optimization by
doing a binary search over the index using desired mark clean `up_to`
offset as a needle and iterate backwards through ranges marking them as
clean and stopping once we hit a clean range. However, the new data
structure has some additional benefits:

This data structure facilitates some invariants checks:
  - Dirty offsets are monotonic: `vassert(_max <= range.first && _max <
    range.second, ...)`
  - Truncation is requested only with a clean index:
    `vassert(_dirty_tracker.clean(), ...)`. See the next section for why
    this is important.
  - At the end of batch cache / batch cache index lifetime there are no
    dirty batches left in the index. This already caught 2 bugs in the
    existing code:
      1) redpanda-data#17032
      2) redpanda-data#17039

Also, it improves the debugging experience as we include the state of
the dirty tracker in the `batch_cache_index` ostream insertion operator.

---

Truncation needs special attention too. When truncation is requested we
must evict batches with offsets higher than truncation point but still
retain dirty batches with smaller offsets. Since we can't have holes in
the ranges and can evict only at the range granularity we require a
flush prior truncation. It is already present so no changes required.
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Mar 15, 2024
Dirty entries are ones that aren't eligible for eviction as they might
have not yet been written durably to disk. Evicting them would otherwise
lead to "disappeared" data on cache miss.

We are making an implicit assumption that readers that want to read
beyond storage stable offset will always read through the batch cache.
Otherwise, the read-after-write semantics won't hold as the data
buffered in appender chunk cache isn't exposed.

---

Batch cache entries are `batch_cache::range` data structures. A range is
a 32KiB memory arena with some ancillary data structures. It is
associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1
mapping with a segment. When a batch is put in the cache, the
`batch_cache_index` will append to an existing range or allocate a new
one and insert it into the `batch_cache`.

We propose to track dirty batches at `batch_cache::range` granularity
rather than per batch.

Using the insight that dirty batches are inserted with monotonically
increasing offsets, and the fact that we always mark as clean (clear the
dirty flag) for a prefix of offsets, it is enough to track only the
maximum dirty batch offset in a range and advance it when new dirty
batches are inserted. For marking offsets as clean (on flush) we check
if `clean_offset >= max_dirty_offset` and mark the range as clean. In
the alternative case, nothing needs to be done.

A `batch_cache_index` can be associated with multiple ranges which
potentially need to be marked as clean. As an optimization, we propose
to track which ranges might contain dirty data and when the request for
marking a prefix of offsets comes in to iterate only through this subset
of ranges. We will use this information together with the already
existing index which maps offsets to ranges
`absl::btree_map<model::offset, batch_cache::entry>` to iterate only
through the entries and ranges which contain the dirty offsets we are
interested in. The complexity of this iteration is O(N) where N is the
number of dirty batches that must be marked as clean.

To this end, we are introducing a new data structure called
`batch_cache_dirty_tracker` to keep track of minimum and maximum dirty
offsets in an index that we need for the optimization.

Technically, we could achieve a similar performance optimization by
doing a binary search over the index using desired mark clean `up_to`
offset as a needle and iterate backwards through ranges marking them as
clean and stopping once we hit a clean range. However, the new data
structure has some additional benefits:

This data structure facilitates some invariants checks:
  - Dirty offsets are monotonic: `vassert(_max <= range.first && _max <
    range.second, ...)`
  - Truncation is requested only with a clean index:
    `vassert(_dirty_tracker.clean(), ...)`. See the next section for why
    this is important.
  - At the end of batch cache / batch cache index lifetime there are no
    dirty batches left in the index. This already caught 2 bugs in the
    existing code:
      1) redpanda-data#17032
      2) redpanda-data#17039

Also, it improves the debugging experience as we include the state of
the dirty tracker in the `batch_cache_index` ostream insertion operator.

---

Truncation needs special attention too. When truncation is requested we
must evict batches with offsets higher than truncation point but still
retain dirty batches with smaller offsets. Since we can't have holes in
the ranges and can evict only at the range granularity we require a
flush prior truncation. It is already present so no changes required.
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Mar 15, 2024
Dirty entries are ones that aren't eligible for eviction as they might
have not yet been written durably to disk. Evicting them would otherwise
lead to "disappeared" data on cache miss.

We are making an implicit assumption that readers that want to read
beyond storage stable offset will always read through the batch cache.
Otherwise, the read-after-write semantics won't hold as the data
buffered in appender chunk cache isn't exposed.

---

Batch cache entries are `batch_cache::range` data structures. A range is
a 32KiB memory arena with some ancillary data structures. It is
associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1
mapping with a segment. When a batch is put in the cache, the
`batch_cache_index` will append to an existing range or allocate a new
one and insert it into the `batch_cache`.

We propose to track dirty batches at `batch_cache::range` granularity
rather than per batch.

Using the insight that dirty batches are inserted with monotonically
increasing offsets, and the fact that we always mark as clean (clear the
dirty flag) for a prefix of offsets, it is enough to track only the
maximum dirty batch offset in a range and advance it when new dirty
batches are inserted. For marking offsets as clean (on flush) we check
if `clean_offset >= max_dirty_offset` and mark the range as clean. In
the alternative case, nothing needs to be done.

A `batch_cache_index` can be associated with multiple ranges which
potentially need to be marked as clean. As an optimization, we propose
to track which ranges might contain dirty data and when the request for
marking a prefix of offsets comes in to iterate only through this subset
of ranges. We will use this information together with the already
existing index which maps offsets to ranges
`absl::btree_map<model::offset, batch_cache::entry>` to iterate only
through the entries and ranges which contain the dirty offsets we are
interested in. The complexity of this iteration is O(N) where N is the
number of dirty batches that must be marked as clean.

To this end, we are introducing a new data structure called
`batch_cache_dirty_tracker` to keep track of minimum and maximum dirty
offsets in an index that we need for the optimization.

Technically, we could achieve a similar performance optimization by
doing a binary search over the index using desired mark clean `up_to`
offset as a needle and iterate backwards through ranges marking them as
clean and stopping once we hit a clean range. However, the new data
structure has some additional benefits:

This data structure facilitates some invariants checks:
  - Dirty offsets are monotonic: `vassert(_max <= range.first && _max <
    range.second, ...)`
  - Truncation is requested only with a clean index:
    `vassert(_dirty_tracker.clean(), ...)`. See the next section for why
    this is important.
  - At the end of batch cache / batch cache index lifetime there are no
    dirty batches left in the index. This already caught 2 bugs in the
    existing code:
      1) redpanda-data#17032
      2) redpanda-data#17039

Also, it improves the debugging experience as we include the state of
the dirty tracker in the `batch_cache_index` ostream insertion operator.

---

Truncation needs special attention too. When truncation is requested we
must evict batches with offsets higher than truncation point but still
retain dirty batches with smaller offsets. Since we can't have holes in
the ranges and can evict only at the range granularity we require a
flush prior truncation. It is already present so no changes required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants