Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: move ghost batch generation into log reader #16721

Merged
merged 2 commits into from
Feb 28, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Feb 27, 2024

Ghost batches are batches that don't contain data whose purpose is to
ensure:

  • the caller (recovery_stm) sees a contiguous set of offsets
  • the terms of the returned batches match with the original terms
    provided at append time, regardless of whether compactions have
    happened

These requirements are currently expected to be honored by the Raft
layer at recovery time, where it currently generates batches in between
batches read from storage. Since the storage layer is the current owner
of term metadata and compactions, the fact that these ghost batches are
generated by the Raft layer makes it trickier to evolve the underlying
storage implementation, since the guarantees would need to be tested /
validated at multiple layers.

So, to support development and testing of future log implementations,
this patch pulls out ghost batch generation into the storage layer as an
option to the log_reader.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

I have an upcoming change that will iterate through a randomized
complete list (i.e. all numbers between A and B, in random order). This
adds a method to generate such a list.
Ghost batches are batches that don't contain data whose purpose is to
ensure:
- the caller (recovery_stm) sees a contiguous set of offsets
- the terms of the returned batches match with the original terms
  provided at append time, regardless of whether compactions have
  happened

These requirements are currently expected to be honored by the Raft
layer at recovery time, where it currently generates batches in between
batches read from storage. Since the storage layer is the current owner
of term metadata and compactions, the fact that these ghost batches are
generated by the Raft layer makes it trickier to evolve the underlying
storage implementation, since the guarantees would need to be tested /
validated at multiple layers.

So, to support development and testing of future log implementations,
this patch pulls out ghost batch generation into the storage layer as an
option to the log_reader.
@andrwng
Copy link
Contributor Author

andrwng commented Feb 27, 2024

/ci-repeat

@andrwng andrwng marked this pull request as ready for review February 27, 2024 16:03
@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/45411#018deb99-4dff-4dcf-8fb3-1e3c6ba0cf90:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_missing_partition.cloud_storage_type=CloudStorageType.ABS"

@vbotbuildovich
Copy link
Collaborator

@andrwng
Copy link
Contributor Author

andrwng commented Feb 28, 2024

CI failure: #16764

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, but there is something i don't understand yet.

the motivation here seems to be that storage contains all the term info/compaction info etc... so that makes it better suited to deal with ghost batches as storage implementation evolves. additionally, it seems like there is a simplification and potential performance benefit since raft doesn't need to do the read and then go back and fill in the gaps as a second pass.

however, i don't see why storage needs to do this from the point of view of making it easier to evolve the implementation because currently it seems like raft depends on information in batches that will be be the same both before and after storage evolution project.

maybe there is a specific change in the evolution project that fills in this knowledge gap (yep)?

@andrwng
Copy link
Contributor Author

andrwng commented Feb 28, 2024

this looks good, but there is something i don't understand yet.

the motivation here seems to be that storage contains all the term info/compaction info etc... so that makes it better suited to deal with ghost batches as storage implementation evolves. additionally, it seems like there is a simplification and potential performance benefit since raft doesn't need to do the read and then go back and fill in the gaps as a second pass.

however, i don't see why storage needs to do this from the point of view of making it easier to evolve the implementation because currently it seems like raft depends on information in batches that will be be the same both before and after storage evolution project.

maybe there is a specific change in the evolution project that fills in this knowledge gap (yep)?

You're right that this isn't needed for the goal of an MVCC-versioned log. The aspect of the storage format evolution this helps with is when we get to implementing compaction that can remove entire terms' worth of batches -- once we get to that, the owner of term metadata should be the one to create ghost batches

@dotnwat dotnwat merged commit 5b8dc1e into redpanda-data:dev Feb 28, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants