Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka: Add support for the delete-records API #10061

Merged
merged 12 commits into from
Jun 22, 2023

Conversation

graphcareful
Copy link
Contributor

@graphcareful graphcareful commented Apr 13, 2023

This PR adds the delete-records feature to redpanda. This feature allows a kafka client to truncate data from a topic/partition at a given offset. To support this in redpanda two non-trivial tasks were implemented:

  1. Addition of new stm (and corresponding special batch type) to process prefix truncation events.
  2. Addition of new concept (visible_start_offset) which is a delta into the first segment of the log representing the first offset a fetch request could read from.

Note, as of this branch delete-records requests for cloud enabled topics will be rejected. Work on supporting this is slated to be completed in a follow up PR, which depends on a currently still in review PR #9994

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Features

  • Adds support for the delete-records API

@graphcareful
Copy link
Contributor Author

Added more ducktape tests

@emaxerrno
Copy link
Contributor

wondering if franz-go has some delete records test, or if the librdkafka test suite should be enabled for this or some other external-API project that exercises this path for edge case detection.

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like lots of tests are failing, perhaps related to the timeout change in the PR.

tests/rptest/tests/prefix_truncate_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/prefix_truncate_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/prefix_truncate_test.py Outdated Show resolved Hide resolved
tests/rptest/util.py Outdated Show resolved Hide resolved
src/v/cluster/partition.h Outdated Show resolved Hide resolved
src/v/features/feature_table.h Outdated Show resolved Hide resolved
src/v/kafka/server/handlers/delete_records.cc Outdated Show resolved Hide resolved
tests/rptest/tests/prefix_truncate_test.py Outdated Show resolved Hide resolved
src/v/raft/log_eviction_stm.cc Outdated Show resolved Hide resolved
@graphcareful
Copy link
Contributor Author

graphcareful commented Apr 17, 2023

Changes in force-push

  • Modify prefix_truncate name schema to delete_records
  • Fixed bug where truncating with -1 (high watermark) may cause a crash
  • Fixed bug that made offset translation not work correctly due to not excluding the new special batch type in translation accounting
  • Fixed bug that made delete_records request return early due to off-by-one error in firing new min offset event notification. Returned low_watermark would appear lower then expected.
  • Fixed bug that would made subsequent request to prefix truncate within the same segment fail due to not updating conditionals in consensus::write_snapshot
  • Fixed bug that would cause a crash when writing a snapshot at the first segment. consensus::write_snapshot expects get_term to always work, which was previously true since write_snapshot would never be called at a last_included_index value of 0. Now this is not the case since write_snapshot can be called just to increment the new log start delta.
  • Fixed bug where log_eviction_stm would start processing events from start_offset() instead of visible_start_offset(). No issues arose because events would be ignored, after correctly determining that the older events aren't applicable to a log with a start_offset + delta ahead of the requested.
  • Fixed bug where delete-records at the high watermark - 1 offset (truncate all data) failed due to off-by-one error
  • Fixed bug where log_eviction_stm was not truncating the underlying log at segment end offsets but rather segment start offsets. (only for events arisen from delete-records requests)
  • Fixed bug that caused some CloudStorage tests in CI to fail, random default timeout was incorrectly modified
  • Modify KCL::delete_records for better error condition handling
  • Much more ducktape tests

@graphcareful
Copy link
Contributor Author

graphcareful commented Apr 17, 2023

Leaving a comment here about the proposed testing plan. All testing will be done in Ducktape, desired scenarios to test are:

  • Test moving delta of current segment
  • Test moving delta of current segment multiple times in a row
  • Test deletion of data at segment boundary
  • Test deletion of data within a segment, between boundaries
  • Test truncation at offsets <= 0 and >= high watermark all result in out of bounds errors
  • Test correct errors returned on topics/partitions that don't exist
  • Test truncation at high watermark (-1) deletes all data
  • Test an edge case situations where begin, end are very close like [38, 40)
  • Test all of the above parameterized with cloud storage enabled topics
  • Test integration with transactions
  • Test scenarios where node failures are introduced (list scenarios - in-progress)

@graphcareful
Copy link
Contributor Author

Test failures look like:

FAIL: <NodeCrash docker-rp-17: ERROR 2023-04-18 01:52:13,290 [shard 1] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0ffc8dc9474419720-1/redpanda/redpanda/src/v/raft/consensus.cc:2273) 'config.has_value()' Configuration for offset 7919 must be available in configuration manager: {configurations: { offset: {9049}, idx: {1},cfg: {current: {voters: {{id: {3}, revision: {25}}, {id: {1}, revision: {25}}, {id: {2}, revision: {25}}}, learners: {}}, old:{nullopt}, revision: 25, update: {nullopt}, version: 5}} } 
>

Note this must indicate a bug in the changes I've made to log_eviction_stm, when handling GC events from storage.

@graphcareful graphcareful force-pushed the delete-records branch 3 times, most recently from 1626948 to ebb4b29 Compare April 18, 2023 21:57
@graphcareful
Copy link
Contributor Author

graphcareful commented Apr 18, 2023

Changes in force-push

  • Rebase against dev to resolve conflicts in feature_table (another change added a new feature)
  • Modified log_eviction housekeeping loop to periodically sleep when there is no work to do, or in the case the work item was re-enqueued due to the offset being greater then the max collectible.
  • Fixed a bug that was resulting in test failures incorrect variable was being referred to in log_eviction_stm::do_process_log_eviction
  • Fixed a newly written flaky ducktape test by repeating if NOT_LEADER is observed
  • Made some changes around previous off-by-one fixes to remove confusing logic around subtracting an extra 1 here and there.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still some things I don't understand, so I'll need to come back to it.

src/v/raft/log_eviction_stm.h Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/raft/log_eviction_stm.cc Outdated Show resolved Hide resolved
@graphcareful
Copy link
Contributor Author

Changes in force-push

  • Fix for CI issues which were causing some retention based tests to fail. Issue was writing a snapshot with offset::min as the value of log_start_delta
  • Fixed issue where truncation at bounds wasn't entirely respected, new ducktape test written to test for this case. (log of one record, truncate at 0 & 1)
  • Expanded a ducktape test to truncates at the high watermark, only truncation at 1 before the high watermark was tested.

@graphcareful
Copy link
Contributor Author

Changes in force-push

  • Reduce eviction housekeeping interval from 1s to 200ms. Fixed a failing raft unit test that had a 2s timeout on checking a segments eviction. 1s also seemed a bit too long, open to discussion on this.
  • Removed a useless tracelog from log_eviction_stm

src/v/raft/log_eviction_stm.cc Outdated Show resolved Hide resolved
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/raft/log_eviction_stm.h Outdated Show resolved Hide resolved
src/v/raft/log_eviction_stm.cc Outdated Show resolved Hide resolved
src/v/raft/log_eviction_stm.cc Outdated Show resolved Hide resolved
@mmaslankaprv mmaslankaprv self-requested a review June 21, 2023 17:39
mmaslankaprv
mmaslankaprv previously approved these changes Jun 21, 2023
mmaslankaprv
mmaslankaprv previously approved these changes Jun 21, 2023
- This commit modifies the log_eviction_stm to be a proper stm that will
scan the underlying log for new special batches of type
'prefix_truncate'.

- These new special batches are indicators to modify the start offset.
Since this is replicated to peers, eventually all peers will modify
their respective start offsets as well.

- There is a new background processing fiber added to this class as well
that attempts to write a raft snapshot , reaping as much data as
possible as close to the current start offset.

- The class is a subclass of persisted_stm and persists to the kvstore
its last applied offset and the current start offset. In the case of
failures these pieces of information can be used to startup from the
last processed offset.
- The kafka layer presents the prefix_truncate abstraction, it performs
bounds checking with the given kafka offset and translates this to a rp
model::offset before invoking cluster::prefix_truncate , which will boil
down to replicating the prefix truncate special batch.
- This stm has a conditional in its last_stable_offset() method that
returns an invalid offset in the case it hasn't completed bootstrapping.

- The issue is that this bootstrap phase isn't considered finished after
bootstrapping from apply_snapshot(). This would cause other stms to
pause thinking the rm_stm had work to do at an offset at 0, causing
that other stm to timeout and fail processing of said event.

- Solution is simple, to set `_boostrap_committed_offset`within the
`apply_snapshot()` method

- Fixes: redpanda-data#11131

- Fixes: redpanda-data#11130
- Adds support for the DeleteRecords API
- Now that the start offset is not an independent concept across all
nodes its desired to ensure the leader node has cought up with previous
leader with respect to processing of prefix truncate special batches
before returning the start offset.

- This avoids returning a stale start_offset in the case leadership has
recently changed and the log_eviction_stm hasn't yet caught up.
@graphcareful
Copy link
Contributor Author

/ci-repeat 1
debug
skip-units
dt-repeat=20
tests/rptest/tests/rack_aware_replica_placement_test.py::RackAwarePlacementTest.test_replica_placement

@graphcareful
Copy link
Contributor Author

/ci-repeat 1
debug
skip-units
dt-repeat=20
tests/rptest/tests/memory_stress_test.py::MemoryStressTest.test_fetch_with_many_partitions

@graphcareful
Copy link
Contributor Author

/ci-repeat 1
debug
skip-units
dt-repeat=200
tests/rptest/tests/memory_stress_test.py::MemoryStressTest.test_fetch_with_many_partitions

@piyushredpanda piyushredpanda merged commit 01e5279 into redpanda-data:dev Jun 22, 2023
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants