Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm_stm: fix fence_pid_epoch cleanup #17880

Merged
merged 3 commits into from
Apr 16, 2024

Conversation

bharathv
Copy link
Contributor

fence_pid_epoch maps a producer id to its latest epoch. Current cleanup code does not do a epoch check before cleaning up the pid state. This can result in removing the state related to the latest epoch. Consider the following series of events..

[x, y] = pid[id=x, epoch=y]

[1, 0] begin_tx - fence_pid_epoch[1] = 0
[1, 1] begin_tx - fence_pid_epoch[1] = 1
evict [1, 0]
erase(fence_pid[1]) ==> removes (1)

This results in a messed up state stalling the state of the transaction because the partition cannot make progress until it verifies the epoch.

This is a long pending bug that was exposed by racy evictions.

note: this whole code is going to be revamped soon and the plan is to add a self contained unit test fixture that supports transactions end-to-end, that should have better test coverage.

Fixes #17827

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • fix a race between eviction and producer registration that results in an invalid transaction state.

fence_pid_epoch maps a producer id to its latest epoch. Current cleanup
code does not do a epoch check before cleaningup the pid state. This can
result in removing the state related to the latest epoch. Consider the
following series of events..

[x, y] = pid[id=x, epoch=y]

[1, 0] begin_tx - fence_pid_epoch[1] = 0
[1, 1] begin_tx - fence_pid_epoch[1] = 1
evict [1, 0]
erase(fence_pid[1]) ==> removes (1)

This results in a messed up state stalling the state of the transaction
because the partition cannot make progress until it verifies the epoch.

This is a long pending bug that was exposed by racy evictions.
@vbotbuildovich
Copy link
Collaborator

@bharathv
Copy link
Contributor Author

Failure unrelated: #16198

@piyushredpanda piyushredpanda merged commit a7e7201 into redpanda-data:dev Apr 16, 2024
16 of 19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@bharathv bharathv deleted the tx_fence_epoch_cleanup branch April 16, 2024 15:01
@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17880-v23.3.x-190 remotes/upstream/v23.3.x
git cherry-pick -x 7229e5fd621b44ff453be84b9af3a3a2469c035b 996e138f0801ae2846fa77be57c75439f9653416 4a9420872a973d2893b280388c8766d93eded8b4

Workflow run logs.

@bharathv
Copy link
Contributor Author

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

Branch name "v24.1.x" not found.

Workflow run logs.

piyushredpanda added a commit that referenced this pull request Apr 17, 2024
[backport] [v23.3.x] rm_stm: fix fence_pid_epoch cleanup #17880
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transaction end up tx_errc::invalid_txn_state forever
4 participants