-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992
raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992
Conversation
…a snapshot. Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
[REVIEW NOTIFICATION] This pull request has not been approved. To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Skipping CI for Draft Pull Request. |
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
I have 2 questions about this PR:
|
|
By the way, as for the point 2, I agree what u mentioned before. But for safety, this pr takes current implementation.
I'll have a try for the |
I don't think we can directly mark it as
There are two ways to fix the panic:
|
Why need this extra RPC? At the leader side, it will switch the peer's state to normal after finishing send the snapshot. At the follower side, when apply snapshot failed, it is also doable to restore the raft state to its previous state before this snapshot. Thus, If I understand correctly, the leader should start another snapshot without any extra operation? |
Do you mean persisting the previous state so it can be restored even after restarting TiKV? That's doable (without introducing a new RPC), but it does add extra complexity to raftstore (and we'll need to review every code path related to snapshot handling). |
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
This pr needs extra support from PD side to make the implementation more valid. Hold until tikv/pd#8266 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: glorv, overvenus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What is changed and how it works?
Issue Number: Close #15292
What's Changed:
Previously, there were pending tasks to address the scenario where TiKV would panic if applying snapshots failed due to abnormal conditions such as IO errors or unexpected issues.
This pull request resolves the issue by introducing additional traits
tombstone: bool
toSnapshotApplied
, indicating whether the failure occurred due to abnormal snapshots.Additionally, the region_id of the abnormal peer will be recorded into the newly added
StoreMeta::damaged_regions
, used to remove the associated peer.Finally, this peer will be destroyed later through the StoreHeartbeat to PD, which will create a
remove peer
operator to remove the peer safely.Related changes
pingcap/docs
/pingcap/docs-cn
:Check List
Tests
Side effects
Release note