Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftstore: fix the issue deadlock between apply snapshot and create peer (#10910) #10918

Merged
merged 1 commit into from Sep 7, 2021

Conversation

ti-srebot
Copy link
Contributor

@ti-srebot ti-srebot commented Sep 7, 2021

cherry-pick #10910 to release-5.2
You can switch your code base to this Pull Request by using git-extras:

# In tikv repo:
git pr https://github.com/tikv/tikv/pull/10918

After apply modifications, you can push your change to this PR via:

git push git@github.com:ti-srebot/tikv.git pr/10918:release-5.2-1e453231b7dd

Signed-off-by: nolouch nolouch@gmail.com

What problem does this PR solve?

Issue Number: close #10909

Problem Summary:

What is changed and how it works?

Introduced by: #10802
The deadlock happened between the apply snapshot and create new peer, we can confirm that the deadlock was caused by the locking sequence of store_meta and global_replication_state. as below stacks information shown, one thread call on_ready_apply_snapshot, another one call maybe_create_peer_internal.

#0  0x00007f2536da14ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f2536d9cdcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f2536d9cc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055a6ffc1505e in raftstore::store::fsm::peer::PeerFsmDelegate$LT$EK$C$ER$C$T$GT$::on_ready_apply_snapshot::hb8ba6dd7648d255a ()
#4  0x000055a6ffc104b6 in raftstore::store::fsm::peer::PeerFsmDelegate$LT$EK$C$ER$C$T$GT$::post_raft_ready_append::h1fcc32c1524e2da0 ()
#5  0x000055a6ffc278b2 in raftstore::store::fsm::store::RaftPoller$LT$EK$C$ER$C$T$GT$::handle_raft_ready_write::h01da8c1bdafe94bc ()
#6  0x000055a6ff94e036 in _$LT$raftstore..store..fsm..store..RaftPoller$LT$EK$C$ER$C$T$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..peer..PeerFsm$LT$EK$C$ER$GT$$C$raftstore..store..fsm..store..StoreFsm$LT$EK$GT$$GT$$GT$::end::ha140758f8328ea0d ()
#7  0x000055a6ff903e8e in batch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h5c7b49aadf7c2a97 ()
#8  0x000055a6ffce20ae in std::sys_common::backtrace::__rust_begin_short_backtrace::h900083fbcc827393 ()
#9  0x000055a6ffcf8afd in std::panicking::try::do_call::hfbcd62ff2215291e ()
#10 0x000055a6ffe967ad in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hb9b7ea1369649798 ()
#11 0x000055a7003b6897 in call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572
#12 call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572
#13 std::sys::unix::thread::Thread::new::thread_start::hb71b17934c5f5e68 () at library/std/src/sys/unix/thread.rs:91
#14 0x00007f2536d9add5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f25363a3ead in clone () from /lib64/libc.so.6
#0  0x00007f2536da14ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f2536d9cdcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f2536d9cc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055a6ffc455c2 in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::maybe_create_peer_internal::hb4a38ea8f75478f8 ()
#4  0x000055a6ffc385ac in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::maybe_create_peer::h8f442248e8d62dfc ()
#5  0x000055a6ffc37521 in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::on_raft_message::h4d43b5dfc6a12f45 ()
#6  0x000055a6ffc3453a in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::handle_msgs::h6fa8c55f9c0a2df8 ()

What's Changed:

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • PR to update pingcap/tidb-ansible:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression
    • Consumes more CPU
    • Consumes more MEM
  • Breaking backward compatibility

Release note

raftstore: fix the deadlock issue in rafstore

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Sep 7, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • BusyJay
  • zhouqiang-cl

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@nolouch please accept the invitation then you can push to the cherry-pick pull requests. Comment with /cherry-pick-invite if the invitation is expired.
https://github.com/ti-srebot/tikv/invitations

@nolouch
Copy link
Contributor

nolouch commented Sep 7, 2021

/test

@nolouch nolouch requested a review from BusyJay September 7, 2021 12:12
@ti-chi-bot ti-chi-bot added the status/LGT1 Status: PR - There is already 1 approval label Sep 7, 2021
@zhouqiang-cl zhouqiang-cl added the cherry-pick-approved Cherry pick PR approved by release team. label Sep 7, 2021
@zhouqiang-cl
Copy link
Contributor

/test

@zhouqiang-cl
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

@zhouqiang-cl: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

@zhouqiang-cl: /merge in this pull request requires 2 approval(s).

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@nolouch
Copy link
Contributor

nolouch commented Sep 7, 2021

/test

1 similar comment
@zhouqiang-cl
Copy link
Contributor

/test

Copy link
Contributor

@zhouqiang-cl zhouqiang-cl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Status: PR - There are already 2 approvals and removed status/LGT1 Status: PR - There is already 1 approval labels Sep 7, 2021
@zhouqiang-cl
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

@zhouqiang-cl: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c3a583c

@ti-chi-bot ti-chi-bot added the status/can-merge Status: Can merge to base branch label Sep 7, 2021
Copy link
Member

@NingLin-P NingLin-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nolouch
Copy link
Contributor

nolouch commented Sep 7, 2021

/test

@ti-chi-bot ti-chi-bot merged commit 40779dd into tikv:release-5.2 Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Cherry pick PR approved by release team. release-note size/M status/can-merge Status: Can merge to base branch status/LGT2 Status: PR - There are already 2 approvals type/cherry-pick Type: PR - Cherry pick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants