Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix raft appendlog deadlock #3141

Merged
merged 2 commits into from
Oct 20, 2021

Conversation

kikimo
Copy link
Contributor

@kikimo kikimo commented Oct 19, 2021

This fix #3140

In bool RaftPart::checkAppendLogResult(AppendLogResult res) we can see that it release logsLock_ before setting replicatingLogs_ to false, during this period, log can be append to logs_ until it overflows and set bufferOverFlow_ to true:

bool RaftPart::checkAppendLogResult(AppendLogResult res) {
if (res != AppendLogResult::SUCCEEDED) {
{
std::lock_guard<std::mutex> lck(logsLock_);
logs_.clear();
cachingPromise_.setValue(res);
cachingPromise_.reset();
bufferOverFlow_ = false;
}
sendingPromise_.setValue(res);
replicatingLogs_ = false;
return false;
}
return true;
}

this is disatrous, because we can see from folly::Future<AppendLogResult> RaftPart::appendLogAsync() that once bufferOverFlow is set to true, appendLogAsync() will return error directly, meanwhile there is no other task comsuming the logs_ and reseting bufferOverFlow_ to false, that mean the whole raft append log process will be blocked:

folly::Future<AppendLogResult> RaftPart::appendLogAsync(ClusterID source,
LogType logType,
std::string log,
AtomicOp op) {
if (blocking_) {
// No need to block heartbeats and empty log.
if ((logType == LogType::NORMAL && !log.empty()) || logType == LogType::ATOMIC_OP) {
return AppendLogResult::E_WRITE_BLOCKING;
}
}
LogCache swappedOutLogs;
auto retFuture = folly::Future<AppendLogResult>::makeEmpty();
if (bufferOverFlow_) {
LOG_EVERY_N(WARNING, 100) << idStr_
<< "The appendLog buffer is full."
" Please slow down the log appending rate."
<< "replicatingLogs_ :" << replicatingLogs_;
return AppendLogResult::E_BUFFER_OVERFLOW;
}
{
std::lock_guard<std::mutex> lck(logsLock_);
VLOG(2) << idStr_ << "Checking whether buffer overflow";
if (logs_.size() >= FLAGS_max_batch_size) {
// Buffer is full
LOG(WARNING) << idStr_
<< "The appendLog buffer is full."
" Please slow down the log appending rate."
<< "replicatingLogs_ :" << replicatingLogs_;
bufferOverFlow_ = true;
return AppendLogResult::E_BUFFER_OVERFLOW;
}
VLOG(2) << idStr_ << "Appending logs to the buffer";
// Append new logs to the buffer
DCHECK_GE(source, 0);
logs_.emplace_back(source, logType, std::move(log), std::move(op));
switch (logType) {
case LogType::ATOMIC_OP:
retFuture = cachingPromise_.getSingleFuture();
break;
case LogType::COMMAND:
retFuture = cachingPromise_.getAndRollSharedFuture();
break;
case LogType::NORMAL:
retFuture = cachingPromise_.getSharedFuture();
break;
}
bool expected = false;
if (replicatingLogs_.compare_exchange_strong(expected, true)) {
// We need to send logs to all followers
VLOG(2) << idStr_ << "Preparing to send AppendLog request";
sendingPromise_ = std::move(cachingPromise_);
cachingPromise_.reset();
std::swap(swappedOutLogs, logs_);
bufferOverFlow_ = false;
} else {
VLOG(2) << idStr_ << "Another AppendLogs request is ongoing, just return";
return retFuture;
}
}

@kikimo kikimo added the ready-for-testing PR: ready for the CI test label Oct 19, 2021
@critical27
Copy link
Contributor

Good job, generally LGTM. Have you try this PR in your test, maybe we don't merge to 2.6.0?

@kikimo
Copy link
Contributor Author

kikimo commented Oct 19, 2021

Good job, generally LGTM. Have you try this PR in your test, maybe we don't merge to 2.6.0?

I too think this should not go to 2.6.0, just hold for a while, I'll do more test.

Tried this pr locally, append log does not block anymore but a simple match query taks more than 2s to complete after we stoped the stresser, think there may have other problem.

@kikimo kikimo changed the title fix raft appendlog deadlock [Don't merge]Fix raft appendlog deadlock Oct 19, 2021
@critical27
Copy link
Contributor

Good job, generally LGTM. Have you try this PR in your test, maybe we don't merge to 2.6.0?

I too think this should not go to 2.6.0, just hold for a while, I'll do more test.

Tried this pr locally, append log does not block anymore but a simple match query taks more than 2s to complete after we stoped the stresser, think there may have other problem.

OK, contact me in dingtalk if any problem.

@kikimo
Copy link
Contributor Author

kikimo commented Oct 19, 2021

Good job, generally LGTM. Have you try this PR in your test, maybe we don't merge to 2.6.0?

I too think this should not go to 2.6.0, just hold for a while, I'll do more test.
Tried this pr locally, append log does not block anymore but a simple match query taks more than 2s to complete after we stoped the stresser, think there may have other problem.

OK, contact me in dingtalk if any problem.

👌

@critical27 critical27 changed the title [Don't merge]Fix raft appendlog deadlock Fix raft appendlog deadlock Oct 20, 2021
Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Thx a lot~

Copy link
Contributor

@liuyu85cn liuyu85cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Job, LGTM

@codecov-commenter
Copy link

Codecov Report

Merging #3141 (5f4c3d3) into master (d05ca51) will increase coverage by 0.19%.
The diff coverage is 84.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3141      +/-   ##
==========================================
+ Coverage   84.98%   85.17%   +0.19%     
==========================================
  Files        1289     1294       +5     
  Lines      117763   118265     +502     
==========================================
+ Hits       100076   100731     +655     
+ Misses      17687    17534     -153     
Impacted Files Coverage Δ
src/codec/RowWriterV2.h 94.44% <ø> (ø)
src/codec/test/ResultSchemaProvider.h 100.00% <ø> (ø)
src/codec/test/SchemaWriter.h 100.00% <ø> (ø)
src/common/meta/SchemaProviderIf.h 100.00% <ø> (ø)
src/graph/util/ExpressionUtils.h 100.00% <ø> (ø)
src/graph/util/ToJson.h 100.00% <ø> (ø)
src/graph/validator/ACLValidator.cpp 85.07% <ø> (ø)
src/graph/validator/ACLValidator.h 100.00% <ø> (ø)
src/graph/validator/AdminValidator.cpp 78.78% <ø> (ø)
src/graph/validator/AdminValidator.h 67.53% <ø> (ø)
... and 106 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e179fc4...5f4c3d3. Read the comment docs.

@critical27 critical27 merged commit 50b1760 into vesoft-inc:master Oct 20, 2021
@critical27 critical27 added cherry-pick-v2.6 PR: need cherry-pick to this version and removed cherry-pick-v2.6 PR: need cherry-pick to this version labels Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raft append log may block forever
4 participants