New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raftstore: fix high commit log duration when adding new peer #13078
Conversation
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
PTAL @cosven |
if alive_replicated_idx > p.matched && p.matched >= truncated_idx { | ||
alive_replicated_idx = p.matched; | ||
} else if p.matched == 0 { | ||
// the new peer is still applying snapshot, do not compact cache now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the new peer takes a very long time applying snapshot, will the cache grow until OOM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that, *last_heartbeat > cache_alive_limit
can't meet, and compact cache without considering the new peer
let rid = self.get_region_id(); | ||
if self.engines.raft.has_builtin_entry_cache() { | ||
self.engines.raft.gc_entry_cache(rid, idx); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why deleting this part? (e.g. what if raft uses rocksdb?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither raft engine nor rocksdb has builtin entry cache now, it's deprecated.
@@ -4958,18 +4963,14 @@ where | |||
self.fsm | |||
.peer | |||
.mut_store() | |||
.maybe_gc_cache(alive_cache_idx, applied_idx); | |||
.compact_cache_to(alive_replicated_idx + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar question. Will the cache costs too much memory? Since it was eagerly cleaned before and now it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if there is a large log lag, force raft log compact will be triggered, and the cache compact is performed as well. Check the mut_store().compact_to()
in on_ready_compact_log
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
if *last_heartbeat > cache_alive_limit { | ||
if alive_replicated_idx > p.matched && p.matched >= truncated_idx { | ||
alive_replicated_idx = p.matched; | ||
} else if p.matched == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why checks for 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the pr description, that's the reason why the entry cache is dropped mistakenly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better split the commit messages by 80 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if matched
is not 0 but less than truncated_idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's less than truncated_idx
, the entry cache must be already dropped by on_ready_compact_log
, so no need to consider it. Seems should keep the name alive_cache_idx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, a node is always lagging behind, so leader will wait for its first snapshot, and then skip all following snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If that's the case, we should adjust the force compact policy to consider on flight snapshot. Nothing can do here as the cache is dropped in on_ready_compact_log
anyway.
let rid = self.get_region_id(); | ||
self.engines.raft.gc_entry_cache(rid, apply_idx + 1); | ||
} | ||
if replicated_idx == apply_idx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compact cache to min(alive_cache_idx, applied_idx)
in latest commit, seems better than this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alive_cached_idx may be accessed again to find the match entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean, I don't get it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replicated_idx + 1
may not be a good index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't ever use replicated_idx + 1
... I still don't know what's your point. For min(alive_cache_idx, applied_idx)
, it already covers the case when the region is inactive.
Some(idx) => idx, | ||
}; | ||
if cache_first_idx > replicated_idx + 1 { | ||
// Catching up log requires accessing fs already, let's optimize for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it makes thing worse once the cache is dropped by mistake. alive_cache_idx
and force compact already exclude the too lagging peer, the policy is better than this
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
/merge |
@Connor1996: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger If you have any questions about the PR merge process, please refer to pr process. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: f370239
|
/run-tests retry=4 |
/run-tests |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
cherry pick to release-6.1 in PR #13089 |
…#13089) close #13077, ref #13078 When adding a new peer, `alive_cache_idx` would not consider the new peer still in applying snapshot. Then it may trigger compacting entry cache due to `alive_cache_idx` being equal to `applied_idx`. After the snapshot is applied, the log gap of new peer is not in entry cache, which triggers async fetch to read disk. Considering raft engine's read performance is not as good as rocksdb's, once there are a lot of Regions triggering async fetch, the process of replicating log to new peer would be slow. If there is a conf change promoting the learner and demoting another peer, the commit index can't be advanced in joint state because the to-be-learner peer doesn't catch up logs in time. Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Co-authored-by: Connor <zbk602423539@gmail.com>
) close tikv#13077 When adding a new peer, `alive_cache_idx` would not consider the new peer still in applying snapshot. Then it may trigger compacting entry cache due to `alive_cache_idx` being equal to `applied_idx`. After the snapshot is applied, the log gap of new peer is not in entry cache, which triggers async fetch to read disk. Considering raft engine's read performance is not as good as rocksdb's, once there are a lot of Regions triggering async fetch, the process of replicating log to new peer would be slow. If there is a conf change promoting the learner and demoting another peer, the commit index can't be advanced in joint state because the to-be-learner peer doesn't catch up logs in time. Signed-off-by: Connor1996 <zbk602423539@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
…13120) ref #13060, ref #13078 In some cases, such as the one mentioned in #13078, the commit log duration became high. In the case, the needed log is not in entry cache and there are many raftlog async fetch tasks. This commit adds a log to show the cache first index and peers' progress when there is any long uncommitted proposal. It also adds a metric to show the duration of the async fetch tasks. Signed-off-by: cosven <yinshaowen241@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
What is changed and how it works?
Issue Number: Close #13077
What's Changed:
Related changes
Check List
Tests
before
after
Release note