Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftstore: fix high commit log duration when adding new peer #13078

Merged
merged 9 commits into from Jul 21, 2022

Conversation

Connor1996
Copy link
Member

@Connor1996 Connor1996 commented Jul 20, 2022

What is changed and how it works?

Issue Number: Close #13077

What's Changed:


When adding a new peer, `alive_cache_idx` would not consider the new peer still
in applying snapshot. Then it may trigger compacting entry cache due to 
`alive_cache_idx` being equal to `applied_idx`. After the snapshot is applied,
the log gap of new peer is not in entry cache, which triggers async fetch to 
read disk. 

Considering raft engine's read performance is not as good as rocksdb's, once 
there are a lot of Regions triggering async fetch, the process of replicating
log to new peer would be slow. If there is a conf change promoting the learner
and demoting another peer, the commit index can't be advanced in joint state
because the to-be-learner peer doesn't catch up logs in time.

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • Integration test
  • Manual test (add detailed scripts or steps below)

before
origin_img_v2_10002a4c-b966-4881-8f73-3bb01d710deg

after
origin_img_v2_09bb560d-cf36-4bd0-aadf-48369e4eb70g

Release note

Fix possible QPS drop due to high commit log duration 

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jul 20, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • BusyJay
  • tabokie

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@Connor1996
Copy link
Member Author

PTAL @cosven

if alive_replicated_idx > p.matched && p.matched >= truncated_idx {
alive_replicated_idx = p.matched;
} else if p.matched == 0 {
// the new peer is still applying snapshot, do not compact cache now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the new peer takes a very long time applying snapshot, will the cache grow until OOM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that, *last_heartbeat > cache_alive_limit can't meet, and compact cache without considering the new peer

let rid = self.get_region_id();
if self.engines.raft.has_builtin_entry_cache() {
self.engines.raft.gc_entry_cache(rid, idx);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why deleting this part? (e.g. what if raft uses rocksdb?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither raft engine nor rocksdb has builtin entry cache now, it's deprecated.

@@ -4958,18 +4963,14 @@ where
self.fsm
.peer
.mut_store()
.maybe_gc_cache(alive_cache_idx, applied_idx);
.compact_cache_to(alive_replicated_idx + 1);
Copy link
Member

@cosven cosven Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar question. Will the cache costs too much memory? Since it was eagerly cleaned before and now it is not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if there is a large log lag, force raft log compact will be triggered, and the cache compact is performed as well. Check the mut_store().compact_to() in on_ready_compact_log

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
if *last_heartbeat > cache_alive_limit {
if alive_replicated_idx > p.matched && p.matched >= truncated_idx {
alive_replicated_idx = p.matched;
} else if p.matched == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why checks for 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the pr description, that's the reason why the entry cache is dropped mistakenly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better split the commit messages by 80 characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if matched is not 0 but less than truncated_idx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's less than truncated_idx, the entry cache must be already dropped by on_ready_compact_log, so no need to consider it. Seems should keep the name alive_cache_idx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, a node is always lagging behind, so leader will wait for its first snapshot, and then skip all following snapshot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If that's the case, we should adjust the force compact policy to consider on flight snapshot. Nothing can do here as the cache is dropped in on_ready_compact_log anyway.

let rid = self.get_region_id();
self.engines.raft.gc_entry_cache(rid, apply_idx + 1);
}
if replicated_idx == apply_idx {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compact cache to min(alive_cache_idx, applied_idx) in latest commit, seems better than this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alive_cached_idx may be accessed again to find the match entry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean, I don't get it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replicated_idx + 1 may not be a good index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't ever use replicated_idx + 1... I still don't know what's your point. For min(alive_cache_idx, applied_idx), it already covers the case when the region is inactive.

Some(idx) => idx,
};
if cache_first_idx > replicated_idx + 1 {
// Catching up log requires accessing fs already, let's optimize for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it makes thing worse once the cache is dropped by mistake. alive_cache_idx and force compact already exclude the too lagging peer, the policy is better than this

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot ti-chi-bot added the status/LGT1 Status: PR - There is already 1 approval label Jul 21, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Status: PR - There are already 2 approvals and removed status/LGT1 Status: PR - There is already 1 approval labels Jul 21, 2022
@Connor1996
Copy link
Member Author

/merge

@ti-chi-bot
Copy link
Member

@Connor1996: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: f370239

@ti-chi-bot ti-chi-bot added the status/can-merge Status: Can merge to base branch label Jul 21, 2022
@Connor1996 Connor1996 changed the title raftstore: Fix entry cache gc may dropped by mistake when adding new peer raftstore: Fix high commit log duration because entry cache gc dropped by mistake when adding new peer Jul 21, 2022
@tabokie
Copy link
Member

tabokie commented Jul 21, 2022

/run-tests retry=4

@BusyJay BusyJay changed the title raftstore: Fix high commit log duration because entry cache gc dropped by mistake when adding new peer raftstore: fix high commit log duration when adding new peer Jul 21, 2022
@tabokie
Copy link
Member

tabokie commented Jul 21, 2022

/run-tests

@ti-chi-bot ti-chi-bot merged commit 1f0a1a3 into tikv:master Jul 21, 2022
@ti-chi-bot ti-chi-bot added this to the Pool milestone Jul 21, 2022
ti-srebot pushed a commit to ti-srebot/tikv that referenced this pull request Jul 21, 2022
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-6.1 in PR #13089

ti-chi-bot pushed a commit that referenced this pull request Jul 21, 2022
…#13089)

close #13077, ref #13078

When adding a new peer, `alive_cache_idx` would not consider the new peer still
in applying snapshot. Then it may trigger compacting entry cache due to 
`alive_cache_idx` being equal to `applied_idx`. After the snapshot is applied,
the log gap of new peer is not in entry cache, which triggers async fetch to 
read disk. 

Considering raft engine's read performance is not as good as rocksdb's, once 
there are a lot of Regions triggering async fetch, the process of replicating
log to new peer would be slow. If there is a conf change promoting the learner
and demoting another peer, the commit index can't be advanced in joint state
because the to-be-learner peer doesn't catch up logs in time.

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: Connor <zbk602423539@gmail.com>
LintianShi pushed a commit to LintianShi/tikv that referenced this pull request Jul 27, 2022
)

close tikv#13077

When adding a new peer, `alive_cache_idx` would not consider the new peer still
in applying snapshot. Then it may trigger compacting entry cache due to
`alive_cache_idx` being equal to `applied_idx`. After the snapshot is applied,
the log gap of new peer is not in entry cache, which triggers async fetch to
read disk.

Considering raft engine's read performance is not as good as rocksdb's, once
there are a lot of Regions triggering async fetch, the process of replicating
log to new peer would be slow. If there is a conf change promoting the learner
and demoting another peer, the commit index can't be advanced in joint state
because the to-be-learner peer doesn't catch up logs in time.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this pull request Aug 2, 2022
…13120)

ref #13060, ref #13078

In some cases, such as the one mentioned in #13078, the commit log duration 
became high. In the case, the needed log is not in entry cache and there are
many raftlog async fetch tasks.

This commit adds a log to show the cache first index and peers' progress when
there is any long uncommitted proposal. It also adds a metric to show the 
duration of the async fetch tasks.

Signed-off-by: cosven <yinshaowen241@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-6.1 release-note size/L status/can-merge Status: Can merge to base branch status/LGT2 Status: PR - There are already 2 approvals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v6.1.0: After network isolation is recovered, QPS drops more than 50% due to high commit log duration
7 participants