Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log-backup: added intervally resolve regions #14180

Merged
merged 15 commits into from Mar 13, 2023

Conversation

YuJuncen
Copy link
Contributor

@YuJuncen YuJuncen commented Feb 8, 2023

Signed-off-by: hillium yujuncen@pingcap.com

What is changed and how it works?

Issue Number: Ref #13638

What's Changed:

This PR is the "resolved TS" part of #14023.

This PR added a “two phase” flush to log backup for reducing checkpoint lag.
Generally, we added a `MinTs` task, where resolve the regions and advance the `resolved_ts` in the checkpoint manager.
then, once we are doing flush, we would make current `resolved_ts` become `checkpoint_ts`.
This allows us to advance checkpoint_ts even the leader has gone. When the leader changes frequently, this can greatly reduce checkpoint lag.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test

Side effects

  • Performance regression
    • Consumes more CPU
    • Consumes more MEM

Release note

Make the checkpoint lag of PITR more stable when there are some leadership transforming.

Signed-off-by: hillium <yujuncen@pingcap.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 8, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • 3pointer
  • hicqu

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

YuJuncen and others added 4 commits February 15, 2023 11:22
Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: hillium <yujuncen@pingcap.com>
Comment on lines 975 to 1002
RegionCheckpointOperation::PrepareMinTsForResolve => {
let min_ts = self.pool.block_on(self.prepare_min_ts());
let start_time = Instant::now();
try_send!(
self.scheduler,
Task::RegionCheckpointsOp(RegionCheckpointOperation::Resolve {
min_ts,
start_time
})
);
}
RegionCheckpointOperation::Resolve { min_ts, start_time } => {
let sched = self.scheduler.clone();
try_send!(
self.scheduler,
Task::ModifyObserve(ObserveOp::ResolveRegions {
callback: Box::new(move |mut resolved| {
let t =
Task::RegionCheckpointsOp(RegionCheckpointOperation::Resolved {
checkpoints: resolved.take_resolve_result(),
start_time,
});
try_send!(sched, t);
}),
min_ts
})
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that RegionCheckpointOperation::Resolve is only scheduled by RegionCheckpointOperation::PrepareMinTsForResolve ?

Suggested change
RegionCheckpointOperation::PrepareMinTsForResolve => {
let min_ts = self.pool.block_on(self.prepare_min_ts());
let start_time = Instant::now();
try_send!(
self.scheduler,
Task::RegionCheckpointsOp(RegionCheckpointOperation::Resolve {
min_ts,
start_time
})
);
}
RegionCheckpointOperation::Resolve { min_ts, start_time } => {
let sched = self.scheduler.clone();
try_send!(
self.scheduler,
Task::ModifyObserve(ObserveOp::ResolveRegions {
callback: Box::new(move |mut resolved| {
let t =
Task::RegionCheckpointsOp(RegionCheckpointOperation::Resolved {
checkpoints: resolved.take_resolve_result(),
start_time,
});
try_send!(sched, t);
}),
min_ts
})
);
}
RegionCheckpointOperation::PrepareMinTsForResolve => {
let min_ts = self.pool.block_on(self.prepare_min_ts());
let start_time = Instant::now();
let sched = self.scheduler.clone();
try_send!(
self.scheduler,
Task::ModifyObserve(ObserveOp::ResolveRegions {
callback: Box::new(move |mut resolved| {
let t =
Task::RegionCheckpointsOp(RegionCheckpointOperation::Resolved {
checkpoints: resolved.take_resolve_result(),
start_time,
});
try_send!(sched, t);
}),
min_ts
})
);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must reschedule the task to consume all the pending events after we updated the concurrency manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@ti-chi-bot
Copy link
Member

@Leavrth: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Signed-off-by: hillium <yujuncen@pingcap.com>
@@ -60,49 +64,59 @@ impl SubscriptionManager {
while let Some(msg) = self.input.next().await {
match msg {
SubscriptionOp::Add(sub) => {
self.subscribers.insert(Uuid::new_v4(), sub);
let uid = Uuid::new_v4();
info!("log backup adding new subscriber"; "id" => %uid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it cause too many logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, one advancer in its lifetime should only subscribe once. 🤔

Signed-off-by: hillium <yujuncen@pingcap.com>
@@ -2595,6 +2597,13 @@ impl BackupStreamConfig {
);
self.num_threads = default_cfg.num_threads;
}
if self.max_flush_interval < ReadableDuration::secs(1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also check min_ts_interval

Signed-off-by: hillium <yujuncen@pingcap.com>
@YuJuncen
Copy link
Contributor Author

/test

@ti-chi-bot
Copy link
Member

@YuJuncen: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@YuJuncen
Copy link
Contributor Author

/test
[2023-03-13T04:23:05.868Z] thread 'test::resolved_follower' panicked at 'called Option::unwrap()on aNone value', components/backup-stream/tests/mod.rs:545:84 🤔

@YuJuncen
Copy link
Contributor Author

(Perhaps we need to add a callback for flush...)

Signed-off-by: hillium <yujuncen@pingcap.com>
@ti-chi-bot ti-chi-bot removed the status/can-merge Status: Can merge to base branch label Mar 13, 2023
@YuJuncen
Copy link
Contributor Author

/test

@hicqu
Copy link
Contributor

hicqu commented Mar 13, 2023

/merge

@ti-chi-bot
Copy link
Member

@hicqu: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 96a0b08

@ti-chi-bot ti-chi-bot added the status/can-merge Status: Can merge to base branch label Mar 13, 2023
@ti-chi-bot ti-chi-bot merged commit 571e513 into tikv:master Mar 13, 2023
1 check passed
@ti-chi-bot ti-chi-bot added this to the Pool milestone Mar 13, 2023
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #14381.

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Mar 13, 2023
ref tikv#13638

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.6: #14382.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.2: #14383.

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Mar 13, 2023
ref tikv#13638

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Mar 13, 2023
ref tikv#13638

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.3: #14384.

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Mar 13, 2023
ref tikv#13638

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.4: #14385.

YuJuncen added a commit to ti-chi-bot/tikv that referenced this pull request Aug 10, 2023
ref tikv#13638

This PR added a “two phase” flush to log backup for reducing checkpoint lag.
Generally, we added a `MinTs` task, where resolve the regions and advance the `resolved_ts` in the checkpoint manager.
then, once we are doing flush, we would make current `resolved_ts` become `checkpoint_ts`.
This allows us to advance checkpoint_ts even the leader has gone. When the leader changes frequently, this can greatly reduce checkpoint lag.

Signed-off-by: hillium <yujuncen@pingcap.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this pull request Aug 14, 2023
ref #13638

This PR added a “two phase” flush to log backup for reducing checkpoint lag.
Generally, we added a `MinTs` task, where resolve the regions and advance the `resolved_ts` in the checkpoint manager.
then, once we are doing flush, we would make current `resolved_ts` become `checkpoint_ts`.
This allows us to advance checkpoint_ts even the leader has gone. When the leader changes frequently, this can greatly reduce checkpoint lag.

Signed-off-by: hillium <yujuncen@pingcap.com>

Co-authored-by: 山岚 <36239017+YuJuncen@users.noreply.github.com>
Co-authored-by: hillium <yujuncen@pingcap.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants