fix: fix epoch check panic when checkpoint #2005

yezizp2012 · 2022-04-21T04:47:50Z

What's changed and what's your intention?

After investigation the bug mentioned in #1995 and reported in slack group, this bug is caused by injecting a barrier immediately after a quick checkpoint when there's no actors exist in the cluster. This means two equal epoch might generated and fail the epoch check in this situation.

After this PR merged, we should introduce or develop a monotonic clock lib for epoch generation.

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests

Refer to a related PR or issue link (optional)

Resolve #1995

yezizp2012 · 2022-04-21T04:54:58Z

The reason why starting cluster using ./risedev p fails occasionally is that the option enable_recovery is set true and the barrier manager will do a recovery when starting. There's a quick checkpoint inside near the end of recovery and due to tokio::time::MissedTickBehavior::Delay setting in min_interval, a new checkpoint will be scheduled immediately after recovery. Cc @mczhuang

codecov · 2022-04-21T04:56:03Z

Codecov Report

Merging #2005 (0499228) into main (99aede2) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #2005      +/-   ##
==========================================
- Coverage   70.79%   70.79%   -0.01%     
==========================================
  Files         627      627              
  Lines       80775    80788      +13     
==========================================
+ Hits        57188    57190       +2     
- Misses      23587    23598      +11

Flag	Coverage Δ
rust	`70.79% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/meta/src/barrier/info.rs	`100.00% <100.00%> (ø)`
src/meta/src/barrier/mod.rs	`69.00% <100.00%> (-1.07%)`	⬇️
src/meta/src/barrier/command.rs	`56.07% <0.00%> (-2.81%)`	⬇️
.../src/executor/managed_state/aggregation/extreme.rs	`90.02% <0.00%> (-0.27%)`	⬇️
src/meta/src/hummock/hummock_manager.rs	`91.23% <0.00%> (-0.11%)`	⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

BugenZhao

Can we also remove the check and assert here?

https://github.com/singularity-data/risingwave/blob/4a8a66bdb84423733ce7a84bd9e2630460aecdab/src/meta/src/barrier/mod.rs#L344-L348

yezizp2012 · 2022-04-21T05:09:22Z

Can we also remove the check and assert here?

https://github.com/singularity-data/risingwave/blob/4a8a66bdb84423733ce7a84bd9e2630460aecdab/src/meta/src/barrier/mod.rs#L344-L348

Yes, NTFS.

yezizp2012 · 2022-04-21T06:38:10Z

Can we also remove the check and assert here?

https://github.com/singularity-data/risingwave/blob/4a8a66bdb84423733ce7a84bd9e2630460aecdab/src/meta/src/barrier/mod.rs#L344-L348

Emmm, we may have to keep this check. This check is applied to each node, we may have some nodes that do not contain any actors in the future but others do.

fix: fix epoch check panic when checkpoint

0499228

yezizp2012 requested review from BugenZhao, MrCroxx, mczhuang, skyzh and TennyZhuang April 21, 2022 04:48

github-actions bot added the type/fix Bug fix label Apr 21, 2022

skyzh approved these changes Apr 21, 2022

View reviewed changes

BugenZhao approved these changes Apr 21, 2022

View reviewed changes

yezizp2012 merged commit 0a31fa7 into main Apr 21, 2022

yezizp2012 deleted the fix/epoch-check-panic branch April 21, 2022 05:07

yezizp2012 mentioned this pull request Apr 21, 2022

meta: introduce a monotonic clock lib for epoch generate #2011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix epoch check panic when checkpoint #2005

fix: fix epoch check panic when checkpoint #2005

yezizp2012 commented Apr 21, 2022 •

edited

yezizp2012 commented Apr 21, 2022 •

edited

codecov bot commented Apr 21, 2022

BugenZhao left a comment

yezizp2012 commented Apr 21, 2022

yezizp2012 commented Apr 21, 2022

fix: fix epoch check panic when checkpoint #2005

fix: fix epoch check panic when checkpoint #2005

Conversation

yezizp2012 commented Apr 21, 2022 • edited

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

yezizp2012 commented Apr 21, 2022 • edited

codecov bot commented Apr 21, 2022

Codecov Report

BugenZhao left a comment

Choose a reason for hiding this comment

yezizp2012 commented Apr 21, 2022

yezizp2012 commented Apr 21, 2022

yezizp2012 commented Apr 21, 2022 •

edited

yezizp2012 commented Apr 21, 2022 •

edited