Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix epoch check panic when checkpoint #2005

Merged
merged 1 commit into from
Apr 21, 2022
Merged

Conversation

yezizp2012
Copy link
Contributor

@yezizp2012 yezizp2012 commented Apr 21, 2022

What's changed and what's your intention?

After investigation the bug mentioned in #1995 and reported in slack group, this bug is caused by injecting a barrier immediately after a quick checkpoint when there's no actors exist in the cluster. This means two equal epoch might generated and fail the epoch check in this situation.

After this PR merged, we should introduce or develop a monotonic clock lib for epoch generation.

Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests

Refer to a related PR or issue link (optional)

Resolve #1995

@yezizp2012
Copy link
Contributor Author

yezizp2012 commented Apr 21, 2022

The reason why starting cluster using ./risedev p fails occasionally is that the option enable_recovery is set true and the barrier manager will do a recovery when starting. There's a quick checkpoint inside near the end of recovery and due to tokio::time::MissedTickBehavior::Delay setting in min_interval, a new checkpoint will be scheduled immediately after recovery. Cc @mczhuang

@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #2005 (0499228) into main (99aede2) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #2005      +/-   ##
==========================================
- Coverage   70.79%   70.79%   -0.01%     
==========================================
  Files         627      627              
  Lines       80775    80788      +13     
==========================================
+ Hits        57188    57190       +2     
- Misses      23587    23598      +11     
Flag Coverage Δ
rust 70.79% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/meta/src/barrier/info.rs 100.00% <100.00%> (ø)
src/meta/src/barrier/mod.rs 69.00% <100.00%> (-1.07%) ⬇️
src/meta/src/barrier/command.rs 56.07% <0.00%> (-2.81%) ⬇️
.../src/executor/managed_state/aggregation/extreme.rs 90.02% <0.00%> (-0.27%) ⬇️
src/meta/src/hummock/hummock_manager.rs 91.23% <0.00%> (-0.11%) ⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yezizp2012 yezizp2012 merged commit 0a31fa7 into main Apr 21, 2022
@yezizp2012 yezizp2012 deleted the fix/epoch-check-panic branch April 21, 2022 05:07
@yezizp2012
Copy link
Contributor Author

@yezizp2012
Copy link
Contributor Author

Can we also remove the check and assert here?

https://github.com/singularity-data/risingwave/blob/4a8a66bdb84423733ce7a84bd9e2630460aecdab/src/meta/src/barrier/mod.rs#L344-L348

Emmm, we may have to keep this check. This check is applied to each node, we may have some nodes that do not contain any actors in the future but others do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

meta: epoch goes backwards
3 participants