Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce safety error in Fast-HotStuff with Twins #55

Merged
merged 31 commits into from
Aug 1, 2022
Merged

Conversation

johningve
Copy link
Member

@johningve johningve commented May 21, 2022

This PR adds a test case showing that our Twins implementation can detect the safety violation in Fast-HotStuff that was found in the Twins paper.

The main problem I encountered is this:

Because our view synchronizer requires a timeout certificate created from a quorum of timeout messages, All nodes become stuck because they cannot gather a quorum of timeout messages:

At round 5, node B timeouts. It is prevented from sending timeout messages to the other nodes due to the firewall of round 5.
At round 6, nodes A and D timeout. They are prevented from communicating with node B due to the firewall of round 6.
At round 7, node C timeouts. It is prevented from communicating with the other nodes due to the firewall of round 7.

Several changes were made to the Twins executor to get it working:

  • Each node now runs concurrently. EDIT: I figured out how to run the event loops in turns.
  • Nodes are allowed to advance to the next network view on timeout.
  • Each scenario is executed for a fixed amount of time with fixed timeout durations and message delays. EDIT: now, the executor runs for some number of "rounds" in which all replicas are allowed to process events

The test scenario results in the following commits, where node 2 disagrees with all other nodes at height 4:

fhsbug_test.go:123: Node r1n1 commits: 
                 Proposer: 1, View: 1, Hash: TNPUwB
                 Proposer: 1, View: 2, Hash: /kQdhx
                 Proposer: 1, View: 3, Hash: 9GFYkx
                 Proposer: 1, View: 6, Hash: PgtGrf
        
    fhsbug_test.go:123: Node r2n2 commits: 
                 Proposer: 1, View: 1, Hash: TNPUwB
                 Proposer: 1, View: 2, Hash: /kQdhx
                 Proposer: 1, View: 3, Hash: 9GFYkx
                 Proposer: 1, View: 4, Hash: aRlvqx
        
    fhsbug_test.go:123: Node r3n3 commits: 
                 Proposer: 1, View: 1, Hash: TNPUwB
                 Proposer: 1, View: 2, Hash: /kQdhx
                 Proposer: 1, View: 3, Hash: 9GFYkx
                 Proposer: 1, View: 6, Hash: PgtGrf
        
    fhsbug_test.go:123: Node r4n4 commits: 
                 Proposer: 1, View: 1, Hash: TNPUwB
                 Proposer: 1, View: 2, Hash: /kQdhx
                 Proposer: 1, View: 3, Hash: 9GFYkx
                 Proposer: 1, View: 6, Hash: PgtGrf

This makes Propose() continue even if the QC block could not be found.

There is no reason why the proposal should be aborted if this happens.
This disables the check of the proposer's identity in OnPropose.
By disabling this, we can perform the HighQC update in AdvanceView
This enables the GetLeader function to work correctly with the absence
of an early HighQC update.
Broadcast and strict mode are on by default, which should be consistent
with the old behavior.
This advances the node to the next view in the network when a timeout
occurs, which should help prevent nodes from becoming "stuck".
@codecov-commenter
Copy link

codecov-commenter commented May 23, 2022

Codecov Report

Merging #55 (bde5359) into master (3e72795) will increase coverage by 1.25%.
The diff coverage is 91.47%.

@@            Coverage Diff             @@
##           master      #55      +/-   ##
==========================================
+ Coverage   53.72%   54.97%   +1.25%     
==========================================
  Files          70       70              
  Lines        6362     6437      +75     
==========================================
+ Hits         3418     3539     +121     
+ Misses       2670     2626      -44     
+ Partials      274      272       -2     
Flag Coverage Δ
unittests 54.97% <91.47%> (+1.25%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
consensus/modules.go 95.69% <ø> (+2.15%) ⬆️
crypto/ecdsa/ecdsa.go 60.52% <ø> (-1.46%) ⬇️
internal/cli/twins.go 6.91% <16.66%> (+0.15%) ⬆️
consensus/consensus.go 68.09% <50.00%> (+4.62%) ⬆️
twins/network.go 87.72% <95.52%> (+1.93%) ⬆️
consensus/events.go 100.00% <100.00%> (ø)
consensus/types.go 79.45% <100.00%> (+12.02%) ⬆️
synchronizer/synchronizer.go 85.82% <100.00%> (+2.64%) ⬆️
twins/generator.go 93.13% <100.00%> (+1.19%) ⬆️
twins/scenario.go 96.77% <100.00%> (+5.52%) ⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e72795...bde5359. Read the comment docs.

@johningve johningve marked this pull request as ready for review May 23, 2022 10:38
@johningve johningve requested a review from meling May 23, 2022 10:39
This changes the way the executor uses rounds.
In a round, each replica is allowed to process all of its pending
events. Any messages that are sent are held by the network until the
start of the next round. A timeout mananger module sends a timeout every
5 rounds unless a view change happens.
@johningve johningve marked this pull request as draft May 23, 2022 14:03
@johningve johningve marked this pull request as ready for review May 24, 2022 09:12
@johningve johningve merged commit f7d16d3 into master Aug 1, 2022
@johningve johningve deleted the fhsbug_test branch August 1, 2022 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants