New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADR] ADR-36: blockchain reactor refactor spec #3506
Conversation
milosevic
commented
Mar 29, 2019
- Updated all relevant documentation in docs
- Updated all code comments where relevant
- Wrote tests
- Updated CHANGELOG_PENDING.md
@milosevic Corrected the markdown and fixed up some Go indentation and naming inconsistencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like the proposed solution (obvious reasons ;). It is clear what the responsibilties are and how the controller could work.
The event handling code is a quite dense and could benefit from inline comments for easier reasoning of the proposal.
|
||
type ControllerState struct { | ||
Height int64 // the first block that is not committed | ||
State State // mode of operation of the state machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is part of ControllerState
and the enum does have the suffix Mode
already it might be beneficial streamline the naming to:
type Mode int
const (
ModeUnknown State = iota
ModeFastSync
ModeConsensus
)
type ControllerState struct {
Height int64 // the first block that is not committed
Mode Mode // mode of operation of the state machine
PeerMap map[ID]PeerStats // map of peer IDs to peer statistics
MaxRequestPending int64 // maximum height of the pending requests
FailedRequests []int64 // list of failed block requests
PendingRequestsNum int // number of pending requests
Store []BlockInfo // contains list of downloaded blocks
Executor BlockExecutor // store, verify and executes blocks
}
The `Controller` is initialized by providing initial height (`startHeight`) from which it will start downloading blocks from peers. | ||
|
||
``` go | ||
func ControllerInit(state ControllerState, startHeight int64) ControllerState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does init get an exisiting state? Is it not supposed to create a fresh state for us?
If so a more idiomatic name would be NewControllerState
.
state.PendingRequestsNum = 0 | ||
} | ||
|
||
func ControllerHandle(event Event, state ControllerState) (ControllerState, Message, TimeoutTrigger, Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be more explicit:
handleEvent(state ControllerState, event Event) (ControllerState, Message, Timeout, Error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative the FSM handle function could just return the new state and any error that occurred while processing the event from old state. Any actions (like sending a message or triggering a timeout) would be part of the handle implementation (action functions). Where needed they can make calls to interface functions where we can plug in tester or real reactors.
I was also wondering if instead or along with the code we could have a diagram or description on what happens on peer removal and update. There is also a case currently where we detect slow peers. We should also put in place the design to have N pending requests, I don't think 1 will be enough. This complicates things quite a bit unfortunately.
state = verifyBlocks(state) // it can lead to event.PeerID being removed from peer list | ||
|
||
if _, ok := state.PeerMap[event.PeerID]; ok { | ||
msg = createBlockRequestMessage(state, event.PeerID, peerStats.Height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous blockchain reactor will send multi block requests in the same time, but new version only send new request until last block have received, will it slow down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new version, there is one pending request per peer. After a response (block) arrives, we will try to send a new request to the peer. This makes protocol more reactive towards responsive peers as the peers that send responses faster will be asked more. Having multiple pending requests (especially big number of requests as in the current version) is problematic as it opens the space for DDoS attacks. It is also not clear that having many pending requests is indeed more efficient. We will need to measure this. The main goal of this ADR is correctness and improved testability, and this should hopefully make performance improvements easier after.
verified = true | ||
|
||
if verified { | ||
block.Execute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step of block.Execute()
takes most of the time of state transition. Since state transition is serial. We may find a way to let block.Execute()
tasks run async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Verifying and executing blocks are good candidates to be isolated in a separate routine but as stated above the idea is to focus on performance improvements after (if needed).
I think it should be in scope for the blockchain reactor to sync with a known root of trust at a certain height. ie. syncing with weak subjectivity |
Co-Authored-By: milosevic <zarko@interchain.io>
Codecov Report
@@ Coverage Diff @@
## develop #3506 +/- ##
==========================================
+ Coverage 64.28% 64.4% +0.11%
==========================================
Files 213 213
Lines 17426 17456 +30
==========================================
+ Hits 11203 11242 +39
+ Misses 5282 5275 -7
+ Partials 941 939 -2
|
@zmanian In terms of specification, does this mean that "fast-sync" will have as an input parameter hash of the block (and maybe some additional info) at some height H provided by the known root of trust, so the protocol can verify it once it reaches height H? |
Block Block | ||
Commit Commit | ||
PeerID ID | ||
CurrentHeight int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to CurrentPeerHeight
? Otherwise not clear what CurrentHeight
refers to.
|
||
The Controller state machine can be in two modes (states): `FastSyncMode` when | ||
it is trying to catch up with the network by downloading committed blocks, | ||
and `ConsensusMode` in which it executes Tendermint consensus protocol. We |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need consensus mode in blockchain package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should help other peers sync-up even during consensus mode. Idea here is to emphasise that fast-sync is a mode of operation and not one shot functionality (as it is in the current code).
blockNumber = r | ||
delete(state.FailedRequests, r) | ||
} else if state.MaxRequestPending < peerHeight { | ||
state.MaxRequestPending++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we set it to peerHeight
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I can't find the place where we decrease it after the peer with the max height disconnects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no notion of max peer height in this proposal (therefore it is never reset). Termination mechanism is different (based on number of pending requests, size of the peer set and termination timeout). I have added more comments in the latest version. MaxRequestPending is the maximum height for which request is made to some peer.
|
||
func createBlockRequestMessage(state ControllerState, peerID ID, peerHeight int64) BlockRequestMessage { | ||
msg = nil | ||
blockNumber = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to BlockHeight
?
faulty nodes, i.e., it is possible that all nodes in some peer set are faulty. | ||
* we assume that communication between correct nodes is synchronous, i.e., if a correct node `p` sends a message `m` to a correct node `q` at time `t`, then `q` will receive | ||
message the latest at time `t+Delta` where `Delta` is a system parameter that is known. `Delta` is normally order of magnitude higher than the real communication delay | ||
(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no the corresponding reply at time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no the corresponding reply at time
(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no the corresponding reply at time | |
(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no corresponding reply at time |
type NewBlock struct { | ||
Height int64 | ||
Block Block | ||
Commit Commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need an overview of how all these data structures are used and when. Some control flow diagrams would be useful.
Also, a part of this ADR should be the concurrency model/ design for scale/performance and testability. I believe this would be part of what you call the Executor task. I did some experiments with a version of the reactor poolRoutine that incorporates some of the Controller/ FSM changes. Sending one request per peer is slowing things down dramatically. Having hardcoded fixed limits may not be the best approach, maybe adaptive would work better. But in order to know, we need to have a good design (e.g. collect good stats, be able to abstract the things we sync, etc.) to understand what happens. Eyeing the logs is a nightmare. Not sure what we have today to collect operational data from nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to add control diagram similar as we have for the current version. Maybe you can share source of the current diagram? Regarding performance concerns, extending this proposal to send multiple requests to a peer should be relatively straightforward. I will try to do it. I agree that number of pending requests per peer should be adaptive in ideal case but we need to estimate the complexity of such approach. We can maybe start with adding multiple pending requests (hardcoded value) to this proposal as this will let faster peer send multiple responses, and have them faster in position to serve new requests.
|
||
type RemovePeer struct { | ||
PeerID ID | ||
Height int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Height here? In what conditions is this sent? Switch removes peer on error? Heartbeat loss?
Similar questions for the messages/ types below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need Height, good point. It could be in case switch removes peer on error or in case StatusReport is not received on time.
msg = createBlockResponseMessage(state, event) | ||
return state, msg, timeout, error | ||
default: | ||
error = "Only respond to BlockRequests while in ConsensusMode!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this is not the case with current implementation.
switch event := event.(type) { | ||
case EventBlockRequest: | ||
msg = createBlockResponseMessage(state, event) | ||
return state, msg, timeout, error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are not in consensus mode and we reply to Block request messages. Also, this (the fast sync helper functionality) could be done outside the controller.
Edit: sorry misread the code.
msg = createBlockResponseMessage(state, event) | ||
return state, msg, timeout, error | ||
default: | ||
error = "Only respond to BlockRequests while in ModeConsensus!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error = "Only respond to BlockRequests while in ModeConsensus!" | |
error = "While in ModeConsensus only respond to Block Requests" |
What about the status requests? Where do we reply to those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of the Heartbeat mechanism that is part of the Executor logic. I assumed that Executor send periodically to each peer StatusRequest message and expects StatusReport to arrive within response timeout. Controller only see StatusReport event and RemovePeer in case peer is removed or has timed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's in a name... "peer"? :)
A node that had previously answered to the status request broadcast? Or the peers the switch knows about (here we don't know the peer fast sync status)?
If the former, will Executor keep track of peers in order to know who didn't answer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A peer is a node with which we have open connection (at the switch level). Executor sends status requests to all peers and need to manage timeouts. Therefore, we probably need to have peer lists at the Executor level also to able to determine who didn't responded on time.
@milosevic Given that we want to stick to consistent ADR numbering captured in #2313 this ADR should get the number 40. I amended the original issue accordingly. |
* Add blockchain reactor refactor ADR * docs: Correct markdown and go syntax in adr-036 * Update docs/architecture/adr-036-blockchain-reactor-refactor.md Co-Authored-By: milosevic <zarko@interchain.io> * Add comments and address reviewer comments * Improve formating * Small improvements * adr-036 -> adr-040