Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADR] ADR-36: blockchain reactor refactor spec #3506

Merged
merged 8 commits into from May 10, 2019

Conversation

milosevic
Copy link
Contributor

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

@xla xla changed the title [ADR]: ADR-36: blockchain reactor refactor spec [ADR] ADR-36: blockchain reactor refactor spec Mar 29, 2019
@xla
Copy link
Contributor

xla commented Mar 29, 2019

@milosevic Corrected the markdown and fixed up some Go indentation and naming inconsistencies.

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the proposed solution (obvious reasons ;). It is clear what the responsibilties are and how the controller could work.

The event handling code is a quite dense and could benefit from inline comments for easier reasoning of the proposal.


type ControllerState struct {
Height int64 // the first block that is not committed
State State // mode of operation of the state machine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is part of ControllerState and the enum does have the suffix Mode already it might be beneficial streamline the naming to:

type Mode int

const (
	ModeUnknown State = iota
	ModeFastSync
	ModeConsensus
)

type ControllerState struct {
	Height             int64            // the first block that is not committed
	Mode               Mode             // mode of operation of the state machine
	PeerMap            map[ID]PeerStats // map of peer IDs to peer statistics
	MaxRequestPending  int64            // maximum height of the pending requests
	FailedRequests     []int64          // list of failed block requests
	PendingRequestsNum int              // number of pending requests
	Store              []BlockInfo      // contains list of downloaded blocks
	Executor           BlockExecutor    // store, verify and executes blocks
}

The `Controller` is initialized by providing initial height (`startHeight`) from which it will start downloading blocks from peers.

``` go
func ControllerInit(state ControllerState, startHeight int64) ControllerState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does init get an exisiting state? Is it not supposed to create a fresh state for us?

If so a more idiomatic name would be NewControllerState.

state.PendingRequestsNum = 0
}

func ControllerHandle(event Event, state ControllerState) (ControllerState, Message, TimeoutTrigger, Error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be more explicit:

handleEvent(state ControllerState, event Event) (ControllerState, Message, Timeout, Error)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative the FSM handle function could just return the new state and any error that occurred while processing the event from old state. Any actions (like sending a message or triggering a timeout) would be part of the handle implementation (action functions). Where needed they can make calls to interface functions where we can plug in tester or real reactors.

I was also wondering if instead or along with the code we could have a diagram or description on what happens on peer removal and update. There is also a case currently where we detect slow peers. We should also put in place the design to have N pending requests, I don't think 1 will be enough. This complicates things quite a bit unfortunately.

state = verifyBlocks(state) // it can lead to event.PeerID being removed from peer list

if _, ok := state.PeerMap[event.PeerID]; ok {
msg = createBlockRequestMessage(state, event.PeerID, peerStats.Height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous blockchain reactor will send multi block requests in the same time, but new version only send new request until last block have received, will it slow down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new version, there is one pending request per peer. After a response (block) arrives, we will try to send a new request to the peer. This makes protocol more reactive towards responsive peers as the peers that send responses faster will be asked more. Having multiple pending requests (especially big number of requests as in the current version) is problematic as it opens the space for DDoS attacks. It is also not clear that having many pending requests is indeed more efficient. We will need to measure this. The main goal of this ADR is correctness and improved testability, and this should hopefully make performance improvements easier after.

verified = true

if verified {
block.Execute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step of block.Execute() takes most of the time of state transition. Since state transition is serial. We may find a way to let block.Execute() tasks run async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Verifying and executing blocks are good candidates to be isolated in a separate routine but as stated above the idea is to focus on performance improvements after (if needed).

@zmanian
Copy link
Contributor

zmanian commented Apr 1, 2019

I think it should be in scope for the blockchain reactor to sync with a known root of trust at a certain height. ie. syncing with weak subjectivity

Co-Authored-By: milosevic <zarko@interchain.io>
@codecov-io
Copy link

codecov-io commented Apr 1, 2019

Codecov Report

Merging #3506 into develop will increase coverage by 0.11%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           develop   #3506      +/-   ##
==========================================
+ Coverage    64.28%   64.4%   +0.11%     
==========================================
  Files          213     213              
  Lines        17426   17456      +30     
==========================================
+ Hits         11203   11242      +39     
+ Misses        5282    5275       -7     
+ Partials       941     939       -2
Impacted Files Coverage Δ
libs/db/debug_db.go 15.78% <0%> (-4.53%) ⬇️
blockchain/pool.go 80.26% <0%> (-0.99%) ⬇️
rpc/core/status.go 0% <0%> (ø) ⬆️
consensus/state.go 79.29% <0%> (+0.47%) ⬆️
consensus/reactor.go 72.96% <0%> (+0.94%) ⬆️
blockchain/reactor.go 71.49% <0%> (+0.96%) ⬆️
libs/db/go_level_db.go 66.46% <0%> (+1.41%) ⬆️

@milosevic
Copy link
Contributor Author

@zmanian In terms of specification, does this mean that "fast-sync" will have as an input parameter hash of the block (and maybe some additional info) at some height H provided by the known root of trust, so the protocol can verify it once it reaches height H?

Block Block
Commit Commit
PeerID ID
CurrentHeight int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to CurrentPeerHeight? Otherwise not clear what CurrentHeight refers to.

docs/architecture/adr-036-blockchain-reactor-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-036-blockchain-reactor-refactor.md Outdated Show resolved Hide resolved

The Controller state machine can be in two modes (states): `FastSyncMode` when
it is trying to catch up with the network by downloading committed blocks,
and `ConsensusMode` in which it executes Tendermint consensus protocol. We
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need consensus mode in blockchain package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should help other peers sync-up even during consensus mode. Idea here is to emphasise that fast-sync is a mode of operation and not one shot functionality (as it is in the current code).

docs/architecture/adr-036-blockchain-reactor-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-036-blockchain-reactor-refactor.md Outdated Show resolved Hide resolved
blockNumber = r
delete(state.FailedRequests, r)
} else if state.MaxRequestPending < peerHeight {
state.MaxRequestPending++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we set it to peerHeight?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I can't find the place where we decrease it after the peer with the max height disconnects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no notion of max peer height in this proposal (therefore it is never reset). Termination mechanism is different (based on number of pending requests, size of the peer set and termination timeout). I have added more comments in the latest version. MaxRequestPending is the maximum height for which request is made to some peer.


func createBlockRequestMessage(state ControllerState, peerID ID, peerHeight int64) BlockRequestMessage {
msg = nil
blockNumber = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to BlockHeight?

faulty nodes, i.e., it is possible that all nodes in some peer set are faulty.
* we assume that communication between correct nodes is synchronous, i.e., if a correct node `p` sends a message `m` to a correct node `q` at time `t`, then `q` will receive
message the latest at time `t+Delta` where `Delta` is a system parameter that is known. `Delta` is normally order of magnitude higher than the real communication delay
(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no the corresponding reply at time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no the corresponding reply at time

Suggested change
(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no the corresponding reply at time
(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no corresponding reply at time

type NewBlock struct {
Height int64
Block Block
Commit Commit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need an overview of how all these data structures are used and when. Some control flow diagrams would be useful.
Also, a part of this ADR should be the concurrency model/ design for scale/performance and testability. I believe this would be part of what you call the Executor task. I did some experiments with a version of the reactor poolRoutine that incorporates some of the Controller/ FSM changes. Sending one request per peer is slowing things down dramatically. Having hardcoded fixed limits may not be the best approach, maybe adaptive would work better. But in order to know, we need to have a good design (e.g. collect good stats, be able to abstract the things we sync, etc.) to understand what happens. Eyeing the logs is a nightmare. Not sure what we have today to collect operational data from nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to add control diagram similar as we have for the current version. Maybe you can share source of the current diagram? Regarding performance concerns, extending this proposal to send multiple requests to a peer should be relatively straightforward. I will try to do it. I agree that number of pending requests per peer should be adaptive in ideal case but we need to estimate the complexity of such approach. We can maybe start with adding multiple pending requests (hardcoded value) to this proposal as this will let faster peer send multiple responses, and have them faster in position to serve new requests.


type RemovePeer struct {
PeerID ID
Height int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Height here? In what conditions is this sent? Switch removes peer on error? Heartbeat loss?
Similar questions for the messages/ types below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need Height, good point. It could be in case switch removes peer on error or in case StatusReport is not received on time.

msg = createBlockResponseMessage(state, event)
return state, msg, timeout, error
default:
error = "Only respond to BlockRequests while in ConsensusMode!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this is not the case with current implementation.

switch event := event.(type) {
case EventBlockRequest:
msg = createBlockResponseMessage(state, event)
return state, msg, timeout, error
Copy link
Contributor

@ancazamfir ancazamfir Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are not in consensus mode and we reply to Block request messages. Also, this (the fast sync helper functionality) could be done outside the controller.
Edit: sorry misread the code.

msg = createBlockResponseMessage(state, event)
return state, msg, timeout, error
default:
error = "Only respond to BlockRequests while in ModeConsensus!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error = "Only respond to BlockRequests while in ModeConsensus!"
error = "While in ModeConsensus only respond to Block Requests"

What about the status requests? Where do we reply to those?

Copy link
Contributor Author

@milosevic milosevic Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of the Heartbeat mechanism that is part of the Executor logic. I assumed that Executor send periodically to each peer StatusRequest message and expects StatusReport to arrive within response timeout. Controller only see StatusReport event and RemovePeer in case peer is removed or has timed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's in a name... "peer"? :)
A node that had previously answered to the status request broadcast? Or the peers the switch knows about (here we don't know the peer fast sync status)?
If the former, will Executor keep track of peers in order to know who didn't answer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A peer is a node with which we have open connection (at the switch level). Executor sends status requests to all peers and need to manage timeouts. Therefore, we probably need to have peer lists at the Executor level also to able to determine who didn't responded on time.

@xla
Copy link
Contributor

xla commented Apr 14, 2019

@milosevic Given that we want to stick to consistent ADR numbering captured in #2313 this ADR should get the number 40. I amended the original issue accordingly.

@ebuchman ebuchman mentioned this pull request Apr 14, 2019
@ebuchman ebuchman merged commit 6f1ccb6 into develop May 10, 2019
@ebuchman ebuchman deleted the zm_blockchain_reactor_spec branch May 10, 2019 17:37
@melekes melekes mentioned this pull request May 30, 2019
44 tasks
brapse pushed a commit to brapse/tendermint that referenced this pull request Jun 5, 2019
* Add blockchain reactor refactor ADR

* docs: Correct markdown and go syntax in adr-036

* Update docs/architecture/adr-036-blockchain-reactor-refactor.md

Co-Authored-By: milosevic <zarko@interchain.io>

* Add comments and address reviewer comments

* Improve formating

* Small improvements

* adr-036 -> adr-040
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants