[ADR] ADR-36: blockchain reactor refactor spec #3506

milosevic · 2019-03-29T14:46:55Z

Updated all relevant documentation in docs
Updated all code comments where relevant
Wrote tests
Updated CHANGELOG_PENDING.md

xla · 2019-03-29T15:31:18Z

@milosevic Corrected the markdown and fixed up some Go indentation and naming inconsistencies.

docs/architecture/adr-036-blockchain-reactor-refactor.md

xla

Really like the proposed solution (obvious reasons ;). It is clear what the responsibilties are and how the controller could work.

The event handling code is a quite dense and could benefit from inline comments for easier reasoning of the proposal.

xla · 2019-03-29T16:18:29Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+
+type ControllerState struct {
+	Height             int64            // the first block that is not committed
+	State              State            // mode of operation of the state machine


As this is part of ControllerState and the enum does have the suffix Mode already it might be beneficial streamline the naming to:

type Mode int const ( ModeUnknown State = iota ModeFastSync ModeConsensus ) type ControllerState struct { Height int64 // the first block that is not committed Mode Mode // mode of operation of the state machine PeerMap map[ID]PeerStats // map of peer IDs to peer statistics MaxRequestPending int64 // maximum height of the pending requests FailedRequests []int64 // list of failed block requests PendingRequestsNum int // number of pending requests Store []BlockInfo // contains list of downloaded blocks Executor BlockExecutor // store, verify and executes blocks }

xla · 2019-03-29T16:20:17Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+The `Controller` is initialized by providing initial height (`startHeight`) from which it will start downloading blocks from peers.
+
+``` go
+func ControllerInit(state ControllerState, startHeight int64) ControllerState {


Why does init get an exisiting state? Is it not supposed to create a fresh state for us?

If so a more idiomatic name would be NewControllerState.

xla · 2019-03-29T16:23:14Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+	state.PendingRequestsNum = 0
+}
+
+func ControllerHandle(event Event, state ControllerState) (ControllerState, Message, TimeoutTrigger, Error) {


This could be more explicit:

handleEvent(state ControllerState, event Event) (ControllerState, Message, Timeout, Error)

As an alternative the FSM handle function could just return the new state and any error that occurred while processing the event from old state. Any actions (like sending a message or triggering a timeout) would be part of the handle implementation (action functions). Where needed they can make calls to interface functions where we can plug in tester or real reactors.

I was also wondering if instead or along with the code we could have a diagram or description on what happens on peer removal and update. There is also a case currently where we detect slow peers. We should also put in place the design to have N pending requests, I don't think 1 will be enough. This complicates things quite a bit unfortunately.

unclezoro · 2019-03-29T17:16:45Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+					state = verifyBlocks(state) // it can lead to event.PeerID being removed from peer list
+
+					if _, ok := state.PeerMap[event.PeerID]; ok {
+						msg = createBlockRequestMessage(state, event.PeerID, peerStats.Height)


Previous blockchain reactor will send multi block requests in the same time, but new version only send new request until last block have received, will it slow down?

With the new version, there is one pending request per peer. After a response (block) arrives, we will try to send a new request to the peer. This makes protocol more reactive towards responsive peers as the peers that send responses faster will be asked more. Having multiple pending requests (especially big number of requests as in the current version) is problematic as it opens the space for DDoS attacks. It is also not clear that having many pending requests is indeed more efficient. We will need to measure this. The main goal of this ADR is correctness and improved testability, and this should hopefully make performance improvements easier after.

unclezoro · 2019-03-29T17:22:00Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+			verified = true
+
+			if verified {
+				block.Execute()


step of block.Execute() takes most of the time of state transition. Since state transition is serial. We may find a way to let block.Execute() tasks run async.

Agree. Verifying and executing blocks are good candidates to be isolated in a separate routine but as stated above the idea is to focus on performance improvements after (if needed).

zmanian · 2019-04-01T00:55:18Z

I think it should be in scope for the blockchain reactor to sync with a known root of trust at a certain height. ie. syncing with weak subjectivity

Co-Authored-By: milosevic <zarko@interchain.io>

codecov-io · 2019-04-01T08:47:53Z

Codecov Report

Merging #3506 into develop will increase coverage by 0.11%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           develop   #3506      +/-   ##
==========================================
+ Coverage    64.28%   64.4%   +0.11%     
==========================================
  Files          213     213              
  Lines        17426   17456      +30     
==========================================
+ Hits         11203   11242      +39     
+ Misses        5282    5275       -7     
+ Partials       941     939       -2

Impacted Files	Coverage Δ
libs/db/debug_db.go	`15.78% <0%> (-4.53%)`	⬇️
blockchain/pool.go	`80.26% <0%> (-0.99%)`	⬇️
rpc/core/status.go	`0% <0%> (ø)`	⬆️
consensus/state.go	`79.29% <0%> (+0.47%)`	⬆️
consensus/reactor.go	`72.96% <0%> (+0.94%)`	⬆️
blockchain/reactor.go	`71.49% <0%> (+0.96%)`	⬆️
libs/db/go_level_db.go	`66.46% <0%> (+1.41%)`	⬆️

milosevic · 2019-04-01T08:54:58Z

@zmanian In terms of specification, does this mean that "fast-sync" will have as an input parameter hash of the block (and maybe some additional info) at some height H provided by the known root of trust, so the protocol can verify it once it reaches height H?

melekes · 2019-04-01T09:32:00Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+	Block         Block
+	Commit        Commit
+	PeerID        ID
+	CurrentHeight int64


Rename to CurrentPeerHeight? Otherwise not clear what CurrentHeight refers to.

docs/architecture/adr-036-blockchain-reactor-refactor.md

melekes · 2019-04-01T09:36:58Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+
+The Controller state machine can be in two modes (states): `FastSyncMode` when
+it is trying to catch up with the network by downloading committed blocks,
+and `ConsensusMode` in which it executes Tendermint consensus protocol. We


Do we even need consensus mode in blockchain package?

We should help other peers sync-up even during consensus mode. Idea here is to emphasise that fast-sync is a mode of operation and not one shot functionality (as it is in the current code).

docs/architecture/adr-036-blockchain-reactor-refactor.md

melekes · 2019-04-01T09:49:11Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+		blockNumber = r
+		delete(state.FailedRequests, r)
+	} else if state.MaxRequestPending < peerHeight {
+		state.MaxRequestPending++


shouldn't we set it to peerHeight?

also, I can't find the place where we decrease it after the peer with the max height disconnects.

There is no notion of max peer height in this proposal (therefore it is never reset). Termination mechanism is different (based on number of pending requests, size of the peer set and termination timeout). I have added more comments in the latest version. MaxRequestPending is the maximum height for which request is made to some peer.

melekes · 2019-04-01T09:50:52Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+
+func createBlockRequestMessage(state ControllerState, peerID ID, peerHeight int64) BlockRequestMessage {
+	msg = nil
+	blockNumber = -1


rename to BlockHeight?

ancazamfir · 2019-04-01T11:24:20Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+  faulty nodes, i.e., it is possible that all nodes in some peer set are faulty. 
+* we assume that communication between correct nodes is synchronous, i.e., if a correct node `p` sends a message `m` to a correct node `q` at time `t`, then `q` will receive
+  message the latest at time `t+Delta` where `Delta` is a system parameter that is known. `Delta` is normally order of magnitude higher than the real communication delay 
+  (maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no the corresponding reply at time


there is no the corresponding reply at time

Suggested change

(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no the corresponding reply at time

(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no corresponding reply at time

ancazamfir · 2019-04-01T11:25:31Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+type NewBlock struct {
+	Height        int64
+	Block         Block
+	Commit        Commit


I think we need an overview of how all these data structures are used and when. Some control flow diagrams would be useful.
Also, a part of this ADR should be the concurrency model/ design for scale/performance and testability. I believe this would be part of what you call the Executor task. I did some experiments with a version of the reactor poolRoutine that incorporates some of the Controller/ FSM changes. Sending one request per peer is slowing things down dramatically. Having hardcoded fixed limits may not be the best approach, maybe adaptive would work better. But in order to know, we need to have a good design (e.g. collect good stats, be able to abstract the things we sync, etc.) to understand what happens. Eyeing the logs is a nightmare. Not sure what we have today to collect operational data from nodes.

I will try to add control diagram similar as we have for the current version. Maybe you can share source of the current diagram? Regarding performance concerns, extending this proposal to send multiple requests to a peer should be relatively straightforward. I will try to do it. I agree that number of pending requests per peer should be adaptive in ideal case but we need to estimate the complexity of such approach. We can maybe start with adding multiple pending requests (hardcoded value) to this proposal as this will let faster peer send multiple responses, and have them faster in position to serve new requests.

ancazamfir · 2019-04-01T11:25:43Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+
+type RemovePeer struct {
+	PeerID ID
+	Height int64


Do we need Height here? In what conditions is this sent? Switch removes peer on error? Heartbeat loss?
Similar questions for the messages/ types below.

We don't need Height, good point. It could be in case switch removes peer on error or in case StatusReport is not received on time.

ancazamfir · 2019-04-01T11:26:41Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+			msg = createBlockResponseMessage(state, event)
+			return state, msg, timeout, error
+		default:
+			error = "Only respond to BlockRequests while in ConsensusMode!"


Just a note that this is not the case with current implementation.

ancazamfir · 2019-04-01T11:27:15Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+		switch event := event.(type) {
+		case EventBlockRequest:
+			msg = createBlockResponseMessage(state, event)
+			return state, msg, timeout, error


Here we are not in consensus mode and we reply to Block request messages. Also, this (the fast sync helper functionality) could be done outside the controller.
Edit: sorry misread the code.

ancazamfir · 2019-04-01T11:43:19Z

docs/architecture/adr-036-blockchain-reactor-refactor.md

+			msg = createBlockResponseMessage(state, event)
+			return state, msg, timeout, error
+		default:
+			error = "Only respond to BlockRequests while in ModeConsensus!"


Suggested change

error = "Only respond to BlockRequests while in ModeConsensus!"

error = "While in ModeConsensus only respond to Block Requests"

What about the status requests? Where do we reply to those?

It is part of the Heartbeat mechanism that is part of the Executor logic. I assumed that Executor send periodically to each peer StatusRequest message and expects StatusReport to arrive within response timeout. Controller only see StatusReport event and RemovePeer in case peer is removed or has timed out.

What's in a name... "peer"? :)
A node that had previously answered to the status request broadcast? Or the peers the switch knows about (here we don't know the peer fast sync status)?
If the former, will Executor keep track of peers in order to know who didn't answer?

A peer is a node with which we have open connection (at the switch level). Executor sends status requests to all peers and need to manage timeouts. Therefore, we probably need to have peer lists at the Executor level also to able to determine who didn't responded on time.

xla · 2019-04-14T12:08:06Z

@milosevic Given that we want to stick to consistent ADR numbering captured in #2313 this ADR should get the number 40. I amended the original issue accordingly.

* Add blockchain reactor refactor ADR * docs: Correct markdown and go syntax in adr-036 * Update docs/architecture/adr-036-blockchain-reactor-refactor.md Co-Authored-By: milosevic <zarko@interchain.io> * Add comments and address reviewer comments * Improve formating * Small improvements * adr-036 -> adr-040

Add blockchain reactor refactor ADR

979ef25

milosevic requested review from ebuchman, melekes and xla as code owners March 29, 2019 14:46

xla changed the title ~~[ADR]: ADR-36: blockchain reactor refactor spec~~ [ADR] ADR-36: blockchain reactor refactor spec Mar 29, 2019

docs: Correct markdown and go syntax in adr-036

1e0eaeb

unclezoro reviewed Mar 29, 2019

View reviewed changes

docs/architecture/adr-036-blockchain-reactor-refactor.md Outdated Show resolved Hide resolved

xla suggested changes Mar 29, 2019

View reviewed changes

unclezoro reviewed Mar 29, 2019

View reviewed changes

Update docs/architecture/adr-036-blockchain-reactor-refactor.md

3d604d1

Co-Authored-By: milosevic <zarko@interchain.io>

melekes reviewed Apr 1, 2019

View reviewed changes

Add comments and address reviewer comments

35ce899

ancazamfir reviewed Apr 1, 2019

View reviewed changes

This was referenced Apr 2, 2019

blockchain: non-deterministic test fail #3400

Closed

sync: Prometheus metrics should update during fast_sync #3507

Closed

blockchain: DATA RACE in blockchain/pool_test.go #3390

Closed

Zarko Milosevic added 2 commits April 5, 2019 17:19

Improve formating

dbaee6e

Small improvements

78f1ccf

ebuchman mentioned this pull request Apr 14, 2019

ADR Assignment #2313

Closed

ebuchman mentioned this pull request May 3, 2019

sync: Sync current state without full replay for Applications #828

Closed

ebuchman added 2 commits May 10, 2019 13:34

Merge branch 'develop' into zm_blockchain_reactor_spec

9411493

adr-036 -> adr-040

bdc9701

ebuchman approved these changes May 10, 2019

View reviewed changes

ebuchman merged commit 6f1ccb6 into develop May 10, 2019

ebuchman deleted the zm_blockchain_reactor_spec branch May 10, 2019 17:37

melekes mentioned this pull request May 30, 2019

temporary #3684

Closed

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADR] ADR-36: blockchain reactor refactor spec #3506

[ADR] ADR-36: blockchain reactor refactor spec #3506

milosevic commented Mar 29, 2019

xla commented Mar 29, 2019

xla left a comment

xla Mar 29, 2019

xla Mar 29, 2019

xla Mar 29, 2019

ancazamfir Apr 1, 2019

unclezoro Mar 29, 2019

milosevic Apr 1, 2019

unclezoro Mar 29, 2019

milosevic Apr 1, 2019

zmanian commented Apr 1, 2019

codecov-io commented Apr 1, 2019 •

edited

milosevic commented Apr 1, 2019

melekes Apr 1, 2019

melekes Apr 1, 2019

milosevic Apr 1, 2019

melekes Apr 1, 2019

melekes Apr 1, 2019

milosevic Apr 1, 2019

melekes Apr 1, 2019

ancazamfir Apr 1, 2019

ancazamfir Apr 1, 2019

milosevic Apr 1, 2019

ancazamfir Apr 1, 2019

milosevic Apr 1, 2019

ancazamfir Apr 1, 2019

ancazamfir Apr 1, 2019 •

edited

ancazamfir Apr 1, 2019

milosevic Apr 1, 2019 •

edited

ancazamfir Apr 1, 2019

milosevic Apr 1, 2019

xla commented Apr 14, 2019

	(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no the corresponding reply at time
	(maximum) between correct nodes. Therefore if a correct node `p` sends a request message to a correct node `q` at time `t` and there is no corresponding reply at time

	error = "Only respond to BlockRequests while in ModeConsensus!"
	error = "While in ModeConsensus only respond to Block Requests"

[ADR] ADR-36: blockchain reactor refactor spec #3506

[ADR] ADR-36: blockchain reactor refactor spec #3506

Conversation

milosevic commented Mar 29, 2019

xla commented Mar 29, 2019

xla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmanian commented Apr 1, 2019

codecov-io commented Apr 1, 2019 • edited

Codecov Report

milosevic commented Apr 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ancazamfir Apr 1, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milosevic Apr 1, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xla commented Apr 14, 2019

codecov-io commented Apr 1, 2019 •

edited

ancazamfir Apr 1, 2019 •

edited

milosevic Apr 1, 2019 •

edited