p2p: blockchain dismiss request channel delay #3459

unclezoro · 2019-03-21T09:14:02Z

Fix issue #3457

Updated all relevant documentation in docs
Updated all code comments where relevant
Wrote tests
Updated CHANGELOG_PENDING.md

unclezoro · 2019-03-22T17:07:13Z

@melekes @ebuchman may you help to review if you get time, thks

melekes · 2019-03-23T12:34:28Z

There's a data race

WARNING: DATA RACE
Write at 0x00c00051a3d0 by goroutine 48:
  runtime.closechan()
      /usr/local/go/src/runtime/chan.go:334 +0x0
  github.com/tendermint/tendermint/blockchain.(*BlockPool).OnStop()
      /go/src/github.com/tendermint/tendermint/blockchain/pool.go:107 +0x6e
  github.com/tendermint/tendermint/libs/common.(*BaseService).Stop()
      /go/src/github.com/tendermint/tendermint/libs/common/service.go:167 +0x4e3
  github.com/tendermint/tendermint/blockchain.TestBlockPoolBasic()
      /go/src/github.com/tendermint/tendermint/blockchain/pool_test.go:120 +0x5e5
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:865 +0x163

Previous read at 0x00c00051a3d0 by goroutine 322:
  runtime.chansend()
      /usr/local/go/src/runtime/chan.go:142 +0x0
  github.com/tendermint/tendermint/blockchain.(*BlockPool).sendRequest()
      /go/src/github.com/tendermint/tendermint/blockchain/pool.go:393 +0xdc
  github.com/tendermint/tendermint/blockchain.(*bpRequester).requestRoutine()
      /go/src/github.com/tendermint/tendermint/blockchain/pool.go:604 +0x2e3

Goroutine 48 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:916 +0x699
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1157 +0xa8
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:865 +0x163
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1155 +0x523
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1072 +0x2eb
  github.com/tendermint/tendermint/blockchain.TestMain()
      /go/src/github.com/tendermint/tendermint/blockchain/store_test.go:111 +0x5cb
  main.main()
      _testmain.go:114 +0x330

Goroutine 322 (running) created at:
  github.com/tendermint/tendermint/blockchain.(*bpRequester).OnStart()
      /go/src/github.com/tendermint/tendermint/blockchain/pool.go:523 +0x64
  github.com/tendermint/tendermint/libs/common.(*BaseService).Start()
      /go/src/github.com/tendermint/tendermint/libs/common/service.go:139 +0x504
  github.com/tendermint/tendermint/blockchain.(*BlockPool).makeNextRequester()
      /go/src/github.com/tendermint/tendermint/blockchain/pool.go:379 +0x229
  github.com/tendermint/tendermint/blockchain.(*BlockPool).makeRequestersRoutine()
      /go/src/github.com/tendermint/tendermint/blockchain/pool.go:130 +0x174
==================

melekes · 2019-03-23T12:36:55Z

blockchain/reactor.go

-	for {
-		select {
-		case request := <-bcR.requestsCh:
+	go func() {


Not sure how extracting requestsCh into its own goroutine solves anything... but I don't understand the issue well enough

The topic of the issue is that : write a BlockRequest int requestsCh channel will create an timer at the same time that stop the peer 15s later if no block have been received . But pop a BlockRequest from requestsCh and send it out may delay more than 15s later. So that the peer will be stopped for error("send nothing to us").
Extracting requestsCh into its own goroutine can make sure that every BlockRequest been handled timely.

melekes · 2019-03-23T12:38:03Z

Should we instead forbid any ABCI queries while we're in fast sync mode?

unclezoro · 2019-03-23T16:49:52Z

Should we instead forbid any ABCI queries while we're in fast sync mode?

Heavy ABCI queries is a condition to make this problem easily to be reproduced. If the time to apply a block increase, this issue is more easily to be exposed, but we can't control the time in tendermint layer.
So the key is the speed of apply a block, not ABCI queries.

unclezoro · 2019-03-23T16:51:40Z

I will try to fix data-race issue tomorrow. thks for remind.

unclezoro · 2019-03-24T14:02:34Z

updated @melekes . Let us see the CI result then.

codecov-io · 2019-03-24T14:09:11Z

Codecov Report

Merging #3459 into develop will increase coverage by 0.14%.
The diff coverage is 58.33%.

@@             Coverage Diff             @@
##           develop    #3459      +/-   ##
===========================================
+ Coverage     64.2%   64.34%   +0.14%     
===========================================
  Files          213      213              
  Lines        17362    17447      +85     
===========================================
+ Hits         11147    11227      +80     
+ Misses        5290     5287       -3     
- Partials       925      933       +8

Impacted Files	Coverage Δ
libs/db/mem_batch.go	`92.59% <ø> (ø)`	⬆️
blockchain/reactor.go	`71.49% <58.33%> (-0.97%)`	⬇️
mempool/reactor.go	`68.35% <0%> (-10.64%)`	⬇️
libs/events/events.go	`93.2% <0%> (-4.86%)`	⬇️
privval/signer_remote.go	`80% <0%> (-2%)`	⬇️
blockchain/pool.go	`80.59% <0%> (-1.65%)`	⬇️
mempool/mempool.go	`79.36% <0%> (-1.18%)`	⬇️
rpc/client/localclient.go	`63.8% <0%> (ø)`	⬆️
config/toml.go	`65.95% <0%> (ø)`	⬆️
rpc/client/httpclient.go	`66.51% <0%> (ø)`	⬆️
... and 11 more

melekes

🍰

ancazamfir · 2019-03-26T17:05:08Z

Instead of the requestsCh handling, we should probably pull the didProcessCh handling in a separate go routine since this is the one "starving" the other channel handlers. I believe the way it is right now, we still have issues with high delays in errorsCh handling that might cause sending requests to invalid/ disconnected peers.

unclezoro · 2019-04-01T02:19:45Z

Instead of the requestsCh handling, we should probably pull the didProcessCh handling in a separate go routine since this is the one "starving" the other channel handlers. I believe the way it is right now, we still have issues with high delays in errorsCh handling that might cause sending requests to invalid/ disconnected peers.

you are right. The code is updated now, I put equestsCh in main routine. @ancazamfir would you like review again.

melekes · 2019-04-11T13:39:26Z

@guagualvcha could you please add a changelog entry (CHANGELOG_PENDING.md)?

ancazamfir

Looks good.

Fixes tendermint#3457 The topic of the issue is that : write a BlockRequest int requestsCh channel will create an timer at the same time that stop the peer 15s later if no block have been received . But pop a BlockRequest from requestsCh and send it out may delay more than 15s later. So that the peer will be stopped for error("send nothing to us"). Extracting requestsCh into its own goroutine can make sure that every BlockRequest been handled timely. Instead of the requestsCh handling, we should probably pull the didProcessCh handling in a separate go routine since this is the one "starving" the other channel handlers. I believe the way it is right now, we still have issues with high delays in errorsCh handling that might cause sending requests to invalid/ disconnected peers.

unclezoro requested review from ebuchman, melekes and xla as code owners March 21, 2019 09:14

unclezoro changed the title ~~p2p: blockchain dismiss request channel delay~~ [WIP]p2p: blockchain dismiss request channel delay Mar 21, 2019

p2p: blockchain dismiss request channel delay

734e783

unclezoro force-pushed the p2p branch from 4c29053 to 734e783 Compare March 21, 2019 11:07

unclezoro changed the title ~~[WIP]p2p: blockchain dismiss request channel delay~~ p2p: blockchain dismiss request channel delay Mar 21, 2019

unclezoro mentioned this pull request Mar 21, 2019

[R4R]p2p: blockchain dismiss request channel delay bnb-chain/bnc-tendermint#64

Merged

4 tasks

melekes reviewed Mar 23, 2019

View reviewed changes

melekes approved these changes Mar 25, 2019

View reviewed changes

melekes requested a review from ancazamfir March 26, 2019 11:29

unclezoro force-pushed the p2p branch from 9e50ce6 to 5515da8 Compare April 1, 2019 02:17

fix data-race bug;and auto imports

c6df60a

unclezoro force-pushed the p2p branch from 5515da8 to c6df60a Compare April 1, 2019 02:18

melekes approved these changes Apr 11, 2019

View reviewed changes

ancazamfir approved these changes Apr 11, 2019

View reviewed changes

melekes merged commit 439312b into tendermint:develop Apr 16, 2019

ancazamfir mentioned this pull request Jul 8, 2019

[blockchain] Riri processor-scheduler peer behaviour #3777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p: blockchain dismiss request channel delay #3459

p2p: blockchain dismiss request channel delay #3459

unclezoro commented Mar 21, 2019 •

edited

unclezoro commented Mar 22, 2019

melekes commented Mar 23, 2019

melekes Mar 23, 2019

unclezoro Mar 23, 2019

melekes commented Mar 23, 2019

unclezoro commented Mar 23, 2019 •

edited

unclezoro commented Mar 23, 2019

unclezoro commented Mar 24, 2019

codecov-io commented Mar 24, 2019 •

edited

melekes left a comment

ancazamfir commented Mar 26, 2019

unclezoro commented Apr 1, 2019 •

edited

melekes commented Apr 11, 2019

ancazamfir left a comment

p2p: blockchain dismiss request channel delay #3459

p2p: blockchain dismiss request channel delay #3459

Conversation

unclezoro commented Mar 21, 2019 • edited

unclezoro commented Mar 22, 2019

melekes commented Mar 23, 2019

melekes Mar 23, 2019

Choose a reason for hiding this comment

unclezoro Mar 23, 2019

Choose a reason for hiding this comment

melekes commented Mar 23, 2019

unclezoro commented Mar 23, 2019 • edited

unclezoro commented Mar 23, 2019

unclezoro commented Mar 24, 2019

codecov-io commented Mar 24, 2019 • edited

Codecov Report

melekes left a comment

Choose a reason for hiding this comment

ancazamfir commented Mar 26, 2019

unclezoro commented Apr 1, 2019 • edited

melekes commented Apr 11, 2019

ancazamfir left a comment

Choose a reason for hiding this comment

unclezoro commented Mar 21, 2019 •

edited

unclezoro commented Mar 23, 2019 •

edited

codecov-io commented Mar 24, 2019 •

edited

unclezoro commented Apr 1, 2019 •

edited