p2p: "peer did not send us anything" in fast_sync mode when under high pressure #3457

unclezoro · 2019-03-20T16:55:01Z

Reproduce step:

To make huge pressure to rpc api, first change tendermint code, just increase the lock time so that a small scale of query request can make ApplyBlock a great delay:
Write a shell script, and run it on local node:

while true; do
time  curl 127.0.0.1:26657/{abci-query-for something} &
sleep 0.2
done

Restart node and wait for about 2 or 3 minutes, we will see errors:

To find the root cause, we modified the source code:

Give BlockRequest new field Time

type BlockRequest struct {
	Time   time.Time //add to test
	Height int64
	PeerID p2p.ID
}

and init it with time.Now() when send it to pool.requestsCh

func (pool *BlockPool) sendRequest(height int64, peerID p2p.ID) {
	if !pool.IsRunning() {
		return
	}
	pool.requestsCh <- BlockRequest{time.Now(),height, peerID}
}

In poolRoutine() of BlockchainReactor , we compare the create time and current time.
Then we check the the peer that seems to be sent us nothing, find the block heigh that it is expected to receive but get nothing, we find that the request is still not send out !!! Generate at
19min42second, but pop from channel at 20min10second.
Just check the code of poolRoutine. we find that trySyncTicker is trigger in every 10ms which will send didProcessCh an object. When process a block takes more that 10ms, didProcessCh alway get an object so that get event from requestsCh will delay.

I did a simple test:

package main

import (
	"fmt"
	"math/rand"
	"time"
)

type Event struct {
	Time time.Time
}

func main() {
	eventChannel := make(chan Event,600)
	didProcessCh := make(chan struct{}, 1)
	trySyncTicker := time.NewTicker(10 * time.Millisecond)
	go func() {
		for{
			t:=rand.Intn(1000)
			time.Sleep(time.Duration(t)*time.Millisecond)
			eventChannel<-Event{time.Now()}
		}
	}()

	for {
		select {
		case e:=<-eventChannel:
			end:=time.Now()
			d:=end.Sub(e.Time)
			fmt.Println(d)
		case <-trySyncTicker.C: // chan time
			select {
			case didProcessCh <- struct{}{}:
			default:
			}
		case <-didProcessCh:
				// We need both to sync the first block.
				// Try again quickly next loop.
				time.Sleep(2*time.Second)
				didProcessCh <- struct{}{}
		}
	}
}

The result is

1.929595741s
3.046203077s
2.199312363s
8.143827024s
12.067358858s
11.749071705s
11.321399465s
10.779326276s
12.323167204s
14.024024162s
13.327849443s
12.816276574s
16.655774867s
16.563461207s
15.832692446s
15.554935313s
21.34267714s
20.893256979s
26.66588s
28.562135556s
34.070507764s
33.602116951s
35.070530314s
34.812306947s
36.765648747s

The text was updated successfully, but these errors were encountered:

Fixes tendermint#3457 The topic of the issue is that : write a BlockRequest int requestsCh channel will create an timer at the same time that stop the peer 15s later if no block have been received . But pop a BlockRequest from requestsCh and send it out may delay more than 15s later. So that the peer will be stopped for error("send nothing to us"). Extracting requestsCh into its own goroutine can make sure that every BlockRequest been handled timely. Instead of the requestsCh handling, we should probably pull the didProcessCh handling in a separate go routine since this is the one "starving" the other channel handlers. I believe the way it is right now, we still have issues with high delays in errorsCh handling that might cause sending requests to invalid/ disconnected peers.

unclezoro changed the title ~~p2p: "peer did not send us anything" in high pressure~~ p2p: "peer did not send us anything" in 发售in high pressure Mar 20, 2019

unclezoro changed the title ~~p2p: "peer did not send us anything" in 发售in high pressure~~ p2p: "peer did not send us anything" in fast_sync mode when under high pressure Mar 20, 2019

xla added T:bug Type Bug (Confirmed) C:p2p Component: P2P pkg labels Mar 20, 2019

unclezoro mentioned this issue Mar 21, 2019

p2p: blockchain dismiss request channel delay #3459

Merged

4 tasks

ebuchman closed this as completed in 439312b Apr 19, 2019

ancazamfir mentioned this issue Jul 8, 2019

[blockchain] Riri processor-scheduler peer behaviour #3777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p: "peer did not send us anything" in fast_sync mode when under high pressure #3457

p2p: "peer did not send us anything" in fast_sync mode when under high pressure #3457

unclezoro commented Mar 20, 2019 •

edited

p2p: "peer did not send us anything" in fast_sync mode when under high pressure #3457

p2p: "peer did not send us anything" in fast_sync mode when under high pressure #3457

Comments

unclezoro commented Mar 20, 2019 • edited

unclezoro commented Mar 20, 2019 •

edited