Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: "peer did not send us anything" in fast_sync mode when under high pressure #3457

Closed
unclezoro opened this issue Mar 20, 2019 · 0 comments
Labels
C:p2p Component: P2P pkg T:bug Type Bug (Confirmed)

Comments

@unclezoro
Copy link
Contributor

unclezoro commented Mar 20, 2019

Reproduce step:

  1. To make huge pressure to rpc api, first change tendermint code, just increase the lock time so that a small scale of query request can make ApplyBlock a great delay:
    image

  2. Write a shell script, and run it on local node:

while true; do
time  curl 127.0.0.1:26657/{abci-query-for something} &
sleep 0.2
done
  1. Restart node and wait for about 2 or 3 minutes, we will see errors:
    image

To find the root cause, we modified the source code:

  1. Give BlockRequest new field Time
type BlockRequest struct {
	Time   time.Time //add to test
	Height int64
	PeerID p2p.ID
}

and init it with time.Now() when send it to pool.requestsCh

func (pool *BlockPool) sendRequest(height int64, peerID p2p.ID) {
	if !pool.IsRunning() {
		return
	}
	pool.requestsCh <- BlockRequest{time.Now(),height, peerID}
}
  1. In poolRoutine() of BlockchainReactor , we compare the create time and current time.
    image

  2. Then we check the the peer that seems to be sent us nothing, find the block heigh that it is expected to receive but get nothing, we find that the request is still not send out !!! Generate at
    19min42second, but pop from channel at 20min10second.
    image

  3. Just check the code of poolRoutine. we find that trySyncTicker is trigger in every 10ms which will send didProcessCh an object. When process a block takes more that 10ms, didProcessCh alway get an object so that get event from requestsCh will delay.

I did a simple test:

package main

import (
	"fmt"
	"math/rand"
	"time"
)

type Event struct {
	Time time.Time
}

func main() {
	eventChannel := make(chan Event,600)
	didProcessCh := make(chan struct{}, 1)
	trySyncTicker := time.NewTicker(10 * time.Millisecond)
	go func() {
		for{
			t:=rand.Intn(1000)
			time.Sleep(time.Duration(t)*time.Millisecond)
			eventChannel<-Event{time.Now()}
		}
	}()

	for {
		select {
		case e:=<-eventChannel:
			end:=time.Now()
			d:=end.Sub(e.Time)
			fmt.Println(d)
		case <-trySyncTicker.C: // chan time
			select {
			case didProcessCh <- struct{}{}:
			default:
			}
		case <-didProcessCh:
				// We need both to sync the first block.
				// Try again quickly next loop.
				time.Sleep(2*time.Second)
				didProcessCh <- struct{}{}
		}
	}
}

The result is

1.929595741s
3.046203077s
2.199312363s
8.143827024s
12.067358858s
11.749071705s
11.321399465s
10.779326276s
12.323167204s
14.024024162s
13.327849443s
12.816276574s
16.655774867s
16.563461207s
15.832692446s
15.554935313s
21.34267714s
20.893256979s
26.66588s
28.562135556s
34.070507764s
33.602116951s
35.070530314s
34.812306947s
36.765648747s
@unclezoro unclezoro changed the title p2p: "peer did not send us anything" in high pressure p2p: "peer did not send us anything" in 发售in high pressure Mar 20, 2019
@unclezoro unclezoro changed the title p2p: "peer did not send us anything" in 发售in high pressure p2p: "peer did not send us anything" in fast_sync mode when under high pressure Mar 20, 2019
@xla xla added T:bug Type Bug (Confirmed) C:p2p Component: P2P pkg labels Mar 20, 2019
brapse pushed a commit to brapse/tendermint that referenced this issue Jun 5, 2019
Fixes tendermint#3457

The topic of the issue is that : write a BlockRequest int requestsCh channel will create an timer at the same time that stop the peer 15s later if no block have been received . But pop a BlockRequest from requestsCh and send it out may delay more than 15s later. So that the peer will be stopped for error("send nothing to us").
Extracting requestsCh into its own goroutine can make sure that every BlockRequest been handled timely.

Instead of the requestsCh handling, we should probably pull the didProcessCh handling in a separate go routine since this is the one "starving" the other channel handlers. I believe the way it is right now, we still have issues with high delays in errorsCh handling that might cause sending requests to invalid/ disconnected peers.
Stumble pushed a commit to lino-network/tendermint that referenced this issue Jul 25, 2019
Fixes tendermint#3457

The topic of the issue is that : write a BlockRequest int requestsCh channel will create an timer at the same time that stop the peer 15s later if no block have been received . But pop a BlockRequest from requestsCh and send it out may delay more than 15s later. So that the peer will be stopped for error("send nothing to us").
Extracting requestsCh into its own goroutine can make sure that every BlockRequest been handled timely.

Instead of the requestsCh handling, we should probably pull the didProcessCh handling in a separate go routine since this is the one "starving" the other channel handlers. I believe the way it is right now, we still have issues with high delays in errorsCh handling that might cause sending requests to invalid/ disconnected peers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:p2p Component: P2P pkg T:bug Type Bug (Confirmed)
Projects
None yet
Development

No branches or pull requests

2 participants