Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[blockchain] Riri processor-scheduler peer behaviour #3777

Closed
brapse opened this issue Jul 8, 2019 · 3 comments
Closed

[blockchain] Riri processor-scheduler peer behaviour #3777

brapse opened this issue Jul 8, 2019 · 3 comments
Assignees

Comments

@brapse
Copy link
Contributor

brapse commented Jul 8, 2019

The new blockchain reactor design specified in #3753 introduces asynchronous message communication between the block processing logic and the request scheduling logic. This async communication introduces latency between the scheduler perceiving bad peer behaviour and informing the processor to remove block provided by that peer. A malicious peer could use this period to saturate the block processing thread and slowing down the FastSync'ing node.

This issue can be solved by yielding the block processing thread to peerErrors at a higher frequency to ensure few if any blocks from an errored peer are processed.

@brapse brapse self-assigned this Jul 8, 2019
@ancazamfir
Copy link
Contributor

Hmmm...not sure. You would yield to the messages in a channel that may contain a lot of blocks from scheduler, requests from other fast-sync peers, etc....before you get to the peerError from the scheduler. Unless we add another channel for errors (this is what v0 and v1 have).
Also, this is not the only issue, we also have to deal carefully with out of sync views on the world between processor and scheduler.

At this point I am not sure anymore that we need to have the processor and scheduler routines. I will try to give a minimal explanation on how v0 and v1 got to having the two routines, added links to the issue and PR below.

We used to have a single goroutine that was doing both the scheduling and the processing. The way it was written and given that we were running timers that triggered in separate goroutines the following was seen:

  • a high number of block requests were sent out and received in channel 1
  • then processing would start and after each processed block a signal would be sent to channel 2 in order to process more blocks quickly without sitting in a tight loop.
  • because processing blocks takes a long time, some blocks would sit in channel 1 for more than 15 sec.
  • the timer associated with the peer(s) sending those blocks would trigger and its handler would run in a separate goroutine rescheduling the pending requests (those blocks already waiting in channel 1).
  • later when we got to actually reading those blocks from channel 1, the code dropped them since now those requests were made to other peers.

With the current approach (use of timestamps and not real timers) this wouldn't happen, right?
Note also that the time spent in the scheduler routine is insignificant compared with the processing routine. AFAIK, we haven't seen any performance improvement after splitting.

v0 issue: #3457
v1 PR: #3459

@brapse
Copy link
Contributor Author

brapse commented Jul 9, 2019

. Unless we add another channel for errors (this is what v0 and v1 have).

Yes I think that could definitely be one solution. It could be done internal the the processRoutine because in the end, it's the processRoutine which is most concerned for missing a peerError.

At this point I am not sure anymore that we need to have the processor and scheduler routines.

The motivation for having separate routines in v2 was to decompose the problem into components which were easier to test and comprehend for reviewing purposes. There are additional benefits but I would say that by far is the largest.

With the current approach (use of timestamps and not real timers) this wouldn't happen, right?

If i understand you correctly, we are talking about thread saturation where an expensive process like block validation holds control of a thread for too long and doesn't check for errors or doesn't check the time and issues timeouts. In both cases it seems like the solution is to ensure that threads yields frequently enough to be responsive to external state changes (events).

@brapse
Copy link
Contributor Author

brapse commented Nov 28, 2019

This has been resolved by using a priority queue routine which ensures peerErrors are higher priority than blocks.

@brapse brapse closed this as completed Nov 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants