Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some external nodes are not able to catch up to the network #5294

Open
pgarg66 opened this issue Jul 26, 2019 · 11 comments

Comments

@pgarg66
Copy link
Contributor

commented Jul 26, 2019

Problem

During TdS dry run, a lot of external nodes were not able to catch up to the network. I did some experiments pretending to be an external node using Azure and GCP VMs. The Azure VM was not able to catch up to the idling network as repairs were slow. GCP VM (running same exact software) was catching up more reliably. The problem appears to be network interconnect between cloud providers. The problem gets magnified as we are using large UDP packets (64K bytes) to transmit blobs between nodes. For example, 1% of IP packet drop translates to ~50% blob loss.

An iPerf test between different cloud providers and regions came up with the following results

Peer1 Peer2 60K UDP 10K UDP 2K UDP MTU UDP
GCP US GCP EU 24Mbps 145Mbps 723Mbps 990Mbps
Azure US GCP EU Did not finish 400Kbps 130Mbps 985Mbps
AWS US GCP EU 24Mbps 147Mbps 735Mbps 990Mbps
AWS US Azure US 24Mbps 147Mbps 733Mbps 990Mbps
AWS US Azure Africa Did not finish 83Mbps 174Mbps 600Mbps

Proposed Solution

The bigger packets are causing lower throughput. Most likely it's due to packet drops which renders a big UDP packet corrupted. Any dropped blob, when repaired, is being transmitted as a whole. So chances of missing part of the blob in retransmit is still pretty high.

MTU sized packets have the highest throughput. The concern is that even 2K sized packet reduces performance by 1/6th in some cases.

TCP could be a solution, but it has its own overheads and problems. Reducing blob to MTU size will impact parallel processing (for transactions, and signature verification).

Splitting a blob into multiple smaller UDP packets (based on MTU) is the most optimal solution. In this scheme, individual UDP packet could be repaired (instead of repairing the whole blob). This would require rework at bunch of places (blocktree, blobs, repair service, erasure etc).

ToDo
  • Blob structure rework
  • Blob fragmentation and reassembly
  • Erasure changes
  • Repair/Window Service changes
  • Blocktree rework
  • Gossip changes

@mvines mvines added this to the Mavericks v0.18.0 milestone Jul 26, 2019

@pgarg66

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

@mvines mvines added this to To do in TdS Stage 1 via automation Jul 26, 2019

@aeyakovenko

This comment has been minimized.

Copy link
Member

commented Jul 27, 2019

@pgarg66

do you mean we transmit 1 blob in N udp chunks to each peer? First chunk is the blob header, the rest follow. RS blobs are chunked as well.

What’s the hit to using an MTU sized blob?

@aeyakovenko

This comment has been minimized.

Copy link
Member

commented Jul 27, 2019

@pgarg66 also, @sakridge is planning on a change that would remove processing transactions in the banking stage. We only need to check that the spending account can pay the fee. The replay stage is linearized already, so it can read all the consecutive available blobs from blocktree and run them in parallel. We might need to use 1 thread per N blobs though.

@pgarg66

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2019

@pgarg66

do you mean we transmit 1 blob in N udp chunks to each peer? First chunk is the blob header, the rest follow. RS blobs are chunked as well.

What’s the hit to using an MTU sized blob?

We'll need some header in each chunk. We need to verify each chunk to avoid malicious fragments.

MTU sized blobs won't fit some transactions. Currently we don't support splitting a transaction in multiple blobs. Also, less transactions per blob will reduce the benefits of parallel processing.

@aeyakovenko

This comment has been minimized.

Copy link
Member

commented Jul 27, 2019

@pgarg66 @sagar-solana
We can add a way to execute transactions stored in an accounts userdata.

@mvines

This comment has been minimized.

Copy link
Member

commented Jul 27, 2019

I'm stoked about the prospect of being able to remove

sysctl_write net.inet.udp.maxdgram 65535

The fact that this sysctl adjustment even needs to exist implies we're doing something abnormal and asking for a hard time.

@leoluk

This comment has been minimized.

Copy link

commented Aug 1, 2019

The reason for the performance issues is that sending UDP packets larger than the link MTU will require the kernel (or NIC, if offloading is enabled) to fragment the IP packet. IP fragmentation is slow and unreliable and must almost necessarily be avoided and handled by the application instead:

  • As already mentioned above, if a single IP fragment is lost, the whole packet is corrupted.

  • Fragmentation performance is a complicated topic and enabling NIC offloading will sometimes even reduce performance - many virtualized NICs do it slower than the guest kernel.

  • Depending on whether PMTU works properly (which is often not the case, like when ICMP packets are filtered), IPv4 packets might be fragmented further by routers on the path.

  • Most stateful firewalls drop IP fragments, especially oversized ones.

...and many more: https://tools.ietf.org/id/draft-ietf-intarea-frag-fragile-01.html

@mvines

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

Thanks @leoluk, that’s helpful!

@zmanian

This comment has been minimized.

Copy link

commented Aug 4, 2019

I wonder how some kind of overlay network like an SD-WAN would interact with this kind of thing. Several SDWAN vendors have reached out in the past who were interested in exploring optimization of the Tendermint layer but fancy throughput maximization hasn't been our focus so far.

@leoluk

This comment has been minimized.

Copy link

commented Aug 4, 2019

Most overlay networks would exacerbate the problem by reducing MTU. Throughput optimization is all about flow control and retransmission at the application layer.

@aeyakovenko

This comment has been minimized.

Copy link
Member

commented Aug 4, 2019

@leoluk @zmanian the “easy” hack would be to fragment and reassemble large packets in user space for each packet. Since the mtu drop rate is about 1% in the worst case that we have seen we would only need 2% erasure codes. The problem is that this would be horribly inefficient. We would need to sign each fragment, and we would basically be doing signature verification and erasure encode/decode twice.

Our main gain with Turbine, is that we run erasure coding over the entire peer window instead of just p2p, so it’s already doing it’s own signature verification and erasure coding.

@mvines mvines moved this from To do to In progress in TdS Stage 0 Aug 7, 2019

@mvines mvines moved this from In progress to Blocking Dry Run 3 in TdS Stage 0 Aug 8, 2019

@mvines mvines moved this from Blocking Dry Run 3 to Non-blocking in TdS Stage 0 Aug 15, 2019

@mvines mvines moved this from Non-blocking to Blocking Dry Run 4 in TdS Stage 0 Aug 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants
You can’t perform that action at this time.