Border Router performance optimization #4334

rohrerj · 2023-04-03T14:53:07Z

Right now, the performance of the border router is strongly limited because a single goroutine per border router interface is responsible for reading, parsing, processing and forwarding the packets.

Proposal: To improve the performance of the border router we make changes to the router/dataplane.go file to separate the forwarding pipeline of each border router interface into a receiver, multiple processing routines and a forwarder, where the pool of processing routines are shared among all border router interfaces of a border router.

The receiver makes a batch read from the network socket using pre-allocated buffers from a pool and parses the source and flowID from the packet. Then we hash them to map the tuple of source and flowID to a processing routine and forward the packet to them. The processing routine will then process the packet as usual and forward the packet to the forwarder which is responsible for the egress interface. The forwarder forwards the packet to the NIC and returns the buffer used to store the packet back to the buffer pool of the receiver that received that packet.

Because of the hashing, all packets of the same source and same flowID are processed by the same processing routine and hence no packet reordering can happen.
The number of processing routines will be configurable.

These improvements are supervised by Marc Wyss from the network security group at ETH.

matzf · 2023-04-03T16:26:40Z

This sounds great, cool that you want to tackle this!

As this is a significant change, we'll want to work out a more detailed design document as described in the contribution guide once there is general agreement that this proposal should be accepted. There is some more detailed information on these design documents in my work-in-progress PR here. I should really get around to finalize this 😳
The design document should describe the implementation in more detail. Things like how configuration options will work, how the code will be organized, how the implementation tasks can be split into multiple steps, etc. For this proposal specifically, I think it should also discuss the details of the buffer recycle mechanism. Also, any considerations on memory locality, locking goroutines to threads and cores etc, if this is something that we want to tackle. Maybe the whole thing could be organized in a way such that it would be at some point possible to replace the reading/writing goroutines with a different asynchronous IO mechanism, such as io_uring on linux?
I'm writing this just so you can maybe already start to thinking about this and prepare.

Perhaps one detail worth discussing on this level already is this; we should consider how we could add any special "slow path" handling of packets (e.g. malformed packets or SCMP traceroute), perhaps also including some form of rate limits for this, so that the "fast-path" is minimally affected.

marcfrei · 2023-04-05T10:53:21Z

Adding to the last point by @matzf, the new design should also consider more general traffic control schemes like, e.g., the one proposed in #4054, that we might want to add in the future.

matzf · 2023-04-06T13:31:16Z

Summarizing some additional points from the discussion of this proposal in the chat for future reference:

The current implementation was never built with performance in mind, it was mostly a PoC for the new Header format. (See posix-router: add header v2 posix-router #3869). From this perspective, it makes sense to quite heavily restructure it.
This implementation also serves as a reference implementation and so should remain somewhat simple and not be made unreadable by very complicated optimizations. This refers mainly to the actual packet processing logic, and applies less to the i/o parts.
There was some discussion on whether simple channels are a good choice for the communication between the goroutines involved here.
- Channels have been found to be slow in the past, but was with older Go versions so this may no longer be the case.
- Alternative could be a custom ringbuffer. There was an existing implementation in the code base that was used in the pre-header-v2 incarnation of the router (still used for dispatcher & gateway).
As a historical note and pointer for reference: the first version of the Go SCION router (pre header-v2) was implemented essentially in the newly proposed way, i.e., a reader go routine that was batch reading from the UDP socket, a set of processing go routines and one output go-routine per external SCION interface. Between the go-routines our own implementation of a ringbuffer was used.
That implementation was far from perfectly optimized and was hard to understand. Some of the complexity may also have been due to the (over-)ambitious configuration hot-reloading.
See https://github.com/scionproto/scion/tree/92531f5cb62197b9d705001c13e5a6bdb7ba1fa4/go/border

Going forward, we're likely going accept this proposal. The final comment period starts now, and unless there are objections during this period, the proposal will be marked as accepted. As mentioned above, @rohrerj, you can then submit a more in-depth design document describing how you plan to implement this proposal.

dtaht · 2023-04-08T21:45:32Z

This design is remarkably similar to my fq_codel stuff (RFC8290), which does flow isolation and active queue management. One of the tricks in fq_codel to speed processing is that sparser updates are processed sooner than more bulky updates.
Also somewhat related to that, in trying to understand scion for the first time, is what happens when the application is overloaded? How does it shed load?

matzf · 2023-04-17T09:46:38Z

No objections were raised during the final comment period, which means that this proposal is now accepted.
As mentioned before, please submit a design document, describing in more detail how this proposal will be implemented.

matzf · 2023-04-17T09:48:25Z

Thanks for the input @dtaht. I saw that you've joined the chat, I will respond to your question over there.

rohrerj · 2023-04-20T12:40:47Z

I created the PR 4339 for the design document.

Add a low-priority slow path for special case packet handling, in particular SCMP errors and traceroutes, for the the asynchronous packet processing pipeline in the router added in scionproto#4351. This is implementation part three of three described in the design document (scionproto#4339, doc/dev/design/BorderRouter.rst). Closes scionproto#4334

rohrerj added the i/proposal A new idea requiring additional input and discussion label Apr 3, 2023

matzf added the i/proposal final comment period Final comment period (10 days) label Apr 6, 2023

matzf added i/proposal accepted and removed i/proposal final comment period Final comment period (10 days) labels Apr 17, 2023

rohrerj mentioned this issue Apr 20, 2023

doc: design document for the border router redesign #4339

Merged

matzf closed this as completed in b4a2011 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Border Router performance optimization #4334

Border Router performance optimization #4334

rohrerj commented Apr 3, 2023

matzf commented Apr 3, 2023 •

edited

Loading

marcfrei commented Apr 5, 2023

matzf commented Apr 6, 2023

dtaht commented Apr 8, 2023

matzf commented Apr 17, 2023

matzf commented Apr 17, 2023

rohrerj commented Apr 20, 2023

Border Router performance optimization #4334

Border Router performance optimization #4334

Comments

rohrerj commented Apr 3, 2023

matzf commented Apr 3, 2023 • edited Loading

marcfrei commented Apr 5, 2023

matzf commented Apr 6, 2023

dtaht commented Apr 8, 2023

matzf commented Apr 17, 2023

matzf commented Apr 17, 2023

rohrerj commented Apr 20, 2023

matzf commented Apr 3, 2023 •

edited

Loading