# AsyncBTree: Revisiting Binary tree topology for efficient FPGA-based NoC implementation

Kizheppatt Vipin

Department of Electrical and Computer Engineering
Nazarbayev University, Astana, Kazakhstan
email:vipin.kizheppatt@nu.edu.kz

Abstract—Binary tree topology generally fails to attract network on chip (NoC) implementations due to its low bisection bandwidth. Fat trees are proposed to alleviate this issue by using increasingly thicker links to connect switches towards the root node. This scheme is very efficient in interconnected networks such as computer networks, which use generic switches for interconnection. In an NoC context, especially for field programmable gate arrays (FPGAs), fat trees require more complex switches as we move higher in the hierarchy. This restricts the maximum clock frequency at which the network operates and offsets the higher bandwidth achieved through using fatter links. In this paper we discuss the implementation of a binary tree-based NoC, which achieves better bandwidth by varying the clock frequency between the switches as we move higher in the hierarchy. This scheme enables using simpler switch architecture thus supporting higher maximum frequency of operation. The effect on bandwidth and resource requirement of this architecture is compared with other FPGA-based NoCs for different network sizes and traffic patterns.

Index Terms-network on chip, topology, binary tree, FPGA

### I. INTRODUCTION

Network on Chip (NoC) architectures enable high performance, scalable and power efficient multi-core systems for modern compute and communication intensive applications [1], [2]. Researchers have proposed different NoC topologies such as mesh, ring, torus, binary trees, star etc., each having varying degrees of quality of service, bandwidth and latency [3], [4]. Despite their simple architecture and routing algorithms, binary trees are generally not attractive for NoC implementations. It is mainly because of its lower bisection bandwidth. Fat trees are proposed as a remedy to improve the bandwidth by adding more number of links when moving towards the root node. This solution works well in traditional interconnect networks such as computer networks [5]. In an NoC environment, their advantage is limited since the switch complexity increses as the network size increases. This limits the maximum supported clock frequency of the network thus bringing down the system performance.

Theoretically instead of increasing the link width between tree levels, increasing the clock frequency between them should provide the same benefit. In traditional computer networks this may not be possible since the network interfaces operate on predefined standards which restrict the frequency of operation. In an NoC environment, this is very much possible since all the compute instances are within the same chip and are not restricted by any physical protocol. Modern FPGAs









Fig. 1. Different NoC topologies

support asynchronous FIFOs, which make the implementation of such asynchronous switches easier. These switches have simpler architecture than fat tree switches and support better clock frequency. But they are more resource intensive compared to traditional binary trees.

Although asynchronous NoCs are proposed before for integrated circuits, they are not evaluated for binary tree performance improvement [6], [7], [8]. A quantitative analysis is missing in the literature especially for FPGA implementations. In this work we present a quantitative analysis of different tree topologies namely the binary tree, binary fat tree and asynchronous binary tree when targeting FPGA based NoC implementation. The main contributions of this work are

- detailed design of an open-source globally asynchronouslocally synchronous (GALS) binary tree-based NoC implementation targeting Xilinx FPGAs,
- performance evaluation of the proposed infrastructure with traditional binary trees and the state-of-the-art open source binary fat tree, torus and mesh topologies,



Fig. 2. (a)A binary tree topology utilizing switches operating at different clock frequencies (b) The tree as an H-tree for better floorplanning on the FPGA

• and an analysis of trade-off points for the different implementations when targeting FPGAs.

The remainder of this paper is organized as, Section II discusses the relevant background, Section III discusses the architecture of the proposed NoC, Section IV discussed the performance metrics and V concludes the paper and gives the future research directions.

#### II. BACKGROUND

Network on chip is an interconnect approach that helps different IPs and subsystems in a chip to communicate with each other in an efficient and scalable manner. In this approach each processing element (PE) is connected to a switch and multiple switches are interconnected to form a network. A PE could be a processor core, a DSP core or an IP block. The network infrastructure helps in routing data from one PE to another in the form of data packets. NoCs have found varying applications such as image and signal processing [1], multiprocessor systems [9], virtual machine implementations [10] etc. Based on how the switches are interconnected, there are different NoC topologies such as mesh, torus, tree, ring, star, BFT etc. as shown in Fig. 1 [11].

In a mesh topology every switch, except the ones on the edges, is connected to 4 other neighboring switches. A torus topology is similar to mesh but cyclic in nature. In a binary tree, switches are arranged in a hierarchy. Each switch has a parent node and two child nodes. Unlike mesh and torus where each switch has a corresponding PE, in a tree topology only the switches at the bottom most level (leaf nodes) are connected to PEs.

For interconnected networks, an important performance parameter is the bisection bandwidth [12]. It is defined as the minimum bandwidth between two equal partitions of the network. For a mesh topology, it is  $\sqrt{n} * B$ , where n is the number of switches and B is the bandwidth of a single link. For torus, it is twice that of mesh but for a binary tree, it is only B. To address this issue, instead of using a single link between switches, more links can be used between them as we go higher in the tree hierarchy. Such topology



Fig. 3. Architecture of the different switches used in the AsyncBTree. (a) Type-1 switch used with the leaf nodes (PEs), with complete cross bar switch and asynchronous FIFOs with receive and transmit interfaces (b) Type-2 switch used in intermediate nodes with partial crossbar switch and synchronous FIFOs (c) Type-3 switches used in intermediate switches with partial cross bar and asynchronous FIFOs with receive and transmit interfaces.

is called a fat tree [13], [14]. Although this will improve the bisection bandwidth, the switches in the upper hierarchy becomes more and more complex. In this work we analyze whether using asynchronous switches with same link width can provide similar performance of fat trees while keeping relatively simpler switches.

Æthereal is a popular NoC implementation which provides guaranteed quality of service through pipelined time-division-multiplexed circuit switching [15]. But it requires to explicitly set up a channel on the routing path before transmitting the first payload packet, and a flow cannot use more than its guaranteed bandwidth share even if the network is underutilized. The MANGO [16] architecture is a clock-less NoC to provide connection-oriented guaranteed services (GS) as well as connection-less best-effort (BE) routing. The clock-less implementation has the advantage of zero dynamic power consumption when the network is idle, but the main challenge



Fig. 4. AXI4-Stream protocol

is its interfacing with the standard IP-cores. IP cores available from vendors and third-party developers are synchronous in nature thus requiring a clock signal between them and the NoC. Both these implementations target ASIC implementations and are not evaluated on FPGAs to understand the maximum possible speed of their operation.

There have been previous efforts to develop NoC architectures specifically targeting FPGAs. CONNECT NoC generator is the most popular among them [17]. CONNECT is inspired by the fact that FPGAs have a large routing infrastructure available compared to memory and logic elements and tries to exploit it. It supports different NoC topologies and uses a single stage pipeline mechanism to minimize hardware and latency. It has low operating frequency and is still quite resource intensive as seen in Section IV. Split-merge is another NoC infrastructure developed at University of Pennsylvania [18]. It tries to overcome the limited clock performance of CONNECT at the expense of few more resources.

To the best of our knowledge, there have been no previous efforts to implement asynchronous switch-based NoCs targeting FPGAs. In this work we give a quantitative analysis of the performance of asynchronous binary trees. We compare their performance with traditional and fat trees as well as other popular FPGA NoCs. The binary tree implementations are available as open-source for other researchers to verify and to improvise the designs.

# III. ARCHITECTURE

An AsyncBTree (asynchronous binary tree) tries to achieve better performance compared to a conventional binary tree and binary fat tree in terms of resource utilization and throughput by applying topology specific optimizations and using asynchronous links between different tree levels. Fig. 2 shows the architecture of an AsyncBTree utilizing different kind of optimized switches at different levels. The detailed architecture of the switches are discussed in Section III-A. The binary tree is placed and routed in the FPGA as an H-tree for efficient resource utilization and better floor-planning. Such placement also supports partial reconfiguration of a portion of the NoC in an efficient manner.

As the first optimization, the root node of the tree is removed and the switches at level-1 are directly connected. Since the connectivity of the root node is only 2, in practical systems they act as a transparent switch. They could be useful where packets are injected to the NoC through the root node by making its connectivity into 3. For FPGAs external interfaces such as PCI express and Ethernet are used for injecting packets. The hard-macros corresponding to these interfaces are situated along the periphery of the chips. Thus, it will provide better clock performance when the packets are injected from one the leaf nodes which incorporates one of these hard macros. Removing the root node helps in reducing the resource utilization.

#### A. Switches

AsyncBTrees use three different kinds of switches for packet routing. The architecture of the proposed type-1 switch is as shown in Fig. 3(a). Type-1 switches are used only in the leaf nodes for directly interfacing with PEs. The switch has separate interface for receiving and transmitting packets from two PEs and a single interface to the parent node. Each receive and transmit interface to/from the PEs are connected to asynchronous FIFOs. The interface to the parent node implements a single synchronous FIFO for the receive interface but no transmit FIFO. Thus each switch contains 5 FIFOs.

Asynchronous FIFOs can operate their read and write interfaces using independent clocks. Depth of the FIFOs are kept very low (16) to reduce the resource utilization. The asynchronous FIFOs receive data from the downstream ports on Clock 1 signal. The received packets are routed to the appropriate output ports by an arbitrator through a cross-bar switch. The read side of the receive FIFO, the write side of the transmit FIFO, the arbitrator and the cross bar works on Clock 2 signal, whose frequency will be much higher than that of Clock 1. In order to match the performance of a binary fat tree, Clock 2 frequency should be twice as that of Clock 1. Type-1 switches implement a full cross bar, which enables loop back of packets at the switch level. The arbitrator internally uses flit-level round-robin arbitration scheme to select the input port when more than one port requests for the same output port.

Fig. 3(b) shows the architecture of a type-2 switch. Type-2 switches are synchronous in nature and are similar to the switches of traditional binary trees. These switches implement only partial cross-bar switch where an incoming packet cannot be routed back to the same port. Since these switches are used only in intermediate levels, there is no necessity to to support loop-back since they are already implemented by Type-1 switches. This simplifies the switch design and helps in reducing resource utilization.

Type-3 switches (Fig. 3(b)) are similar to type-1 switches except that they implement only a partial cross-bar similar to type-2 switches. Theoretically to match the performance of a binary fat tree, AsyncBTrees should be implementing type-3 switches at every level with increasing clock frequencies. But in practical scenarios, doubling clock frequency at each level is not possible in FPGAs as the tree size increases. The maximum frequency supported by modern devices are in the order of hundreds of megahertz. Hence for practical implementations, AsyncBTrees increases the clock frequency



Fig. 5. Packet structure

only for alternative levels. For example a NoC with 8 PEs and three levels (as shown in Fig. 2), clock frequency is doubled between the PE interface and the output of level-3 (type-1) switches. The input and output frequencies of next level swiches (here type-2 switches) are same, which is equal to the output frequency of type-1 switches. Again, the clock frequency is doubled between the input and output of level-1 switches (here type-3 switches). Although asynchronous FIFOs can operate using same clock signal for read and write interfaces, the resource utilization of an asynchronous FIFOs is much more than its synchronous counterpart for a given FIFO size. For reducing resource consumption, levels which operate on synchronous clock uses type-2 switches and levels which operate on asynchronous clock uses type-3 switches.

## B. Packet Format

The NoC uses a simple packet format with destination PE address and payload as shown in Fig. 5. The NoC implementation is fully configurable such that it can support any data width with varying number of PEs. The total packet size depends upon these two parameters.

## C. Routing

AyncBTree uses fixed routing based on the destination address of the packet header. The routing is flit-level meaning each packet is expected to have the destination PE address in the header. Larger packets are sent as multiple flits. One major advantage of binary trees is the multiple packets sent from one PE to another will be always delivered in the sent order. In other packet switched networks such as mesh or torus, the packets could be delivered in out of order depending on the routing algorithms. In this case additional logic is required for packet reassembly and packet numbers also have to be inserted into the payload.

The routing table of each switch in AsyncBTree contains four entries corresponding to the smallest and largest PE addresses in its left and right sub-trees. If the destination address is within the range of left sub-tree, it is routed left and if it is within the range of right sub-tree the packet is sent right. If the address is not within these ranges, the packet is routed towards the parent node. Due to the deterministic routing policy and since the routing is at flit-level, the NoC is free of dead or live locks.

#### D. Flow control

The current AsyncBTree implementation does not include virtual channels and flow control is achieved through at electrical signaling level. The interface between switches as well as switches and PEs confirm to AXI4-stream interface. Such interface also enables seamless integration of several

vendor supported IP cores directly with the NoC. As per AXI4-stream protocol, a successful data transfer happens only when the *valid* signal from the transmitter and *ready* signal from the receiver are asserted simultaneously as shown in Fig. 4.

For comparison purpose, following the same design principles, we implemented traditional (synchronous) binary trees also. They follow the same arbitration and routing architecture except that all asynchronous FIFOs are replaced with synchronous counterparts.

### IV. RESULTS AND DISCUSSION

In this section we discuss the implementation and performance comparison of the different NoC implementations. All designs are modeled using Verilog HDL and extensively simulated for their functional correctness. We use the CONNECT open-source NoC platform as the fat tree, mesh and torus references [17]. The designs are simulated as well as implemented with Vivado 2017.3 targeting Xilinx xc7vx690t FPGA (VC709 evaluation board).

Table I compares the resource utilization and the maximum frequency of operation for the binary tree, AsyncBTree, and fat tree for different network sizes (number of PEs). For all implementations, the interface between PEs and the switches are kept 32-bits wide. As expected, binary trees are least resource intensive due to their simple switch architecture. AsyncBTree consumes 45% to 65% more LUTs (look-uptables) and about 165% more flip flops compared to the binary tree implementation. The multiple frequencies in the AsyncB-Tree rows represent the maximum clock frequency supported at different tree levels. At the lowest level (switches connected to PEs), the clock performance is better than that of binary trees but deteriorates as going higher in the hierarchy. This could be because of the additional pipelining present inside the asynchronous FIFOs. This also means if AsyncBTree is used as a synchronous NoC (all tree levels are clocked by a single clock source), its resource consumption and performance will be worse than a binary tree.

Compared to AsyncBTree, fat trees consume  $\sim 3.7 \times$  LUTs but less than half the number of flip flops. AsyncBTree

TABLE I
RESOURCE UTILIZATION AND MAXIMUM CLOCK PERFORMANCE OF
DIFFERENT BINARY TREE BASED NOC CONFIGURATIONS

| NoC         | # of PEs | LUTs  | FFs   | Fmax (MHz)  |
|-------------|----------|-------|-------|-------------|
| Fat Tree    | 4        | 2833  | 760   | 180         |
| Fat Tree    | 8        | 3660  | 912   | 150         |
| Fat Tree    | 16       | 12431 | 3040  | 135         |
| Fat Tree    | 32       | 37047 | 8512  | 94          |
| Binary Tree | 4        | 522   | 510   | 450         |
| Binary Tree | 8        | 1616  | 1566  | 407         |
| Binary Tree | 16       | 3603  | 3726  | 420         |
| Binary Tree | 32       | 6077  | 6850  | 425         |
| AsyncBTree  | 4        | 760   | 1350  | 481,416     |
| AsyncBTree  | 8        | 2242  | 4100  | 506,407,418 |
| AsyncBTree  | 16       | 4212  | 7506  | 483,386,416 |
| AsyncBTree  | 32       | 10046 | 18360 | 460,387,377 |



Fig. 6. Throughput of different Binary NoC architectures with varying size corresponding to different traffic patterns when all of them are clocked at the same clock frequency



Fig. 7. Worst-case latency of different Binary NoC architectures with varying size corresponding to different traffic patterns when all of them are clocked at the same clock frequency



Fig. 8. Throughput of different NoC architectures with varying size corresponding to different traffic patterns when all of them are clocked at the maximum supported clock frequency

requires more flip flops due to the presence of asynchronous FIFOs. For fat trees, the impact due to complex switches can be clearly seen in the clock performance, where they are not even able to achieve 200 MHz for a high-end FPGA. Due to the low clock performance of the NoC, the PEs also have to be under-clocked in most scenarios for overall synchronous operations. Considering the fact that the number of LUTs available in 7-series Xilinx FPGAs are half of that of number of flip-flops, AsyncBTree is much lite compared fat trees at the same time given more than  $4\times$  clock performance.

Table II lists the resource utilization and clock performance of the most popular FPGA NoC topologies namely mesh and torus. Data shows that these topologies are quite resource intensive compared to all binary tree configurations, especially the number of LUTs. In terms of clock performance, for larger

NoC configurations, they perform better than fat trees but inferior to traditional binary trees and the proposed AsyncBTrees.

TABLE II
RESOURCE UTILIZATION AND MAXIMUM CLOCK PERFORMANCE OF MESH AND TORUS NOC TOPOLOGIES WITH DIFFERENT CONFIGURATIONS

| NoC   | # of PEs | LUTs  | FFs  | Fmax (MHz) |
|-------|----------|-------|------|------------|
| Torus | 4        | 3763  | 960  | 164        |
| Torus | 8        | 7615  | 1920 | 164        |
| Torus | 16       | 15365 | 3840 | 149        |
| Torus | 32       | 32689 | 7680 | 141        |
| Mesh  | 4        | 1337  | 576  | 215        |
| Mesh  | 8        | 3940  | 1344 | 179        |
| Mesh  | 16       | 10471 | 3072 | 165        |
| Mesh  | 32       | 23798 | 6528 | 158        |
|       |          |       |      |            |



Fig. 9. Maximum latency of different Binary NoC architectures with varying size corresponding to different traffic patterns when all of them are clocked at the maximum supported clock frequency

Fig. 6 and 7 compares the throughput and latency of the three implementations with different NoC sizes for different traffic patterns such as random, tornado and reverse [19]. The different patterns are generated based on how the destination addresses as generated for each data packet. In each case, the PE to switch interface is clocked at the lowest frequency supported among the three implementations. For AsyncBTree upper levels are clocked at double the frequency of lower levels, but limited by maximum supported frequency given in Table I. It could be seen that for NoC size up to 8 PEs, AsyncBTree performance is better than or comparable to that of fat trees. For larger tree sizes, fat trees perform better since the clock frequencies cannot be scaled beyond a limit. If PEs run at lower clock frequencies (~50 MHz), AsyncBTree can provide better performance for NoC with up to 16 PEs.

Fig. 8 and 9 compares the throughput and latency when each implementation is running at its maximum supported frequency. Again for smaller NoC sizes binary tree and AsyncBTree out performs fat tree for random traffic pattern. But for larger NoC sizes fat trees are clearly advantageous. Fig. 8(a) also shows the performance of mesh topology, which is the most popular NoC topology, compared to different tree topologies. Again for smaller NoC sizes (8 or less) AsyncTree performance is better. Several FPGA-based multi-processor systems have 8 core or less thus AsyncBTree could be much suitable for their implementation.

Fig. 10 compares the performance of each NoC compared to their resource utilization. The total number of resources is calculated by adding the number of LUTs with the scaled number of flip-flops. The number of flip-flops is multiplied by a factor of 0.5 since in Xilinx 7-series FPGAs there are twice the number of flip-flops compared to LUTs in every logic slice. Synchronous binary trees clearly have an upper hand in this regard. AsyncBTree give moderately high throughput by consuming relatively less resources. But for larger NoC size, mesh topology is still the suitable candidate.

# V. CONCLUSION

In this paper we discussed the implementation of an asynchronous binary tree based NoC architecture targeting FPGA implementation and compared with other tree and mesh implementations. It could be seen that for low performance applications, binary trees are best suitable due to their high



Fig. 10. Throughput vs cost

throughput to resource utilization ratio. For high-performance large NoC sizes, mesh topology is the ideal candidate. For high-performance small NoC sizes, AsyncBTree is a suitable candidate. In future we will be analyzing the effect of using asynchronous switches with other topologies such as mesh and torus. Implementation of AsyncBTree and binary tree are provided as open source which could complement the CONNECT NoC platform [20].

### REFERENCES

- J. Joshi, K. Karandikar, S. Bade, M. Bodke, R. Adyanthaya, and B. Ahirwal, "Multi-core image processing system using network on chip interconnect," in *Midwest Symposium on Circuits and Systems*, 2007.
- [2] C. Neeb, M. J. Thul, and N. Wehn, "Network-on-chip-centric approach to interleaving in high throughput channel decoders," in *IEEE Interna*tional Symposium on Circuits and Systems, May 2005, pp. 1766–1769 Vol. 2.
- [3] W. Dally and B. Towles, Principles and Practices of Interconnection Networks. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2003.
- [4] S. Kumar, A. Jantsch, J. P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, and A. Hemani, "A network on chip architecture and design methodology," in *Proceedings IEEE Computer Society Annual Symposium on VLSI (ISVLSI)*, 2002, pp. 105–112.
- [5] G. Shainer, "Networks: Topologies how to design," HPC Advisory Council, Tech. Rep., 2011.
- [6] T. Bjerregaard and J. Sparso, "A router architecture for connectionoriented service guarantees in the MANGO clockless network-on-chip," in *Design, Automation and Test in Europe*, March 2005, pp. 1226–1231.
- [7] E. Beigne, P. Vivet, M. Renaudin, and J. Quartana, "Communication node architecture in a globally asynchronous network on chip system," U.S. Patent US7 940 666B2, 2011.

- [8] I. M. Panades and A. Greiner, "Bi-synchronous fifo for synchronous circuit communication well suited for network-on-chip in gals architectures," in *International Symposium on Networks-on-Chip (NOCS)*, May 2007, pp. 83–94.
- [9] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. D. Micheli, "NoC synthesis flow for customized domain specific multiprocessor systems-on-chip," *IEEE Transactions on Parallel and Distributed Systems*, vol. 16, no. 2, pp. 113–129, Feb 2005.
- [10] G. Mathias and K. Kent, "An embedded Java virtual machine using network-on-chip design," in Seventeenth IEEE International Workshop on Rapid System Prototyping, 2006.
- [11] M. Ortín-Obón, D. Suárez-Gracia, M. Villarroya-Gaudó, C. Izu, and V. Viñals-Yúfera, "Analysis of network-on-chip topologies for costefficient chip multiprocessors," *Microprocessors and Microsystems*, vol. 42, pp. 24 – 36, 2016.
- [12] W.-C. Tsa, Y.-C. Lan, Y.-H. Hu, and S.-J. Chen, "Networks on chips: Structure and design methodologies," *Journal of Electrical and Computer Engineering*, vol. 2012, p. 15, 2012.
- puter Engineering, vol. 2012, p. 15, 2012.
  [13] C. E. Leiserson, "Fat-trees: Universal networks for hardware-efficient supercomputing," *IEEE Transactions on Computers*, vol. C-34, no. 10, pp. 892–901, Oct 1985.
- [14] S. R. Ohring, M. Ibel, S. K. Das, and M. J. Kumar, "On generalized fat trees," in *International Parallel Processing Symposium*, April 1995, pp. 37–44.
- [15] K. Goossens, J. Dielissen, and A. Radulescu, "Aethereal network on chip: concepts, architectures, and implementations," *IEEE Design Test of Computers*, vol. 22, no. 5, pp. 414–421, Sept 2005.
  [16] T. Bjerregaard and J. Sparso, "Implementation of guaranteed services in
- [16] T. Bjerregaard and J. Sparso, "Implementation of guaranteed services in the MANGO clockless network-on-chip," *IEE Proceedings - Computers* and Digital Techniques, vol. 153, no. 4, pp. 217–229, July 2006.
- [17] M. K. Papamichael and J. C. Hoe, "CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs," in Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays, 2012, pp. 37–46.
- [18] Y. Huan and A. DeHon, "FPGA optimized packet-switched NoC using split and merge primitives," in 2012 International Conference on Field-Programmable Technology, Dec 2012, pp. 47–52.
- [19] J. H. Bahn and N. Bagherzadeh, "A generic traffic model for on-chip interconnection networks," in *International Workshop on Network on Chip Architectures (NoCArc)*, 2008.
- [20] [Online]. Available: https://github.com/dsdnu/TreeNoC