# Network-on-Multi-Chip (NoMC) for multi-FPGA multimedia systems

Marta Stepniewska, Adam Luczak, Jakub Siast
Chair of Multimedia Telecommunications and Microelectronics,
Poznan University of Technology
{mstep, aluczak, jsiast}@multimedia.edu.pl

#### Abstract

Some applications, especially in the area of multimedia processing, need to be implemented in a multichip platform, due to their size. An efficient communication infrastructure for such systems may be designed with the use of the Networks-on-Chip (NoCs). However, a network for multi-chip systems require a scalable architecture. Moreover, for multimedia purposes, such NoC should support a multicast transmission mode. In order to meet this requirements, we propose the NoMC (Network-on-Multi-Chip) which is a hierarchical interconnect system, designed for multichip systems. A performance of the proposed network is assessed utilizing a model of the MVC (Multiview Video Coding) coder. In such system, the multicast transmission mode may yield an overall bandwidth gain up to 30%. Moreover, the synthesis results show that the proposed network elements are easily synthesizable for the FPGA devices.1

### 1. Introduction

New multimedia systems, especially those intended for 3D video, require a processing with use of variety complex coding/decoding techniques. The Multiview Video Coding (MVC) [1] is one of the techniques that due to its complexity extensively consumes hardware resources. Such implementation may utilize a few chips due to large size of the whole MVC encoder. Similar situation may occur in case of other new video processing algorithms.

Multi-chip system including e.g. several FPGA devices may be successfully exploited as a test platform as well as a target platform for low- and middle- volume designs. However, the bottleneck of such system is the communication infrastructure.

Efficient solution for connecting modules of a hardware application is the Network-on-Chip (NoC).

However, NoC for multi-chip systems, should exhibit the following features:

- Scalability, i.e. extension of a current structure with new network modules cannot require changes in an existing part of the network,
- Multicast support,
- Synthesizability for FPGA.

The scalability for multi-chip system may be achieved by a hierarchy in an interconnect system. However, not all hierarchical networks are flexibly scalable. Some work, e.g. [3], introduce a hierarchy to improve a flow of the network traffic and to ease resource management. A large NoC is divided into subnets, each connected through additional links. Such approach, however, is not suitable for multichip systems, because higher level interconnections need to be redesigned if a change in subnet is introduced.

More flexible solution has been shown in [4]. However, it is designed for a hierarchical arrangement of Chip Multiprocessors (CMP), based on a mesh topology. In case of non-homogenous tiles, such a structure is inefficient. Moreover, the mesh topology, as a higher level interconnect system (also shown in [2]), is not easily scalable.

Another solution presented in [5] is a three-level hierarchical NoC with Wishbone [6] bus at the lowest level. Communication with two higher levels of hierarchy is provided with the use of network interfaces. The higher levels consist of fully connected routers. Such approach requires a routing protocol that avoids deadlocks in the system. Another issue is a large density of connections between routers, which becomes a problem in case of large networks.

Moreover, in the multimedia applications many cores use the same data in a coding process. In case of the unicast transmission mode these data need to be sent multiple times to their destinations. In order to enhance a network performance, the multicast transmission should be introduced.



<sup>&</sup>lt;sup>1</sup> The work was supported by the public founds as a research project.



Figure 1. The NoMC elements

In this paper we propose a variation of the NoC for multi-FPGA systems called the Network-on-Multi-Chip (NoMC). The NoMC is a hierarchical network. Moreover, to provide efficient data processing and to improve a network performance we introduce the multicast transmission. Our proposal is designed to synthesize easily on FPGAs, because in most of the cases, a circuit optimized for FPGA is efficient also as ASIC (but not in a reverse). Due to this fact, it is convenient to consider the NoC networks for FPGAs.

This paper is organized as follows: in the section 2 we summarize a basic solutions chosen for our proposal. The section 3 gives a description of network elements and architecture both for local and external networks. Further, in the section 4, we describe a network protocol. The assessment of our proposal is discussed in the section 5, followed by the conclusions.

# 2. The proposed architecture overview

We propose a hierarchical network architecture called the Network-on-Multi-Chip (NoMC) which is intended for multi-chip designs. The proposed architecture consists of 3 levels of hierarchy, starting from the lowest level:

- Local network (also called the group or group of PEs) that contains the Processing Elements (PEs), network interfaces called the EndPoints (EPs) and routers,
- Cluster that provides connectivity for a set of groups (local networks),
- System, which is introduced to interconnect the clusters.

The lowest level of hierarchy, i.e. the local network, provides a basic connectivity for the PEs. The two higher

levels, i.e. the cluster and the system, provide an interconnection between groups and are referenced as an external network.

The proposed hierarchy levels are separated with the use of dedicated devices called the gateways, which format data to suite a respective level requirements. Such approach allows designing each level separately i.e. an architecture and a protocol used on each level don't influence each other.

The local network architecture is defined with only a set of devices (the routers and the EndPoints i.e. network interfaces for the PE) that can be connected applying any topology. Since the routers are expensive in terms of a hardware consumption, their number should be as low as possible. To meet that requirement, functionality of the EndPoint is slightly extended, comparing to the commonly known network interfaces [7]. The proposed EndPoint is able to perform basic switching operations. EPs may be connected to each other without a need for more sophisticated routers implementing routing tables.

The proposed solution for the external network architecture (the cluster and system level) is based on a tree topology. Distinction between the cluster and the system level has been introduced in order to connect the clusters flexibly. The tree topology at the cluster and the system level is used because it arranges routers hierarchically. Such architecture allows designing a simplified routing algorithm and a packet handling protocol which yield the reduction of a hardware consumption.

The proposed addressing scheme is adjusted to the hierarchical architecture. Each address consists of three parts, each referring to the one level of a network hierarchy. At the particular level only the own part of the

address is recognized. In order to introduce a multicast transmission mode we propose to add more than one destination address per a packet. Each address is then checked in every network element. A packet is copied if routes for any of a destination addresses are splitting.

## 3. NoMC architecture

The NoMC is defined with a set of a network elements which can be connected in a different way at each hierarchy level. There are three types of the network elements: the EndPoints i.e. the network interfaces, routers and gateways.

## 3.1. EndPoint

The EndPoint is a network interface that provides connectivity for the PE. It includes a multiplexer in a transmitting part (TX) and a demultiplexer in a receiving part (RX), as shown on fig 1a. At an arrival of a packet, a destination address is checked (this process is done on the fly, due to the network protocol described in the section 4.2.). If the packet is destined to the PE attached, it is directed to its receiving buffer. In another case, it is passed forward to the next interface. In case of sending a packet by the PE, the EndPoint checks if the TX input (fig. 1a) is free, before a transmission. If it is free, the packet is placed in a TX output FIFO. Such approach allows building EP chains (fig. 2). In the commonly known NoC networks [7] an each network interface has to be connected to a router port, whereas in the NoMC only one EndPoint from the chain needs to. Such approach is suitable especially for the FPGAs because it can reduce the number of routers and result in a smaller occupancy of chip area by the network components.

## 3.2. Router

The routers at each level share a similar functionality i.e. routing and handling packets protocol. They contain up to 4 ports. In order to mark a default direction to/from a gateway or a router in the root, two port types  $(R - to/from\ Root\ and\ L - Leaf)$  have been introduced, as shown on fig 1b. The port R indicates a direction to/from the group, the cluster or the system root. This port can be connected only to the router of the same level or the gateway to the higher level of the hierarchy. If this port is not connected, the router acts as a system root and e.g. discards packets of unknown destination. On the contrary, the port L indicates a direction to devices that are at the lower level in the system or cluster hierarchy. That means that the port L can be connected to the port R of another router at the same level or to the group gateway.

Additionally, in case of a local network, the port L may be connected to the EP.

# 3.3. Gateway

The gateway is an interface between the two levels of the network hierarchy. Its main task is a translation of the destination address between the two levels. Moreover, the gateway may perform more complicated actions such as a packet format and/or an address translation if the attached part of a network doesn't comply with the higher level network protocol.

The Gateway is specified as a device with two ports: R and L, which meaning is similar as in case of the router. Port L is connected to a network element inside a group or a cluster (the lower level in the system hierarchy). Port R indicates a direction to the root. Figure 1c shows the basic gateway architecture.

#### 3.4. Local network

The local network is a basic entity of the NoMC. A good example of a local network is the interconnect system for an AVC/H.264 encoder. The local network consists of PEs, network interfaces called the EndPoints and routers. Figure 2 shows an exemplary local network architecture with two EP chains connected to one router. Connection with an external network is provided with a use of the gateway.

Network and processing elements can be connected using any topology. Connecting the EPs in chains is optional and it is possible to connect each network interface directly to the router (i.e. not exploit switching capabilities of the EPs). Although such a network would be more homogenous, it consumes much more chip resources.



Figure 2. An exemplary local network interconnect system

#### 3.5. External network

The external network provides a connectivity for groups of the PEs and clusters. The inter-group and inter-cluster network is based on a tree topology which introduces a hierarchical arrangement of network elements with the root at the top. Both levels use routers of a similar architecture. Such approach is highly scalable and simplifies an interconnection design for big applications due to its flexibility. An exemplary cluster scheme is shown on fig. 3.



Figure 3. Example of an external network interconnect system

The proposed addressing scheme is very flexible and may be adjusted to various needs in a designing process. The group level is intended to be used on a single chip. In case of connecting several chips, the cluster level can be introduced. In case of more complicated designs (e.g. when connecting boards with several chips on each) it may be useful to add the system level network.

## 4. Network protocol

The network protocol is designed to take an advantage from a hierarchical arrangement of the network elements. At each level, a gateway or a router in the root is identified as a device of last resort. Namely, all packets with an unknown destination address are directed to it. It results from a routing protocol based on registering EPs, groups and clusters in the network, each registering only at the own level of the hierarchy. The registering mechanism ensures that a device in the root of the one hierarchy level has a knowledge of the whole network attached to it.

# 4.1. Addressing

The addressing in the NoMC is hierarchical. An address that is 24 bits long consists of three fields, each corresponding to the one level of a network, as shown on fig. 5. Each part of the address (referring to the EndPoint, group or cluster) is 7 bits long, the remaining bit is reserved for a future use.



Figure 5. Fields in a network address

Using such convention, the system can address up to 2 million EndPoints. Moreover, only small routing tables at each level are required. Such short routing tables (128 of 7-bits position for each level of the hierarchy) are very suitable to be implemented with use of small blocks of distributed memories.

Each network level deals only with a corresponding part of an address. The higher level addressing part should be equal to zero, which indicates that a packet is destined to a device at the same hierarchy level. If not, the packet is passed through a dedicated port R of routers, towards a gateway. Inside a group or cluster only the gateway knows its address. If the gateway receives a packet destined to the device inside a group, it strips off a group or cluster address, i.e. sets it to zero. On the contrary, if the packet is to be sent up in the hierarchy, the gateway adds the group or cluster address.

### 4.2. Packet format and multicast transmission

Packets have restricted size which depends on receiving buffers used in network elements. The goal is the ability to receive a full, maximum length packet. In our implementation packets have 32 flits at most. Such short messages make a network traffic more fluent. Packets start with the Destination Address and end with the EndOfPacket commands, as shown on fig. 6.



Figure 6. The NoMC packet format

We also propose to enable a multicast transmission mode. Current NoC proposals [7] support only a unicast transmission mode which is a packet delivery between a single source and a single destination. The multicast transmission allows sending one packet to multiple destinations, copying it only if routes are splitting in routers.

In order to enable multicast, there can be more than one Destination Address in a packet. Routers or EPs check all of the Destination Addresses, and copy a packet if needed.

In order to hasten the processing of a packet in network elements we use a 4-bit width description word (the tag) which specifies the type of data that are being carried in the packet in a respective packet flit. The tag simplifies carrying many operations such as a packet multiplexing in the EndPoints or checking a destination addresses on the fly. Such approach significantly reduces latency introduced by a network devices.

The buses between network elements are 32-bits width and may carry commands of a network protocol or a user data, as shown on fig 7. The network protocol flit consists of a 8-bit command and a 24-bit long parameter.



Figure 7. The packet flit

# 4.4. Routing protocol

The proposed routing protocol is simple and thus it results in little hardware consumption. Routing tables in routers may be predefined or learned dynamically using a protocol based on registration packets. Each device, the EP in local network and the gateway at the cluster and system level, sends a registration packet at a reset with a destination address set to unknown, and the own address as a source. Such packet also contain a command which updates a rooting table of a router. The registration packet is passed through the routers to the gateway or a router in the root of the own hierarchy level, where it is discarded. At the end of this process, the root at each level has information about a whole network that is attached.

## 5. Assessment

In order to assess a performance of our proposal a model of the Multiview Video Coding (MVC) [1] coder has been used. The application codes several views of the same scene. Each view is encoded with the modified AVC/H.264 coding scheme. Modification allows coding a current view with use of information available in neighboring views. The coding scheme is shown on fig. 8a. The beginning of arrows mark video frame that is used as a reference picture by a video frame marked at the end of the arrow. Arrows connecting video frames horizontally show how pictures are referenced in the same view. Vertical arrows refer to data exchanged between views.

Modeled system contain 18 cameras and attached AVC/H.264 video codecs. Each codec is a local network that may be placed on a separate chip. Codecs receive an uncompressed video stream from a camera directly through a dedicated interface, not over the NoC. Codecs are connected together with a use of an external network which means that they form one cluster, as shown on fig. 8b. Codecs send the coded video stream, to a stream formatter, the device that controls a format of the MVC stream and manages data storage.

# 5.1. Performance analysis

In order to assess a multicast gain, the traffic volume has been analyzed for two cases: the unicast and the multicast transmission modes. The unicast transmission means that one source sends data only to one destination at the same time. In case of two receivers, data stream need to be doubled. On the contrary, the multicast transmission allows sending to multiple destinations one data stream that is copied only in case of splitting routes.

In the modeled MVC application, according to the MVC coding scheme (fig. 8b), all codecs send data to a stream formatter and the half of them (i.e AVC codec 1, 3, 5, ..., 17) send coded reference frames to neighbors.





Figure 8. The MVC coding scheme (a) and the assessed model of the MVC coder (b)

|  | Table | 1. | The | synthesis | results |
|--|-------|----|-----|-----------|---------|
|--|-------|----|-----|-----------|---------|

| Network<br>Elements       | Spartan3 XC3S1500 |          |           | Spartan6 XC6SLX75-3 |           |           |
|---------------------------|-------------------|----------|-----------|---------------------|-----------|-----------|
|                           | LUT               | FF       | CLK (MHZ) | LUT                 | FF        | CLK (MHZ) |
| Router (4 ports)          | 1291 (4%)         | 759 (2%) | 88.8      | 1106 (2%)           | 759 (<1%) | 278.8     |
| Router (3 ports)          | 895 (3%)          | 576 (2%) | 111.2     | 647 (1%)            | 573 (<1%) | 277.3     |
| EndPoint                  | 518 (1%)          | 339 (1%) | 156.5     | 436 (<1%)           | 345 (<1%) | 315.4     |
| Gateway                   | 638 (2%)          | 402 (1%) | 154.8     | 515 (1%)            | 409 (<1%) | 315.3     |
| Maximum available bitrate | 710,7 MB/s        |          |           | 2218,3 MB/s         |           |           |

In order to compare a size of the traffic in case of the unicast and the muliticast transmission mode, the traffic volume has been evaluated in relation to the hop-count. The highest gain is in case of a stream sent by codec no. 9 and is about 45%. This is because a path from codec 9 to codec 10 mostly covers the path to stream formatter. In other cases a multicast gain for single streams is not less than 25%. Overall profit (including streams for the AVC codec 2, 4, 6,..., 18) yielded due to the multicast stream is about 30%.

## 5.2. Synthesis results

The proposed NoMC elements has been synthesized for the two FPGA devices: Xilinx Spartan-3 and the Spartan-6. The results are summarized in the table 1.

Routers for each level of the hierarchy have the same size, because they share the same functionality. The difference lay in an analyzed part of the address, which doesn't influence the size of the device. The size of the EndPoint is about 40% of a 4-port router and is caused by an additional switching capabilities. However, using such network interfaces may result in a significant resource consumption reduction, because there is no need to attach a router to each EP.

In case of designing a local network, that requires using a few 4-port and 3-port switches connecting about 15 EPs, an overall resource consumption by network elements would be less than 20%. In case of the external network for the described MVC coder, an overall hardware cost for routers is still less than 20% of the Spartan-6 device.

The cost of the multicast mode implementation (in terms of an occupied hardware) is below one percent of a router size. A multicast packet is forwarded to few output ports at one time and no middle buffering is implemented.

Assuming a 32-bit wide bus between devices, the maximum available bitrate (shown in table 1) results from the maximum clock frequency for the slowest network device. In case of the Spartan-3 synthesis it is 88.8 MHz

for a 4-port router. For the Spartan-6 FPGA it is 277.3 MHz for a 3-port router.

#### 5. Conclusions

In this paper we have presented our NoMC architecture which is the hierarchical NoC designed for multichip systems. The proposed interconnect system consists of the three levels of the hierarchy, each separated with a dedicated device, referenced as the gateway. Its main task is data formatting to suite a respective level requirements. It is a flexible and scalable approach that allows separating the design of each level.

The addressing scheme reflects a division of the NoMC into the three levels. Therefore, the address consists of a three fields, each regarding the respective hierarchy level.

Furthermore, we have proposed efficient network elements such as the EndPoint (i.e. the network interface) and the router. A functionality of the EP has been slightly extended with a basic switching capabilities. Therefore, the network interfaces can be connected in chains without using larger and more sophisticated routers. The proposed routers at each level share the same functionality i.e. handling packets of an unknown destination and a routing protocol. However, the router of each level deals only with a part of the address that corresponds to the respective level of hierarchy. Such approach results in small routing tables (127 bits long) included in routers.

Moreover, in order to improve a network performance we introduced the multicast transmission mode. The presented results show that a gain yielded due to the multicast transmission significantly decreases an overall bandwidth consumption up to 30% in the presented MVC coder. Moreover, the synthesis results show that the proposed network components result in a small resource consumption. A network cost, in terms of an occupied hardware, can be easily restricted to 20% of a chip area in case of implementation on the Xillinx Spartan-6 chip.

Further research will focus on a network protocol extension. Authors plan to add a hot-plug and a self-testing functionality to the network protocol.

# 6. References

- [1] ISO/IEC 14496-10:2008/FDAM 1:2008(E), "Information technology Coding of audio-visual objects part 10: Advanced Video Coding, Amendment 1: Multiview Video Coding."
- [2] A. Lankes, T. Wild, A. Herkersdorf, "Hierarchical NoCs for Optimized Access to Shared Memory and IO Resources", *DSD* 2009, 07 December 2009, pp. 255-262.
- [3] Holsmark, R.; Kumar, S.; Palesi, M.; Mejia, A., "HiRA: A methodology for deadlock free routing in hierarchical networks on chip," *3rd ACM/IEEE International Symposium on Networks-on-Chip*, pp.2-11, 10-13 May 2009
- [4] C. Puttmann, J.-C. Niemann, M. Porrmann, U. Ruckert, "GigaNoC A Hierarchical Network-on-Chip for Scalable Chip-Multiprocessors," *DSD 2007*, 29-31 August 2007, pp.495-502.
- [5] X. Leng, N. Xu, F. Dong, Z. Zhou, "Implementation and simulation of a cluster-based hierarchical NoC architecture for multi-processor SoC", *ISCIT 2005*, vol. 2, pp. 1203–1206.
- [6] "WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores", *Revision: B.3*, September 7, 2002
- [7] Erno Salminen, Ari Kulmala, Timo D. Hämäläinen, "Survey of Network-on-chip Proposals", White Paper, OCP-IP, March 2008.