# A FRAMEWORK FOR MAPPING NEURAL NETWORKS ON PARALLEL ARCHITECTURES

A synopsis submitted in partial fulfillment of the requirements for the degree

of

# **Doctor of Philosophy**

Submitted by:

SAURABH TEWARI (2015CSZ8046)

under the guidance of Prof. Anshul Kumar Prof. Kolin Paul



DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY DELHI NEW DELHI

# List of Included Papers

This thesis is based on the following publications: **Published:** 

- 1. **S. Tewari**, A. Kumar and K. Paul, "SACC: Split and Combine Approach to Reduce the Off-chip Memory Accesses of LSTM Accelerators", in DATE 2021.
- 2. **S. Tewari**, A. Kumar and K. Paul, "Minimizing Off-Chip Memory Access for CNN Accelerators", in IEEE Consumer Electronics Magazine 2021.
- 3. S. Tewari, A. Kumar and K. Paul, "Bus Width Aware Off-Chip Memory Access Minimization for CNN Accelerators", in ISVLSI 2020.
- 4. D. Stathis, Y. Yang, S. Tewari, A. Hemani, K. Paul, M. Grabherr and R. Ahmad, "Approximate Computing Applied to Bacterial Genome Identification using Self-Organizing Maps", in ISVLSI 2019.

# Contents

| 1                   | 1.1 Impact of off-chip memory access                  | 1<br>2 |  |  |  |  |
|---------------------|-------------------------------------------------------|--------|--|--|--|--|
| 2 Mapping Framework |                                                       |        |  |  |  |  |
| 3                   | Related Work                                          | 4      |  |  |  |  |
| 4                   | Off-Chip Memory Access Optimizations for CNN          | 4      |  |  |  |  |
|                     | 4.1 Introduction                                      | 4      |  |  |  |  |
|                     | 4.2 Related Work                                      | 5      |  |  |  |  |
|                     | 4.3 Bus Width Aware Off-chip Memory Access Estimation | 5      |  |  |  |  |
|                     | 4.4 Optimal data partitioning for CNNs                | 6      |  |  |  |  |
|                     | 4.5 Results                                           | 6      |  |  |  |  |
| 5                   | Data reuse approach for RNNs                          | 7      |  |  |  |  |
|                     | 5.1 Introduction                                      | 7      |  |  |  |  |
|                     | 5.2 Previous Work                                     | 8      |  |  |  |  |
|                     | 5.3 Proposed data reuse approach                      | 8      |  |  |  |  |
|                     | 5.4 Results                                           | 9      |  |  |  |  |
| 6                   | Optimizing SOMs using low bit-width                   | 10     |  |  |  |  |
|                     | 6.1 Introduction                                      | 10     |  |  |  |  |
|                     | 6.2 Previous Work                                     | 11     |  |  |  |  |
|                     | 6.3 Analyzing impact of low bit-width                 | 11     |  |  |  |  |
|                     | 6.4 Results                                           | 12     |  |  |  |  |
| 7                   | Conclusion                                            | 13     |  |  |  |  |

# 1 Introduction

Deep neural networks (DNNs) are machine learning algorithms used in wide range of applications due to their high accuracy. To improve the accuracy, recent DNNs use large number of parameters and perform memory and compute intensive operations. DNNs have training and inference phase. They are first trained (training phase) to determine the learning parameters (weights and biases) and then these learned parameters are used in applications to predict the output (inference phase). The training of these deep networks require large number of inputs and several iterations to update the weights and baises. This may take several hours to days and require significant computations. The training is mostly performed offline on high end machines and servers. The parameters learned during training phase are then used in applications for inference. In this work, we focus on the inference phase of DNNs, which is performed energy constraint devices.

Many latest consumer electronics devices like smartphones and home appliances use DNNs to provide new features and improve user experiences. The manufacturers are shifting the processing of these algorithms to edge devices to improve response time and eliminate network bandwidth issues. These edge devices, usually battery operated, commonly use DNN accelerators to speed-up processing and reduce energy consumption.

Deep neural networks consists of an input layer, an output layer and several intermediate hidden layers. Recent DNN can have upto thousands of hidden layers. Figure 1 shows two broad class of neural networks. Most of the recent DNNs can be classified as following,

- 1. Feed forward: This class of NNs use the outputs from a layer as input to the subsequent layers. For a given input, output is independent of previous seen inputs. Convolution Neural Networks (CNN) belongs to this category and commonly used for image and video processing applications.
- 2. Recurrent NN: This class of NNs maintain the state information to store the contextual information and state is updated with every input. The output depends on the input as well as the state information. For the same input, the output may vary depending on the current state. This class of networks is used for processing sequential data e.g. speech recognition and language processing tasks. Long Short Term Memory networks (LSTMs) are one of the most popular variants of RNN.



Figure 1: Feedforward and Recurrent Neural Networks.

To improve the performance FPGA [2,14,16,41,43,46], GPU [8] and ASIC [5–7,10] accelerators with large number of parallel compute resources are proposed. While these accelerators can match the performance to some extent, their energy efficiency is still a concern. Energy efficiency of DNN accelerators is of paramount importance for their ubiquitous usage in energy constrained devices like smart-phones, tablets etc. More than 80% of the overall energy consumption of these accelerators is due to off-chip memory accesses [5]. In this work, we propose approaches to optimize the performance and energy efficiency of feed forward and recurrent neural network accelerators during inference phase.

## 1.1 Impact of off-chip memory access

#### 1.1.1 Performance

DNN accelerators have large number of processing elements (PEs) to exploit the parallelism of DNNs. While these PEs can perform several operations per cycle, their performance is limited by off-chip memory bandwidth. In the near forseeable future, off-chip memory accesses limits the system performance [42]. Fig. 2 shows, two kernels with different compute intensity (FLOPS/byte), possibly due to different data reuse techniques. Kernel 2 can perform more computations per byte, compared to Kernel 1. Performance of Kernel 2 is limited by computational roof and it utilizes the compute resources fully (compute bound), while Kernel 1's performance is limited by memory bandwidth (memory bound). Memory bound kernels results



Operational Intensity(Flops/Bytes)

Figure 2: roofline model

in under-utilization of compute resources, which results in degraded performance and energy consumption. To improve the DNN accelerators' performance, maximizing the data reuse from on-chip memories is essential.

#### 1.1.2 Energy efficiency

The energy of DNN accelerators is the sum of computations and data accesses energy.

$$E = E_{comp} + E_{data} \tag{1}$$

where  $E_{comp}$  and  $E_{data}$  are the computation and data movement energy, respectively.  $E_{comp}$  for a DNN layer depends on the number of operations in the layer, which is fixed as it depends on layer shape e.g. number of activations, filters, and filter sizes. However  $E_{data}$  can be optimized using data reuse techniques. Typically DNN accelerator system has multiple levels of memory hierarchy. Memories at different levels of hierarchy have different sizes, access-energies and access-time. The off-chip memory (DDR), at the top of the hierarchy, has the largest capacity, and highest access-energy and access-time compared to memories at lower level of hierarchy. The off-chip memory access energy is upto two orders higher compared to on-chip memory access energy.

Figure 3 illustrates the impact of data reuse on the energy efficiency of a system with 2 levels of memory (DDR and L1). The energy efficiency can be expressed as below,

Energy-efficiency = 
$$(1 - \frac{E_{data}^2}{E_{data}^1}) \times 100$$
  
=  $(1 - (\frac{1}{n} + \frac{e_{L1}}{e_{ddr}})) \times 100$ 

As,  $e_{L1} \ll e_{DDR}$ , the ratio  $\frac{e_{L1}}{e_{ddr}}$  is very small, the energy efficiency depends on number of reuses (n) from on-chip memory L1. It improves considerably with increasing the number of data reuse (n).

Reducing the off-chip memory accesses is the key to reduce the energy consumption of DNN accelerators. State of the art DNN accelerators aim to maximize the data reuse from on-chip memory to reduce the off-chip memory accesses. Our work aims to improve the performance and energy efficiency of DNN accelerators by reducing the off-chip memory accesses and maximizing the data reuse from on-chip memories.



Figure 3: Data access energy  $E_{data}$ .

# 2 Mapping Framework

Towards this end we propose a mapping framework which optimizes the off-chip memory accesses for feed-forward and recurrent neural networks. Our main contributions in this thesis dessertations are



Figure 4: Mapping DNN on coarse grain parallel architectures

- 1. We propose a model that precisely estimates the off-chip memory accesses and the energy consumption of the DNN's layers, while taking into account the accelerator's architectural parameters.
- 2. We propose an approach that determines the optimal partitioning of the DNN layers for reducing the off-chip memory accesses and energy consumptions. We analytically express the off-chip memory accesses of DNNs using the layer shapes and tiling parameters and express it as a constraint optimization problem.
- 3. We propose a novel data reuse approach to improve the performance and energy efficiency of recurrent neural networks (RNN). The proposed approach partitions the data and schedules the operations in a way that reduces the off-chip memory accesses of large matrices by half.
- 4. We explore the limits of a neural network, self-organizing map (SOM), using different bit resolutions and the effect that it has on the accuracy, as well as the benefits that this low resolution can provide for a hardware architecture.
- 5. We do an in depth analysis of the reduction in resolution vs. loss in accuracy as the basis for designing a battery operated system where the area, power and performance are of critical importance.

# 3 Related Work

We compare our approaches and show that our work significantly improves the performance and energy efficiency of DNN accelerators compared to the state of the art. To address the computational and energy efficiency of DNNs, several ASIC [3,9,40] and FPGA based accelerators [4,12,15,19,26] are proposed. The energy efficiency of DNN accelerators is critical for their widespread usage, and off-chip memory access is the key to improving energy consumption. Most of these works focused on improving energy efficiency by reducing off-chip memory accesses. Figure 5 shows broad categories of state of the art approaches that aim to improve energy efficiency of DNN accelerators by reducing the off-chip memory accesses.



Figure 5: Energy efficiency optimization approaches

Some approaches [12, 26, 35] used on-chip memory to store all the weights. Sizes of weights in recent multi-layer LSTM models can be several MB's, and using large on-chip memory is expensive. These approaches are not scalable and effective only for small LSTM models. The proposed approach is independent of model size and effective for large LSTM models.

Several approaches used the fact that neural networks are error-tolerant and have lots of redundancy. They used the quantization and pruning techniques to compress the models' size. Approaches [12,39] used 18-bit, Chang et al. [4] used 16-bit, Han et al. [19] used 12-bits precision for storing the inputs and weights, Lee et al. [26] used 8-bit inputs and 6-bits for weights to reduce the model size. The proposed approach is orthogonal to the quantization techniques and can be integrated with different quantization techniques to reduce the memory accesses further.

Han et al. [19] used pruning to compress the model. However, pruning results in irregular network structure, and the sparse matrix require additional computational and storage resources and causes unbalanced load distribution. To overcome this Wang et al. [39] used block-circulant matrices representations to compress the LSTM/RNN model and to eliminate irregularities resulted from compression. Some approaches ([19, 30, 31]) used load balance aware pruning techniques to overcome the unbalanced load distribution problem. Quantization and pruning approaches impacts the accuracy of the networks and may not be suitable where accuracy is at priority.

The other line of works reduced the memory accesses without compromising the accuracy of the network by applying the data-reuse techniques. Que et al. [34] proposed a blocking-batching scheme to reuse the LSTM weights fetched from external memory. Park et al. ([32]) proposed a time step interleaved weight reuse scheme (TSI-WR).

# 4 Off-Chip Memory Access Optimizations for CNN

## 4.1 Introduction

CNNs perform compute and memory intensive operations. To speed up the performance, CNN accelerators have large number of processing units (PE). CNNs have large amount of parallelism. However, limited on-chip memory, leads to repeated accesses from the off-chip memory and results in large volume of off-chip memory accesses. This lowers the resource utilization due to limited memory bandwidth and degrades energy efficiency significantly. To fit the layer data into on-chip memory, CNN accelerators apply loop tiling to partition the layer data into small tiles. Loop tiling is a compiler technique [1] that partitions the loop iteration space and large arrays into smaller tiles to increase the data locality and ensures that data fits into smaller memories. The tile dimensions significantly impacts the off-chip memory accesses [27,



Figure 6: Typical DNN accelerator architecture

46]. Determining optimal tile dimensions requires analyzing off-chip memory accesses for different tile dimensions. Memory accesses also depend on the architectural parameters of the accelerator like bus width and data alignment. We have proposed an approach that considers architectural parameters to compute the off-chip memory accesses accurately and determines the optimal tile dimensions and data reuse scheme for CNN layers.

#### 4.2 Related Work

To meet the computation and energy demands of CNNs, researchers have proposed efficient accelerators [2, 7,14,43]. Zhang et al. [46] and Li et al. [27] used loop tiling to optimize the off-chip memory accesses. They expressed the off-chip memory access as a function of tile dimensions and layer shape. Zhang et al. [46] determined optimal tile dimensions by enumerating all the legal tile dimensions. To reduce the hardware design complexity, they determined a global optimal tile dimension and used a common data reuse scheme for all the layers. Li et al. [27] proposed a layer-wise adaptive data partitioning and scheduling scheme. However, previous approaches ignored the architectural parameters and address alignment, and assumed all tiles of the same dimensions have the same off-chip memory accesses. With these assumptions, the tile dimensions determined by previous apporaches lead to a suboptimal solution.

## 4.3 Bus Width Aware Off-chip Memory Access Estimation

The CNN accelerators use a wide data bus to access off-chip memory to meet the high memory bandwidth requirement [5,7]. If the number of bytes accessed from an off-chip memory address is not a multiple of bus width or the address is not aligned to the word boundary, it results in unused bytes lanes of the data bus. Figure 7 illustrates memory accesses on a 64-bit data bus. Fig. 7a shows a read transaction of 8 bytes from an aligned address and uses the full bus width. However, if only 5 bytes are read from an aligned address, as shown in Figure 7b, 8 bytes are still accessed. If 5 bytes are read from an unaligned address, it results in 16 bytes of data access, as shown in Fig. 7d. The unused byte lanes do not carry any useful data, but they contribute to overall energy consumption. The length of the data read should be chosen such that bus utilization is high, and off-chip memory accesses and energy consumption are minimized. To estimate the performance and energy efficiency, we have developed an offline model that estimates the



Figure 7: Off-chip memory accesses on 64-bit wide data bus

off-chip memory accesses while taking into account the data alignment and memory bus width.

# 4.4 Optimal data partitioning for CNNs

Fig. 8 shows a CL data stored in off-chip memory and its tiles in the accelerator's on-chip buffer. The



Figure 8: CNN layer tiles in off-chip and on-chip memory

off-chip memory access of 3D data can be computed as following

$$\mathbb{B}_{3D} = r \times \sum_{t=1}^{N_{tiles}} \mathbb{B}_t \tag{2}$$

where  $N_{tiles}$  is the number of tiles. r and  $\mathbb{B}_t$  are the trips count and the number of bytes accessed from off-chip memory of the  $t^{th}$  tile, respectively. The trips count r depends on the data reuse scheme, and  $\mathbb{B}_t$  is computed using the model described in section 4.3

Tiles of ifm, ofm and wts reside in on-chip memory. If the on-chip memory buffer size is buffSize, and  $V_{ifm}$ ,  $V_{ofm}$ , and  $V_{wts}$  are the size of ifm, ofm, and wts tiles, respectively, then constraints on tile dimensions are

$$(V_{ifm} + V_{wts} + V_{ofm}) \cdot DW \le buffSize \tag{3}$$

Determining the tile dimensions which minimize the off-chip memory accesses, is a constraint optimization problem. The number of bytes accessed from off-chip memory and the constraints (3) are non-linear functions of tile dimensions and solving it is non-trivial.

We propose an approach to determine the optimal tile dimensions by computing the number of bytes accessed from off-chip memory  $(\mathbb{B})$  at all the feasible points using section 4.3. Our approach determines the optimal tile dimensions for each layer for different data reuse schemes and it can be configured for different on-chip memory sizes, bus widths, and data bit width.

#### 4.5 Results

We experimented with three popular CNN networks, AlexNet [25], VGG16 [36], and ResNet [21] having 8, 16, and 50 layers, respectively, with varying layer shapes and using filters of dimensions  $1\times1$ ,  $3\times3$ ,  $5\times5$ ,  $7\times7$ , and  $11\times11$ . To compare the results with other approaches, we have used the on-chip buffer size of 108 KB, batch size of 3 for VGG16, and 4 for ResNet and AlexNet. Figure 9 shows the comparison of the proposed approach with SmartShuttle (SS) [27] for VGG16. The SS approach determines the tile dimensions and data reuse scheme adaptively for each layer to minimize the data size (volume) accessed from off-chip memory.

#### 4.5.1 Latency And Energy Analysis

Figure 9a show the energy, off-chip memory accesses, and latency efficiency achieved using the BWA compared to the SS approach for VGG16 for 8 bits data width and 64 bits bus width. Using our FPGA implementation, we measured the memory access latencies and execution time for CLs to estimate the energy efficiency of the BWA approach compared to the SS approach. We observed that the changes in energy and latency are proportional to the changes in memory access. This observation confirms that off-chip memory access dominates the energy consumption of the CNN accelerators.







- (a) VGG16: Energy and latency efficiency
- (b)  $\mathbb{B}_{VGG16}$ : Varying on-chip buffer sizes
- (c) VGG16: Fully connected layers

Figure 9: Latency and off-chip memory access. BWA: Bus Width Aware, SS:SmartShuttle

#### 4.5.2 On-chip Memory Size

The size of the on-chip memory constrains the tile dimensions. Larger on-chip memory can accommodate bigger tiles which reduces the trips and number of off-chip memory access transactions. Figure 9b shows off-chip memory access for VGG16. Off-chip memory access reduces with increasing on-chip memory size for both approaches. However, the BWA approach performs better than SS for all the on-chip memory sizes.

## 4.5.3 Impact of Bus Width on $\mathbb{B}$ of FCLs

In FCLs, the height and width of tiles and data are the same. Tiles in FCL can be accessed from off-chip memory using a single transaction which results in fewer transactions in FCLs compared to CLs. Whereas in CLs, multiple transactions are required to access different rows of the tiles. Thus the impact of bus width is less on  $\mathbb{B}$  of FCLs as compared to CLs. Figure 9c shows the off-chip memory accesses of FCLs of VGG16 for 8 bits data width. BWA reduces  $\mathbb{B}_{VGG16}^{FC}$  by 1%, 2%, 3%, and 4% on 32, 64, 128, and 256 bits wide buses, respectively.

# 5 Data reuse approach for RNNs

## 5.1 Introduction

Many applications involve sequential data processing and time-series predictions, e.g., natural language processing, speech recognition, video activity recognition, sentiment classification. Processing sequential data requires remembering the contextual information from previous data. Recurrent neural networks (RNNs) are deep learning algorithms specialized in handling such problems by maintaining an internal state based on previously seen data. LSTMs [22] are variants of RNNs designed to handle long-range dependencies by storing useful information about previous inputs for a long duration.

LSTM computations involve multiplications, and these matrix-vector multiplications are performed several times. The size of these matrices can be significant in several MB's and often exceed the size of the accelerator's on-chip memory. These matrices are partitioned into blocks and accessed from off-chip memory repeatedly by the accelerator, which results in a large volume of off-chip memory accesses and energy consumption. The high energy consumption limits the usage of these accelerators on embedded devices.

Typically the computations of LSTM cell is described by the following equations

$$i = \sigma(W^{i} \cdot x_{t} + R^{i} \cdot h_{t-1} + b^{i})$$

$$f = \sigma(W^{f} \cdot x_{t} + R^{f} \cdot h_{t-1} + b^{f})$$

$$g = \tanh(W^{g} \cdot x_{t} + R^{g} \cdot h_{t-1} + b^{g})$$

$$o = \sigma(W^{o} \cdot x_{t} + R^{o} \cdot h_{t-1} + b^{o})$$

$$c_{t} = f \odot c_{t-1} + i \odot g$$

$$h_{t} = o \odot \tanh(c_{t})$$

$$(4)$$

where  $x_t$  is the input,  $h_t$  is the hidden state, and  $c_t$  is the cell state at time t. i, f, g, o are the computed gate values.  $\odot$  denotes the element-wise multiplications.  $W^j$  and  $R^j$  are the input and hidden state weight matrices, and  $b^j$  is the bias vector, learned during the training process, where  $j \in \{i, f, g, o\}$ . The dimension of  $h_t$  is referred to as the number of hidden states of the LSTM (N). At every time step,  $x_t$  is taken as input, and cell state  $(c_t)$  and hidden state  $(h_t)$  are computed using (4). The dependency of  $h_t$  on  $h_{t-1}$  and  $c_{t-1}$  prevents the parallel processing of multiple time steps and limits the data reuse.

Figure 10 shows the LSTM cell computations for two consecutive time steps using conventional and the proposed approach. To compute the hidden state vector  $h_t$ , conventional approaches accesses R matrix at each time step t, as shown in Fig. 10a. Accessing the weights at each time step results in large volume of off-chip memory accesses.





- (a) Conventional approaches: R matrix is accessed at each time step.
- (b) Proposed approach: R matrix accesses are reduced by half.

Figure 10: LSTM cell computations for consecutive time-steps showing the weight accesses.

#### 5.2 Previous Work

Que et al. [34] proposed a blocking-batching scheme to reuse the LSTM weights fetched from external memory. Their approach reuses the weights of W matrix on a group of input vectors, by processing those input vectors as a batch. The input vectors in the same batch share the same weight matrices (W). However, it is difficult to collect required number of input vectors. As the LSTM cell states computations depend on cell states of the previous time-step, benefit of their batching schemes is limited to W matrix. Reusing weights of R across different time-steps has not been successful because of the state dependency.

Park et al. (32) proposed a time step interleaved weight reuse scheme (TSI-WR) which reuses the weights of R matrix between two adjacent time steps by performing computations in a time-interleaved manner. Their approach logically partitions the R matrix into blocks. A block is accessed from off-chip memory to compute the two consecutive hidden state vectors partially. However, their approach do not fully exploit the data reuse, and several weights are accessed repeatedly from the off-chip memory. In addition, the data reuse in TSI-WR approach depends on the on-chip storage size which limits the benefits of their approach.

#### 5.3 Proposed data reuse approach

We proposes an approach that reduces off-chip memory accesses by splitting the computation in two. At each time step, while the computations of a hidden state vector of one time step  $h_t$  completes, partial computation of next time step  $(S_{t+1})$  is also performed by reusing the weights. Our approach schedules the computations in a way that reuses all the weights of R between two adjacent time steps. The data

reuse in our approach is independent of on-chip buffer sizes which makes it suitable for accelerators with very small on-chip memory.

The proposed data reuse approach splits and combines the LSTM cell computations in a way that reduces the off-chip memory accesses of hidden state matrices by 50%. The computation of the  $h_t$  can be expressed as shown below

$$h_t[k] = F(S_t[k] + q_t[k]) \tag{5}$$

where F is a non-linear function.  $q_t$  is computed as  $W \cdot x_t + b$  and its computations are independent of previous step cell states.  $S_t[k]$  is the sum of N product terms as shown below,

$$S_t[k] = \sum_{n=0}^{N-1} R[k][n] \cdot h_{t-1}[n]$$
(6)

 $S_t[k]$  can be computed as a sum of the following two partial sums  $S_t^L[k]$  and  $S_t^U[k]$ 

$$S_t^L[k] = \sum_{n=0}^k R[k][n] \cdot h_{t-1}[n]$$
(7)

$$S_t^U[k] = \sum_{n=k+1}^{N-1} R[k][n] \cdot h_{t-1}[n]$$
(8)

Equation (7) uses the lower-diagonal and diagonal elements of R ( $R^L$ ), and (8) uses the upper diagonal elements of R ( $R^U$ ). As shown in Figure 11,  $R^L$  and  $R^U$  are accessed in consecutive time steps and reused



Figure 11: Splitting the hidden state vector computations into partial sums

in the partial sum computations of two steps. At time step t,  $S_t^U$  and  $h_{t-1}$  are the inputs from the previous time step, and  $R^L$  is reused to compute the partial sums  $S_t^L$  and  $S_{t+1}^L$ . Input  $S_t^U$  is added to  $S_t^L$  to compute  $h_t$ , and  $S_{t+1}^L$  is passed to  $(t+1)^{th}$  step computations. In the same way, at time step t+1,  $R^U$  is reused to compute  $S_{t+1}^U$  and  $S_{t+2}^U$ . Elements of  $R^L$  are accessed from top to bottom, left to right, while elements of  $R^U$  are accessed in the reverse order to satisfy the dependencies. As shown in Figure 11, the proposed approach accesses the weight matrix R once, to compute the output of two consecutive time steps  $h_t$  and  $h_{t+1}$ .

# 5.4 Results

We have compared our approach with conventional approaches and TSI-WR approach [32]. Conventional approaches access the hidden state weight matrices R at each step from the off-chip memory. We have used the same on-chip buffer size to store the weight matrices  $(4 \times B^2)$  to perform a fair comparison. The proposed approach requires additional on-chip memory (4N+4B) to store the four partial sum and temporary vectors. We have experimented with LSTM models used in speech recognition (for TIMIT [13]) and character level Language Modelling (LM) [37].



Figure 12: (a) off-chip memory access and latency (b) Off-chip memory access comparison between TSI-WR and SACC approach for two consecutive time steps. (c) normalized energy. CONV: conventional

#### 5.4.1 Memory Accesses

Figure 12a shows the proposed approach's reduction in off-chip memory accesses compared to conventional approaches. The proposed approach reduces the memory accesses of the R matrices by half. Total memory accesses, including R, W, and b, for LM and TIMIT models reduce by 28% and 32%, and memory access latencies reduce by 29% and 31%, respectively. The proposed approach reduces the run-time by 12.8% and 16% for LM and TIMIT models.

#### 5.4.2 Comparison with TSI-WR

Figure 12b compares the off-chip memory accesses of the SACC and the TSI-WR approaches using the simulation results. The performance of the TSI-WR approach depends on on-chip buffer sizes. TSI-WR reduces 50% off-chip memory accesses when the on-chip buffer size is 70% of the R matrix. The proposed approach reduces R's memory access by 50%, irrespective of the on-chip buffer size.

## 5.4.3 Energy Efficiency

We computed the energy consumption using the design power reported by the Vivado synthesis tool, execution time, and off-chip memory accesses. Figure 12c shows the normalized energy consumption of the conventional and the proposed approach for TIMIT-1024. Due to the reduction in the run-time and off-chip memory accesses, the SACC approach reduces the energy consumption on average by 13% and 16% for LM and TIMIT models.

# 6 Optimizing SOMs using low bit-width

# 6.1 Introduction

One of the popular low-power stategy is performing computations at reduced bit-widths. In some cases, this results in trading off accuracy for smaller power consumption. The benefits of doing this results in improved energy performance metrics in both the data path and in the memory path. This is because there is a reduction of the energy cost for data transfers, which usually dominates the total energy consumption for such systems. Machine learning tasks do have intrinsic error resilience and hence it is expected that bit-width reduction will work well.

We explore the design space of a self-organizing map (SOM) used for rapid and accurate identification of bacterial genomes which is an important health care problem. SOM has been implemented as an FPGA design and shown to have better computational efficiency compared to GPUs. To further lower the energy consumption, we exploit the robustness of SOM by successively lowering the resolution to gain further improvements in efficiency and lower the implementation cost without substantially sacrificing the accuracy. We do an in depth analysis of the reduction in resolution vs. loss in accuracy as the basis for designing a system with the lowest cost and acceptable accuracy. The objective of this method is to design a bacterial recognition system for battery operated clinical use where the area, power and performance

are of critical importance. We demonstrate that with 39% loss in accuracy in 12 bits and 1% in 16 bit representation can yield significant savings in energy and area.

Prior work in [44] has introduced Self-organizing maps (SOM) for rapid genome identification. SOM uses a type of unsupervised learning called competitive ANN learning model. The model reduces the data dimensions and it clusters similar data together [24]. A trained SOM network does not require to go through the whole DNA sequence to recognize the pathogen, but only requires a small part of its DNA. SOM can be highly parallelized and such parallel implementation have been proposed for synchoros VLSI design, custom FPGA and GPUs [28, 33, 44]. Another important aspect of SOM and other ANN is their robustness. ANNs have been proven to work with low bit resolution without sacrificing much of their accuracy [23]. In this work, we explore the limits of the SOM using different bit resolutions and the effect that it has on the accuracy of the SOM, as well as the benefits that this low resolution can provide for a hardware architecture.

#### 6.2 Previous Work

An emerging design paradigm that is able to achieve better energy efficiency by trading off the quality (e.g., accuracy) and effort (e.g., energy) of computation is approximate computing [48]. Many modern applications, such as machine learning and signal processing, are able to produce results with acceptable quality despite most of the calculations being computed imprecisely [45]. The tolerance of imprecise computation in approximate computing to acquire substantial performance gains and is the basis for a wide range of architectural innovations [11].

Earlier work in approximate computing was focused on the design of basic elements, such as approximate adders and logic [18]. Although, these techniques adequately demonstrated the benefit of approximate computing, the fixed functionality and low-level design limits further performance improvement. Additionally, many of these techniques use complementary metal-oxide-semiconductor (CMOS) technology. Innovations in device technology provide a great opportunity for radically different forms of architecture design [38].

One of the popular low-power strategy implemented in custom designs consists in performing computations at reduced bit-widths. In some cases, this results in trading off accuracy for smaller power consumption. The benefits of doing this results in improved energy performance metrics in both the data path and in the memory path. This is because there is a reduction of the energy cost for data transfers, which usually dominates the total energy consumption for such systems. Machine learning tasks do have intrinsic error resilience and hence it is expected that bit-width reduction will work well Additionally, there is a body of literature that has demonstrated that high-precision computations are often unnecessary in presence of statistical algorithms [29, 47]. Znag et.al. report a less than 5% of quality loss obtained by simulation of the real hardware implemented in a 45nm CMOS technology. Gupta et. al. also present similar results where they train deep networks with 16 bits fixed-point number representations and stochastic rounding [17]. Talathi et. al. show that the best performance with reduced precision can be achieved with 8 bits weights and 16 bits activation, which, if reduced to 8 bits, results in a 2% drop in accuracy. Hashemi et. al. look at a broad range of numerical representations applied to ANNs in both inputs and network parameters and analyze the trade-off between accuracy and hardware implementation metrics and conclude that a wide range of approximation parameters are feasible with negligible degradation in performance [20].

## 6.3 Analyzing impact of low bit-width

We have implemented SOM on FPGA, mapped on a Xilinx Virtex7 485t chip, of SOM for identification of bacterial genomes. A custom semi systolic array was hand crafted, for different bit width implementations, to analyze the area versus energy trade off.

Figure 13a shows a high level schematic of the FPGA implementation of BioSOM and illustrates the key components in the design. The input is a n-bit vector. Each pair of bits in the input represents one of nucleotide A,C,G or T. Thus a 16-bit word contains 8 symbols. The Neural Network weights are stored in BRAMs. Each neuron has 8 weights and each weight is stored as a fixed-point number. Bit width analysis is performed by varying the number of bits (8, 12, 16, 24 and 32) used to represent the weights.

During the training phase Neuron Update component is enabled by setting (Update Enable=1). The weights of the Neurons are updated using the distance output of the Compute Distance module. The



Figure 13: (a) Hardware Module for BioSOM. (b) Neuron Update Module.

Neuron Update component is also pipe-lined design with II=1. The pipeline stages are shown in Figure 13b.

## 6.4 Results

The SOM has been implemented with a range of fixed-point formats. With fewer bits, one naturally expect that the SOM network for bacterial identification to suffer from accuracy degradation. A MATLAB simulation model was created to analyze the accuracy loss when using fixed-point implementation. We trained 10 SOMs with 10 different bacteria DNA sequences. Each SOM network has 100 neurons inside, and each neuron has 20 weights. We trained the networks by two independent training processes running in parallel. One is implemented using double precision floating point and the other is implemented with fixed-point weights. After training, we used the trained networks to identify the unknow sequence and record their scores.

The FPGA design is implemented with Vivado v.2016.4 used for synthesis and analysis of the HDL Designs. Our design is implemented in VHDL and validated using the Vivado simulator. Experimentation is done for different fixed point representations of weights by modifying parameters in VHDL code.



Figure 14: comparison between different fixed-point from at (a) FPGA LUT utilization (b) energy.

The area and power numbers for different weight resolutions are extracted from the reports generated by Vivado tool post placement and routing with a working frequency of 100 MHz. Table 1 compares the resources and area for 8, 12, 16, 24, and 32 bits fixed point formats, for a SOM network with 512 neurons. The second part of the table compares the average power in the different fixed point formats, for the same SOM.

The results are summarized in the Figure 14a and 14b. Both the amount of utilized LUTs and total energy in Joule is presented against the classification error. From the figures, we can easily conclude that, we can substantially reduce the resources used and the energy by using a 16-bit fixed-point representation, without losing accuracy. We can reduce the resources even further by moving to the 12-bit representation, by sacrificing 39% of the SOM accuracy.

Table 1: Resource Comparison of different fixed point formats

| Resource      | 8b    | 12b   | 16b   | 24b   | 32b   |
|---------------|-------|-------|-------|-------|-------|
| LUTs          | 1823  | 2611  | 3196  | 4375  | 5549  |
| Registers     | 3481  | 4679  | 5871  | 8255  | 10639 |
| Slice         | 854   | 1158  | 1369  | 1809  | 2395  |
| LUT FF Pairs  | 1007  | 1369  | 1750  | 2372  | 3043  |
| B-RAM         | 4     | 6     | 8     | 11    | 15    |
| DSP48E1       | 17    | 17    | 17    | 17    | 33    |
| Bonded IOB    | 57    | 61    | 65    | 73    | 81    |
| Power(W)      | 8b    | 12b   | 16b   | 24b   | 32b   |
| Total Power   | 0.295 | 0.314 | 0.332 | 0.356 | 0.392 |
| Dynamic       | 0.052 | 0.071 | 0.089 | 0.113 | 0.148 |
| Device Static | 0.243 | 0.243 | 0.243 | 0.244 | 0.244 |

# 7 Conclusion

With the sudden surge in Neural Network based applications, there is a pressing need for improving the performance and energy efficiency of DNN accelerators for their ubiquitous usage in energy constrained devices. The performance of DNN accelerator's is limited by the memory bandwidth and off-chip memory accessed dominates the energy consumption. The key to improving the energy efficiency of these NN is reducing the expensive off-chip memory accesses. Towards this we proposed approaches to reduce the off-chip memory accesses for DNNs. The proposed approaches optimizes the data reuse from on-chip memories for CNNs and LSTMs by partioning the data and scheduling the operations intelligently, without compromising the accuracy. We have also analyzed the impact of low bit resolution on the accuracy and energy, area performance.

## References

- [1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, techniques, and tools, 2006.
- [2] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer CNN accelerators. In MICRO, 2016.
- [3] E. Azari and S. Vrudhula. Elsa: A throughput-optimized design of an lstm accelerator for energy-constrained devices. ACM Transactions on Embedded Computing Systems (TECS), 19(1):1–21, 2020.
- [4] A. X. M. Chang, B. Martini, and E. Culurciello. Recurrent neural networks hardware implementation on fpga. arXiv preprint arXiv:1511.05552, 2015.
- [5] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. *ASPLOS*, 2014.
- [6] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al. DaDianNao: A machine-learning supercomputer. In MICRO, 2014.
- [7] Y.-H. Chen, J. S. Emer, and V. Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM/IEEE ISCA, 2016.
- [8] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cuDNN:efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
- [9] F. Conti, L. Cavigelli, G. Paulin, I. Susmelj, and L. Benini. Chipmunk: A systolically scalable 0.9 mm 2, 3.08 gop/s/mw@ 1.2 mw accelerator for near-sensor recurrent neural network inference. In 2018 IEEE Custom Integrated Circuits Conference (CICC), pages 1-4. IEEE, 2018.

- [10] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. ShiDianNao: Shifting vision processing closer to the sensor. In *ISCA*, 2015.
- [11] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 449–460, 2012.
- [12] J. C. Ferreira and J. Fonseca. An fpga implementation of a long short-term memory neural network. In 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pages 1–8. IEEE, 2016.
- [13] J. S. Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993.
- [14] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. A 240 G-ops/s mobile coprocessor for deep neural networks. In *CVPR Workshops*, 2014.
- [15] Y. Guan, Z. Yuan, G. Sun, and J. Cong. Fpga-based accelerator for long short-term memory recurrent neural networks. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pages 629–634. IEEE, 2017.
- [16] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In ICML, 2015.
- [17] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In *Proceedings of the 32Nd International Conference on International Conference on Machine Learning*, pages 1737–1746, 2015.
- [18] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy. Low-power digital signal processing using approximate adders. *IEEE Tran. on Computer-Aided Design of Integrated Circuits and Systems*, 32(1):124–137, 2013.
- [19] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga. In *Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, pages 75–84, 2017.
- [20] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In *Proceedings of the Design, Automation* Test in Europe Conference Exhibition (DATE), pages 1478–1483, 2017.
- [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
- [22] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- [23] L. Jiao, C. Luo, W. Cao, X. Zhou, and L. Wang. Accelerating low bit-width convolutional neural networks with embedded fpga. In 27th International Conference on Field Programmable Logic and Applications, pages 1–4, Sep. 2017.
- [24] T. Kohonen. Essentials of the self-organizing map. Neural Netw., 37:52–65, Jan. 2013.
- [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- [26] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung. Fpga-based low-power speech recognition with recurrent neural networks. In 2016 IEEE International Workshop on Signal Processing Systems (SiPS), pages 230–235. IEEE, 2016.
- [27] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, and X. Li. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. *DATE*, 2018.

- [28] S. McConnell, R. Sturgeon, G. Henry, A. Mayne, and R. Hurley. Scalability of self-organizing maps on a GPU cluster using OpenCL and CUDA. *Journal of Physics: Conference Series*, 341:012018, feb 2012.
- [29] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst. 14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi. In *IEEE International Solid-State Circuits Conference (ISSCC)*, pages 246–247, 2017.
- [30] J. Park, J. Kung, W. Yi, and J.-J. Kim. Maximizing system performance by balancing computation loads in lstm accelerators. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 7–12. IEEE, 2018.
- [31] J. Park, W. Yi, D. Ahn, J. Kung, and J.-J. Kim. Balancing computation loads and optimizing input vector loading in lstm accelerators. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 39(9):1889–1901, 2019.
- [32] N. Park, Y. Kim, D. Ahn, T. Kim, and J.-J. Kim. Time-step interleaved weight reuse for lstm neural network computing. In *Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design*, pages 13–18, 2020.
- [33] M. Porrmann, U. Witkowski, and U. Rückert. *Implementation of Self-Organizing Feature Maps in Reconfigurable Hardware*, pages 247–269. Springer US, Boston, MA, 2006.
- [34] Z. Que, T. Nugent, S. Liu, L. Tian, X. Niu, Y. Zhu, and W. Luk. Efficient weight reuse for large lstms. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), volume 2160, pages 17–24. IEEE, 2019.
- [35] V. Rybalkin, A. Pappalardo, M. M. Ghaffar, G. Gambardella, N. Wehn, and M. Blott. Finn-l: Library extensions and design trade-off analysis for variable precision lstm networks on fpgas. In 2018 28th international conference on field programmable logic and applications (FPL), pages 89–897. IEEE, 2018.
- [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [37] M. Sundermeyer, H. Ney, and R. Schlüter. From feedforward to recurrent 1stm neural networks for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):517– 529, 2015.
- [38] S. Venkataramani, A. Sabne, V. Kozhikkottu, K. Roy, and A. Raghunathan. Salsa: Systematic logic synthesis of approximate circuits. In *Proceedings of the 49th Annual Design Automation Conference*, pages 796–801, 2012.
- [39] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang. C-lstm: Enabling efficient lstm using structured compression techniques on fpgas. In *Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, pages 11–20, 2018.
- [40] Z. Wang, J. Lin, and Z. Wang. Accelerating recurrent neural networks: A memory-efficient approach. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 25(10):2763–2775, 2017.
- [41] X. Wei, Y. Liang, and J. Cong. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In *DAC*, 2019.
- [42] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. *Communications of the ACM*, 52(4):65–76, 2009.
- [43] L. Xie, X. Fan, W. Cao, and L. Wang. High throughput CNN accelerator design based on FPGA. In FPT, 2018.
- [44] Y. Yang, D. Stathis, P. Sharma, K. Paul, A. Hemani, M. Grabherr, and R. Ahmad. RiBoSOM. In Proceedings of the 18th International Conference on Embedded Computer Systems Architectures, Modeling, and Simulation - SAMOS, pages 105–114, New York, New York, USA, 2018. ACM Press.

- [45] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu. On reconfiguration-oriented approximate adder design and its application. In *IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*, pages 48–54, 2013.
- [46] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In *FPGA*, 2015.
- [47] Q. Zhang, T. Wang, Y. Tian, F. Yuan, and Q. Xu. Approxann: An approximate computing framework for artificial neural network. In *Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE)*, pages 701–706, San Jose, CA, USA, 2015. EDA Consortium.
- [48] Q. Zhang, F. Yuan, R. Ye, and Q. Xu. Approxit: An approximate computing framework for iterative methods. In *Proceedings of the 51st Annual Design Automation Conference*, pages 97:1–97:6, 2014.