# CHORD: CORDIC Hardware of RISC-V Device

Ke Wu, Jiayi Chen

Department of Micor/Nano Electronics, Shanghai Jiao Tong University, Shanghai, China. oracle0133@gmail.com, miracle3310@sjtu.edu.cn

Abstract—CHORD is a hardware accelerator IP for trigonometric functions. It implements the CORDIC (COordinate Rotation DIgital Computer) algorithm with a six-stage pipeline and is equipped with an AHB-Lite bus interface. This IP is designed as a co-processor of RISC-V device. We synthesis the design on T-HEAD's wujian100 platform and verify it in RTL and hardware level. Driver library in C is provided. The driver supports scalar and vector operations. The CHORD consumes 23 mW in our FPGA board and increase the hardware utilization by 3%.

Index Terms—CORDIC, hardware accelerator, wujian100

#### I. Introduction

CORDIC, also known as the Volder's algorithm, is an efficient algorithm to calculate trigonometric functions and their inverse functions. The implementation of the algorithm only requires addition, subtraction and bitshift units. Therefore, CORDIC based accelerators are suitable for systems where cost and power consumption are critical.

Wujian100 is a simple yet fully featured MCU base SoC developed by T-HEAD. The SoC implements a RISC-V processor based on the rv32ec ISA and fundemental peripherals, e.g., TIM (Timer), DMA (Direct Memory Access) and USI (Unified Serial Interface). Communication between modules observes the AHB protocol and a multi-master, multi-slave AHB matrix is provided.

In this paper, we present a CORDIC based hardware accelerator, which is called CHORD. CHORD implements the CORDIC algorithm with a six-stage pipeline and provides acceleration of sin, cos and arctan functions. The accelerator is designed as a co-processor of RISC-V devices, therefore, an AHB-Lite bus interface is integrated inside the system. We connected our CHORD accelerator to the processor as a peripheral via the AHB bus. We choose the wujian100 platform to realize our design and the hardware is implemented on the XC7A200T FPGA. The calculation result is transferred to a PC by UART serial port to verify our design.

The paper is organized as follows: section II introduces the modules of our design in detail, section III presents our verification method in HDL and hardware level, section IV is the implementation process and introduces the drivers written in C, section V discusses the results such as area consumption and calculation delay, and section VI is the conclusion of the paper.

# II. SYSTEM MODULES

In this part we make an introduction to system modules of CHORD, including their HDL-level implementations. This

This is the course project of MR334 supervised by Prof. Guanghui He. Special thanks to Prof. He and our TA Gang Wang.

system is composeds of five modules: pipeline, interface\_in, interface\_out, FIFO and AHB-Lite bus interface, as shown in Fig.1. The pipeline module implements CORDIC algorithm with six-stage pipeline. Interface\_in and interface\_out modules are used to transform data organization form between bus and pipeline. FIFO is inserted between interface\_out and AHB-Lite bus interface, which serves as the buffer between CORDIC and AHB bus. AHB-Lite bus interface is the bus interface of our co-processor. The implementations of two key modules, e.g. pipeline and AHB-Lite bus interface, are described in detail below.

# A. CORDIC Execution Module

To compute trigonometric values of the input degree  $\varphi$ , the CORDIC algorithm will start from a given vector  $(x_0, y_0)$ , and then apply succesive rotations until the desired angle is reached. After n iterations the rotated vector should be  $(\sin \theta_n, \cos \theta_n)$ , where  $\theta_n$  is approximately  $\varphi$ .

In each iteration i, the vector is updated through one  $\theta_i$  rotation,

$$x_{i+1} = \cos \theta_i (x_i - y_i \tan \theta_i) \tag{1}$$

$$y_{i+1} = \cos \theta_i (y_i + x_i \tan \theta_i) \tag{2}$$

The insight of CORDIC is to choose those  $\tan \theta_i$  in the form of  $2^{-i}$ , so the multiplication operation can be converted in one bitshift operation, which makes an significant improvement on compute efficiency.

In order to further reduce multiplication operation, we define

$$k_i = \cos \theta_i = \cos(\arctan 2^{-i}) = \frac{1}{\sqrt{1 + 2^{-2i}}}$$
 (3)

$$d_i = sign(\theta_i) \tag{4}$$

then the multipiler  $k_i$  can be avoided in each iteration, as we compute  $k = \prod k_i$  as CORDIC gain and set initial vector as (k, 0).

Equation 1 and 2 can thus be simplified to

$$x_{i+1} = x_i - y_i d_i 2^{-i} (5)$$

$$y_{i+1} = y_i + x_i d_i 2^{-i} (6)$$

and the final rotation degree is

$$\theta_n = \sum d_i \arctan 2^{-i} \tag{7}$$

The computation process of inverse trigonometric functions is similar to trigonometric functions. To compute  $\arctan t$ , we initialize the vector  $(x_0, y_0)$  as (1, t). Obviously the angle between the initial vector and X-axis is the required result



Fig. 1. Overall architecture of system

 $\arctan t \text{ since } \tan(\arctan t) = t$ . Let the initial vector be rotated to X-axis, after n iterations CORDIC will give the final rotation degree  $\theta_n$ , and that is just the output.

CHORD implements CORDIC algorithm with six-stage pipeline, that is to say the iteration times n=6. Declare the register variable degree\_approx\_reg to indicate the degree that has been rotated, x\_reg and y\_reg to indicate the vector components. In each stage i, if degree\_approx\_reg[i-1] is smaller than desired degree  $\varphi$ , then degree\_approx\_reg[i] adds  $\theta_i$  which is stored a priori in pipeline module, meanwhile x\_reg[i] and y\_reg[i] are updated like

If degree\_approx\_reg[i-1] is bigger than  $\varphi$ , swap the above plus and minus signs.

#### B. AHB-Lite Bus Interface

Wujian100 supports AHB-Lite bus protocol. It is a bus interface that supports a single bus master and provides high-bandwidth operation (though CHORD requires only low-bandwidth). In our design, the AHB-Lite bus interface is slave while RISC-V kernel is master.

Since wujian100 does not support burst operation, we only implement SINGLE transfer mode, therefore HBURST is ignored.

Considering the width of the output from pipeline is bigger than 16bits, we set the width of the data bus to 32bit. This width is unconfigurable so HSIZE is ignored, too.

HREADY indicates if a tranfer is completed. This signal is useful when the slave want to extend the data phase, but it may also lead to endless loops in bus if HREADY is always set LOW. Considering the small data size we would like to just ignore HREADY, and thus HREADYOUT is not used.

We introduce finite-state machine (FSM) in our bus interface to control those AHB-Lite signals. There are four states in the interface: INIT, IDLE, READ and WRITE. State INIT occurs only once when the bus interface is initialized after HRESETn is set from LOW to HIGH. State IDLE means the bus interface is not selected or the master set transfer types to IDLE, that is to say state IDLE happens when HSEL is LOW or HTRANS equals to b00. State READ/WRITE means the master want to read/write from/to CHORD, which depends on HWRITE.

Apart from dealing with bus signals, the bus interface is connected with module interface\_in and FIFO, and should transfer data with them if its state is READ or WRITE. When READ, the bus interface will enable reading from FIFO and transfer data from FIFO to HRDATA. When WRITE, the bus interface caches HWDATA in internal registers and send data to interface in as the input data for CORDIC execution module.

#### III. IMPLEMENTATION AND DRIVERS

In this part, we talk about the implementation of CHORD's hardware and the drivers' usages.

#### A. CHORD on SoC

To mount CHORD on E902 SoC, replace the instantiated module x\_main\_dummy\_top0 with the CORDIC coprocessor in the file ahb\_matrix\_top.v. According to wujian100\_open Userguide, The addresss range of CHORD is 0x40010000-0x4001FFFF.

# B. Drivers

We provide five functions written in C for CHORD. Based on their processing styles, they can be catalogued into two types.

1) scalar processing: These functions are designed for direct scalar processing, i.e., send a data in int16\_t and immediately return the corresponding function value in int16\_t. However, using these functions for large scale data processing may be costly due to the overhead of pipeline. The scalar processing functions are:

chord\_cos, chord\_sin and chord\_arctan.

2) vector processing: These functions are designed for vector processing. That is, given the pointers of the input vector and output vector's head, the data od input vector is continuously sent to CHORD and results are temporary stored in the output buffer of CHORD. The results are then sent back to output vector after all the data processing are finished. The vector can be processed is restricted by the output buffer size of CHORD, which is CHORD\_BUFFER\_SIZE = 512 in our implementation.

In the cosine/sine case, the function receives a pointer of type int16\_t\* and can process at most 512 data. The result is stored in the vector of type uint32\_t\* since both cosine and sine values are returned.

In the tangent case, the result is stored in the vector of type uint16\_t\* since only degree values are required.

The vector processing functions are chord\_cos\_sin\_v and chord\_arctan\_v.

#### IV. VERIFICATION OF THE SYSTEM

In this part, we introduce our verification method of CHORD. Based on our design, the verification can be partitioned into two parts: verification of the functionality in HDL level and verification of the hardware implementation.

# A. Functionality Verification

For a n step CORDIC algorithm with iteration steps of  $\{\theta_1, \theta_2, \dots, \theta_n\}$ , the possible angles can be precisely reached are

$$\Theta = \{ \sum_{i=1}^{n} d_i \theta_i | d_i \in \{-1, 1\} \}.$$
 (8)

However, for a given input  $\phi$ , the CORDIC cannot find its closest degree in  $\Theta$  if the iteration steps set is chosen as

$$\{\arctan 2^0, \arctan 2^{-1}, \dots, \arctan 2^{-n+1}\}.$$
 (9)

For example, if we want to calculate the cosine and sine value of  $71.56501^{\circ}$ , the approximation sequence generated by 6-step CORDIC is  $\arctan 2^{0} + \arctan 2^{-1} + \arctan 2^{-2} - \arctan 2^{-3} - \arctan 2^{-4} - \arctan 2^{-5} \approx 73.11^{\circ}$ , but the closet degree is  $70.0201^{\circ}$ . Therefore, we set the verification criteria as

$$|f(F\phi) - f(\theta)| \le \epsilon \tag{10}$$

in which  $F\phi$  is the approximation degree by CORDIC,  $\theta = \arg\min_{\theta \in \Theta} \{|\phi - \theta|\}$  and f is sin or cos. If the task of CORDIC is to figure out arc-tangent value of given input t, we have

$$|F't - \theta| \le \epsilon \tag{11}$$

where F't is the approximation degree by CORDIC. If we remember  $m(\phi) = \min\{\theta | \theta \in \Theta, \theta > \phi\}$ . A proper choice of  $\epsilon$  should be

$$\epsilon = \max\{|f(\theta) - f(m(\theta))| | \theta \in \Theta\}$$
 (12)

for  $f = \sin$  or  $f = \cos$ , and

$$\epsilon = \max\{|\theta - m(\theta)| | \theta \in \Theta\}. \tag{13}$$

for arc-tangent. In the case of our six-stage CORDIC (n=6), the  $\epsilon$  value is shown in Table I

 $\begin{tabular}{l} TABLE\ I\\ \epsilon\ FOR\ COSINE,\ SINE\ AND\ ARC-TANGENT\ FUNCTIONS \end{tabular}$ 

| f      | $\epsilon$ |
|--------|------------|
| cos    | 0.0624     |
| sin    | 0.0610     |
| arctan | 0.0625     |

We sweep the input from  $-90^{\circ}$  to  $90^{\circ}$ , the comparison between the HDL simulation results and theoretical results is shown in Fig. 2 and Fig. 3.

We see that the error is bounded by 0.06, which is very close to the value of  $\epsilon$ . Moreover, to fully verify our HDL design,



Fig. 2. CORDIC HDL simulation results v.s. theoretical results



Fig. 3. Error of simulation results of functions sine and cosine

we run the simulation with random input vector with length of 512 for 1000 rounds. The result shows that our design is correct.

#### B. Hardware Verification

The hardware is done with XC7A200T FPGA. We transfer the calculation result back to PC by UART and redo the verification process in previous subsection. An example of UART serial port's output is shown in Fig. 4. The hardware implementation is correct according to our verification results.

#### V. RESULTS

The resources utilization of CHORD is shown in Fig. 5 and Table II. We compare the implementation report of wujian100 with and without CHORD. The majorities of our hardware consumption are LUT used as logic in CORDIC and FF used as buffer in FIFO. Power consumption is shown in Fig 6. Area consumption is shown in Fig. 7.

The whole system can run in XC7A200T FPGA at the clock frequency of 20 MHz. Although a timing violation of 0.9 ns is reported, the functionality is proper since it can calculate and communicate with PC by UART effortlessly.

# Connected. Begin CHORD test... Test I: non continuous sin/cos evaluation cos, sin -90: -0.0195, -1.0000 cos, sin -89: 0.0429, -1.0000 cos, sin -88: 0.0429, -1.0000 cos, sin -87: 0.0429, -1.0000 cos, sin -86: 0.0429, -1.0000 cos, sin -86: 0.0429, -1.0000 cos, sin -85: 0.1054, -0.9960 cos, sin -84: 0.1054, -0.9960 cos, sin -83: 0.1054, -0.9960 cos, sin -82: 0.1679, -0.9882 cos, sin -81: 0.1679, -0.9882 cos, sin -80: 0.1679, -0.9882 cos, sin -79: 0.1679, -0.9882

Fig. 4. example of UART serial port's output



Fig. 5. Resources utilization ratio



Fig. 6. Power consumption



Fig. 7. left: wujian100 without CHORD, right: wujian100 with CHORD

#### TABLE II RESOURCES UTILIZATION

|        | without CHORD | with CHOURD |
|--------|---------------|-------------|
| LUT    | 24710         | 28912       |
| IO     | 62            | 62          |
| BRAM   | 64            | 64          |
| FF     | 13420         | 22346       |
| LUTRAM | 72            | 74          |

# VI. CONCLUSION

In this paper, we present a CORDIC based hardware accelerator, e.g. CHORD, and integrate it in wujian100 SoC using AHB-Lite bus interface. We also provide C functions with the help of CDK tool chain. A specific verification is performed to discover the error of the outputs from six-stage pipeline CORDIC implementation, which is within the expected range of theoretical analysis. The synthesises and implementation report shows that CHORD consumes about 23 mW in out XC7A200T FPGA platform and increases hardware utilization by 3%.

#### REFERENCES

- [1] Garcia E O, Cumplido R, Arias M. Pipelined CORDIC design on FPGA for a digital sine and cosine waves generator[C]//2006 3rd International Conference on Electrical and Electronics Engineering. IEEE, 2006: 1-4.
- [2] Teja R R, Reddy P S. Sine/cosine generator using pipelined cordic processor[J]. International Journal of Engineering and Technology, 2011, 3(4): 431.
- [3] AMBA® 3 AHB-Lite Protocol.
- [4] wujian100\_open Userguide v1.0.