# Accelerators: Enhancing the Capabilities of the C2000™ MCU Family



Kenneth W. Schachter

C2000 Technical Staff
Texas Instruments

Real-time control systems require fast and efficient processing, with latency kept to a minimum in order to maintain stability and boost overall performance. In addition, the increasing sophistication of modern motor systems, power electronics, smart grid technology, robotics and similar applications require the central processor to keep up with numerous tasks simultaneously.

The C2000™ family of microcontrollers (MCUs) from Texas Instruments addresses these challenges with an array of integrated on-chip hardware accelerators that dramatically increase the performance of the MCU in many real-time applications. The four key accelerators are:

- Floating-Point Unit (FPU)
- Real-Time Control Co-Processor (CLA)
- Trigonometric Math Unit (TMU)
- Viterbi, Complex Math, and CRC Unit (VCU)

At the center of each C2000 MCU lies a fast fixed-point central processing unit (CPU) that on its own provides excellent 32-bit processing capabilities. The FPU provides seamless integration of floating-point hardware into the CPU. To augment this further, the CLA provides an independent floating-point CPU operating at the full speed of the device and designed to perform control law computations with minimal latency. This can effectively double the raw computing capabilities of the device. The TMU accelerator provides hardware support for common trigonometric math functions, while the VCU accelerator adds hardware support for communications, complex math and CRC calculations. This paper provides an overview of each of these accelerators.

# **Floating-Point Unit (FPU)**

Many control system designs typically start with simulation tools, where the algorithms are developed with floating-point math. These algorithms can then easily be ported to a microcontroller that has native floating-point math support. Floating-point math provides a large dynamic range, thereby making it easier to develop code compared to fixed-point math. The programmer no longer needs to worry about scaling and saturation. Additionally, robustness is improved since floating-point values do not wrap around the number line on an overflow or underflow, as they would in fixed-point math. This enables high performance mathematical capabilities needed for advanced control systems. Also, the C2000 MCU architecture has been optimized to support high-level language programming, along with seamless support from a complete set of TI development tools.

The C2000 MCUs feature a C28x CPU core which is designed around a 32-bit fixed-point accumulator-based architecture. It utilizes the best features of digital signal processors and microcontroller architectures. The addition of the FPU to the C28x fixed-point CPU core enables the C2000 MCUs to support hardware IEEE-754 single-precision floating-point format operations. Devices with the C28x+FPU add an extended set of floating-point registers and instructions to the standard C28x architecture. These additional registers are: eight floating-point result registers, a floating-point status register and a repeat block register. The repeat block adds zero overhead looping which enables flexibility to the processor over the repeat single instruction. All of the registers are shadowed, except the repeat block register. Shadowing is useful with high priority interrupts for fast context save and restore of the floating-point registers.

The compiler tools provide C programming support for the CPU which makes it easy to write software in addition to porting existing code. Since the FPU instructions are extensions of the standard C28x instruction set, most instructions operate in one or two pipeline cycles and some can be done in parallel. With the hardware support of the FPU, single cycle instructions can be performed on data conversions (integer to float), fast inverse and inverse square root operations, and multiply-accumulate (MAC). Floating-point performance dramatically enhances the mathematical computation horsepower used in signal processing and control algorithms. On average, greater than a 2.5 times performance improvement can be achieved with floating-point math as compared to fixed-point math.

control systems require minimal latency where the time delay between sampling, processing and outputting must fit within a tight time window in order to meet performance objectives. A typical digital controller consists of an ADC to read the input signals (e.g. voltage and current), a math engine to compute the control law algorithms (e.g. PID, 2-pole/2-zero and 3-pole/3-zero compensators) and a PWM channel to output the calculated waveform. Many advanced control systems would greatly benefit from an architecture that integrates these functions in such a way as to minimize latency, yielding the absolute minimum sample to output delay. Ideally, this architecture would execute time-critical control loops concurrently with the main CPU and free it up to perform other required tasks. In addition, the architecture must have a built-in protection mechanism to guard against over-current and over-voltage conditions. To address these important requirements, TI developed the CLA.

| Function                      | Туре               | FPU<br>cycles | Fixed cycles | Performance Improvements                                                                    |
|-------------------------------|--------------------|---------------|--------------|---------------------------------------------------------------------------------------------|
| Complex FFT                   | 512 pt             | 24249         | 59075        | 2.43x (FPU vs Fixed-Point)                                                                  |
|                               | 1024 pt            | 53219         | 132823       | 2.49x (FPU vs Fixed-Point)                                                                  |
| Real FFT                      | 512 pt             | 13675         | 34615        | 2.53x (FPU vs Fixed-Point)                                                                  |
|                               | 1024 pt            | 30357         | 77004        | 2.53x (FPU vs Fixed-Point)                                                                  |
| Square Root                   | Compiler intrinsic | 22            | 64           | 2.90x (FPU vs Fixed-Point)                                                                  |
| FIR (Finite Impulse Response) | 64 pts             | 119           | 111          | 1.07x (Fixed vs FPU) – Both algorithms make use of the circular addressing mode of the C28x |

# Real-Time Control Co-Processor (CLA)

Enabling extremely high performance computation and efficient processing is critical for solving today's complex real-time control applications. Real-time The CLA is an independent 32-bit floating-point hardware accelerator that is designed for mathintensive computations. This accelerator can offer a significant boost to the performance of typical math functions commonly found in control

algorithms. The CLA is designed to execute realtime control algorithms in parallel with the C28x CPU, effectively doubling the computational performance. This makes the CLA perfect for managing low-level control loops with higher cycle performance improvements over the C28x CPU. Another advantage of the CLA is that since it directly accesses memory, the overhead penalty for managing a data page pointer is removed. Additionally, the multiplier on the CLA does not require any delay slots, thus providing true single-cycle performance. A device using the CLA can achieve about a 1.3 times performance improvement over the C28x CPU for applications like motor control and solar, as shown in the table below. Furthermore, by using the CLA to service time-critical functions, the C28x CPU is freed up for other tasks, such as communications and diagnostics.

|                       |         | ber of<br>on Cycles | Performance<br>Improvements |  |
|-----------------------|---------|---------------------|-----------------------------|--|
| Application           | CPU     | CLA                 |                             |  |
|                       | min/max | min/max             |                             |  |
| Motor<br>AC Induction | 888/952 | 639/694             | 1.39x (vs CPU)              |  |
| Power Control<br>2p2z | 48      | 39                  | 1.23x (vs CPU)              |  |
| Power Control<br>3p3z | 68      | 52                  | 1.31x (vs CPU)              |  |

The CLA is able to minimize latency because it has direct access to the various control peripherals such as the ADC and PWM modules. Utilizing this low-latency architecture and capability to directly access the various control peripherals provides a fast trigger response. The CLA is able to read the ADC result register on the same cycle that the ADC sample

conversion is completed. This "just-in-time" reading of the ADC reduces the sample to output delay and enables faster system response for higher frequency control loops.

Programming the CLA consists of initialization code and tasks. A task is similar to an interrupt service routine, and once started it runs to completion. Each task is capable of being triggered by a variety of peripherals without CPU intervention. This makes the CLA very efficient since it does not use interrupts for hardware synchronization, nor must the CLA do any context switching. Compared with the traditional interrupt-based scheme, the CLA approach greatly reduces jitter and latency and therefore becomes deterministic. It supports eight independent tasks, each of which is mapped back to an event trigger, such as a timer or the availability of an ADC result. Separate tasks can be used to support multiple control loops or phases at the same time.

Another key benefit of the CLA, over hardware-based control law implementations, is flexibility. The CLA is a fully software programmable solution where developers can freely modify their control system without the time and high cost required to redesign a hardware-based solution. Also, the CLA is significantly more power-efficient in executing operations when compared to the main C28x CPU which is an advantage for power-sensitive applications.

With the performance and efficiency advantages provided by the CLA, complete complex real-time control applications can be implemented on a single C2000 device. Some of these applications include motor control (such as field-oriented control and sensorless feedback), power conversion, renewable energy and electric vehicles.

## **Trigonometric Math Unit (TMU)**

The TMU is an extension of the FPU and enhances the instruction set of the C28x+FPU by efficiently executing trigonometric and arithmetic operations commonly used in control system applications. Similar to the FPU, the TMU is an IEEE-754 floating-point math accelerator tightly coupled with the CPU. However, where the FPU provides general-purpose floating-point math support, the TMU focuses on accelerating several specific trigonometric math operations that would otherwise be quite cycle intensive. These operations include sine, cosine, arctangent, divide and square root. The TMU instructions include:

| Operation                          | C Equivalent Operation                     |
|------------------------------------|--------------------------------------------|
| Multiply by 2*pi                   | a = b * 2pi                                |
| Divide by 2*pi                     | a = b / 2pi                                |
| Divide                             | a = b / c                                  |
| Square Root                        | a = sqrt(b)                                |
| Sin Per Unit                       | $a = \sin(b^22pi)$                         |
| Cos Per Unit                       | a = cos(b*2pi)                             |
| Arc Tangent Per Unit               | a = atan(b)/2pi                            |
| Arc Tangent 2 & Quadrant Operation | operation to assist in calculating ATANPU2 |

The TMU uses the same pipeline, memory bus architecture and FPU registers as the C28x+FPU, thereby removing any special requirements for interrupt context save or restore.

The C2000 compiler has built-in support that allows automatic generation of the TMU instructions. The

user writes code in C using math.h functions, and the compiler uses the TMU instructions, where applicable, instead of RTS library calls. This results in significantly fewer cycles and dramatically increases the performance of trigonometric operations.

The TMU can have a significant impact on many commonly used real-time control algorithms such as:

- Park and Inverse Park Transforms
- Space Vector Generation
- DQ<sub>0</sub> and Inverse DQ<sub>0</sub> Transforms
- FFT Magnitude and Phase Calculations

For example, a Park Transform typically takes anywhere from 80 to more than 100 cycles to execute on the FPU. With the TMU a Park Transform takes only 13 cycles, yielding an 85 percent improvement as compared to without the TMU.



In a typical system application, such as digital motor control (AC induction and permanent magnet) and 3-phase solar applications, about a 1.4 times performance improvement can be achieved using the TMU over just the FPU.

|                              | Numb<br>Executior |         | Performance<br>Improvements |  |
|------------------------------|-------------------|---------|-----------------------------|--|
| Application                  | FPU               | TMU     |                             |  |
|                              | min/max           | min/max |                             |  |
| Motor<br>AC Induction        | 888/952           | 593/670 | 1.42x (vs FPU)              |  |
| Motor<br>Permanent<br>Magnet | 738/786           | 547/592 | 1.32x (vs FPU)              |  |
| Solar<br>3-Phase             | 1351/1358         | 985/983 | 1.38x (vs FPU)              |  |

An existing C28x design can realize an immediate advantage using the TMU without the need to rewrite any code. Simulation-based generated code can realize the same benefits. Portability is maintained since the same code can be used on TI MCUs with and without the TMU support.

# Viterbi, Complex Math, and CRC Unit (VCU)

Todays advanced control systems, such as motor control and power applications, can benefit from intelligent management and communications to optimize efficient operation. Power line communication (PLC) has become an ideal solution for intelligent management since the existing infrastructure can be used cost effectively. Communicating data in noisy environments is very challenging and computationally intensive. A typical microcontroller running a control application at its limit cannot tolerate the additional burden of supporting power line communications, and may require an additional processor. To solve this problem, TI developed the VCU. This unit is a tightly coupled fixed-point accelerator that

improves performance of communication-based applications by a factor of roughly seven times. Additionally, cost savings are realized by eliminating the need for a separate processor. Besides communications, the VCU is very useful for general-purpose signal processing applications such as filtering and spectral analysis. For example, spectral analysis can be used to process motor vibration noise to determine the impact of vibration on a system, estimate the motor operating life and calibrate the control loop to improve efficiency.

The VCU has been designed to be flexible in supporting various communication technologies. For the typical MCU, four key operations consume most of the processing power: Viterbi decoding, complex Fast Fourier Transform (FFT), complex filters and Cyclical Redundancy Check (CRC). Using the hardware capabilities of the VCU, an application will significantly benefit by the increased performance over a software implementation. As an example, the performance contributions of each key operation are:

- Viterbi decoding is commonly used in baseband communications applications. The Viterbi decode algorithm consists of three main parts branch metric calculation, add-compare-select (Viterbi butterfly) and traceback operation. With the VCU, the branch metric calculation can be completed in a single cycle (code rate = 1/2, and two cycles for code rate = 1/3). The Viterbi butterfly takes 2 cycles per stage, as compared to 15 cycles per stage without the VCU. The traceback takes 3 cycles per stage, as compared to 22 cycles per stage without the VCU.
- The complex FFT is used in spread spectrum communications, as well as many other signal

processing algorithms. For a 16-bit fixed-point complex FFT the VCU only requires 5 cycles per stage, as compared to approximately 20 cycles per stage without the VCU.

- Complex filters are used to improve data reliability, transmission distance and power efficiency, and are commonly used in other various signal processing applications. The VCU can perform a complex I and Q multiply with coefficients (four multiplies) in a single cycle, as compared to approximately 10 cycles without the VCU. In addition, the VCU can read/write the real and imaginary parts of 16-bit complex data to memory in a single cycle.
- CRC algorithms are used for verifying data integrity over large data blocks, communication packets or code sections. The VCU can perform 8-bit, 16-bit, 24-bit and 32-bit CRCs completely in the background, offloading the main C28x CPU. For example, the VCU can compute the CRC for a block length of 10 bytes in 10 cycles, as compared to approximately 250 cycles without the VCU. A CRC result register contains the current CRC and is updated each time a CRC instruction is executed. This simplifies the CRC calculations and access to the final CRC value.



Devices with the C28x+VCU add an extended set of registers and instructions to the standard C28x architecture, which are used to support the acceleration of communications-based algorithms. The additional registers are – nine result registers, two traceback registers, a configuration and status register, and a CRC result register. The VCU performs fixed-point operations using the same existing instruction set format, pipeline and memory bus architecture as C28x.

Programming the VCU is made easy with Ti's controlSUITE™ software suite. TI provides a complete library of C-callable assembly functions These functions are implemented using the VCU instruction set to optimize efficiency and minimize overhead. TI also provides higher-level functions to support PLC communications standards such as PRIME and G3.

# Summary and Device Family Accelerator Matrix

Utilizing the high performance C28x CPU along with the advanced hardware accelerators described in this white paper, the TI C2000 family of MCUs provide the advanced processing power required for today's complex real-time control systems.

Combining these accelerators with the various control-optimized peripherals, such as high-speed ADCs and high-resolution PWMs, developers can minimize latency while increasing system performance. TI provides a comprehensive set of development tools and software that enable developers to quickly design, test and produce extremely reliable control systems. A wide range of

TI C2000 MCUs are available to solve the most demanding control system requirements. The C2000 portfolio can be conveniently divided into two groups: Delfino™ and Piccolo™ microcontroller portfolios.

The TMS320C2000™ Delfino series of microcontrollers are designed for high performance real-time control applications. Based on an extremely fast C28x CPU, advanced control peripherals and integrated analog unctions, the Delfino series can reduce system cost while increasing system reliability. The newest family members feature a dual-core microcontroller running up to 200 MHz on each CPU. Combining each CPU with its own CLA, the device has the capability for delivering the equivalent performance of 800 MIPS. The Delfino series is ideal for applications requiring advanced signal processing such as industrial drives, digital power, renewable energy and smart sensing.

### TMS320C2000 Delfino Microcontrollers

|               | FPU | CLA | ТМИ | VCU |
|---------------|-----|-----|-----|-----|
| TMS320F2837xD | X   | X   | X   | X   |
| TMS320F2837xS | X   | X   | X   | X   |
| TMS320C2834x  | X   |     |     |     |
| TMS320F2833x  | X   |     |     |     |

X Indicates available – check specific device for details

The TMS320C2000 Piccolo series of microcontrollers offer low-cost solutions for real-time control applications. Its high level integration of control and analog peripherals reduces system complexity and offers greater efficiency for cost-sensitive designs. Devices in the Piccolo series range from 40-60 MHz fixed-point performance up to 120 MHz floating-point performance with a CLA running concurrently, effectively doubling the throughput to 240 MIPS. The Piccolo series is ideal for applications such as white goods appliances, motor control, hybrid electric vehicle (HEV) and PLC.

## **TMS320C2000 Piccolo Microcontrollers**

|              | FPU | CLA | TMU | VCU |
|--------------|-----|-----|-----|-----|
| TMS320F2807x | X   | X   | X   |     |
| TMS320F2806x | X   | X   |     | X   |
| TMS320F2805x |     | X   |     |     |
| TMS320F2803x |     | X   |     |     |
| TMS320F2802x |     |     |     |     |

Important Notice: The products and services of Texas Instruments Incorporated and its subsidiaries described herein are sold subject to TI's standard terms and conditions of sale. Customers are advised to obtain the most current and complete information about TI products and services before placing orders. TI assumes no liability for applications assistance, customer's applications or product designs, software performance, or infringement of patents. The publication of information regarding any other company's products or services does not constitute TI's approval, warranty or endorsement thereof.

The platform bar, C2000, controlSUITE, Delfino, Piccolo and TMS320C2000 are trademarks of Texas Instruments. All other trademarks are the property of their respective owners.



#### IMPORTANT NOTICE

Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, enhancements, improvements and other changes to its semiconductor products and services per JESD46, latest issue, and to discontinue any product or service per JESD48, latest issue. Buyers should obtain the latest relevant information before placing orders and should verify that such information is current and complete. All semiconductor products (also referred to herein as "components") are sold subject to TI's terms and conditions of sale supplied at the time of order acknowledgment.

TI warrants performance of its components to the specifications applicable at the time of sale, in accordance with the warranty in TI's terms and conditions of sale of semiconductor products. Testing and other quality control techniques are used to the extent TI deems necessary to support this warranty. Except where mandated by applicable law, testing of all parameters of each component is not necessarily performed.

TI assumes no liability for applications assistance or the design of Buyers' products. Buyers are responsible for their products and applications using TI components. To minimize the risks associated with Buyers' products and applications, Buyers should provide adequate design and operating safeguards.

TI does not warrant or represent that any license, either express or implied, is granted under any patent right, copyright, mask work right, or other intellectual property right relating to any combination, machine, or process in which TI components or services are used. Information published by TI regarding third-party products or services does not constitute a license to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property of the third party, or a license from TI under the patents or other intellectual property of TI.

Reproduction of significant portions of TI information in TI data books or data sheets is permissible only if reproduction is without alteration and is accompanied by all associated warranties, conditions, limitations, and notices. TI is not responsible or liable for such altered documentation. Information of third parties may be subject to additional restrictions.

Resale of TI components or services with statements different from or beyond the parameters stated by TI for that component or service voids all express and any implied warranties for the associated TI component or service and is an unfair and deceptive business practice. TI is not responsible or liable for any such statements.

Buyer acknowledges and agrees that it is solely responsible for compliance with all legal, regulatory and safety-related requirements concerning its products, and any use of TI components in its applications, notwithstanding any applications-related information or support that may be provided by TI. Buyer represents and agrees that it has all the necessary expertise to create and implement safeguards which anticipate dangerous consequences of failures, monitor failures and their consequences, lessen the likelihood of failures that might cause harm and take appropriate remedial actions. Buyer will fully indemnify TI and its representatives against any damages arising out of the use of any TI components in safety-critical applications.

In some cases, TI components may be promoted specifically to facilitate safety-related applications. With such components, TI's goal is to help enable customers to design and create their own end-product solutions that meet applicable functional safety standards and requirements. Nonetheless, such components are subject to these terms.

No TI components are authorized for use in FDA Class III (or similar life-critical medical equipment) unless authorized officers of the parties have executed a special agreement specifically governing such use.

Only those TI components which TI has specifically designated as military grade or "enhanced plastic" are designed and intended for use in military/aerospace applications or environments. Buyer acknowledges and agrees that any military or aerospace use of TI components which have *not* been so designated is solely at the Buyer's risk, and that Buyer is solely responsible for compliance with all legal and regulatory requirements in connection with such use.

TI has specifically designated certain components as meeting ISO/TS16949 requirements, mainly for automotive use. In any case of use of non-designated products, TI will not be responsible for any failure to meet ISO/TS16949.

#### Products Applications

Audio www.ti.com/audio Automotive and Transportation www.ti.com/automotive **Amplifiers** amplifier.ti.com Communications and Telecom www.ti.com/communications **Data Converters** dataconverter.ti.com Computers and Peripherals www.ti.com/computers **DLP® Products** www.dlp.com Consumer Electronics www.ti.com/consumer-apps DSP dsp.ti.com **Energy and Lighting** www.ti.com/energy Clocks and Timers www.ti.com/clocks Industrial www.ti.com/industrial Interface interface.ti.com Medical www.ti.com/medical Logic Security www.ti.com/security logic.ti.com

Power Mgmt power.ti.com Space, Avionics and Defense www.ti.com/space-avionics-defense

Microcontrollers microcontroller.ti.com Video and Imaging www.ti.com/video

RFID www.ti-rfid.com

OMAP Applications Processors www.ti.com/omap TI E2E Community e2e.ti.com

Wireless Connectivity www.ti.com/wirelessconnectivity