# Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction \*

#### Yan Lin and Lei He

Electrical Engineering Department University of California, Los Angeles, CA 90095 {ylin, lhe}@ee.ucla.edu, http://eda.ee.ucla.edu

#### **ABSTRACT**

To reduce power, Vdd programmability has been proposed recently to select Vdd-level for interconnects and to powergate unused interconnects. However, Vdd-level converters used in the Vdd-programmable method consume a large amount of leakage. In this paper, we develop chip-level dual-Vdd assignment algorithms to guarantee that no low-Vdd interconnect switch drives high-Vdd interconnect switches. This removes the need of Vdd-level converters and reduces interconnect leakage and interconnect device area by 91.78% and 25.48%, respectively. The assignment algorithms include power sensitivity based heuristics with implicit time slack allocation and a linear programming (LP) based method with explicit time slack allocation. Both first allocate time slack to interconnects with higher transition density and assign low-Vdd to them for more power reduction. Compared to the aforementioned Vdd-programmable method using Vdd-level converters, the LP based algorithm reduces interconnect power by 65.13% without performance loss for the MCNC benchmark circuits. Compared to the LP based algorithm, the sensitivity based heuristics can obtain slightly smaller power reduction but run 4X faster.

Categories and Subject Descriptors: B.7.1 [Integrated Circuits]: Types and Design Styles General Terms: Algorithms, Design Keywords: FPGA, low power, time slack, programmable-Vdd

#### 1. INTRODUCTION

FPGA power modeling and reduction has become an active research recently. [1, 2] present power evaluation frameworks and study power characteristics for parameterized FPGA architectures. [3] proposes configuration inversion method to reduce leakage power of multiplexers. [4] studies

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

*DAC 2005*, June 13–17, 2005, Anaheim, California, USA. Copyright 2005 ACM 1-59593-058-2/05/0006 ...\$5.00.

the interaction between a suite of power-aware FPGA CAD algorithms. In addition, low power FPGA circuits and architectures have been proposed, including region-based power-gating [5] and Vdd-programmability for FPGA logic blocks and interconnects [6, 7, 8, 9, 10].

In this paper, we are aiming at improving the Vdd programmable interconnects proposed in [8], where a Vdd-level converter is inserted in front of each interconnect switch to avoid excessive leakage. However, the fine-grained level converter insertion introduces large leakage overhead. For example, the leakage overhead of level converters in routing channels is 36% of total power for circuit s38584 in 100nm technology. In this paper, we use the Vdd-programmable interconnect same as that in [8], but remove the level converters in routing channels by developing novel CAD algorithms. Experimental results show that compared to [8], we reduce interconnect leakage power and area by 91.78% and 25.48% respectively, and reduce total interconnect power by up to 65.13% without performance loss.

The rest part of the paper is organized as follows. Section 2 presents modeling and problem formulation. Section 3 presents both power sensitivity based heuristics and a linear programming (LP) based algorithm. Section 4 presents the experimental results and Section 5 concludes this paper. Proofs of Theorems are included in the technical report [11].

# 2. MODELING AND PROBLEM FORMULATION

#### 2.1 Preliminaries

Interconnects consume most of the area and power of FP-GAs. We assume the traditional island style routing architecture in our study. The logic blocks are surrounded by routing channels consisting of wire segments. The input and output pins of a logic block can be connected to the wire segments in the surrounding channels via a connection switch. Wire segments can be formed into a long connection via a routing switch at each intersection of routing channels. An interconnect switch is either a routing switch or a connection switch, and can be implemented by a tri-state buffer or a buffer. An interconnect segment is a wire segment driven by an interconnect switch. Suggested by [12], we assume a uniform length 4 for all wire segments.

Vdd programmability can be applied to interconnects to reduce FPGA power. Figure 1 shows the Vdd-programmable interconnect switch in [8]. For the routing switch in Figure 1 (a), two PMOS power transistors M3 and M4 are inserted

<sup>\*</sup>This paper is partially supported by NSF CAREER award CCR-0093273/0401682 and NSF grant CCR-0306682. Address comments to lhe@ee.ucla.edu.

between the tri-state buffer and VddH, VddL power rails, respectively. Turning off one of the power transistors can select a Vdd-level while turning off both power transistors can power-gate an unused routing switch. As shown in [8], power-gating an unused routing switch can reduce the leakage power by a factor of over 300. Another type of routing resources is the connection block in Figure 1 (b). Similar to routing switches, programmable-Vdd is also applied to connection switches inside connection blocks. The same power transistors from [8] are used in this paper.



Figure 1: Vdd-programmable interconnect switch and configurable Vdd-level conversion. (SR stands for SRAM cell and LC stands for level converter.)

A Vdd-level converter is needed whenever a VddL interconnect switch drives a VddH interconnect switch to avoid excessive leakage, and can be bypassed in other cases. As shown in Figure 1 (c), a pass transistor M1 and a MUX together with a SRAM cell can be used to implement a configurable level conversion, which is inserted in front of each interconnect switch to provide fine-grained Vdd programmability for interconnects in [8]. Same as [8], in this paper we start with the single-Vdd placed and routed netlists by VPR and then perform Vdd-level assignment for interconnects. For the rest part of the paper, we use switch to represent interconnect switch for simplicity whenever there is no ambiguity.

#### 2.2 Motivation

It has been shown that the fine-grained Vdd-level converter insertion introduces large leakage overhead. If CAD algorithms can guarantee that no VddL interconnect switch drives VddH switches, no level converter is needed. In this paper, we propose two ways to avoid using level converters. In the first approach, we enforce that there is only one Vdd-level within each routing tree, namely, tree based assignment. In the second approach, we can have different Vdd-levels within a routing tree, but no VddL switch drives VddH switches, namely, segment based assignment. To make the presentation simple, we summarize the notations frequently used in this paper in Table 1. They will be explained in detail when first used.

#### 2.3 Delay Modeling with Dual-Vdd

A directed acyclic timing graph  $\mathcal{G}(\mathcal{V}, \mathcal{E})$  [12] is constructed to model the circuit for timing analysis. Vertices represent the inputs/outputs of basic circuit elements such as registers and LUTs. Edges are added between the inputs of combinational logic elements (e.g. LUTs) and their outputs, and between the connected pins specified by the circuit netlist.

| $\mathcal{G}(\mathcal{V},\mathcal{E})$ | timing graph                                                              |
|----------------------------------------|---------------------------------------------------------------------------|
| PI                                     | set of all primary inputs and register outputs                            |
| $\mathcal{PO}$                         | set of all primary outputs and register inputs                            |
| $\mathcal{FO}_v$                       | set of all fanout vertices of vertex $v$ in $G$                           |
| $\mathcal{SRC}$                        | set of vertices corresponding to routing tree sources                     |
| $\mathcal{R}_i$                        | i <sup>th</sup> routing tree in FPGA                                      |
| $\mathcal{FO}_{ij}$                    | set of fanout switches of $j^{th}$ switch in $\mathcal{R}_i$              |
| $\mathcal{SL}_{ij}$                    | set of sinks in the fanout cone of $j^{th}$ switch in $\mathcal{R}_i$     |
| a(v)                                   | arrival time of vertex $v$ in $\mathcal{G}$                               |
| d(u, v)                                | delay from vertex $u$ to vertex $v$ in $G$                                |
| $N_r$                                  | total number (#) of routing trees in FPGA                                 |
| $v_{ij}$                               | Vdd-level of $j^{th}$ switch in $\mathcal{R}_i$                           |
| $l_{ik}$                               | # of switches in the path from source to $k^{th}$ sink in $\mathcal{R}_i$ |
| $s_{ik}$                               | allocated slack for $k^{th}$ sink in $\mathcal{R}_i$                      |
| $p_{i0}$                               | vertex in $\mathcal{G}$ corresponding to the source of $\mathcal{R}_i$    |
| $p_{ik}$                               | vertex in $\mathcal{G}$ corresponding to $k^{th}$ sink of $\mathcal{R}_i$ |
| $f_s(i)$                               | transition density of $\mathcal{R}_i$                                     |
| $N_k(i)$                               | # of sinks in $\mathcal{R}_i$                                             |
| $N_s(i)$                               | total # of switches in $\mathcal{R}_i$                                    |
| $N_l(i)$                               | # of VddL switches in $\mathcal{R}_i$                                     |
| $F_n(i)$                               | estimated # of VddL switches in $\mathcal{R}_i$                           |

Table 1: Notations frequently used in this paper.

Register inputs are not joined to register outputs. Each edge is annotated with the delay required to pass through the circuit element or routing. We use  $\mathcal{PI}$  to represent the set of primary inputs and register outputs, and  $\mathcal{PO}$  to represent the set of primary outputs and register inputs.

Elmore delay model is used to calculate the routing delay. We define the  $fanout\ cone$  of a switch as the sub-tree of the routing tree rooted at the switch. Assigning VddL to a switch affects the delay from source to all the sinks in its fanout cone, and therefore affects the delay of the corresponding edges in  $\mathcal{G}$ . To incorporate dual-Vdd into timing analysis, we use SPICE to pre-characterize the intrinsic delay and effective driving resistance for a switch under VddH and VddL, respectively. Vdd-level has little impact on the input and load capacitance of a switch, and such impact is ignored in this paper.

## 2.4 Power Modeling with Dual-Vdd

There are three types of power sources in FPGAs, switching power, short-circuit power and static (leakage) power. The first two contribute to the dynamic power and can only occur when a signal transition happens at the gate output. Although timing change may change the transition density, we assume that the transition density for an interconnect switch will not change when VddL is used, and the switches within one routing tree have the same transition density. The third type of power, static power, is the power consumed when there is no signal transition for a circuit element. We assume that the unused switches are power-gated and consume no leakage. Despite of simplification in the modeling, a more accurate power simulation will be performed to verify experimental results in Section 4.

Given Vdd-level of interconnect switches and transition density  $^2$  of routing trees, the interconnect power P using programmable dual-Vdd can be expressed as

$$P = 0.5 f_{clk} \cdot c \sum_{i=0}^{N_r - 1} f_s(i) \sum_{i=0}^{N_s(i) - 1} V dd_{ij}^2 + \sum_{i=0}^{N_r - 1} \sum_{i=0}^{N_s(i) - 1} P_s(V dd_{ij})$$

where  $N_r$  is the total number of routing trees,  $f_s(i)$  is the transition density of  $i^{th}$  routing tree  $\mathcal{R}_i$ ,  $N_s(i)$  is the num-

 $<sup>^1\</sup>mathrm{Same}$  as [8], configurable level converters are inserted at logic block inputs and outputs, and can be used when needed.

 $<sup>^2\</sup>mathrm{We}$  perform cycle-accurate simulation on single-Vdd placed and routed circuit to obtain transition density.

ber of switches in  $\mathcal{R}_i$ , and  $Vdd_{ij}$ ,  $P_s(Vdd_{ij})$  and c are the Vdd-level, leakage power and load capacitance of each switch respectively. For simplicity, we assume that all the switches have the same load capacitance. Our algorithms can however be easily extended to remove this simplification.  $v_{ij}$ indicates Vdd-level of  $j^{th}$  switch in  $\mathcal{R}_i$  as follows

$$v_{ij} = \begin{cases} 1 & \text{if Vdd-level of } j^{th} \text{ switch in } \mathcal{R}_i \text{ is VddH} \\ 0 & \text{if Vdd-level of } j^{th} \text{ switch in } \mathcal{R}_i \text{ is VddL} \end{cases}$$

The interconnect dynamic power reduction  $P_r$  using programmable dual-Vdd can be expressed as

$$P_r \propto \sum_{i=0}^{N_r - 1} f_s(i) \sum_{j=0}^{N_s(i) - 1} (1 - v_{ij}) = \sum_{i=0}^{N_r - 1} f_s(i) N_l(i)$$
 (1)

where  $N_l(i)$  is the number of VddL switches that can be achieved in  $\mathcal{R}_i$ .

#### 2.5 **Problem Formulation**

Removing Vdd-level converters requires that no VddL switch should drive VddH switches. For the tree based assignment, only one Vdd-level can be used within each routing tree, and the Vdd-level constraints can be expressed as

$$v_{ij} = v_{ik} \qquad 0 \le i < N_r \land 0 \le j, k < N_s(i) \tag{2}$$

i.e., each pair of switches within a routing tree have the same Vdd-level. For the segment based assignment, we can have different Vdd-levels within one routing tree, and the Vdd-level constraints can be expressed as

$$v_{ik} \le v_{ij}$$
  $0 \le i < N_r \land 0 \le j < N_s(i) \land k \in \mathcal{FO}_{ij}$  (3)

i.e., no VddL switch should drive VddH switches.  $\mathcal{FO}_{ij}$  gives the set of fanout switches of  $j^{th}$  switch in  $\mathcal{R}_i$ .

The timing constraints require that the maximal arrival time at  $\mathcal{PO}$  with respect to  $\mathcal{PI}$  is at most  $T_{spec}$ , i.e., for all paths from  $\mathcal{PI}$  to  $\mathcal{PO}$ , the sum of edge delays in each path p must be at most  $T_{spec}$ . As the number of paths from  $\mathcal{PI}$  to  $\mathcal{PO}$  can be exponential, the direct path-based formulation on timing constraints is impractical for analysis and optimization. Alternatively, we use the net-based formulation which partitions the constraints on path delay into constraints on delay across circuit elements or routing. Let a(v) be the arrival time for vertex v in  $\mathcal{G}$  and the timing constraints become

$$a(v) \le T_{spec}$$
  $\forall v \in \mathcal{PO}$  (4)  
 $a(v) = 0$   $\forall v \in \mathcal{PI}$  (5)

$$a(v) = 0 \qquad \forall v \in \mathcal{PI}$$
 (5)

$$a(u) + d(u, v) \le a(v) \quad \forall u \in \mathcal{V} \land v \in \mathcal{FO}_u$$
 (6)

where V is the set of vertices in  $\mathcal{G}$ , d(u,v) is the delay from vertex u to v and  $\mathcal{FO}_u$  is the set of fanout vertices of u.

The below objective function (7) is to maximize the dynamic power reduction (1).

$$Maximize \sum_{i=0}^{N_r-1} f_s(i) N_l(i)$$
 (7)

Note that (7) may help to minimize interconnect leakage power that exponentially depends on the Vdd-level. The tree based assignment problem consists of objective function (7), Vdd-level constraints (2) and timing constraints (4), (5) and (6). The segment based assignment problem is same as the tree based problem except that Vdd-level constraints (3) replace (2).

#### CHIP-LEVEL VDD ASSIGNMENT

#### 3.1 **Sensitivity Based Algorithms**

Optimal Vdd-level assignment to circuit elements in a circuit is known to be NP-complete. Below, we present two simple yet practical power sensitivity based heuristic algorithms without performance loss, namely, tree based heuristic and segment based heuristic.

#### 3.1.1 Tree Based Heuristic

Starting with a placed and routed single-Vdd circuit netlist. we calculate power sensitivity  $\Delta P/\Delta V_{dd}$ , which is the power reduction by changing VddH to VddL, for each switch with the wire it drives. The total power P includes both the dynamic power  $P_d$  and the leakage power  $P_s$ . We define the power sensitivity of tree  $\mathcal{R}_i$  as  $\sum_{j=0}^{N_s(i)-1} \Delta P_{ij}/\Delta V_{dd}$ , where  $\Delta P_{ij}/\Delta V_{dd}$  is the power sensitivity of  $j^{th}$  switch in  $\mathcal{R}_i$ .

A greedy algorithm is performed to assign Vdd-level for routing trees. In the beginning, VddH is assigned to all the trees and the power sensitivity is calculated for each tree. We then iteratively perform the following steps. VddL is assigned to the tree with the largest power sensitivity. After updating the circuit timing, we accept the assignment if the critical path delay does not increase. Otherwise, we reject the assignment and restore the Vdd-level of this tree to VddH. In either case, the tree will be marked as 'tried' and will not be re-visited in subsequent iterations. After the dual-Vdd assignment, we obtain a dual-Vdd netlist.

#### 3.1.2 Segment Based Heuristic

The segment based heuristic is quite similar to the tree based heuristic except two differences. First, the assignment unit in the segment based heuristic is an interconnect switch instead of a routing tree. We define a switch as a candidate switch if it is 'untried', and it does not drive any switch or all of its fanout switches have been marked as 'tried' and assigned to VddL. In the assignment, we try to assign VddL to the candidate switch with maximum power sensitivity in each iteration. Second, when VddL cannot be assigned to a candidate switch due to the timing violation, we mark all the upstream switches of that candidate switch in the same routing tree as 'tried' and those upstream switches stay VddH. As there is no level converter in routing channels, VddH has to be assigned to all the upstream switches of a VddH switch. We summarize the segment based heuristic in Figure 2.

```
Segment based heuristic:
Assign VddH to all switches and mark them as 'untried';
Calculate power-sensitivity for all switches;
While(∃ 'untried' switch){
   Assign VddL to the candidate switch j with the largest
power sensitivity;
   If (critical path delay increases) {
      Find all the upstream switches of j in the same tree;
      Assign VddH to j and those upstream switches, and
mark them as 'tried';
   Élse mark j as 'tried';
```

Figure 2: Segment based heuristic.

#### 3.2 Linear Programming Based Algorithm

The above algorithms implicitly allocate time slack first to routing trees or switches with higher transition density to reduce more power. Below, we present a linear programming (LP) based algorithm with explicit time slack allocation considering both global and local optimality. As the segment based assignment in general reduces more power than the tree based assignment, we only consider segment based assignment in the LP based algorithm that includes three phases: We first allocate time slack to each routing tree by formulating the problem as an LP problem to maximize a lower bound of power reduction. We then perform a bottom-up assignment algorithm to achieve the optimal solution within each routing tree given the allocated time slack. We finally perform a refinement to leverage surplus time slack. The details are discussed below.

#### 3.2.1 Chip-level Time Slack Allocation

#### A. Estimation for Number of Low-Vdd Switches

The slack  $s_{ij}$  of a connection between the source and  $j^{th}$ sink in  $\mathcal{R}_i$  is defined as the amount of delay which could be added to this connection without increasing the cycle time  $T_{spec}$ . We represent the slack  $s_{ij}$  in multiple of  $\Delta d$ , where  $\Delta d$ is the delay increase for an interconnect segment by changing the Vdd-level from VddH to VddL. Figure 3 presents a 3pin routing tree as an example. S0 and S1 are the slacks allocated to two sinks Sink0 and Sink1, respectively. In Figure 3 (a), VddL can be assigned to b2 given S0 = 1 and VddL can be assigned to b3 given S1 = 1. When we increase the slack S1 for Sink1 to 2 in Figure 3 (b), b0 has to stay VddH restricted by S0 = 1. In other words, b0 is restricted by both S0 and S1, and VddL can only be assigned to b0when  $S0 \ge 3 \land S1 \ge 2$ . Figure 3 (c) shows the case in which VddL is assigned to all the switches given  $S0 = 3 \land S1 = 2$ . Therefore, there is an upper bound for slack that is useful and slack more than the upper bound cannot lead to more VddL switches. For the rest part of the paper, we use slack to represent the useful slack.



Figure 3: Estimation of number of VddL switches.

Figure 3 (a) and (b) show that we may achieve the same number of VddL switches with different slacks. Given a routing tree with arbitrary topology and allocated slack for each sink, we need to estimate the number of VddL switches that can be achieved. We use  $l_{ik}$  to represent the number of switches in the path from the source to  $k^{th}$  sink in  $\mathcal{R}_i$ . We define sink list  $\mathcal{SL}_{ij}$  as the set of sinks in the fanout cone of  $j^{th}$  switch in  $\mathcal{R}_i$ . We then estimate the number of VddL switches that can be achieved given the allocated slack as

$$F_n(i) = \sum_{j=0}^{N_s(i)-1} \min(\frac{s_{ik}}{l_{ik}} : \forall k \in \mathcal{SL}_{ij})$$
(8)

To estimate the number of VddL switches that can be achieved in tree  $\mathcal{R}_i$ , we first deliberately distribute the slack  $s_{ik}$  evenly to the  $l_{ik}$  switches in the path from source to  $k^{th}$  sink in  $\mathcal{R}_i$ . For a switch with multiple sinks in its fanout cone, we choose the minimum  $s_{ik}/l_{ik}$  as the slack distributed to the switch.

We then add the slack distributed to all the switches in  $\mathcal{R}_i$  and get the estimated number of VddL switches. The rationale is that we consider  $k^{th}$  sink with minimum  $s_{ik}/l_{ik}$  in sink list  $\mathcal{SL}_{ij}$  as the most critical sink to  $j^{th}$  switch in  $\mathcal{R}_i$ . Figure 3 (d) gives an example and the estimated number of VddL switches is calculated as

$$F_n = S0/3 + S0/3 + S1/2 + min(S0/3, S1/2)$$

For (8) which estimates VddL switch number, we have the following theorem that can be proved by induction.

Theorem 1. Given a routing tree and allocated slack in multiple of  $\Delta d$ , (8) gives a lower bound of number of VddL interconnect switches that can be achieved.

#### B. LP Problem Formulation

The objective function (7) is to maximize power reduction which is the weighted sum of VddL switch number within each tree, where the weight is the transition density. To incorporate (8), which gives a lower bound of VddL switch number, into mathematical programming, we introduce a variable  $f_n(i,j)$  for  $j^{th}$  switch in  $\mathcal{R}_i$  and some additional constraints. The new objective function after transformation plus the additional constraints can be expressed as

$$Maximize \sum_{i=0}^{N_r-1} f_s(i) F_n(i)$$
 (9)

s.t.

$$F_n(i) = \sum_{j=0}^{N_s(i)-1} f_n(i,j) \qquad 0 \le i < N_r$$
 (10)

$$f_n(i,j) \le \frac{s_{ik} - 1}{l_{ik}} \quad 0 \le i < N_r \land \forall k \in \mathcal{SL}_{ij} \quad (11)$$

The slack  $s_{ik}$  is a continuous variable normalized to  $\Delta d$  in (11) and also the rest part of the paper. To make (10) a lower bound of number of VddL switches, we replace  $\frac{s_{ik}}{l_{ik}}$  with  $\frac{s_{ik}-1}{l_{ik}}$  in (11) to avoid floor function  $\lfloor s_{ik} \rfloor$  that is not a linear operation. The slack upper bound constraints can be expressed as

$$0 \le s_{ik} \le l_{ik} \qquad 0 \le i < N_r \land 1 \le k \le N_k(i) \tag{12}$$

where  $N_k(i)$  is the number of sinks in  $\mathcal{R}_i$ .

We modify the timing constraints (6) as follows. For the edges corresponding to routing in  $\mathcal{G}$ , the constraints considering slack can be expressed as

$$a(p_{i0}) + d(p_{i0}, p_{ik}) + s_{ik} \cdot \Delta d \le a(p_{ik})$$
  
$$0 \le i < N_r \land \forall p_{ik} \in \mathcal{FO}_{p_{i0}}$$
 (13)

where vertex  $p_{i0}$  is the source of  $\mathcal{R}_i$  in  $\mathcal{G}$ , vertex  $p_{ik}$  is  $k^{th}$  sink of  $\mathcal{R}_i$  in  $\mathcal{G}$  and  $d(p_{i0}, p_{ik})$  is the delay from  $p_{i0}$  to  $p_{ik}$  in  $\mathcal{R}_i$  using VddH. For the edges other than routing in  $\mathcal{G}$ , the constraints can be expressed as

$$a(u) + d(u, v) < a(v)$$
  $\forall u \in \mathcal{V} \land u \notin \mathcal{SRC} \land v \in \mathcal{FO}_u$  (14)

where SRC is a subset of V and gives the set of vertices corresponding to routing tree sources.

We formulate the *time slack allocation problem* using objective function (9), additional constraints (10) and (11), slack upper bound constraints (12), plus timing constraints (4), (5), (13) and (14). It is easy to verify the following theorem.

Theorem 2. The time slack allocation problem is a linear programming (LP) problem.

In this paper, we use the LP solver from [13] to solve the above problem. For the rest part of the paper, we use LP problem to represent the time slack allocation problem.

#### 3.2.2 Net-level Assignment

Given the allocated slack for each routing tree after solving the LP problem, we perform a bottom-up assignment within each tree to leverage the allocated slack. For each tree  $\mathcal{R}_i$ , VddH is first assigned to all the switches in  $\mathcal{R}_i$ . We then iteratively perform the following steps in a bottom-up fashion. We assign VddL to a candidate switch and mark the switch as 'tried'. After updating the circuit timing, we reject the assignment and restore the Vdd-level of the switch to VddH if the delay increase at any sink exceeds the allocated slack. The iteration terminates when there is no candidate switch in  $\mathcal{R}_i$ .

THEOREM 3. Given a routing tree  $\mathcal{R}_i$  and allocated slack for each sink, the bottom-up assignment gives the optimal assignment solution when Vdd-level converters cannot be used.

The above theorem can be easily proved by contradiction.

THEOREM 4. Given a routing tree  $\mathcal{R}_i$  in which each switch has a uniform load capacitance and the same transition density, and Vdd-level converter can be used, there exists a power-optimal Vdd-level assignment for any given slacks without using Vdd-level converters.

Sketch of proof: It is easy to prove that for an optimal solution using level converters, each VddL switch in  $\mathcal{R}_i$  can drive at most one VddH switch by contradiction. By keeping swapping Vdd-level of the VddL switch and its fanout VddH switch in the optimal solution, we can achieve a solution with the same number of VddL switches as the optimal solution, but no level converter is needed.  $\square$ 

#### 3.2.3 Refinement

After net-level assignment, we may further reduce power by leveraging surplus slack. Figure 3(b) shows a routing tree containing surplus slack. b0 has to stay VddH restricted by S0=1. Therefore, Sink1 can only consume one unit slack from S1 and there is surplus slack of 1. To leverage surplus slack, we mark all the VddH switches as 'untried' but keep the VddL switches as 'tried', and then perform the segment based heuristic (see Figure 2) to achieve more VddL switches and further reduce power.

#### 4. EXPERIMENTAL RESULTS

In this section, we conduct experiments on the MCNC benchmark set and present the interconnect power and area reduction by the three algorithms compared to the baseline using Vdd-programmable interconnects with level converters [8]. We use the same Vdd-programmable interconnects in [8], but no level converter is inserted in routing channels. The unused interconnect switches are power-gated in either case. Same as [8], we use 1.3v for VddH and 0.8v for VddL under 100nm technology node. We use the FPGA evaluation package fpgaEva-LP2 [2, 10] to verify our power reduction. Because the power model in fpgaEva-LP2 is more accurate than the power model in our problem formulations,

using fpga Eva-LP2 verifies both our modeling and problem formulations.

We present the interconnect power reduction in Table 2. Column 6 and Column 7 are the interconnect dynamic and leakage power for the baseline, respectively. By removing level converters in routing channels, we reduce interconnect leakage power by 91.78% (see column 8). We can also reduce area by removing configurable level converters. As shown in Table 3, the interconnect device area is reduced by 25.48% compared to [8], where the area is represented in number of minimum width transistors [12].

| circuit      | w/ LCs       | w/o LCs  | area      |
|--------------|--------------|----------|-----------|
|              | baseline [8] | ,        | reduction |
| alu4         | 8027562      | 6031490  | 24.87%    |
| apex2        | 12832956     | 9533624  | 25.71%    |
| apex4        | 8807502      | 6559485  | 25.52%    |
| bigkey       | 19520485     | 15065392 | 22.82%    |
| clma         | 123197209    | 89125972 | 27.66%    |
| des          | 28285474     | 21783998 | 22.99%    |
| diffeq       | 8479705      | 6397439  | 24.56%    |
| dsip         | 23769620     | 18101740 | 23.85%    |
| elliptic     | 27411520     | 20210164 | 26.27%    |
| ex1010       | 36205361     | 26561010 | 26.64%    |
| ex5p         | 9176510      | 6825863  | 25.62%    |
| frisc        | 57239492     | 41537145 | 27.43%    |
| misex3       | 8587536      | 6430858  | 25.11%    |
| $_{\rm pdc}$ | 53364989     | 38756961 | 27.37%    |
| s298         | 10362364     | 7824915  | 24.49%    |
| s38417       | 46594875     | 34463763 | 26.04%    |
| s38584       | 37840516     | 28014506 | 25.97%    |
| seq          | 12832956     | 9533624  | 25.71%    |
| spla         | 30784702     | 22541947 | 26.78%    |
| tseng        | 5911100      | 4484115  | 24.14%    |
| avg.         | -            | -        | 25.48%    |

Table 3: Interconnect device area reduction.

Column 2-5 in Table 2 present the percentage of VddL switches achieved by the three algorithms compared to the sensitivity based heuristic that uses a switch as an assignment unit in [8]. The tree based heuristic, segment based heuristic and LP based algorithm achieve 67.89%, 85.72% and 85.93% VddL switches, respectively. Both the segment based heuristic and LP based algorithm are better than the tree based heuristic, and achieve almost the same VddL switches. In contrast, the sensitivity based heuristic in [8] achieved 83.97% VddL switches for Vdd-programmable interconnects with level converters. Both segment based heuristic and LP based algorithm achieve more VddL switches than [8] because we remove the delay overhead of level converters in the routing<sup>3</sup>. Column 9-11 presents the interconnect dynamic power achieved by the three algorithms. Compared to [8], the segment based heuristic and LP based algorithm reduce interconnect dynamic power by 1.92% and 4.68%, respectively. The tree based heuristic cannot reduce interconnect dynamic power compared to [8] as the assignment unit is a tree. Note that we assume wire segment length and wire capacitance per segment do not change when level converters are removed. Therefore, the dynamic power reduction is pessimistic and will be larger in reality.

Column 12-14 in Table 2 present the overall interconnect power reduction. Compared to [8], The tree based heuristic, segment based heuristic and LP based algorithm reduce interconnect power by 58.03%, 64.19% and 65.13%, respectively. The LP based algorithm achieves the best power reduction as it considers both global and local optimality. The segment based heuristic achieves slightly smaller power reduction compared to the LP based algorithm.

<sup>&</sup>lt;sup>3</sup>Without considering the delay overhead of level converters, [8] may achieve more VddL switches as the Vdd-programmable interconnects with level converters are more flexible in Vdd-level assignment.

| 1        | 2                  | 3         | 4         | 5      | 6                    | 7         | 8                                          | 9         | 10            | 11      | 12            | 13        | 14      |
|----------|--------------------|-----------|-----------|--------|----------------------|-----------|--------------------------------------------|-----------|---------------|---------|---------------|-----------|---------|
|          | % of VddL switches |           |           |        |                      | ect power | interconnect power w/o LCs compared to [8] |           |               |         |               |           |         |
|          |                    | w/o LCs   |           |        | w/ LCs (baseline)[8] |           | leakage                                    | d;        | dynamic power |         | overall power |           |         |
| circuit  | w/ LCs             | tree      | segment   | LP     | dynamic              | leakage   | power                                      | tree      | segment       | LP      | tree          | segment   | LP      |
|          | [8]                | based     | based     | based  | power                | power     | (due to LC                                 | based     | based         | based   | based         | based     | based   |
|          |                    | heuristic | heuristic | alg.   | (watt)               | (watt)    | removal)                                   | heuristic | heuristic     | alg.    | heuristic     | heuristic | alg.    |
| alu4     | 69.12%             | 49.01%    | 73.70%    | 74.93% | 0.03735              | 0.03444   | -91.22%                                    | +20.90%   | -1.35%        | -5.00%  | -32.88%       | -44.46%   | -46.36% |
| apex2    | 76.97%             | 51.59%    | 80.50%    | 80.65% | 0.04472              | 0.05665   | -91.03%                                    | +37.92%   | +1.02%        | -4.90%  | -34.15%       | -50.43%   | -53.04% |
| apex4    | 71.44%             | 48.12%    | 73.24%    | 74.09% | 0.02192              | 0.03863   | -91.14%                                    | +30.72%   | +4.09%        | -3.52%  | -47.03%       | -56.67%   | -59.42% |
| bigkey   | 85.72%             | 80.70%    | 87.04%    | 87.78% | 0.07125              | 0.08036   | -92.92%                                    | +3.62%    | -1.33%        | -1.42%  | -47.55%       | -49.87%   | -49.92% |
| des      | 87.56%             | 76.81%    | 89.36%    | 89.61% | 0.08156              | 0.11627   | -93.83%                                    | +8.40%    | -2.18%        | -2.96%  | -51.68%       | -56.04%   | -56.37% |
| diffeq   | 93.89%             | 84.61%    | 94.01%    | 94.07% | 0.00476              | 0.03666   | -89.68%                                    | -4.37%    | -6.60%        | -6.54%  | -79.88%       | -80.14%   | -80.13% |
| dsip     | 86.56%             | 80.76%    | 87.55%    | 87.73% | 0.07656              | 0.09994   | -94.27%                                    | +6.04%    | -1.19%        | -1.53%  | -50.76%       | -53.89%   | -54.04% |
| elliptic | 96.70%             | 88.34%    | 97.14%    | 97.09% | 0.01716              | 0.12369   | -91.81%                                    | -4.96%    | -6.16%        | -8.38%  | -81.23%       | -81.38%   | -81.65% |
| ex1010   | 77.61%             | 56.49%    | 81.01%    | 80.30% | 0.03800              | 0.16439   | -91.93%                                    | +35.78%   | -1.11%        | -7.57%  | -67.95%       | -74.88%   | -76.09% |
| ex5p     | 73.26%             | 53.16%    | 76.67%    | 75.44% | 0.01968              | 0.04033   | -91.54%                                    | +16.45%   | -2.17%        | -3.75%  | -56.13%       | -62.23%   | -62.75% |
| frisc    | 99.44%             | 97.17%    | 99.42%    | 99.45% | 0.01251              | 0.26407   | -93.74%                                    | -11.93%   | -11.33%       | -12.18% | -90.03%       | -90.01%   | -90.04% |
| misex3   | 73.77%             | 47.95%    | 75.41%    | 75.94% | 0.03653              | 0.03721   | -91.07%                                    | +27.48%   | +2.20%        | -2.83%  | -32.34%       | -44.86%   | -47.35% |
| pdc      | 80.06%             | 53.66%    | 82.08%    | 82.22% | 0.05591              | 0.24593   | -92.92%                                    | +36.01%   | -3.21%        | -5.74%  | -69.04%       | -76.31%   | -76.78% |
| s298     | 87.42%             | 50.84%    | 88.67%    | 88.99% | 0.01269              | 0.04383   | -90.93%                                    | +41.37%   | -5.20%        | -6.29%  | -61.23%       | -71.68%   | -71.92% |
| s38417   | 90.76%             | 83.72%    | 92.04%    | 92.41% | 0.06916              | 0.21047   | -91.27%                                    | +10.91%   | +0.24%        | -3.19%  | -66.00%       | -68.63%   | -69.48% |
| s38584   | 98.07%             | 94.60%    | 98.39%    | 98.36% | 0.06632              | 0.17088   | -91.17%                                    | +4.30%    | -1.94%        | -2.02%  | -64.48%       | -66.22%   | -66.24% |
| seq      | 71.87%             | 48.12%    | 74.38%    | 75.32% | 0.04767              | 0.05663   | -91.59%                                    | +28.07%   | +1.06%        | -5.03%  | -36.90%       | -49.24%   | -52.03% |
| spla     | 77.64%             | 49.11%    | 80.28%    | 80.46% | 0.04260              | 0.13954   | -92.37%                                    | +39.25%   | +0.29%        | -5.74%  | -61.59%       | -70.70%   | -72.11% |
| tseng    | 97.63%             | 95.22%    | 97.75%    | 97.88% | 0.00627              | 0.02527   | -89.48%                                    | -0.06%    | -1.60%        | -0.37%  | -71.69%       | -72.00%   | -71.75% |
| avg.     | 83.97%             | 67.89%    | 85.72%    | 85.93% | -                    | -         | -91.78%                                    | +17.15%   | -1.92%        | -4.68%  | -58.03%       | -64.19%   | -65.13% |

Table 2: Percentage of VddL switches and interconnect power achieved by the three algorithms compared to the baseline using Vdd-programmable interconnects with level converters (LCs).

|          |            | runtime (s) |               |          |  |  |  |
|----------|------------|-------------|---------------|----------|--|--|--|
| circuit  | # of nodes | tree based  | segment based | LP based |  |  |  |
| alu4     | 10716      | 60.52       | 124.4         | 482.53   |  |  |  |
| apex2    | 14860      | 180.75      | 378.59        | 1153.28  |  |  |  |
| apex4    | 9131       | 66.93       | 177.52        | 461.37   |  |  |  |
| clma     | 91620      | 8763.24     | 16799.67      | >20H     |  |  |  |
| elliptic | 30192      | 607.85      | 913.04        | 3136.59  |  |  |  |
| ex1010   | 33265      | 836.32      | 1422.79       | 5109.22  |  |  |  |
| frisc    | 40662      | 1135.84     | 1912.15       | 6135.38  |  |  |  |
| pdc      | 40001      | 1254.57     | 2508.57       | 8210.07  |  |  |  |
| s38417   | 57503      | 1821.09     | 2895.79       | 9152.52  |  |  |  |
| s38584   | 46014      | 1255.31     | 1892.86       | 6863.62  |  |  |  |
| geome    | tric mean  | 1X          | 1.85X         | 6.66X    |  |  |  |

Table 4: Runtime comparison.

Table 4 compares the runtime  $^4$  between the three algorithms. The tree based heuristic is the fastest among the three algorithms. The segment based heuristic and the LP based algorithm take 1.85X and 6.66X runtime compared to the fastest one. For the largest circuit clma, the LP based algorithm cannot solve the LP problem after running 20 hours. Compared to the LP based algorithm, the segment based heuristic has slightly smaller power reduction, but runs 4X faster and is effective for large circuits. The LP based algorithm is worthwhile for small circuits and can achieve best power reduction.

## 5. CONCLUSIONS

We have developed chip-level dual-Vdd assignment algorithms to guarantee that no VddL switch drives VddH switches. This removes the need of Vdd-level converters and reduces interconnect leakage and interconnect device area by 91.78% and 25.48% respectively compared to [8]. We have presented two simple yet practical power sensitivity based heuristics, tree based heuristic and segment based heuristic, which implicitly allocate time slack first to interconnects with higher transition density and assign VddL to them for more power reduction. We have also presented a linear programming (LP) based algorithm in which time slack is first explicitly allocated to each routing tree by formulating the problem as an LP problem to maximize a lower bound of power reduction, and then the Vdd-level assignment is solved optimally within each routing tree given the allocated time slack. We have conducted the experiments on MCNC benchmark set and compared the power reduction by the

three algorithms. Compared to [8], the LP based algorithm obtains the best power reduction and reduces interconnect power by 65.13% without performance loss. The tree based heuristic and segment based heuristic reduce interconnect power by 58.03% and 64.19%, respectively. Compared to the LP based algorithm, the segment based heuristic has slightly smaller power reduction, but runs 4X faster and is effective for large circuits.

#### 6. REFERENCES

- K. Poon, A. Yan, and S. Wilton, "A flexible power model for FPGAs," in Proc. of 12th International conference on Field-Programmable Logic and Applications, Sep 2002.
- [2] F. Li, D. Chen, L. He, and J. Cong, "Architecture evaluation for power-efficient FPGAs," in Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, Feb 2003.
- [3] J. H. Anderson, F. N. Najm, and T. Tuan, "Active leakage power optimization for FPGAs," in Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, Februray 2004.
- [4] J. Lamoureux and S. J. Wilton, "On the interaction between power-aware FPGA CAD algorithms," in *Proc. Intl. Conf.* Computer-Aided Design, pp. 701–708, November 2003.
- [5] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan, "Reducing leakage energy in FPGAs using region-constrained placement," in *Proc. ACM Intl. Symp.* Field-Programmable Gate Arrays, February 2004.
- [6] F. Li, Y. Lin, L. He, and J. Cong, "Low-power FPGA using pre-defined dual-vdd/dual-vt fabrics," in *Proc. ACM Intl.* Symp. Field-Programmable Gate Arrays, Februray 2004.
- [7] F. Li, Y. Lin, and L. He, "FPGA power reduction using configurable dual-vdd," in *Proc. Design Automation Conf.*, June 2004.
- [8] Fei Li and Yan Lin and Lei He, "Vdd programmability to reduce FPGA interconnect power," in *Proc. Intl. Conf.* Computer-Aided Design, November 2004.
- [9] Jason H. Anderson and Farid N. Najm, "Low-power programmable routing circuitry for FPGAs," in *Proc. Intl.* Conf. Computer-Aided Design, November 2004.
- [10] Y. Lin, F. Li, and L. He, "Power modeling and architecture evaluation for FPGA with novel circuits for vdd programmability," in Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, Februray 2005.
- [11] Y. Lin and L. He, "Leakage efficient chip-level dual-vdd assignment with time slack allocation for FPGA power reduction," Tech. Rep. 05-257, UCLA Engr., available at http://eda.ee.ucla.edu
- [12] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers, Feb 1999.
- [13] M Berkelaar, lp-solver 3.2: a public domain (MI)LP solver. ftp://ftp.ics.ele.tue.nl/pub/lp\_solve/.

<sup>&</sup>lt;sup>4</sup>The runtime includes single-Vdd placement and routing by VPR and generating the interface files between VPR and fpgaEva-LP2.