# A Novel Switchable Pin Method for Regulating Power in Chip-Multiprocessor

Zhou Zhao, Ashok Srivastava, Lu Peng and Shaoming Chen Division of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 70803, U.S.A.

{zzhao13, eesriv, lpeng, schen26}@lsu.edu

Saraju P. Mohanty
Department of Computer Science and Engineering,
University of North Texas, Denton, TX 76207,
U.S.A.
saraju.mohanty@unt.edu

#### Abstract

Transistor scaling has allowed a large number of circuits to be integrated into integrated circuit (IC) chips implemented in nanometer CMOS technology nodes. However, dark silicon which signifies for under-utilized circuitry will become dominant in future chips due to limited thermal design power (TDP). Furthermore, large voltage loss due to complex routing and placement will also degrade the performance of ICs. In addition, effectively managing power dissipation in a packaged chip is one of the major issues of IC design. Previous work done by our group mainly focused on RCL simulation and elementary IC simulation, this work not only builds on power delivery network (PDN), but also designs switchable pin working for two cores at the layout level. The essence of our idea is to supply power to the chip using traditional I/O pads. In order to balance power supply and I/O bandwidth, we set several groups of parallel switchable pins between the core and memory such that I/O pads can dynamically switch between two modes which are data transmission and power supply. To remove the risk that large current going through I/O pad breaks down the pad frame, we redesigned traditional I/O pad to operate in bi-direction. Using TSMC CMOS 180nm process for the design and simulation, our test results show that the proposed switchable pin can well compensate voltage loss in chip multiprocessor, and transfer time of two modes is very short. For data transmission, we perform a sensitivity study to explore the impact brought by switchable pins. Our simulation results demonstrate that performance degradation is in acceptable range when the switchable pins are added to the chip-multiprocessor.

#### **Key words**

Dark Silicon, Multiprocessor System, Voltage Loss, Power Delivery Network, Switchable Pin, Pad Frame

#### 1. Introduction

Efficient circuit and hardware implementation are critical to achieve high performance computing. Growing clock frequency and design complexity will inevitably increase processor power dissipation. For emerging nanometer silicon MOSFETs, quantum effects are dominant, and thus the sub-threshold leakage [1] and resulted heat dissipation are the critical problems of future IC development. Limited thermal design power (TDP), dark silicon [2, 3] which refers to frequency drop or even turning-off of transistors will

happen in future chips and will counteract the purpose of transistor scaling. From the perspective of voltage regulation, complex routing and placement will potentially bring the irregular power supply to each block in a chip, which will obviously affect the reliable operation of the chip.

Effective power distribution plays an important role in IC chip design [4]. Extensive work has been done on power management in both circuit and architecture areas. On-chip and off-chip voltage regulator have been designed [5, 6], which can robustly manage power modes according to different workloads. Lots of passive devices are integrated with packaged chips and motherboards, however, it is at the cost of bringing area and complexity. In [7], an architectural concept has been proposed where the chip multiprocessor serves multi-functions in portable devices. The sub-core is designed in such a way that it can largely enhance the utilization ratio of a single chip. In [8], an extreme turbo technique is demonstrated where a core runs under the ultra-fast speed in short time and then under-clock for a while at cost of a non-conventional cooling device with phase change materials.

The present work is based on our earlier work [9, 10, 11]. Previous works proposed a switchable pin concept for an efficient power delivery which has been simulated in both RCL and simple IC level. The method is to fully use I/O pins with their pads to convert them as a group of power pads to compensate voltage loss in chip. This work use specific PDN simulation to build proper PCB environment letting the switchable pins reliably work in chip multiprocessor with two sub-cores with the help of clock block. We explore also the voltage compensation due to switchable pins, chip cost brought by switchable pins and signal integrity negatively influenced by switchable pins.

The contributions of this work include: 1) we analyze the voltage loss due to long global wires under the rules of routing/placement in current VLSI chips. The trend of voltage loss in chips designed in sub-nanometer process technologies is anticipated, 2) with the switchable pins added, the interface between core and memory is needed to be modified to keep proper signal integrity, thus we model a power delivery network (PDN) and build a specific PDN with proposed switchable pins for our design. Our PDN is guided by the rules of proper PCB design and extracted parameters in IC fabrication process (TSMC 180nm), 3) we modify traditional I/O pad to meet the requirement of current to boost IC chip going through the pad frame since parts of I/O pads need to be used as power pads and overlarge current might break down the pad frame, 4) our circuit design is based on a microprocessor including two sub-cores and a group of memory chips in layout level as the platform. Design of interface on PCB is based on PDN modeling and for controllable switchable pins, a specific clock circuit is introduced and 5) we verify that switchable pins can well compensate voltage loss with low cost. We also perform a sensitivity study to explore the impact brought by switchable pins including signal integrity under different frequencies.

The rest of the paper is organized as follows. Section 2 presents the analysis of voltage loss in chips. Section 3 presents the concept of switchable pin and its PDN modeling. Section 4 is on the



Fig. 1. Cross-section view of power supply interconnection from multiple layers.

modification of I/O pad. Section 5 is our circuit design implementing switchable pin and relative test report. The summary is presented in Section 6.

### 2. Analysis of Voltage Loss

Finite width of wire, contact resistance and complex routing will cause voltage loss in chips. Previous theoretical study has proved that voltage drop can influence the performance of data transmission [12]. The current IC fabrication allows multi-layer process in silicon wafers to release the pressure of chip layout. But complex interconnect/contact and inductor effect under high frequency still degrade voltage distribution in chips. Fig. 1 shows a cross-section view of power supply interconnection from multiple layers. For a global metal wire in a chip, the voltage loss can be described by the following equation:

$$V_{loss} = \sum_{i=1}^{m-1} R_{c_{-i}} i_{w} + \rho_{w} \frac{l}{w} i_{w} + Z_{L} i_{w}$$
 (1)

where m is the number of metal layers,  $R_{c_{-}i}$  is the contact resistance,  $i_w$  is the current flowing through the wire, l and w are the length and width of the global wire, respectively.  $Z_L$  is the impedance of inductor in global interconnection. The three terms in Eq. 1 reflect the voltage loss contributed by contacts, wire resistance, inductor effect under high frequency, respectively.

To estimate the voltage loss in chips, we make following considerations: a) The parameters of fabrication process we use for estimation is Predictive Transistor Model (PTM) [13]; b) The global wire serving for power supply is straight without complex rotation, which means we ignore mutual inductance effect between neighboring wires. The parameters of interconnect for the selected fabrication process are extracted as in [14]; c) We consider 10 multi-layers in a chip. The top and bottom layers are used as global layers and layer supplying power to sub-block, respectively; d) The voltage loss due to inductor at higher frequencies are dynamic loss and static loss as well [15]. Here we only focus on static loss, which means for the loss due to inductor effect, absolute value of impedance is taken into estimation without the consideration of phase; e) We select a virtual core which has 100 million transistors. For each transistor, we use minimum dimension in the selected process. Considering gaps between adjacent transistors/sub-



Fig. 2. Voltage loss ratio in various fabrication processes.



Fig. 3. Growth rate of fall/rise time for various processes at 1GHz.

blocks, the area of the virtual core is close to  $1.2\times100$  million transistors. For a single transistor, DRC rules cannot be neglected, we explore currently mainstream layout of a single transistor with the strategy of saving area, the area of a single transistor is  $(1.5\times l)\times(3\times w)$ ; f) Largest voltage loss happens in the geometrical center of the virtual chip; and g) For a virtual chip multiprocessor, we set that there are 2, 4, 6, 8 virtual cores in a chip multiprocessor corresponding to 32nm, 20nm, 10nm, and 7nm fabrication processes, respectively.

Guided by above considerations from (a) to (g), we can calculate the voltage drop in the virtual single core and chip multiprocessor for various sub-nanometer processes, as shown in Fig. 2. We observe that the voltage loss increases with the frequency due to inductance effect. It is anticipated that if emerging chips work under ultra-high frequency, or have too many layers, voltage loss will be continuously increased. Another issue which needs to be analyzed is the relationship between voltage loss and transmission time of signals in chips. In this work, we mainly study how voltage loss influences rise



Fig. 4. Package a) standard, and b) with switchable pins.

and fall times, which are significant factors determining signal transmission. The mathematical relationship can be shown as follows [16]:

$$t_{f} = t_{r} = \frac{2C_{L}}{\beta(V_{DD} - |V_{TH}|)} \left[ \frac{|V_{TH}| - 0.1V_{DD}}{V_{DD} - |V_{TH}|} + 0.5 \ln \left( \frac{19V_{DD} - 20|V_{TH}|}{V_{DD}} \right) \right]$$
(2)

Current digital VLSI is based on static logic and the load capacitance is mostly contributed by the equivalent gate capacitance of the next stage. Fig. 3 shows the growth rate of fall/rise time calculated at 1GHz using PTM under the influence of voltage loss which increases with the increase in frequency. It can be concluded that if no voltage calibration is used for VLSI chips, voltage drop will seriously influence this performance. The mainstream strategy to avoid this problem is embedding in-chip a voltage regulator. However, this method is at a cost of power dissipation and real estate.



Fig. 5. Conceptual diagram of SoC with switchable pins.

# 3. Concept of Switchable Pin and Its Power Delivery Model

Normally, a complex function digital VLSI chip has one power pad, one ground pad, clock pad, and several data pads used for writing or reading. This feature can maximize the bandwidth of data transmission for a specific package. But only one pair of pads used for power supply cannot guarantee each sub-block to work under a perfect voltage as analyzed in the last section because of complex route and placement. If we set more pads as power pads, it can be seen that the performance of power supply in

multi-power pads will be enhanced compared to in a typical package. However, this strategy will decrease bandwidth which is also an important factor in currently high speed chips.

An observation is that for a processor, it does not always need large bandwidth due to different states [17]. Under some specific instructions, some data pins will be in idle state. That's where we can bring over novel switchable pin concept, where in that setting several I/O pads dynamically switch between traditional data transmission and power supply. Through this strategy, and the need of large bandwidth, the switchable pin type package behaves like the normal one. While the required bandwidth is reduced, some of I/O pads will be changed to power pads to compensate the voltage loss in chips. Fig. 4 shows the standard package and the package with switchable pins. In Fig. 4 (a), the color trend from pure black to white reflects the trend of voltage drop in a die if the standard package is used. As shown in Fig. 4 (b), multiple switchable pads are used to confirm that each part in a die is supplied by the perfect voltage without voltage drop.

To verify the correctness of data transmission between processors and memories is the key problem when applying switchable pins in system-on-chip (SoC). Our solution is that when some of I/O pads are used for power delivery, rest of I/O pads not only still work for the original data line, but also need to transmit the data from the switched pins. To achieve this method, two groups of parallel switches, which are located in-core and off-core, are needed. Fig. 5 shows the conceptual diagram of our design. We define two modes which are data mode and power mode. In data model, only Sdata is enabled, the entire system works as usual, that the original core is supplied by one power pad and all of I/O pads are working for data transmission, and the supported core is in idle state. In power mode, both Spower and Sdata are enabled. Our switchable pins begin to work, some of I/O pads belonging to the original core will start to be used as virtual power pads for the supported core. In this mode, step-by-step data transmission in the original core is achieved by limited I/O pads with the aid of clock shifter (not shown in Fig. 5). Note that in our previous design [11], all of data that come to one storage unit, correspond to a single non-switchable line, while rest of storage units that correspond to switchable lines, are in idle state. The disadvantage of the previous approach is largely decreasing the utilization rate of memory chip. In this case, further in order to recognize data from different lines, decoder and controller managing WRITE/READ in memory need be modified, which can improve the design difficulty. It brings extra delay thereby degrading the performance of high speed data mode. In this work, we modified switches of off-core thereby letting all of storage units receive data in power mode step by step. The essence of our modification is adding another group of paralleled switches. The signal from each data line has to pass three switches at the cost of delay. But this method avoids the modification in decoder and controller of memory. These considerations provide guidance to our re-designing of the switchable pins.



Fig. 6. Power delivery network of: a) 'two-chip' mode, b) 'three-chip' mode and c) normal mode without switchable pins.

There are two feasible options including tri-state buffer, and CMOS switches, to achieve switchable function. A tri-state buffer can perform better signal integrity than the CMOS switch. Besides, to improve signal integrity, equalizer can be added near data ports. The equalizer, which is the same as tri-state buffer, will reduce signal attenuation at the cost of extremely large delay. However, we cannot ignore the delay issue. Our modification between core chip and memory chip has brought large delay to data

transmission. Thus, we make a tradeoff that we choose CMOS switch with two inverters instead of a tristate buffer to balance the delay and signal integrity.

It is obvious that the left group of switches as shown in Fig. 5 must be in the processor chip to achieve correct mode transmission. For the right group of switches, we have two strategies: 1) First method is based on adding an independent chip of paralleled switches between processor and memory. The advantage of this idea is to keep away from modification of memory so as to reduce the difficulty encountered in memory design. We define this as a 'three-chip' model. 2) The second one is embedding paralleled switches into memory. This can reduce delay between processor and memory intuitively since it eliminates an extra package. We define this as a 'two-chip' model.

We have built a power delivery network (PDN) to compare the performance of two strategies and attempted to address a major concern on how to set up a proper package/PCB model to describe signal transmission through SoC. In [18], the core is disassembled and builds a detailed network for power distribution of in-chip. In [19], a systematic model is proposed, which includes chip, PCB and relative parasitic parameters. Since in this step, we mainly focus on to predict signal integrity of the proposed idea, we assume, that all signals that start from output of in-chip and end in the I/O pins of memory are ideal. Based on this assumption, to monitor signals going through package and PCB is much more important than going through in-chip circuits. Thus, the systematic model is suitable for our prediction. Further, besides package, bumps and vias are very important factors influencing the performance. Here we conduct a detailed study of the model in support of our proposed idea.

First we consider the fact that our power delivery model should be well used for two directions (READ and WRITE). For a given model, it should give an acceptable performance in both WRITE and READ operations. It is usual that the frequency of core is way faster than of memory. Thus, the performance of WRITE is relatively more important than that of READ. Secondly, unlike previous work, we introduce parameters of bump, bump wire and via into our model to confirm the accuracy of signal transmission. Last variable thing is that matched resistance can be set by ourselves according to reflections happening in PCB.

Guided by above principles, we have built two models corresponding to a 'three-chip' model and 'two-chip' model, respectively. Fig. 6 shows two proposed models and an ideal model without switchable pins implementation. To determine each parameter in our models, we mainly use datasheet of TSMC 180nm fabrication process to set pad frame model. For the PCB environment, in this work, we have used classical C4 package, which is still the mainstream in current SoC. We use the work in [20] to set our C4 package including bump and its wire. For via, which also can influence the performance of the entire model, we have used the industry standard model [21]. Note that in our work, we do not use DIMM package, which is a mainstream package of DRAM used for mainboard in computers, but uniformly use



Fig. 7. Simulation results of signal attenuation in a) read stage and b) write stage.

Table 1 Summary of dimension in PDN

| ε                     | 4.4F/m |
|-----------------------|--------|
| Т                     | 50mm   |
| <b>D</b> <sub>1</sub> | 20mm   |
| $\mathbf{D}_2$        | 32mm   |
| d                     | 10mm   |

C4 package for both core chip and memory chip due to the consideration of the compatibility to all SoCs. The major parasitic parameters in via are inductance and capacitance. These two parameters can be calculated as follows [22]:

$$C_{via} = \frac{1.41\varepsilon_r T D_1}{D_2 - D_1} \tag{3}$$

$$L_{via} = 5.08h \left[ \ln \left( \frac{4h}{d} \right) + 1 \right] \tag{4}$$

where  $D_1$ ,  $D_2$  are the diameters of via pad and anti-pad, respectively, h is the length of via, d is the diameter of via barrier and T is the thickness of PCB.

Properly using the dimensional restriction of PCB design, in our models, we define each dimensional parameter of via as shown in Table 1. Through this, we can get capacitance, 0.5pF, and

Table 2 Summary of parameters in PDN

| Tubic 2 Summary of parameters in 1 Biv |                       |        |  |  |  |
|----------------------------------------|-----------------------|--------|--|--|--|
| PAD                                    | R <sub>ESD</sub>      | 50kΩ   |  |  |  |
|                                        | C <sub>pad</sub>      | 250fF  |  |  |  |
| вимр                                   | L <sub>BUMP</sub>     | 60pH   |  |  |  |
|                                        | R <sub>BUMP</sub>     | 30mΩ   |  |  |  |
|                                        | C <sub>BUMP</sub>     | 0.2pF  |  |  |  |
| BOND WIRE                              | L <sub>BONDWIRE</sub> | 2.58nH |  |  |  |
|                                        | R <sub>BONDWIRE</sub> | 90mΩ   |  |  |  |
|                                        | CBONDWIRE             | 0.02pF |  |  |  |
| VIA                                    | L <sub>VIA</sub>      | 1.02nH |  |  |  |
|                                        | C <sub>VIA</sub>      | 0.5pF  |  |  |  |
| TLINE                                  | Delay                 | 40ps   |  |  |  |
|                                        | Impedance             | 50Ω    |  |  |  |
|                                        | R <sub>matched</sub>  | 500kΩ  |  |  |  |
| DRAM .                                 | L <sub>DRAM</sub>     | 0.5nH  |  |  |  |
|                                        | C <sub>DRAM</sub>     | 300fF  |  |  |  |
| CORE                                   | R <sub>AC</sub>       | 50Ω    |  |  |  |
|                                        | R <sub>WIRE</sub>     | 100Ω   |  |  |  |
|                                        |                       |        |  |  |  |

inductance, 1.02nH. Then, we can get entire parameters in our models as shown in Table 2. Fig. 7 shows simulation results of signal attenuation. From simulations, we notice that the performance of data transmission in WRITE mode is better than in READ mode. For READ mode, signal attenuation drops quickly when frequency exceeds 1GHz. This is acceptable since that current mainstream DDR3 only runs under 900MHz [23]. From the comparison of two proposed models and from the view of signal integrity, 'two-chip' model is a good practical choice to design flow of specific ICs with PCB. The cost of this method needs to add parallel switches in memory in order to confirm correct data transmission. But we do not need to modify decoder and controller in memory as mentioned before. We also plot the performance of the model without switchable pin. It can be seen that our switchable pin keeps signal integrity in acceptable range, especially at low frequency operation.

## 4. Redesign of I/O Pad

In the previous section, we discussed how switchable pin works in SoC and modeled its power delivery network. One thing cannot be ignored is that, unlike traditional function of I/O pads, which connects gates of transistors to drive logic circuits, or output stage in chip, in our work, these pads will directly connect power node of circuits. Here is a major difference between two types of connections. Since CMOS transistor is a voltage-control device, in which the resistance of gate is extremely high, current through the gate is negligible. But the route of power supply will generate large dynamic current in



Fig. 8. Logic and circuit diagrams of a traditional bi-direction pad and the problem when used for power supply.

complex logic processing. For achieving switchable function, bi-directional pad seems a potential candidate. But as shown in Fig. 8, when power supplies to in-chip, the p-MOSEFT in buffer will tolerate huge current like power pad does due to the requirement of powering the entire core. However, a normal transistor cannot drive that much large current. Based on this analysis, original I/O pad is not suitable for our design.

This can be addressed by suitably modifying p-MOSEFT in buffer design as follows: 1) We can use numerous p-MOSEFTs in parallel to reduce current going through each transistor. But for a normal MOSFET, the maximum current is at a mA level. Reaching the large current going through the complex processors, it will require in parallel over hundred or even thousand pMOS transistors, which increase the cost of package largely. For discharge current through an nMOS transistor, large current also needs numerous nMOS transistors in parallel. Thus, this method is infeasible for our work. 2) To tolerate huge current, the power MOSEFT is a good choice [24]. Compared to previous method, this implementation can avoid the area cost of package. But the delay in power MOS transistor is larger than in traditional MOS transistor [25], which adversely affects data transmission at high speed. Another issue is that the large current going through this transistor, will generate large heat in the package. The overlarge heat can influence the performance of data transmission to some degree [26]. 3) For current technology of IC package, signal transmission in pad frame used for mixed-signal IC design is straightforward and can be achieved by metal interconnection without logic gates. This is used for transmitting variable voltage of analog signal, and variable current that is larger than current existing in I/O port of digital IC. Therefore, we selected the third method as the initial solution addressing overlarge current going through I/O pads.

Deep looking into current analog pad, for transmission route of variable signal, typically only one layer of metal is needed. For the drive current a core needs, in this work, we integrate six metal layers to



Fig. 9. Redesigned I/O pad used for both data transmission and power supply: a) logic diagram, and b) layout.



Fig. 10. Delay comparison between mainstream pads and proposed redesigned pad.

let current go through. We also need to control signal direction. The designed pad should be used well not only for power supply from off-chip to in-chip, but also for data transmission from in-chip to off-chip.

Novelty of our following approach lies in combining a traditional bi-direction I/O pad and analog pad with some modification as shown in Fig. 9. In the modified I/O pad, we set two routes, one is traditional output port using a tri-state buffer and the other one is six metal layers used for the power supply. When Data\_IO\_EN is high, the tri-state buffer is enabled to make signal going from core to off-core. During Data\_IO\_EN is low, the data route will be blocked, and off-core power will drive the supported core. It can be concluded that this redesign can tolerate a large current due to original logic gates that are replaced by six metal layers overlapping at the cost of signal integrity. When data normally come from off-core to in-core, there is no buffer to buff the signal. To address the response time since timely power boosting is required to turn on supported core, we compared delays as shown in Fig. 10 extracted from analog pad, traditional bi-direction pad, power pad, and power MOSFET based pad and our redesign. From the result,



Fig. 11. The layout design flow.

we can see that our design is faster than bi-direction pad, power MOSFET based bi-direction pad and power pad. Even though analog pad is the fastest, our modification can confirm for the signal integrity from in-chip to off-chip. For the robustness of the method, since our modification only occurs in the logic block in I/O pad without change in ESD part, there is no risk of transistor breakdown due to unwanted high voltage.

#### 5. Circuit Implementation and Testability

## 5.1 Circuit Implementation

For the circuit design, we used the part of openMSP430 (a 16bit mixed-signal microcontroller) [27] as a single core and DRAM as memory [28]. Note that standard openMSP430 has digital block that only process digital data, and analog part which includes Sigma-Delta ADCs, passive device based DACs and analog comparators to meet the requirement of various mixed-signal processing. In our design, we only used digital block based on following two reasons: 1) Our proposed switchable pin is mainly used for digital VLSI chips, especially for real chip multiprocessor. 2) Study of performance in high speed is very necessary for digital VLSI. If analog blocks are added, the entire work speed cannot be as fast as in digital block. Specifically explaining, normal Sigma-Delta ADC always works under MHz level, while the DAC used in openMSP430 is not RF DAC, that means conversion frequency cannot reach GHz level. So analog blocks largely reduce the whole work speed. Thus, we choose the pure digital block in openMSP430 as a single core serving our simulation. The entire digital block is totally built by highly standard logic gates with register used for temporary data storage, which is positive to boost core frequency since only cascade logic chain contributes delay and register helps avoid data missing under high speed work.



Fig. 12. A core chip (12 of 16 I/O pads, modified as switchable pins).

In the circuit level, the design technology is TSMC 180nm. The design flow is briefly shown in Fig. 11, and explained as follows: First we export EDF file from Verilog source using *Mentor Graphic Leonard Spectrum*. Then TPR file is exported from EDF file in *Tanner Schematic-Edit*. With the support of standard cell library at the layout level, and custom defined rules of routing and placement, TPR file can automatically draw the total layout view of core and logic blocks of DRAM in Tanner Layout-Edit. We drew the layouts of storage units in DRAM, extra circuits for switchable pins and required wires connecting sub-blocks manually. At the end for the specific requirement to place pad frame for both core chip and DRAM chips, redesigned pad, traditional I/O pad, power pad, and ground pad are used to finalize the entire layout.

The layout diagram is shown in Fig. 12, in which 12 of 16 I/O pads are switchable pins supplying the supported core. Note that for the chip of core, we designed two single cores referring to an original core and a supported core. Our circuit implementation focuses on the verification of the proposed switchable pin but not internal circuits in the single core. Thus, for the pads definition of the single core, we group instruction ports from one core connecting one pad. This kind of grouping is also used in data inputs, and outputs which do not connect to DRAMs. For the pads definition of two cores, here we do not share pads for the same ports but use independent pads. One real power pad only serves for the original core. The supported core will be powered under power mode using switchable pins. In a single core, there are 16 ports connecting to DRAM, and the single DRAM has 8 data ports. Therefore, to verify correctness of data transmission, we used 4 DRAM chips to deal with data from two cores.

The additional circuits to control switchable pins include paralleled switches in both core chip and DRAM chip, and clock tree to shift clock signal achieving step-by-step data transmission during power mode. For the paralleled switches in both core chip and DRAM chips, the resistance of the switch is much smaller than of core and DRAM. So the possibility of overlarge voltage acting on switches

Table 3 Summary of designed case

| Case   | $N_{\text{swp}}$ | N <sub>data_busy</sub> | N <sub>data_normal</sub> | N <sub>data_power</sub> |
|--------|------------------|------------------------|--------------------------|-------------------------|
| name   | 1 \swp           | 1 data_busy            | 1 data_normal            | 1 data_power            |
| SWP_8  | 8                | 8                      | 0                        | 2                       |
| SWP_10 | 10               | 5                      | 1                        | 3                       |
| SWP_12 | 12               | 4                      | 0                        | 4                       |
| SWP_14 | 14               | 2                      | 0                        | 8                       |

should not be of concern. The dimension of the switch does not need to be very large, which helps to suppress delay. Especially, for the paralleled switches in DRAM chip, we put them in the front of data storage unit without modification of DRAM controller and decoder, which means, in-memory switches are in charge of where signals go under a specific mode, traditional controller still works for controlling READ/WRITE. The reason of this placement is to reduce additional circuits in DRAM avoiding extra delay. Clock block is designed by paralleled shift registers as in our previous work [11]. To precisely recognize and control two modes, we add a non-overlapping block [29] in the end of shift register to avoid two modes working at the same time causing data competition.

#### 5.2 Testability

The post-layout simulation was run after core chip and DRAM chips combined together with the guidance of the PDN designed in Section 3. For the final verification of our design, we divided our test into three parts. The first part is related to power issue. Here, we mainly concern on how much bonus power can be supplied to the supported core, how much power consumed by our switchable pins and extra clock block in core chip, and the voltage compensation due to switchable pins added in core chip. The second part is concerning the issue of data transmission. We still use signal attenuation to evaluate performance under the influence of switchable pins same as we did in PDN simulation. The last one is area issue that measures how much large extra area added by circuits serving for switchable pins.

How many switchable pins can work properly for a given core chip deserves to be focused on. We define the number of switchable pins,  $N_{swp}$ , the number of unchanged I/O pads helping data transmission of switched pads in power model,  $N_{data\_busy}$ , the number of unchanged I/O pads still working normally in power mode,  $N_{data\_normal}$ . For those pads serving data transmission of switched pads in power mode, each pad is responsible for transmitting  $N_{data\_power}$  data sources (include the data belonging itself). We also defined the case intuitively corresponding to the number of switchable pins in a core chip. The principle is that, to mitigate the presence of data transmission done by unchanged I/O pad in power mode, for a given case, we fully use those unchanged I/O pads for data transmission as much as we can. Guided by this, several cases are described in Table 3.



Fig. 13. Current simulation in both data mode and power mode.



Fig. 14. Power dissipation of extra circuits serving for switchable pins.

For the test of power issue, we input random signals to all input ports to obtain various dynamic current going through the single core. Fig. 13 shows the current test in case of SWP\_8. In power mode, the entire power can be doubled compared to data model due to the fact that two cores are the same so that the current generated by each core is approximately identical. For different cases, since additional circuits working for switchable pins are much smaller than two cores, current in additional blocks is also much smaller than in two cores. Thus, no matter how many I/O pads are used as switchable pins, power is always seen to be doubled roughly when chip is working under power mode. So this proves that our switchable pins can be seen as a dynamic power supply as the traditional power pad does. The response time for mode transmission is very small compared to the time of one mode. Only if we do not need very fast transmission between two modes, this mode delay will be accepted.

It is obvious that extra circuits serving for switchable pins must consume power in core chip. To calculate additional power dissipation, we use the way as follows: 1) For the clock block which controls the transmission of two modes, the measurement of power dissipation is same as in traditional digital circuits. 2) For paralleled existing switches between the core and pads, we first get the power dissipation corresponding to a single switch. We simulate the entire system in a period that includes both power mode and data mode with WRITE and READ, then multiply the average voltage occupied by the switch itself with the average current going through the switch itself to get the total energy consumed in the given period. The last step is to use both energy and the given time, the average power dissipation of a single switch can be get. Repeat above method for all switches, then add all of them together, we can get extra power dissipated by switches. Fig. 14 shows extra power dissipation introduced by clock block and switches corresponding to four cases which are described in Table 3. From the results, we can see that extra circuit does not bring large power dissipation to the entire system. From Fig. 13, we can conclude



Fig. 15. a) Mean value and b) standard deviation of supplied voltage under power pad and switchable pin supply.

that overall power dissipation of extra circuits does not exceed 7% of the whole power dissipation of our core chip.

For verification of voltage compensation brought by our switchable pins, we mainly monitor the voltage distribution in supported core since our initial setting is that the one core is closed to the traditional power pad, and is perfectly powered on by it. For the supported core, which is far from power pad, its voltage loss should be mainly compensated by switchable pins. Thus, we let the supported core be supplied by both power pad and switchable pins to compare voltage distribution. Automatically routing and placement in *Tanner Layout-Edit* is row by row style. Thus, we sample the supplied voltages of all rows under power pad supply and switchable pins supply. Using these sampled supplied voltages, we compute mean value and standard deviation under four cases as described in Table 3. The comparison of the mean value and standard deviation under two supply methods as shown in Fig. 15. We can see that the proposed switchable pin does compensate voltage loss, and provides the supported core better power supply than power pad does.

Data test is more complex than power test. To explore the influence brought by switchable pins, additionally we define another two cases, SWP 0 C1, and SWP 0 C2. SWP 0 C1 which represents



Fig. 16. Signal attenuation in WRITE stage.



Fig. 17. Signal attenuation in READ stage.

only one core in chip without switchable pin. SWP 0 C2 refers that both cores supplied under only one power pad normally without switchable pin. In this case, the data line we monitor is the line that is far from power pad, with the motivation of seeing how voltage loss influences signal integrity. For these two cases without switchable pins, there is no two modes in cases of SWP 0 C1 and SWP 0 C2. For all of cases, we need to compare WRITE and READ in both power mode and data mode. For all of cases, Fig. 16 and Fig. 17 show our test results. For these, we get that the performance of WRITE stage is better than of READ stage, which follows our prediction using PDN simulations. From the circuit point of view, this phenomenon is due to, in WRITE stage, data that goes from core to DRAM, is buffered via tri-state buffer in pad. But in READ stage, there is no buffer block in pad since we modify original I/O pad to meet our current requirement. Thus, data from DRAM to core is not as clean as data from core to DRAM. This can explain the performance degradation in READ stage. Comparing power mode and data mode, data integrity in data mode is better than in power mode as shown in simulations. It is obvious that in data mode, the entire system works like a normal one without switchable pins, only some turned-on CMOS switches contribute delay to data transmission. In power mode, only limited I/O pads are in charge of data transmission. One data line needs to transmit data from several data sources in one period, which largely improve the difficulty of data transfer. Thus, inadequate charging and discharging will happen resulting in imperfect signal. Analyzing all of cases in this work, single core chip without switchable pin performs the

best. The chip composing of two cores without switchable pins gets an acceptable performance under low and medium clock frequency. But with frequency boosting, inductor effect will be dominant causing signal integrity degradation due to the increasing of both fall and rise times. Another observation is that, with increase of switchable pins, the larger signal attenuation is larger due to shorter time for data transmission of a single line. Looking into our results, we conclude that SWP\_8, SWP\_10, and SWP\_12 can perform with acceptable signal integrity in both WRITE stage and READ stage. However in the case of SWP\_14, signal attenuation is very large since one I/O route is needed to transfer 8 data routes in power mode.

For area issue, the two cores occupy around 6.63mm² and the extra area brought by extra circuits corresponding to SWP\_8, SWP\_10, SWP\_12, and SWP\_14, are 0.1257mm², 0.1263mm², 0.1281mm², and 0.1319mm², respectively. It is confirmed that extra circuits do not cost a lot in the view of chip area. Final comparison stands on the analysis with our previous work. It can be concluded that work in [9, 10] are the initial reports presenting the concept of switchable pin. Their simulations mainly focus on the system architectural simulation using a RCL model to predict the feasibility of switchable pin working in chip multiprocessor. While in [11], the elementary IC level simulation is done without PDN simulation and consideration of potential transistor breakdown if traditional I/O pad is used. Compared to [11], this work has several improvement as follows: 1) We build a specific PDN which guides on how to set PCB environment for letting switchable pin work well in the entire system. 2) We redesign I/O pad to strength the robustness of entire system avoiding large current breakdown core chip. 3) In test part, comprehensively we simulate all aspects which can evaluate the performance of the whole system and extra cost brought by the introduced switchable pin. 4) We implement a sensitive study regarding the influence due to different number of switchable pins working for core chip.

#### 6. Conclusion

This work presents a novel concept of switchable pin to regulate power distribution in chip multiprocessor at a low cost. We used several sub-nanometer CMOS technology to predict the serious performance degradation caused by voltage loss in complex function chips. With the inspiration of settling more power pads in chip, we proposed the switchable pin and described its fundamental principle. We proved the feasibility of our idea with studying the power delivery network suitable for SoC with switchable pins. Simulation shows that signal attenuation brought by switchable pin is acceptable. Furthermore, based on the problem in our work, I/O pads need to tolerate overlarge current with the purpose of supplying bonus power, we redesigned I/O pads that include output routes using tri-state buffers and power routes using multiple metal layers. Finally, we combined automatic layout flow and manual layout in EDA software to implement our idea at a circuit level. Final test results show that using switchable pins, the power in chip can be doubled without long response time and large voltage loss in

pads. Through the simulation for signal integrity, we also found that switchable pins won't seriously degrade the performance of data transmission, especially under medium frequency (2GHz for WRITE behavior, and 1GHz for READ operation). The work is of first kind and so it has been limited to considerations such as the PCB design from C4 package and use of 180nm CMOS process.

#### **ACKNOWLEDGMENT**

Part of the work is supported under NSF grant 1422408.

#### References

- [1] R. Anjana, A. K. Somkuwar, Analysis of sub threshold leakage reduction techniques in deep submicron regime for CMOS VLSI circuits, Proceedings of 2013 International Conference on Emerging Trends in VLSI, Embedded System, Nano Electronics and Telecommunication System (ICEVENT 2013), 2013, pp. 1-5.
- [2] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, D. Burger, Dark silicon and the end of multicore scaling, Proceedings of 38th Annual International Symposium on Computer Architecture (ISCA 2011), 2011, pp. 365-376.
- [3] Y. Zhang, L. Peng, X. Fu, Y. Hu, Lighting the dark silicon by exploiting heterogeneity on future processors, Proceedings of 50th ACM/EDAC/IEEE Design Automation Conference (DAC), 2013, pp. 1-7.
- [4] Z. Zhao, A. Srivastava, S. M. Chen, S. P. Mohanty, An Algorithm Used in a Power Monitor to Mitigate Dark Silicon on VLSI Chip, Proceedings of the 14th IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2015, pp. 191-194.
- [5] R. J. Milliken, J. S. Martinez, E. S. Sinencio, Full On-Chip CMOS Low-Dropout Voltage Regulator, IEEE Transactions on Circuits and Systems I: Regular Papers, 54 (9), (2007) 1879-1890.
- [6] S. K. Lau, P. K. T. Mok, K. N. Leung, A Low-Dropout Regulator for SoC With Q-Reduction, IEEE Journal of Solid-State Circuits, 42 (3), (2007), 658-664.
- [7] S. Swanson, M. B. Taylor, Greendroid: Exploring the next evolution in smartphone application processors, IEEE Communications Magazine, 49 (4), (2011), 112-119.
- [8] A. Raghavan, Y. Luo, A. Chandawalla, M. Papaefthymiou, K. P. Pipe, T. F. Wenisch, M. M. K. Martin, Computational sprinting, Proceedings of IEEE 18th International Symposium on High Performance Computer Architecture (HPCA 2012), 2012, pp. 1-12.
- [9] S. M. Chen, Y. Hu, Y. Zhang, L. Peng, J. Ardonne, S. Irving, A. Srivastava, Increasing off-chip bandwidth in multi-core processors with switchable pins, Proceedings of 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), 2014, pp. 385-396.

- [10] S. M. Chen, L. Peng, Y. Hu, Z. Zhao, A. Srivastava, Y. Zhang, J. W. Choi, B. Li, E. Song, Powering Up Dark Silicon: Mitigating the Limitation of Power Delivery via Dynamic Pin Switching, IEEE Transactions on Emerging Topics in Computing, 3 (4), (2015), 489-501.
- [11] Z. Zhao, A. Srivastava, L. Peng, S. M. Chen, S. P. Mohanty, Circuit Implementation of Switchable Pins in Chip Multiprocessor, Proceedings of IEEE 1st International Symposium on Nanoelectronic and Information Systems (iNIS), 2015, pp. 89-94.
- [12] S. P. Mohanty, Nanoelectronic Mixed-Signal System Design, McGraw Hill Professional, 2015, 1st Edition, ISBN-10: 0071825711, ISBN-13: 978-0071825719.
- [13] Latest Predictive Transistor Model. Available: <a href="http://ptm.asu.edu/latest.html">http://ptm.asu.edu/latest.html</a>> (accessed Feb. 2016) [Online].
- [14] Interconnect Estimation. Available: <a href="http://ptm.asu.edu/">http://ptm.asu.edu/</a>> (accessed Feb. 2016) [Online].
- [15] M. Shevgoor, J. S. Kim, N. Chatterjee, R. Balasubramonian, A. Davis, A. N. Udipi, Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device, Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2013), 2013, pp. 198-209.
- [16] J. M. Rabaey, A. P. Chandrakasan, B. Nikoli, 2nd Edition, *Digital* integrated circuits, Prentice hall, Upper Saddle River, NJ, 2002.
- [17] H. Akkary, M. A. Driscoll, A dynamic multithreading processor, Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), 1998, pp. 226-236.
- [18] R. Zhang, K. Wang, B. H. Meyer, M. R. Stan, K. Skadron, Architecture implications of pads as a scarce resource, Proceedings of 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), 2014, pp. 373-384.
- [19] M. Popovich, A. V. Mezhiba, E. G. Friedman, 1st Edition, Power distribution networks with on-chip decoupling capacitors, Springer Science & Business Media, New York, NY, 2007.
- [20] K. DeHaven, J. Dietz, Controlled collapse chip connection (C4)-an enabling technology, Proceedings of 44th Electronic Components and Technology Conference, 1994, pp. 1-6.
- [21] Via Optimization Techniques for High-Speed Channel Designs, Available: <a href="https://www.altera.com/content/dam/altera-www/global/en\_US/pdfs/literature/an/an529.pdf">https://www.altera.com/content/dam/altera-www/global/en\_US/pdfs/literature/an/an529.pdf</a> (accessed Jan. 2016) [Online].
- [22] High Speed PCB Layout Techniques, Available: <a href="http://www.ti.com/lit/ml/slyp173/slyp173.pdf">http://www.ti.com/lit/ml/slyp173/slyp173.pdf</a> (accessed Jan. 2016) [Online].
- [23] MICRON DDR3 SDRAM, Available: <a href="https://www.micron.com/products/dram/ddr3-sdram">https://www.micron.com/products/dram/ddr3-sdram</a> (accessed Feb. 2016) [Online].

- [24] R. R. Boudreaux, R. M. Nelms, A comparison of MOSFETs, IGBTs, and MCTs for solid state circuit breakers, Proceedings of 1996 Applied Power Electronics Conference and Exposition (APEC), 1996, pp. 227-233.
- [25] D. A. Grant, J. Gowar, 1st Edition, Power MOSFETs: Theory and Applications. Wiley-Interscience, Hoboken, NJ, 1989.
- [26] F. Shoucair, Design Consideration in High Temperature Analog CMOS Integrated Circuits, IEEE Transactions on Components, Hybrids, and Manufacturing Technology, 9 (3), 1986, 242-251.
- [27] openMSP430, Available: <a href="http://opencores.org/project,openmsp430">http://opencores.org/project,openmsp430</a>> (Jan. 2016) [Online].
- [28] K. Itoh, 5th Edition, VLSI memory chip design, Springer Science & Business Media, New York, NY, 2013.
- [29] A. M. Abo, Design for Reliability of Low-Voltage, Switched-Capacitor Circuits, Ph.D. Dissertation, University of California, Berkeley, USA, 1999.