# STREAM: Stress and Thermal Aware Reliability Management for 3-D ICs

Hai Wang<sup>®</sup>, Darong Huang, Rui Liu, Chi Zhang, He Tang, Member, IEEE, and Yuan Yuan

Abstract-Accurate and fast reliability management is important for 3-D integrated circuits (3-D ICs) because of the severe on-chip thermal and reliability problems. However, due to the lack of stress information and difficulties in implementing management method for reliability, existing full-chip reliability management methods suffer from low management accuracy and high system performance degradation. In this paper, we propose a new stress and thermal aware reliability management method for 3-D ICs called STREAM. Unlike traditional methods which do not perform explicit stress analysis due to the large computing cost, STREAM employs an artificial neural network-based stress model to estimate stress accurately at runtime. In order to further improve the reliability management accuracy and improve the system performance, a lifetime estimator with lifetime banking technology and a specially designed lifetime model predictive control are integrated into the reliability management framework. Our numerical results show that STREAM performs the stress and thermal aware full-chip reliability management with both high accuracy and speed. It is able to boost the performance of 3-D ICs and outperforms the state-of-the-art 3-D IC reliability management method.

Index Terms—3-D integrated circuit (3-D IC), artificial neural network (ANN), model predictive control (MPC), reliability management, stress and thermal aware.

# I. INTRODUCTION

**3** -D INTEGRATED circuits (3-D ICs) exploit *z*-direction of traditional 2-D IC by integrating multiple silicon layers vertically using through-silicon vias (TSVs) to achieve performance improvements [1]. A 3-D IC consisting of two layers connected by TSVs is shown in Fig. 1. Although 3-D IC has many advantages, its stacked structure brings about

Manuscript received March 26, 2018; revised June 24, 2018; accepted August 4, 2018. Date of publication October 19, 2018; date of current version October 16, 2019. This work was supported in part by the National Natural Science Foundation of China under Grant 61404024, in part by the Fundamental Research Funds for the Central Universities under Grant ZYGX2016J043, and in part by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. This paper was recommended by Associate Editor C. Zhuo. (Corresponding author: Hai Wang.)

H. Wang, D. Huang, C. Zhang, and H. Tang are with the State Key Laboratory of Electronic Thin Films and Integrated Devices, University of Electronic Science and Technology of China, Chengdu 610054, China, and also with the School of Electronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China (e-mail: wanghai@uestc.edu.cn).

R. Liu is with the Institute of Chemical Materials, China Academy of Engineering Physics, Mianyang 621900, China.

Y. Yuan is with the School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCAD.2018.2877019



Fig. 1. 3-D IC microprocessor with two layers connected by TSVs.



Fig. 2. Structure of a TSV used in 3-D IC. The TSV is filled with Cu and contains a  $SiO_2$  liner. (a) Longitudinal-section view. (b) Cross-sectional view.

severe thermal and reliability problems, because it has higher power density than traditional 2-D IC chips. These problems are very challenging and are the major obstacles that prevent the commercializing of 3-D IC [2], [3].

Many researches have been done to solve the thermal issues of 3-D ICs. Wang *et al.* [4] focused on 3-D thermal modeling and analysis. A TSV placement technique was proposed in [5] to minimize lateral heat blockages caused by TSV structures in 3-D ICs. Cong *et al.* [6], [7] developed thermal-aware placement approaches for 3-D ICs to reduce the maximum on-chip temperature. Dynamic thermal management methods for 3-D IC systems were proposed in [8]–[11]. However, these researches only consider temperature itself, with reliability issues of 3-D ICs ignored.

Different from traditional IC chips, 3-D ICs contain a special component TSV, whose structure is shown in Fig. 2. Because of the existence of TSV, temperature has a much more complex impact on the final reliability of 3-D ICs. As a result, simple temperature distribution optimization does not necessarily lead to good reliability anymore for 3-D ICs. Recently, many researches have been done directly on the reliability issues of 3-D ICs [12]–[14]. Among these issues, reliability problems caused by stress, especially TSV induced stress, have the most significant impact: the tensile stress generated by

thermal coefficient mismatching of TSV and silicon can cause reliability problems such as cracking and timing violation [15]. To solve these problems, some techniques such as TSV tapering and TSV placement were introduced in [16]–[18] for 3-D IC design and manufacturing.

Besides performing reliability optimization at design or manufacturing stage using techniques mentioned above, reliability management at runtime is also important to guarantee the safety and enhance the performance of 3-D ICs. To consider runtime reliability issues, one critical aspect is to correctly estimate reliability information of 3-D ICs. Some existing methods rely on finite element methods (FEMs) [19], [20]. However, FEM is too expensive to be used for runtime reliability management because of its large computing overhead. In order to perform the runtime reliability management, the approximate expression of reliability information for 3-D ICs is used in [21]. But such approximate expression sacrifices accuracy and may lead to poor management results. On the other hand, there are very few control methodologies proposed for 3-D IC runtime reliability management. The state-of-theart 3-D IC reliability management work [21] uses a primitive control scheme to avoid the thermal and reliability violations of 3-D ICs. This scheme does not lead to good reliability control quality, and cannot boost 3-D IC performance even when the chip is totally safe in reliability.

In this paper, we propose a new stress and thermal aware runtime reliability management method for 3-D ICs, called STREAM, to mitigate the mentioned problems in the existing 3-D IC reliability management. STREAM contains three novel components: 1) an artificial neural network (ANN)-based stress model for runtime stress estimation; 2) a lifetime estimator with lifetime banking technique to boost system performance; and 3) a specially designed model predictive control (MPC) method called lifetime MPC to improve the control quality in reliability management. With these new components, STREAM is able to fully exploit the performance potential of the 3-D IC system and still keep its designed operating lifetime.

The remaining part of this paper is organized as follows. In Section II, we first review some important work in reliability management. Next, in Section III, we present the thermal and reliability models of 3-D ICs used in STREAM. Then, we demonstrate STREAM, which is the new stress and thermal aware runtime reliability management method for 3-D ICs, in Section IV. The experimental results showing the performance of STREAM are presented in Section V. Finally, Section VI concludes this paper.

### II. RELATED WORK AND MOTIVATION

In this section, we briefly review some important researches in reliability management, especially in 3-D IC stress-aware reliability management.

Reliability modeling and estimation of microprocessor at the architecture level were studied in [22]–[24]. Among them, temperature-related reliability aware microprocessor model (RAMP) [22] is well known for considering major failure mechanisms in integrated circuits and computing an overall mean time to failure (MTTF). RAMP model is extended to interconnect lifetime prediction and many-core systems in [25] and [26]. In order to increase the accuracy of RAMP model, a hierarchical reliability modeling framework based on Monte Carlo technique was studied in [27].

A dynamic reliability management method was introduced in [28]. This method uses reliability banking strategy which reserves reliability banking on light load, and temporarily boost chip performance by consuming the reserved reliability banking when system load is heavy. A dynamic reliability management based on workload driven conditions with banking strategy was developed in [29]. Ideas similar to the banking strategy were also applied to electromigration (EM) reliability management problems [30], [31]. For network-on-chip based microprocessors, a dynamic reliability management scheme was introduced in [32]. However, these methods mainly consider traditional integrated circuit structures, and cannot handle 3-D IC when TSV induced stress is a main reliability concern.

TSV structures are used in 3-D ICs to connect multiple die layers in the vertical direction, which also cause some reliability problems. The TSV stress induced reliability problems have been studied in [13] and [14]. TSV tapering techniques have been proposed in [16] to alleviate the power and thermal issues of 3-D IC. There are also static stress-aware reliability managements proposed by adjusting the TSV keep-out zone size, TSV placement, or TSV structures in [17] and [18].

For runtime stress-aware reliability management, a runtime thermal management technique considering TSV stress induced reliability problems was proposed in [21]. It relieves TSV induced stress by eliminating the temperature gradients on each layer of the chip, and also considers thermal cycling effect. However, this paper does not quantify the influence of temperature and stress on reliability problems of 3-D ICs and does not fully exploit the performance of 3-D ICs.

As discussed above, existing 3-D IC runtime reliability management methods share the following problems.

- There is lack of explicit stress information for runtime reliability management decision in existing methods, due to the large computing cost in stress estimation. Without accurate stress information, the reliability management may lead to poor system performance or even reliability violation.
- 2) By using the existing runtime reliability management methods, the 3-D IC system does not have performance boost ability when executing heavy load applications, meaning the high performance potential of the 3-D IC system will be over pessimistically limited by existing methods.
- 3) The existing methods do not have advanced control scheme to quantify the power adjustment suggestions for reliability management actions. Without such advanced control scheme, the system may experience large reliability control overshoot or system performance degradation.

In this paper, a new 3-D IC runtime reliability management method STREAM is proposed to solve the above problems in existing methods. The major contributions of this paper are summarized as follows.

- To avoid the stress approximation error of existing reliability management of 3-D ICs, STREAM uses an ANN to perform runtime stress analysis. Thanks to this ANN stress model, designing an accurate runtime reliability management method with 3-D IC performance boost becomes possible.
- 2) To fully exploit 3-D IC's performance potential under reliability constraint, we introduce a lifetime estimator with lifetime banking technique to 3-D IC reliability management method, with the reliability information estimated by the ANN stress model. Thanks to the lifetime estimator, STREAM can temporally boost 3-D IC performance and still maintain its designed operating lifetime.
- 3) To further improve the control quality in 3-D IC reliability management, a specially designed MPC method, called lifetime MPC, is integrated into STREAM. With the new lifetime MPC, STREAM is able to quantify the future power suggestion values. The power suggestions lead to smoother and more efficient reliability control, compared with existing 3-D IC runtime reliability management methods.
- 4) Experimental results demonstrate that STREAM is able to accurately estimate the stress information and boost 3-D IC performance with accurate and smooth reliability control. It outperforms the state-of-the-art 3-D IC reliability management method in both control quality and 3-D IC system performance.

#### III. THERMAL AND RELIABILITY MODELS

In this section, we present the thermal and reliability models used in STREAM. The 3-D IC thermal model shown in Section III-A will be used in the lifetime MPC component (which will be presented later in Section IV-D) of STREAM in order to compute the suggested power. The reliability models shown in Section III-B will be used in the lifetime estimator component (which will be presented later in Section IV-C) of STREAM to compute the lifetime information of the 3-D IC system.

# A. Thermal Model of the 3-D IC

Since stress is influenced by temperature, and reliability (lifetime consumption) is affected by both temperature and stress, we need the 3-D IC thermal model in reliability management for lifetime consumption prediction.

Due to the well-known duality between thermal system and electrical circuit system, we can build the thermal model of the 3-D IC using thermal equivalent resistors (thermal resistors), thermal equivalent capacitors (thermal capacitors), and thermal equivalent independent current and voltage sources. Assume there is a 3-D IC with n silicon and interface material layers and l cores. By using finite difference method, each layer can be divided into m grids. Then we can generate the thermal model of the 3-D IC as ordinary differential equations like

$$GT(t) + C\frac{dT(t)}{dt} = B_c P(t)$$

$$Y(t) = LT(t)$$
(1)

where  $T(t) \in \mathbb{R}^q$  is the thermal vector representing temperatures of q grids of the 3-D IC, including mn grids on the n layers, and (q-mn) grids on package components (spreader, heat sink, etc.);  $G \in \mathbb{R}^{q \times q}$  matrix includes thermal resistance information;  $C \in \mathbb{R}^{q \times q}$  matrix includes thermal capacitance information;  $B_c \in \mathbb{R}^{q \times l}$  matrix contains the power injection topology information; and  $P(t) \in \mathbb{R}^l$  is the power vector with power dissipations of l cores at time t, and it is the input of the model.  $Y(t) \in \mathbb{R}^{mn}$  is the thermal vector with temperature information of the l cores, and it is the output of the model;  $L \in \mathbb{R}^{mn \times q}$  is the output selection matrix, which selects the mn temperatures on the 3-D IC layers from T(t). For the details of generating the thermal model matrices  $(G, C, B_c, \text{ and } L)$  in (1), please refer to the thermal modeling works [33]–[36] and especially the 3-D IC thermal modeling works [37], [38].

In order to analyze the thermal system, the original ordinary differential equation (1) in continuous time is discretized into the following difference equation by using the Euler method or other numerical integration methods as:

$$T(k) = AT(k-1) + B_d P(k)$$
  

$$Y(k) = LT(k)$$
(2)

where the variables T(k), P(k), and Y(k) are the discretized versions of T(t), P(t), and Y(t) in (1), A and  $B_d$  are formulated using G, C, and  $B_c$  according to the specific numerical integration method used to discretize equation (1). For example, if we use backward Euler method, which ensures absolute stability, to perform the discretization with discretization time step h, there is

$$A = \left(\frac{C}{h} + G\right)^{-1} \frac{C}{h}, \qquad B_d = \left(\frac{C}{h} + G\right)^{-1} B_c. \tag{3}$$

Then, thermal model in (2) can be used to compute the temperatures of 3-D IC with Y(k) as the output, by feeding in the power consumption of the power units of the chip [P(k)] as the input].

# B. Reliability Models

With so many advantages against traditional 2-D ICs, 3-D ICs suffer from severe reliability problems, due to its high power density and poor heat removing ability in the vertical dimension [38], [39]. The TSV structure makes the reliability of 3-D ICs even worse: because TSV and die are manufactured using different materials, thermal variations in both space and time lead to stress variations around TSV, which shortens the lifetime of 3-D ICs [40]. There are two main stress-induced reliability problems, one is thermal cycling effect and another is stress migration effect. We will introduce the thermal cycling effect in Section III-B1 and stress migration effect in Section III-B2.

1) Reliability Model for Thermal Cycling Effect: Fatigue failures can be induced by thermal cycling. With the increasing of thermal cycle number, the permanent damage accumulates and eventually leads to failure. Thermal cycling effect can be modeled as [41]

$$MTTF_{TC} = \frac{1}{v_{TC}} = A_0 \left(\frac{1}{\sigma - \sigma_0}\right)^{q_c} \tag{4}$$

where MTTF $_{TC}$  is the MTTF due to thermal cycles.  $\nu_{TC}$  is the inverse of MTTF $_{TC}$ , meaning the average lifetime consumption rate of thermal cycling effect.  $A_0$  is an empirically determined material-dependent constant by assuming the thermal cycling frequency to be constant [22], and  $q_c$  is the Coffin-Manson exponent. Since the maximum stress always appears around TSV for 3-D IC, we consider the worst case stress by selecting the maximum stress around TSV. In this way,  $\sigma$  is the runtime maximum stress and  $\sigma_0$  is the maximum stress when the 3-D IC is under ambient temperature. Specifically, in STREAM,  $\sigma$  will be given as the output of the ANN stress model shown later in Section IV-B.

2) Reliability Model for Stress Migration Effect: Stress migration effect is caused by mechanical stress induced by different thermal expansion factor between TSV and silicon. Such mechanical stress will influence the movement of metal atoms, leading to voids in circuits. Stress migration effect can be modeled as [41]

$$MTTF_{SM} = \frac{1}{v_{SM}} = B_0 \sigma^{-n_a} e^{\frac{E_a}{kT_a}}$$
 (5)

where MTTF<sub>SM</sub> is the MTTF due to stress migration.  $v_{SM}$  is the average lifetime consumption rate of stress migration.  $B_0$ ,  $E_a$ , and  $n_a$  are material-dependent constants, and k is Boltzmann constant. For stress migration effect of 3-D IC, we use the runtime maximum stress  $\sigma$  around TSV and average temperature  $T_a$  of the chip following the RAMP model [22]. In STREAM,  $\sigma$  will be given by the ANN stress model, and  $T_a$  is obtained from the on-chip temperature sensors.

For the two reliability models above, we use explicit expression of stress instead of the temperature difference used in RAMP model, thanks to the ANN stress model in STREAM. This improves the fidelity of the reliability model and the management method.

3) Unified Reliability Model: In order to unify the MTTFs introduced above, we use the industry standard sum-of-failurerates (SOFRs) model [22] to get the unified MTTF as

$$MTTF_{IC} = \frac{1}{\nu_{IC}} = \frac{1}{\nu_{SM} + \nu_{TC}}$$
 (6)

where  $v_{IC}$  is the average lifetime consumption rate of IC considering all failure effects.

SOFR model is a very good industrial standard practical model which balances the accuracy and computing overhead for runtime reliability management [42]. It is based on two assumptions. The first assumption is that the IC is a series failure system. This means any failure mechanism will make the entire system fail. The second assumption is that each failure rate is constant. This means every failure mechanism has an exponential lifetime distribution.

Please note that although both stress migration and thermal cycling depend on temperature and stress, system failure events caused by these two failure mechanisms are still assumed to be independent in the SOFR MTTF computing process. This is because temperature and stress are viewed as *constant* parameters in MTTF, representing the specific condition that MTTF is computed. Let us denote the probability of system failure by stress migration as  $P(X_{SM})$  and that by

thermal cycling as  $P(X_{TC})$ . Then, in a specific temperaturestress condition, the event of system failure by stress migration does not change the probability of system failure by thermal cycling, i.e.,  $P(X_{TC}|X_{SM}) = P(X_{TC})$ , indicating the two failures are independent.

It is also noted that there are other failure mechanisms besides stress migration and thermal cycling in 3-D ICs, such as EM, negative-bias temperature instability, time-dependent dielectric breakdown, etc. In order to adapt them into STREAM, generally, we can compute the system MTTF individually for each failure mechanism using existing methods. Then, the MTTFs of all failure mechanisms can be unified using the industrial standard SOFR model in the same way as we unify the two stress-aware failure mechanisms in (6). For more discussions on this topic, please refer to the Ph.D. thesis by Srinivasan [22], who proposed RAMP [22], [42] to specially deal with this problem.

# IV. NEW STRESS AND THERMAL AWARE RELIABILITY MANAGEMENT METHOD FOR 3-D ICS

In this section, a new 3-D IC stress and thermal aware reliability management method STREAM is presented. In Section IV-A, we will first show the basic flow of STREAM and briefly discuss the functions of the major components in STREAM including lifetime estimator and lifetime MPC. Then, we present the important techniques and major components of STREAM in details. Specifically, in Section IV-B, we introduce the ANN-based stress model for 3-D ICs, which is an important component of the lifetime estimator. Then, we show the new lifetime estimator with lifetime banking technique in Section IV-C. Next, the new lifetime MPC is presented in Section IV-D. Finally, Section IV-E summarizes the flow of STREAM.

## A. Basic Flow of STREAM

The goal of STREAM is to compute the future power suggestion for the 3-D IC plant, which optimizes the performance of the 3-D IC system without violating the designed lifetime of the chip.

To achieve such goal, it is vital to estimate the lifetime information of the system by providing the stress and temperature state of the system. Although temperature state can be obtained easily, it is difficult to obtain the stress information at runtime to estimate system lifetime for reliability management. As a result, we first propose a new ANN based stress model which estimates the maximum stress around each TSV by taking temperature as input.

The lifetime of the system can be estimated with the stress information obtained by the ANN stress model. But since existing methods only consider current lifetime information which pessimistically limits the performance of the 3-D IC system, we further propose the new lifetime estimator with lifetime banking technique. With the new lifetime estimator, the 3-D IC system will deposit lifetime when the system has light task loads, and will consume the lifetime deposit (LTDP) to boost performance with heavy task loads.

Finally, we need to determine the reliability management actions which fully utilizes the LTDP provided by the lifetime



Fig. 3. Stress and thermal aware reliability management flow. The detailed structures of the lifetime estimator and lifetime MPC are given further in Figs. 8 and 9, respectively.

estimator to boost system performance. Traditional methods use primitive control schemes to guide the management actions which do not ensure the control optimality. In STREAM, we propose to use the lifetime MPC to quantitively compute the proper future power suggestions. Since lifetime MPC uses MPC which not only considers 3-D IC system's current state but also its future state, it leads to high system performance by fully utilizing the LTDP.

The basic flow of STREAM is given in Fig. 3. The major components of STREAM include an ANN based stress model, a lifetime estimator with lifetime banking technique, and a lifetime MPC. Integrated with the ANN-based stress model, the lifetime estimator takes the 3-D IC temperature information to quantify the accurate lifetime information using lifetime banking technique. Then, based on such lifetime information and 3-D IC temperature information, lifetime MPC is used to compute the power suggestion for the future management cycle, which concludes the reliability management loop.

# B. ANN-Based Stress Model for 3-D ICs

1) Motivation of Using ANN-Based Stress Model: The main problem of using the reliability models presented in Section III-B is how to get the accurate stress information  $\sigma$ . Conventionally, stress information can be estimated by FEMs [19], [20]. But FEM methods cannot be used for runtime thermal management due to its large computing cost. Previous works [21], [43] use temperature difference to approximate complicate  $\sigma$ , which can introduce large error. In this paper, we use an ANN [44] to perform the fast and accurate stress analysis for reliability management.

A widely used TSV structure with full copper filling and a silicon dioxide liner between copper and silicon is shown in Fig. 2. As a necessary structure in 3-D IC, TSVs, however, lead to thermal induced stress problem, which harms the reliability of the chip. There are two major reasons for the problem. First, TSV has a much higher thermal conductivity than silicon wafers. As a result, large temperature gradient may appear in the area close to TSV, which usually leads to large thermal stress. Second, mismatch in coefficient of thermal expansion (CTE) also brings significant stress increase. Specifically, copper's CTE  $(17 \times 10^{-6} \text{ K}^{-1})$  is seven times larger than the CTE of silicon  $(2.56 \times 10^{-6} \text{ K}^{-1})$ . When temperature increases with the same degree, copper expansion will be much more significant than silicon, resulting in considerable stress.

One example of temperature and stress distributions of a 3-D IC simulated using the FEM tool COMSOL is given in Fig. 4. The experimental settings follow the work in [18]. Please note that in Fig. 4(c), the whole range of stress is actually from 112 to 834 MPa, but we choose to display only the range from 130 to 150 MPa to make the stress distribution more viewable. We can see that the stress distribution in 3-D IC is largely influenced by both TSV distribution and temperature distribution. As a result, we build an ANN stress model to capture such complex effects at runtime.

2) Structure and Training of the ANN Stress Model: In machine learning and cognitive science, ANNs are a family of statistical learning models inspired by biological neural networks to estimate or approximate functions that depend on a large number of inputs. ANNs are generally presented as systems of interconnected "neurons," which connect and send messages to each other.

The basic structure of the ANN stress model used in this paper is shown in Fig. 5. The input of this model is the temperature distribution around a TSV in 3-D IC, denoted as  $\{T_1, T_2, \ldots, T_{n_g}\}$ , where  $n_g$  is the grid number around each TSV. The maximum stress around a TSV is chosen as the output stress information  $\sigma$  to save the computing cost of the ANN model, because it is the most important one for reliability management. Please note that other output other than the maximum stress or more outputs can also be implemented if necessary. Each circle in the figure is a neuron. For neurons in our model, they have the same function structure as

$$out = \sum_{i=1}^{n_i} I_i w_i \tag{7}$$

where the terms  $I_i$  ( $T_i$  for the neuron in the input layer) and  $w_i$ ,  $i = 1, 2, ..., n_i$  are inputs and weights of the neuron, out is the output of the neuron ( $\sigma$  for the neuron in the output layer). The values of the weights  $w_i$  need to be determined in the training process to make the ANN work as desired, which will be shown later. There are one input layer, one output layer, and usually one or several hidden layers in this ANN stress model. With one or more hidden layers, the network is able to model higher-order statistical properties.

Before being applied in STREAM, the ANN stress model needs to be trained using temperature input and stress output data (called training samples) specially generated for training purpose. The goal of training is to find the optimal weights  $[w_i \text{ in } (7)]$  in the ANN model, which leads to good output accuracy. In this paper, we use backpropagation (BP) method to train our ANN stress model, as it is a common method of training ANNs used in conjunction with an optimization method such as gradient descent. BP method calculates the gradient of a loss function with respect to all the weights in the network. The gradient is fed to the optimization method, which in turn uses it to update the weights, in an attempt to minimize the loss function. More specifically, before training, the weights of the ANN model will be set randomly. They will be updated by learning from the training samples. Each sample, taken from the system to be modeled, is an input-output set  $\{T_1, T_2, \ldots, T_{n_g}, \sigma_s\}$ , where  $\{T_1, T_2, \ldots, T_{n_g}\}$ 



Fig. 4. One example of temperature and stress distributions of a 3-D IC simulated in COMSOL. (a) Die structure of a 32-core (16 cores on each layer) 3-D IC microprocessor with two die layers and 144 TSVs. Package structure is not shown in the figure. (b) Temperature (K) distribution of the bottom surface of the 3-D IC. (c) Thermal induced Von Mises stress (MPa) distribution of the bottom surface of the 3-D IC. Stress display range is manually constrained from 130 to 150 MPa for better demonstration.

are the inputs and  $\sigma_s$  is the corresponding output of the system to be modeled. Since the ANN is not exactly the same as the system to be modeled, given the sample inputs  $\{T_1, T_2, \dots, T_{n_g}\}$  to the ANN will generate an output  $\sigma_a$  which differs from  $\sigma_s$  (especially when the weights are initially random). During the training process, the weights will be tuned by optimization to make  $\sigma_a$  close to  $\sigma_s$ . Considering the large scale of samples and the resilient backpropagation's (RPROP) advantage in nonlinear mapping, processing efficiency, and fewer resources occupying [45], we select RPROP as the training algorithm. For a comprehensive introduction of ANN training process, please refer to the book [46].

Please note that the trained ANN model can only work on the specific 3-D IC it is trained on. If the ANN model trained for one 3-D IC design is applied to another 3-D IC with more layers or with different TSV distributions, it may show large error. Although it is possible to train a universal ANN model which works for many different 3-D ICs using samples from all different 3-D ICs, training such universal ANN model is not recommended as explained here. First, this ANN model should be large in size, in order to contain all information learned from different cases to ensure universal accuracy. This large ANN model leads to large stress estimation computing overhead, which is unacceptable for runtime algorithm. Second, training specific ANN model for each 3-D IC design is easier than training the universal ANN model, since only sampling data for this specific 3-D IC design is required. It also leads to more accurate and compact ANN model because it only needs to consider this specific 3-D IC design.

3) Advantage of Using the ANN Stress Model: This ANN stress model is able to estimate important stress information at extremely fast speed with good accuracy, which will be shown in the experiments. As a result, it is able to provide explicit stress information to greatly improve the management quality and enable the lifetime banking technique of STREAM. To see this more clearly, we make a simple comparison here with the traditional methods without such explicit stress information. As discussed before, traditional methods use temperature difference as the stress approximation to avoid the high overhead of explicit stress computation. In Fig. 6, we plot the MTTF results of using the explicit stress as in STREAM and



Fig. 5. ANN stress model used in STREAM with one hidden layer.



Fig. 6. MTTF values without considering explicit stress and considering explicit stress.

using the stress approximation as in existing methods [22]. The MTTF results are computed considering both thermal cycling and stress migration effects (as introduced in Section III-B) using Monte Carlo method [22] to ensure accuracy. From the figure, we can see that the computed MTTF by using the stress approximation is inconsistently larger than that of using the explicit stress information. This means the reliability estimation may be over-optimistic without explicit stress information. Due to such over-optimism, the runtime reliability management method may falsely compute a "proper" power budget, which is excessive in reality, to put the chip in danger state.

# C. New Lifetime Estimator With Lifetime Banking Technique

Setting a temperature threshold in thermal/reliability management is pessimistic and leads to poor system performance. Actually, when the IC chip is operating at low temperature

state, its aging is reduced and its lifetime is increased. By utilizing such lifetime increasing, system performance can be boosted for a period of time without shortening the designed lifetime, because the reduced lifetime by violating the temperature threshold can be compensated by the increased lifetime in low temperature. Based on this idea, lifetime banking technique has been proposed to boost system performance by quantifying and balancing the lifetime saving (also called LTDP) and lifetime reduction [28].

However, using lifetime banking technique for runtime reliability management of 3-D ICs is challenging, because it is difficult to quantify the lifetime of 3-D IC with TSV induced stress at runtime. Thanks to the accurate stress information provided by the ANN stress model presented previously, we are able to compute the lifetime information by feeding both explicit thermal and stress information to the MTTF equations shown in Section III-B. Now we present the basic theory and structure of the new lifetime estimator with lifetime banking technique.

IC chip has an expected lifetime under a particular operating condition (with a certain temperature distribution, stress distribution, etc.). For example, a 3-D IC may have an MTTF of ten years at 90  $^{\circ}$ C uniform temperature distribution and the corresponding stress distribution. In order to perform lifetime banking, we use lifetime consumption rate expression other than MTTF, to compute LTDP or consumption. Such average lifetime consumption rate  $\nu$  is simply the inverse of MTTF as

$$v = \frac{1}{\text{MTTF}}.$$
 (8)

So, for the same example given previously, we can also say this 3-D IC has an average lifetime consumption rate v = 1/10 at 90 °C uniform temperature distribution and the corresponding stress distribution.

Similarly, we can also transform the designed 3-D IC lifetime, denoted as MTTF<sub>th</sub>, into the designed 3-D IC lifetime consumption rate, denoted as  $v_{th}$ . For example, if the 3-D IC is designed to have an MTTF<sub>th</sub> of 15 years, then  $v_{th} = 1/15$ . According to the reliability banking theory, if the realtime 3-D IC lifetime consumption rate is lower than the designed one  $v_{th}$ , the true lifetime of the 3-D IC is increased, and vice versa. This property is important for reliability management: performance boost which shortens the designed lifetime will be allowed if the lifetime was previously increased. The goal of our new lifetime estimator is to calculate how much lifetime has been previously increased, so that the reliability management algorithm can quantify the suitable future power budget for the performance boost.

In our lifetime estimator during the runtime reliability management, we can first calculate the real lifetime consumption rate v(t) at each management step, using both the temperature and stress information. Then, in order to know how much lifetime has been increased (or reduced) because of this v(t), the LTDP is defined using v(t) and  $v_{th}$  as [28]

$$LTDP = \int \left[ v_{th} - v(t) \right] dt. \tag{9}$$

When the 3-D IC is running at a safe state (low temperature, small stress, etc.), v(t) will be less than  $v_{th}$ , and LTDP



Fig. 7. Mechanism of lifetime banking technology.



Fig. 8. Flow of the novel lifetime estimator.

will increase. This mechanism is illustrated in Fig. 7. The shaded area filled with slash (from 0 to 2 s) represents the positive temporal LTDP when the 3-D IC is running at a safe state. Such positive LTDP can be used to boost system performance in the future. On the other hand, when v(t) is larger than  $v_{th}$ , the LTDP will decrease, because the temporal LTDP will become negative, which is shown as the shaded area with backslash (from 2 s to 5 s). In traditional 3-D IC reliability management, such negative temporal LTDP is not allowed because a strict temperature threshold is set. But in STREAM, such temperature and stress violation will be allowed as long as the total deposit lifetime is positive (LTDP > 0) to boost system performance. However, when the total LTDP is consumed to be zero which means the LTDP balance point is reached (i.e., the slash shaded area equals the backslash shaded area, at 5 s in the figure), the reliability management method will constrain system performance to make sure the total LTDP will not become negative. We remark that the lifetime estimator is only responsible for computing the lifetime information LTDP, and the detailed management step will be shown next in Section IV-D.

Fig. 8 concludes the flow of the lifetime estimator. First, the ANN stress model (denoted as "ANN" in the figure) is used to compute the stress information of the 3-D IC using temperature information. Then, we feed both the temperature and stress information to the reliability models (denoted as "reliability models" in the figure) to compute the average lifetime consumption rate v(t) of the 3-D IC under current temperature and stress state. Finally, the LTDP is computed by the "lifetime banking" component, and outputted to the Lifetime MPC.

# D. New Lifetime Model Predictive Control for 3-D IC Reliability Management

Now we have the LTDP provided by the lifetime estimator, we still have to quantify the power suggestions of the 3-D IC system, to enable the performance boost ability by properly consuming the LTDP. In STREAM, we propose a specially

designed MPC method, called lifetime MPC, to achieve such purpose, as shown in Fig. 3.

MPC is a widely used advanced control scheme which considers not only the current but also the future. Standard MPC has been applied to the dynamic thermal management problem for 2-D ICs because of its smooth and accurate temperature control ability [47]–[49]. However, applying MPC directly in the 3-D IC reliability problem is difficult due to the complex and nonlinear relationship between the control target (lifetime) and plant input (power of 3-D IC). In this paper, lifetime MPC is introduced to perform 3-D IC reliability management.

In this part, Section IV-D1 introduces the target temperature estimator which transforms the lifetime information into the target temperature to enable the model-based control. Then, we present the lifetime MPC which computes the power suggestions for reliability management with performance boost ability in Section IV-D2.

1) Target Temperature Estimator for Lifetime Control: As discussed above, model based control schemes like MPC have been directly applied to control the temperature of 2-D IC systems [47]–[49]. This is because the relationship between temperature and power is linked by a linear thermal model as shown in (1) or (2). With a target temperature distribution and current temperature status of the system provided, MPC is able to compute the future power distribution which leads to such target temperature in the future. However, if we want to boost the performance of the 3-D IC system with lifetime banking strategy, we need to control the lifetime instead of the temperature. In other words, we want to compute the future power distribution, which leads to a given lifetime target. Unfortunately, this is very difficult because the relationship between lifetime and temperature is highly nonlinear, especially with the existence of the TSV structure in 3-D IC.

In order to enable MPC in lifetime control, we need to transform the lifetime information into temperature information. To be specific, with the LTDP information provided at current time by the lifetime estimator, we want to know the corresponding future temperature distribution of the 3-D IC that consumes such LTDP. Then, we can use such future temperature distribution as the target temperature in MPC to compute the future power distribution. We developed the *target temperature estimator* to accomplish such goal.

The task of the target temperature estimator is to transform LTDP into the target temperature distribution. Although the target temperature estimator can be formed in many ways, we build it as a lookup table for simplicity in this paper. In order to build such a lookup table, the following experiments are performed offline. First, we feed different temperature distributions (the *i*th distribution is denoted as  $Y_i$ ) into the ANN stress model to compute the stress information for these temperature distributions. Next, we compute the average lifetime consuming rate ( $v_i$  for the *i*th temperature distribution  $Y_i$ ) for these different temperature distributions with both temperature and stress information. Then, we need to compute the LTDP consumptions for different temperature distributions. For each reliability management time interval with the length  $\Delta t$ , we want to consume all the LTDP stored to maximize system



Fig. 9. Flow of lifetime MPC for 3-D ICs' reliability management.

performance. According to the LTDP definition in (9), we have the following equation:

$$LTDP_i = (v_{th} - v_i)\Delta t \tag{10}$$

where LTDP<sub>i</sub> is the LTDP consumed (with negative sign) by the *i*th temperature distribution  $Y_i$  in time period  $\Delta t$ . Finally, we store the target temperature distributions ( $Y_i$  for the *i*th target temperature distributions) and the corresponding LTDP consumptions (LTDP<sub>i</sub> for the *i*th target temperature distributions) into the target temperature estimator lookup table.

The function of the target temperature estimator during the reliability management process is shown in Fig. 9. It receives lifetime information LTDP as the input. Then, it will search for LTDP consumption with the nearest absolute value stored in the lookup table (assume it is LTDP $_j$ ), and then output the corresponding target temperature distribution ( $Y_j$ ) to the MPC. The target temperature distribution serves as the maximum allowed temperature distribution for 3-D IC without LTDP overdraft for the future reliability management time interval.

2) Lifetime Model Predictive Control: In the previous part, we have introduced the target temperature estimator which is able to transform the LTDP into target temperature distribution. In this part, we demonstrate the lifetime MPC method which computes the *future-aware* power distribution suggestion for the 3-D IC to achieve the target temperature distribution. Then, the computed power distribution suggestion will serve as the maximum allowed power distribution for the reliability management with dynamic voltage and frequency scaling (DVFS) management actions.

By using the thermal model in the form of (2), the goal of MPC is to compute the power suggestion P(k) for the future reliability management interval in order to track the maximum allowed temperature distribution estimated by the target temperature estimator. Now, we briefly introduce the process of MPC. More detailed presentations of MPC can be found in [49] and [50].

First, we define the target temperature distributions over several time steps into the future in a vector form as

$$\mathcal{Y}_{tg} = \left[Y_j^T, Y_{th}^T, \dots, Y_{th}^T\right]^T \in \mathbb{R}^{mnN_p \times 1}.$$

In this vector,  $Y_j^T \in \mathbb{R}^{mn \times 1}$  is the target temperature distribution given by the target temperature estimator (assume the jth temperature distribution in the lookup table is picked), and  $Y_{\rm th}^T \in \mathbb{R}^{mn \times 1}$  is the threshold temperature distribution which just consumes zero LTDP.  $N_p$  stands for a time frame from current to the  $N_p$  steps into the future, and is called the prediction horizon

In order to keep the core temperatures tracking the temperature goal in the prediction horizon, at a time k, the future control trajectory (which is actually unknown and needs to be computed in the end) is introduced as

$$\Delta \mathcal{P}_k = [\Delta P(k), \Delta P(k+1), \dots, \Delta P(k+N_c-1)]^T$$

where  $\Delta P(k) = P(k) - P(k-1)$  and  $N_c$  is called the control horizon.

The prediction of core temperatures is defined as

$$\mathcal{Y}_k = \left[ Y(k+1|k)^T, Y(k+2|k)^T, \dots, Y(k+N_p|k)^T \right]^T$$

where Y(k + j|k) is the predicted core temperatures at time (k + j) using information of current time k.

 $\mathcal{Y}_k$  can be calculated by assuming  $\Delta P_k$  is known, using

$$\mathcal{Y}_k = V\hat{T}(k) + \Phi \Delta \mathcal{P}_k \tag{11}$$

where  $\hat{T}(k) = [\Delta T(k)^T, Y(k)^T]^T$  with  $\Delta T(k) = T(k) - T(k-1)$ , V and  $\Phi$  are known matrices formed by thermal model matrices in (2), and their detailed structures are not given here due to page limitation.

Next, we would like to calculate the power, which minimizes the difference between temperatures  $\mathcal{Y}_k$  generated by such power and the provided target temperatures  $\mathcal{Y}_{tg}$ . We can first introduce the measurement of such difference as

$$F = (\mathcal{Y}_{tg} - \mathcal{Y}_k)^T (\mathcal{Y}_{tg} - \mathcal{Y}_k)$$
 (12)

and the optimal power distribution is the one leading to F = 0 (or  $\mathcal{Y}_k = \mathcal{Y}_{tg}$ ).

As a result, optimization is performed to minimize (12) by taking the first derivative of (12) with respect to  $\Delta P_k$  and making it equal to zero. The solution of  $\Delta P_k$  is

$$\Delta \mathcal{P}_k = (\Phi^T \Phi)^{-1} \Phi^T (\mathcal{Y}_{tg} - V \hat{T}(k)). \tag{13}$$

At each MPC time k, we only use the first computed control signal  $\Delta P(k)$  from  $\Delta P_k$  and update the power distribution as

$$\bar{P}(k) \leftarrow P(k) + \Delta P(k)$$
 (14)

where  $\bar{P}(k)$  is the final computed power distribution. If such computed power is actually consumed, the resulting temperature distribution of 3-D IC would track the target temperature distribution given by the target temperature estimator. In other words, the computed power distribution is the *maximum power distribution allowed which just consumes all the LTDP*. Please note that with such maximum power distribution, temperature is allowed to exceed the threshold temperature  $Y_{th}$ , which enables performance boost of 3-D IC.

Next, thermal management method can be performed with the maximum power distribution (power suggestion) provided by MPC in (14). DVFS can be integrated with MPC easily by adjusting the frequency and voltage level of each core to ensure the maximum power distribution will not be exceeded.

Fig. 9 concludes the flow of lifetime MPC for reliability management of 3-D ICs. First, the target temperature estimator takes the LTDP provided by lifetime estimator as input, and outputs the target temperature distribution. Then, MPC takes

#### **Algorithm 1 STREAM for 3-D ICs**

- 1: Get temperature information of 3-D IC.
- 2: Compute stress information using ANN stress model with temperature information as input.
- 3: Compute MTTF with stress and temperature information as input.
- 4: Compute lifetime information LTDP with MTTF as input.
- 5: Compute target temperature distribution using target temperature estimator with LTDP as input.
- 6: Compute power suggestion for the next management cycle using MPC with target temperature distribution and current temperature information as input.
- Perform standard thermal management techniques such as DVFS with power suggestion as a reference.
- 8: Go to step 1 for the next management cycle.

the target temperature distribution and current temperature distribution of 3-D IC as input, and outputs the maximum power distribution as power suggestion. Finally, 3-D IC will adjust its power distribution using method like DVFS to meet the suggested power requirement, which concludes the reliability management cycle.

#### E. Full Workflow of STREAM

We have shown the key components of STREAM in the previous sections, now the full workflow of STREAM is summarized in Algorithm 1.

# V. EXPERIMENTAL RESULTS

The experiments are performed on a laptop PC with 6-GB memory and Core i5-3210M CPU clocked at 2.0 GHz. The new reliability management is implemented in MATLAB R2016b, based on 3-D IC microprocessor shown in Fig. 4(a) with 32 identical Alpha 21264 cores and 144 TSVs. The dimension of this 3-D IC chip is  $10 \text{ mm} \times 10 \text{ mm} \times 0.3 \text{ mm}$ , and the chip consists of two identical layers with TSVs uniformly distributed across the layer. The TSV in the experiment has a structure shown in Fig. 2. The radiuses in the TSV structure are set as  $r_i = 20 \mu \text{m}$  and  $r_o = 24 \mu \text{m}$ , where  $r_i$  and  $r_o$  are shown in Fig. 2. The ambient temperature is set as 20 °C. HotSpot [51] with 3-D extension [37] is used to extract the G, C, and  $B_c$  matrices of 3-D IC thermal model in (1). COMSOL 5.0 is used to obtain the golden stress data. We use the architectural level power estimator Wattch [52] to generate the power traces by running SPEC CPU2006 benchmarks [53] on the Alpha 21264 core. As a work focusing on reliability, we assume there is no task processing related communication and synchronization among cores in the 3-D IC microprocessors, i.e., we assume one task is assigned to one core. In addition, we assume there is no memory bandwidth related problems in the multicore system due to the limitation of the experiment platform. We remark that further system performance optimization can be studied as the future work with more realistic architectural platform settings. We set 19 voltage/frequency levels (from 0.5 V @ 200 MHz to 1.5 V @ 2 GHz) for DVFS in this experiment. The penalty to change







Fig. 10. Accuracy verification results of the ANN stress model for common 3-D IC running conditions. "SPEC" denotes the SPEC benchmark free run case. "Synthetic" denotes synthetic workload case. "Target" represents the stress condition around the lifetime MPC target. "Light" means the 3-D IC has very light task loads.

the DVFS level is 10  $\mu$ s, during which the pipeline is stalled, by following the settings in [28].

In order to demonstrate the improvements against existing work, we compare STREAM with the state-of-the-art 3-D IC reliability management method [21], because it is the most advanced 3-D IC reliability management method to date. It shares the same experimental settings as STREAM. For simplicity, we will call [21] as existing method in the experiment.

First of all, the accuracy of the fast explicit stress estimation will be verified in Section V-A. Next, we will show the improvements of using MPC in 3-D IC reliability management in Section V-B. Then, the benefits of introducing lifetime estimator with lifetime banking technique are demonstrated in Section V-C. Last, the overall reliability and performance comparison against the state-of-the-art 3-D IC reliability management method will be given in Section V-D.

# A. Validation of the ANN Stress Model

First, we perform simulations using COMSOL with different power distributions covering a large variety of the 3-D IC operating conditions. In this step, we choose the stress-free temperature as 275 °C, which is the TSV's annealing temperature [18]. The ANN model has two hidden layers with 30 and 10 neurons in each layer, respectively. First, we use 2580 COMSOL simulation samples to train the ANN models. Then, we use other 588 samples for validation. Since ANN model and COMSOL model have different temperature grids, we use interpolation to solve this problem. In order to get the temperature information around TSVs, there are totally 32 thermal sensors on chip, uniformly placed with one thermal sensor on each core. Then, temperature values around TSVs are obtained using interpolation with negligible computing overhead [54]. If there are extremely few on-chip thermal sensors, a more advanced temperature distribution recovery method is recommended, which can be found in runtime full-chip thermal estimation works like [55].

In order to test the ANN model accuracy for different common 3-D IC running conditions, we show the ANN model accuracies separately for different test cases including "SPEC," "User," "Target," and "Light" in Fig. 10. Each case is explained as follows. SPEC denotes the SPEC benchmark free run case, covering the SPEC benchmark's running conditions. User

denotes synthetic workload running case [like the one shown in Fig. 14(a)], which is used to emulate the real world user-controlled application running condition. Target represents the stress condition around the lifetime MPC target. Light means the 3-D IC has very light task loads and STREAM will deposit lifetime in this condition.

The average errors of different test cases are 0.1% for SPEC, 0.08% for Synthetic, 0.06% for Target, and 0.09% for Light. These results reveal that the ANN stress model is accurate for different common 3-D IC running conditions.

We also tested the overhead of the ANN stress model in the verification process. Since each core can compute its stress independently, the average computing time of the ANN stress model (including interpolation process, which is negligible) is only 0.25 ms (only around 0.025% throughput degradation).

# B. Advantages of Using Lifetime MPC

In this part, we demonstrate the benefit of using the lifetime MPC in STREAM.

We compare STREAM with lifetime MPC with the existing method. Since the existing method does not compute a power suggestion for management reference, it relies on DVFS scale factors. The maximum frequency of 3-D IC is set to 2.0 GHz, with the basic frequency scaling step set as 0.1 GHz in DVFS. Scale factor ( $\zeta$ ) is introduced to determine the DVFS level in the existing method. For example, when the core is running at 2.0 GHz and the temperature is higher than threshold, DVFS needs to level down the frequency by  $\zeta \times 0.1$  GHz. Please note that STREAM does not need scale factor since it can quantify the power suggestion using lifetime MPC to determine the proper DVFS level. In this experiment, we test DVFS scale factor  $\zeta = 1, 2, 3$  to include both low and high scale factors.

Fig. 11 shows the transient maximum temperatures using STREAM (with zero LTDP) and the existing method [21]. We start at a high temperature (around 106 °C) to exclude the lifetime banking effect in STREAM for a fair comparison. Then, starting from 1 s, all the reliability management methods begin to perform management using DVFS every 1 s to track the target temperature (the maximum allowed temperature), which is set as 90 °C.

From Fig. 11, we observe that the existing method is unable to track the target temperature with both high scale factor and low scale factor. With high scale factor ( $\zeta = 3$ ), temperature



Fig. 11. Transient maximum temperatures using STREAM (with zero LTDP) and the existing method [21]. All reliability management methods start at 1 s with the management period of 1 s.

of the existing method will oscillate around the target temperature, with large control overshoot. On the other hand, with low scale factor ( $\zeta=1$  and  $\zeta=2$ ), temperature of the existing method converges very slowly, using five control periods (5 s) when  $\zeta=1$  and two control periods (2 s) when  $\zeta=2$  to reach the target temperature. This is because the existing method does not know how much power should be adjusted for the future step, using high scale factor may over adjust power and using low scale factor may under adjust power.

With MPC, STREAM is able to quantify the power suggestion in the future and choose the correct DVFS scale ratio based on the power suggestion as shown in Section IV-D. As a result, temperature using STREAM quickly converges to the target temperature in just one control period, and stays just at the target temperature with very small overshoot.

On the overhead side, MPC in this experiment employs a moderate sized 3-D IC thermal model with dimensions as  $G \in \mathbb{R}^{92 \times 92}$ ,  $C \in \mathbb{R}^{92 \times 92}$ , and  $B_c \in \mathbb{R}^{92 \times 32}$  to balance the computing overhead and reliability management accuracy. Such thermal model has an average thermal estimation error to be 0.24 °C, which is accurate enough for power suggestion computation in MPC. For each management period (1 s), the computing overhead from MPC is tested as 0.092 ms on a 2-GHz core, which is negligible for a 32-core 3-D IC (only around 0.00032% throughput degradation).

# C. Advantages of Using Lifetime Estimator With Lifetime Banking

In this part, we demonstrate the advantages of using lifetime estimator with lifetime banking.

Same as the previous experiment, we compare STREAM with the existing method [21] which does not have lifetime banking ability. The management period is set as 0.1 s for all methods. Since system performance will be tested, the management overhead is accounted for all methods. Specifically, STREAM has 0.092-ms MPC overhead for single core and 0.25-ms ANN stress computation overhead for all cores. The overheads of lifetime computation and target temperature estimation in STREAM are too small to be counted.

We use cool phase to denote the state that 3-D IC chip is running at low temperature (with light task load) and hot phase to denote the state that the chip is running at high temperature (with heavy task load). We also use temperature difference to denote the maximum temperature difference



Fig. 12. Performance boost from reliability banking with different workload percentage.

between hot phase and cool phase. In this experiment, we let the 3-D IC start at the cool phase cycle and then switch it to the hot phase cycle, and record the million instructions per second (MIPS) results of using STREAM and the existing method. We performed several experiments with different cool phase percentage and different temperature differences (we fix the hot phase temperature at 120 °C and adjust cool phase temperature to get different temperature differences).

Fig. 12 shows the hot phase performance boost of STREAM against the existing method, for different cool phase percentages and temperature differences. We can see that STREAM is able to gain performance boost up to 9% when the temperature difference and cool phase percentage are both large. This is expected since LTDP will increase during the cool phase according to (9). Then, longer cool phase time and lower cool phase temperature mean more LTDP can be accumulated for the higher performance boost strength during hot phase.

From the observation above, we conclude that STREAM, equipped with lifetime estimator, is able to improve system performance as long as there are cool phases, which are common in 3-D IC systems.

# D. Overall Reliability and Performance Enhancement by Using STREAM

In this part, we test the overall reliability and performance enhancement by using STREAM.

First, we let the same SPEC benchmark workload run on all cores of 3-D IC, and compute the average throughput (MIPS) improvement ratio of STREAM over the existing method. The average throughput improvement ratio is calculated as

$$ratio = \frac{MIPS_{STREAM} - MIPS_{existing}}{MIPS_{existing}}$$
 (15)

where MIPS<sub>STREAM</sub> is the average system MIPS with STREAM, and MIPS<sub>existing</sub> is the average MIPS with the existing method. Since different 3-D IC cores have different running speeds with reliability management, the SPEC benchmark on each core will restart upon completion for a fair throughput comparison.

Fig. 13 shows the average throughput improvement ratio by using STREAM over by using the existing method with different benchmark workloads. The improvement ratio achieves 5.5% with "bwaves." The improvement ratio with bwaves is



Fig. 13. Average throughput improvement ratio by using STREAM over by using the existing method [21] on 3-D ICs.

significant because this benchmark has long cool phase and low cool phase temperature, which enable performance boost in STREAM. On the other hand, we note that the improvement ratio is very small (nearly zero) for benchmarks "hmmer," "mcf," and "sjeng," with two different reasons. The reason for hmmer is that the 3-D IC cores are always at the hot phase by running this benchmark, which completely disables LTDP in STREAM. The reason for mcf and sjeng is that the 3-D IC cores are always at cool phase by running these two benchmarks, which means no performance boost is needed at all. Overall, STREAM outperforms existing method in system throughput comparison, with the performance boost ability.

In addition to the benchmark test above, we also created a synthetic workload to emulate the real world application behavior. This synthetic workload has three cool phases and two hot phases with different temperatures. Fig. 14(a) shows the transient thermal behavior of 3-D IC with this synthetic workload under different reliability management methods and without reliability management. Without reliability management, the temperature can reach 110 °C which harms the reliability of the 3-D IC system. With the existing method, the temperature can be controlled below 90 °C, which is the threshold temperature without harming the reliability. However, the threshold temperature is not violated even after the cool phase, meaning the performance potential of the 3-D IC system is not fully exploited. In addition, there is significant temperature oscillation around the threshold temperature, indicating poor control performance with large control overshoot with the existing method.

Now let us analyze the performance of STREAM with both Fig. 14(a) and (b). We can see that from 0 to 100 s, 3-D IC stays in cool phase and LTDP increases with time. There is no reliability management needed for STREAM during this time period. From 100 to 150 s, 3-D IC runs in hot phase and begins to consume lifetime banking. Since the LTDP is not fully consumed during this hot phase, STREAM does not take any control action even if the threshold temperature is violated. System performance with STREAM is higher than that with the existing method for this hot phase. After the second cool phase (from 150 to 200 s), the second hot phase arrives. At



Fig. 14. Max temperature of 3-D IC's synthetic workload under different management methods. (a) Max temperature of synthetic workload with STREAM, existing method [21], and free run without any reliability management. (b) LTDP information of STREAM.

the beginning of this hot phase (from 200 to 240 s), LTDP is being consumed, and STREAM takes no control action as designed. At around 240 s, the LTDP is fully consumed. STREAM immediately takes control action using DVFS (with DVFS level set according to the computed power suggestion) in order to keep zero LTDP. Since the LTDP value is always larger than 0 according to Fig. 14(b), the lifetime of 3-D IC is always longer than the designed lifetime, meaning STREAM is able to bring performance boost to 3-D IC and still maintain its designed lifetime.

## VI. CONCLUSION

In this paper, we have demonstrated a new stress and thermal aware reliability management method for 3-D ICs called STREAM. Unlike the existing methods, STREAM uses an ANN stress model to estimate the explicit stress of 3-D IC at runtime. Thanks to the ANN stress model, an accurate lifetime estimator is proposed with lifetime banking technology to boost the performance of 3-D IC. In order to improve the control quality of reliability management, a specially designed lifetime MPC has been integrated into STREAM to compute the power suggestion for management. The new method has been tested with SPEC benchmarks and synthetic workload. The results show that STREAM successfully manages the reliability of 3-D IC, and leads to higher chip performance than the state-of-the-art 3-D IC reliability management method.

#### REFERENCES

- R. S. Patti, "Three-dimensional integrated circuits and the future of system-on-chip designs," *Proc. IEEE*, vol. 94, no. 6, pp. 1214–1224, Jun. 2006.
- [2] P. Leduc et al., "Challenges for 3D IC integration: Bonding quality and thermal management," in Proc. IEEE Int. Interconnect Technol. Conf., Burlingame, CA, USA, Jun. 2007, pp. 210–212.
- [3] K. Puttaswamy and G. H. Loh, "Thermal analysis of a 3D die-stacked high-performance microprocessor," in *Proc. IEEE/ACM Int. Great Lakes Symp. VLSI (GLSVLSI)*, Apr. 2006, pp. 19–24.
- [4] F. Wang, Z. Zhu, Y. Yang, and N. Wang, "A thermal model for the top layer of 3D integrated circuits considering through silicon vias," in *Proc. IEEE Int. Conf. ASIC*, Oct. 2011, pp. 618–620.
- [5] Y. Chen, E. Kursun, D. Motschman, C. Johnson, and Y. Xie, "Analysis and mitigation of lateral thermal blockage effect of through-silicon-via in 3D IC designs," in *Proc. Int. Symp. Low Power Electron. Design* (ISLPED), Aug. 2011, pp. 397–402.
- [6] J. Cong, G. Luo, J. Wei, and Y. Zhang, "Thermal-aware 3D IC placement via transformation," in *Proc. Asia South Pac. Design Autom. Conf. (ASP-DAC)*, Jan. 2007, pp. 780–785.
- [7] J. Cong, G. Luo, and Y. Shi, "Thermal-aware cell and through-siliconvia co-placement for 3D ICs," in *Proc. Design Autom. Conf. (DAC)*, New York, NY, USA, Jun. 2011, pp. 670–675.
- [8] J. H. Lau and T. G. Yue, "Thermal management of 3D IC integration with TSV (through silicon via)," in *Proc. Electron. Compon. Technol. Conf. (ECTC)*, San Diego, CA, USA, May 2009, pp. 635–640.
- [9] A. K. Coskun, J. L. Ayala, D. Atienza, T. S. Rosing, and Y. Leblebici, "Dynamic thermal management in 3D multicore architectures," in *Proc. Eur. Design Test Conf. (DATE)*, Apr. 2009, pp. 1410–1415.
- [10] F. Zanini, M. M. Sabry, D. Atienza, and G. D. Micheli, "Hierarchical thermal management policy for high-performance 3D systems with liquid cooling," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 1, no. 2, pp. 88–101, Jun. 2011.
- [11] K. Kang, J. Kim, S. Yoo, and C.-M. Kyung, "Runtime power management of 3-D multi-core architectures under peak power and temperature constraints," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 30, no. 6, pp. 905–918, Jun. 2011.
- [12] K. N. Tu, "Reliability challenges in 3D IC packaging technology," *Microelectron. Rel.*, vol. 51, no. 3, pp. 517–523, 2011.
- [13] D. Z. Pan et al., "Design for manufacturability and reliability for TSV-based 3D ICs," in Proc. Asia South Pac. Design Autom. Conf. (ASP-DAC), Jan. 2012, pp. 750–755.
- [14] L. Yu et al., "Methodology for analysis of TSV stress induced transistor variation and circuit performance," in Proc. Int. Symp. Qual. Electron. Design (ISQED), Mar. 2012, pp. 216–222.
- [15] J.-S. Yang, K. Athikulwongse, Y.-J. Lee, S. K. Lim, and D. Z. Pan, "TSV stress aware timing analysis with applications to 3D-IC layout optimization," in *Proc. Design Autom. Conf. (DAC)*, Anaheim, CA, USA, Jun. 2010, pp. 803–806.
- [16] A. Todri et al., "A study of tapered 3-D TSVs for power and thermal integrity," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 21, no. 2, pp. 306–319, Feb. 2013.
- [17] K. Athikulwongse, A. Chakraborty, J.-S. Yang, D. Z. Pan, and S. K. Lim, "Stress-driven 3D-IC placement with TSV keep-out zone and regularity study," in *Proc. Int. Conf. Comput.-Aided Design (ICCAD)*, San Jose, CA, USA, Nov. 2010, pp. 669–674.
- [18] M. Jung, J. Mitra, D. Z. Pan, and S. K. Lim, "TSV stress-aware full-chip mechanical reliability analysis and optimization for 3D IC," *Commun. ACM*, vol. 57, no. 1, pp. 107–115, Jan. 2014.
- [19] K. H. Lu et al., "Thermo-mechanical reliability of 3D ICs containing through silicon vias," in Proc. Electron. Compon. Technol. Conf. (ECTC), May 2009, pp. 630–634.
- [20] T. Chen, D. M. Liao, and J. X. Zhou, "Numerical simulation of casting thermal stress and deformation based on finite difference method," *Mater. Sci. Forum*, vol. 762, pp. 224–229, Jul. 2013.
- [21] Q. Zou, E. Kursun, and Y. Xie, "Thermomechanical stress-aware management for 3-D IC designs," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 25, no. 9, pp. 2678–2682, Sep. 2017.
- [22] J. Srinivasan, "Lifetime reliability aware microprocessors," Ph.D. dissertation, Dept. Comput. Sci., Univ. Illinois at Urbana–Champaign, Springfield, IL, USA, 2006.
- [23] A. K. Coskun, T. S. Rosing, K. Mihic, G. D. Micheli, and Y. Leblebici, "Analysis and optimization of MPSoC reliability," *J. Low Power Electron.*, vol. 2, no. 1, pp. 56–69, 2006.

- [24] J. Shin, V. Zyuban, Z. Hu, J. A. Rivers, and P. Bose, "A framework for architecture-level lifetime reliability modeling," in *Proc. Int. Conf. Depend. Syst. Netw.*, Jun. 2007, pp. 534–543.
- [25] Z. Lu, W. Huang, M. R. Stan, K. Skadron, and J. Lach, "Interconnect lifetime prediction for reliability-aware systems," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 15, no. 2, pp. 159–172, Feb. 2007.
- [26] L. Huang and Q. Xu, "On modeling the lifetime reliability of homogeneous manycore systems," in *Proc. IEEE Pac. Rim Int. Symp. Depend. Comput.*, Dec. 2008, pp. 87–94.
- [27] Y. Xiang, T. Chantem, R. P. Dick, X. S. Hu, and L. Shang, "System-level reliability modeling for MPSoCs," in *Proc. Int. Conf. Hardw. Softw. Codesign Syst. Synth. (CODES+ISSS)*, Oct. 2010, pp. 297–306.
- [28] Z. Lu, J. Lach, M. R. Stan, and K. Skadron, "Improved thermal management with reliability banking," *IEEE Micro*, vol. 25, no. 6, pp. 40–49, Nov./Dec. 2005.
- [29] E. Karl, D. Blaauw, D. Sylvester, and T. Mudge, "Reliability modeling and management in dynamic microprocessor-based systems," in *Proc. Design Autom. Conf. (DAC)*, Jul. 2006, pp. 1057–1060.
- [30] T. Kim, Z. Sun, H.-B. Chen, H. Wang, and S. X.-D. Tan, "Energy and lifetime optimizations for dark silicon manycore microprocessor considering both hard and soft errors," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 25, no. 9, pp. 2561–2574, Sep. 2017.
- [31] T. Kim, Z. Liu, and S. X.-D. Tan, "Dynamic reliability management based on resource-based EM modeling for multi-core microprocessors," *Microelectron. J.*, vol. 74, pp. 106–115, Apr. 2018.
- [32] A. Y. Yamamoto and C. Ababei, "Unified reliability estimation and management of NoC based chip multiprocessors," *Microprocessors Microsyst.*, vol. 38, no. 1, pp. 53–63, 2014.
- [33] W. Huang et al., "HotSpot: A compact thermal modeling methodology for early-stage VLSI design," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 5, pp. 501–513, May 2006.
- [34] W. Huang, K. Sankaranarayanan, K. Skadron, R. J. Ribando, and M. R. Stan, "Accurate, pre-RTL temperature-aware processor design using a parameterized, geometric thermal model," *IEEE Trans. Comput.*, vol. 57, no. 9, pp. 1277–1288, Sep. 2008.
- [35] H. Wang, S. X.-D. Tan, D. Li, A. Gupta, and Y. Yuan, "Composable thermal modeling and simulation for architecture-level thermal designs of multi-core microprocessors," ACM Trans. Design Autom. Electron. Syst., vol. 18, no. 2, pp. 1–28, Mar. 2013.
- [36] V. Hanumaiah, S. Vrudhula, and K. S. Chatha, "Performance optimal online DVFS and task migration techniques for thermally constrained multi-core processors," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 30, no. 11, pp. 1677–1690, Nov. 2011.
- [37] J. Meng, K. Kawakami, and A. K. Coskun, "Optimizing energy efficiency of 3-D multicore systems with stacked DRAM under power and thermal constraints," in *Proc. Design Autom. Conf. (DAC)*, San Francisco, CA, USA, Jun. 2012, pp. 648–655.
- [38] A. Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, and D. Atienza, "3D-ICE: Fast compact transient thermal modeling for 3D ICs with intertier liquid cooling," in *Proc. Int. Conf. Comput.-Aided Design (ICCAD)*, Nov. 2010, pp. 463–470.
- [39] F. Li et al., "Design and management of 3D chip multiprocessors using network-in-memory," in Proc. Int. Symp. Comput. Archit. (ISCA), Boston, MA, USA, Jun. 2006, pp. 130–141.
- [40] M. Jung, J. Mitra, D. Z. Pan, and S. K. Lim, "TSV stress-aware full-chip mechanical reliability analysis and optimization for 3-D IC," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 31, no. 8, pp. 1194–1207, Aug. 2012.
- [41] "Failure mechanisms and models for semiconductor devices," JEDEC Solid State Technol. Assoc., Richmond, VA, USA, Rep. JEP122H, 2003.
- [42] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "Exploiting structural duplication for lifetime reliability enhancement," in *Proc. Int. Symp. Comput. Archit. (ISCA)*, Jun. 2005, pp. 520–531.
- [43] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The case for lifetime reliability-aware microprocessors," in *Proc. Int. Symp. Comput. Archit. (ISCA)*, Jun. 2004, pp. 276–287.
- [44] L. Zhang, H. Wang, and S. X.-D. Tan, "Fast stress analysis for runtime reliability enhancement of 3D IC using artificial neural network," in *Proc. Int. Symp. Qual. Electron. Design (ISQED)*, Mar. 2016, pp. 173–178.
- [45] Z. Xiong, Y. Zhang, L. Ou, and L. Li, "Two-phases parallel neural network algorithm based on RPROP," in *Proc. IEEE Region 10 Conf.* (TENCON), Nov. 2005, pp. 1–6.
- [46] I. Goodfellow, Y. Bengio, and A. Courville, *Deep Learning*. Cambridge, MA, USA: MIT Press, 2016.

[47] F. Zanini, D. Atienza, L. Benini, and G. De Micheli, "Multicore thermal management with model predictive control," in *Proc. Eur. Conf. Cicuit Theory Design*, Aug. 2009, pp. 90–95.

- [48] A. Bartolini, M. Cacciari, A. Tilli, and L. Benini, "Thermal and energy management of high-performance multicores: Distributed and selfcalibrating model-predictive controller," *IEEE Trans. Parallel Distrib. Syst.*, vol. 24, no. 1, pp. 170–183, Jan. 2013.
- [49] H. Wang et al., "Hierarchical dynamic thermal management method for high-performance many-core microprocessors," ACM Trans. Design Autom. Electron. Syst., vol. 22, no. 1, p. 1, Jul. 2016.
- [50] L. Wang, Model Predictive Control System Design and Implementation Using MATLAB. London, U.K.: Springer-Verlag, 2009.
- [51] K. Skadron et al., "Temperature-aware microarchitecture," in Proc. Int. Symp. Comput. Archit. (ISCA), Jun. 2003, pp. 2–13.
- [52] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A framework for architectural-level power analysis and optimizations," in *Proc. Int. Symp. Comput. Archit. (ISCA)*, Jun. 2000, pp. 83–94.
- [53] J. L. Henning, "SPEC CPU2006 benchmark descriptions," ACM SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 1–17, Sep. 2006.
- [54] J. Long, S. O. Memik, G. Memik, and R. Mukherjee, "Thermal monitoring mechanisms for chip multiprocessors," ACM Trans. Archit. Code Optim., vol. 5, no. 2, pp. 1–9, Aug. 2008.
- [55] H. Wang, S. X.-D. Tan, G. Liao, R. Quintanilla, and A. Gupta, "Full-chip runtime error-tolerant thermal estimation and prediction for practical thermal management," in *Proc. Int. Conf. Comput.-Aided Design* (ICCAD), San Jose, CA, USA, Nov. 2011, pp. 716–723.



**Rui Liu** received the B.S. degree from the Beijing Institute of Technology, Beijing, China, in 2007, the M.S. degree from the Institute of Mechanics, Chinese Academy of Sciences, Beijing, in 2010, and the Ph.D. degree from the Politecnico di Milano, Milan, Italy, in 2017.

He is currently a Research Assistant with the Institute of Chemical Materials, China Academy of Engineering Physics, Beijing. His current research interest includes simulation of the behavior of materials under complex loading conditions.



Chi Zhang received the bachelor's degree from the Taiyuan University of Science and Technology, Taiyuan, China, in 1994, and the master's degree from the Microelectronics Research Institute, Chinese Academy of Sciences, Beijing, China, in 2003. He is currently pursuing the Ph.D. degree with the University of Electronic Science and technology of China, Chengdu, China.

His current research interests include mixedsignal integrated circuit design, electronic design automation technology, and multimode biometrics technology.



Hai Wang received the B.S. degree from the Huazhong University of Science and Technology, Wuhan, China, and the M.S. and Ph.D. degrees from the University of California at Riverside, Riverside, CA, USA, in 2007, 2008, and 2012, respectively.

He is currently an Associate Professor with the University of Electronic Science and Technology of China, Chengdu, China. His current research interests include electrical/thermal verification and optimization of very large-scale integration circuits and systems.

Dr. Wang has served as a Technical Program Committee Member for several international conferences, including Design, Automation and Test in Europe, ASP-DAC, and the International Symposium on Quality Electronic Design, and also served as a Reviewer for several journals, including the IEEE TRANSACTIONS ON COMPUTERS, the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, and ACM Transactions on Design Automation of Electronic Systems.



He Tang (M'09) received the B.S.E.E. degree from the University of Electronic Science and Technology of China, Chengdu, China, in 2005, the M.S. degree in electrical and computer engineering from the Illinois Institute of Technology, Chicago, IL, USA, in 2007, and the Ph.D. degree in electrical engineering from the University of California at Riverside, Riverside, CA, USA, in 2010.

From 2010 to 2012, he was an Analog IC Designer with OmniVision Technologies, Inc., Santa Clara, CA, USA, where he researched on high-

speed I/O interface. Since 2012, he has been an Associate Professor and subsequently a Professor with the University of Electronic Science and Technology of China. His past research includes high-speed high-resolution pipelined analog-to-digital converters (ADCs) with digital calibration and high-performance ultralow-power successive approximation ADCs. He has authored or coauthored over 40 papers. His current research interests include data converters and analog/mixed-signal IC designs.

Dr. Tang has been serving on the IEEE CAS Analog Signal Processing Technical Committee since 2013.



**Darong Huang** received the B.S. degree from the University of Electronic Science and Technology of China, Chengdu, China, in 2016, where he is currently pursuing the master's degree.

His current research interests include thermal analysis, and thermal and reliability management of integrated circuits.



**Yuan Yuan** received the B.S. and M.S. degrees from the University of Electronic Science and Technology of China, Chengdu, China, in 1992 and 2005, respectively.

He is currently an Associate Professor with the University of Electronic Science and technology of China. He has published over ten research papers in international conferences and journals. His current research interests include electronic measuring equipment design, computer-based measuring technology, and embedded system.