# Predictive Maintenance with Bayesian Recurrent Neural Networks

*Kejia Shi (ks3403) and Zhirui Wang (zw2389)*

Data Science Intitute, Columbia University

## 1. Introduction

Sequentially predicting the time to failure (also known as Remaining Useful Life/RUL) of the engines is vital to ensure the operational safety in manufacturing and all kinds of services, including the management of data servers and scientific research facilities. Collecting and analyzing machine data of working environment and user settings enable us to better model engine degradation beyond pure human extimation.

In this project, we build deep generative models based on the C-MAPSS Data, or the Turbofan Engine Degradation Simulation Data Set from NASA. We predict the time to failure (RUL) of those engines.

## 2. Review of Methods

|Name|Model|Advantages (Not limited to)|Shortcomings (In our context)|
|:---|:----|:--------------------------|:----------------------------|
| **Survival analysis** | *Survival analysis* is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems.<br>**Typical Models**: accelerated failure time (AFT) models.<br>**Details**: AFT models incorporate covariates $X\in \mathcal{R}^d$ into the survival function $S(t)=P(T>t)$, which are equivalent to log-linear models of time $t$.$$S(t |\beta,X)=S_0(\exp(\beta^T X)\cdot t)$$Under this setting, Weibull distribution is always chosen to be the baseline survival function ($S_0\sim\text{Weibull}(S_0:\lambda,k)$), making the log-linear error distribution a Gumbel (extreme value) distribution ($\epsilon\sim\text{Gumbel}(\epsilon:\mu,\gamma)$).<br>**Note**: A typical Bayesian setting introduces priors to both coefficients of additive features and the hyperparameter of the Gumbel distribution. For example, we can pick independent, vague normal priors. $\beta\sim \text{Normal}(0,\sigma_1^2 I^d)$, $\mu=0$, $\gamma\sim \text{Half-Normal}(0,\sigma_2^2)$. | - Capture the characteristics of many applications.<br>- Correctly incorporate information from both censored and uncensored observations in estimating important model parameters. ("censoring": only gather data before the "death" of the event) | - Work only efficiently with representative features that are wisely engineered.<br>- Need effective models to back up: the original data generated as from different sensors may have been simulated under complex physics or material science principles. With 26 features in total, this is nearly impossible for us to capture the true relationships.<br>- Lack explanatory of features for cumbersome models. |
| **Churn prediction** | *Churn prediction* predicts whether customers are about to leave(non-event prediction, *churn score*). This can be viewed as a *machinelearning* problem.<br>**Typical models**: GLM, SVM, k-NN, Neural Network and so on.<br>**Details**: A typical Recurrent Neural Network may capture such recurrent and time-variant structure, for each time step from $t=1$ to $t=\tau$, $\mathbf{a}^{(t)}=\mathbf{b}+\mathbf{Wh}^{(t-1)}+\mathbf{Ux}^{(t)}$,$\mathbf{h}^{(t)}=\text{tanh}(\mathbf{a}^{(t)})$  $\mathbf{o}^{(t)}=\mathbf{c}+\mathbf{Vh}^{(t)}$,$\hat{\mathbf{y}}^{(t)}=\text{softmax}(\mathbf{o}^{(t)})$ Bias vectors $\mathbf{b}$ and $\mathbf{c}$, Weight matrices $\mathbf{U}$(input-to-hidden), $\mathbf{V}$ (hidden-to-output) and $\mathbf{W}$ (hidden-to-hidden) We train the model using the defined loss function such as log-likelihood loss.<br>**Note**: Although in churn analysis the predictive task is different fromestimating the survival function, it is worth noticing that predicting theevent status, given the individuals have survived up to the specified time, is essentially related with hazard function, which is also part of the survival analysis. | - Straightforward as the predictive goal (in a narrower definition of inference) is clearly defined with enough number of features.<br>- Tons of models | - Similar to downsides of survival analysis.<br>- Doesn’t allow flexibility of model parameters (no uncertainty) |
| **Churn prediction** (probabilistic) | A probabilistic setup uses the mathematics ofprobability theory to express all forms of uncertainty and noise associatedwith our model.<br>**Typical Models**: Gaussian Mixture Model Expectation-Maximization methodto approximate the mean and covariance.<br>**Details**:see [Lin et al., 2013](http://www.sciencedirect.com/science/article/pii/S095183201300149X) | - Approximate any arbitrary distribution with a mixture model.<br>- Simplify the representation of the model framework. | - Doesn’t know GMM’s number of clusters<br>- Assume an underlying Gaussian generative distribution, which may not be the case for our dataset (lack explanatory of clusters) |

As summarized in the above table, this predictive task fits in both churn prediction and survival contexts. Our models extend the usage of recurrent neural network for churn predictions ([Uz, 2017](https://github.com/Azure/lstms_for_predictive_maintenance/blob/master/Deep%20Learning%20Basics%20for%20Predictive%20Maintenance.ipynb)) to noisy settings.

The recurrent neural network may be able to capture the hidden data relationships, save us time for engineering features and uncover temporal patterns in recurrent events. Although traditional recurrent neural network may already be able to give competitive predictions, it treats all model parameters as fixed values.

The baseline Bayesian RNN model differs from the standard RNN with the following distributions.

$$\mathbf{b}\sim\text{Normal}(\mathbf{0}_H,\mathbf{\Sigma_1}_H),c\sim\text{Normal}(0,\sigma_2^2),$$
$$\mathbf{U}\sim\text{Normal}(\mathbf{0}_{(D,H)},\mathbf{\Sigma_3}_{(D,H)}),\mathbf{V}\sim\text{Normal}(\mathbf{0}_{(H,1)},\mathbf{\Sigma_4}_{(H,1)}),\mathbf{W}\sim\text{Normal}(\mathbf{0}_{(H,H)},\mathbf{\Sigma_5}_{(H,H)}).$$

In practice, such added compositional structure for probabilistic programming "compiles" the model down into inference procedures. While we draw model parameters from distributions, uncertainty is taken into consideration. And then we do the inference part, where some up-to-date algorithm can be applied.

## 3. Dataset

## 4. Model

## 5. Inference
### 5.1 Black-Box Variational Inference

### 5.2 Bayes by Backprop

## 6. Criticism

## 7. Summary

### References

<!--bibtex

@Article{PER-GRA:2007,
  Author    = {P\'erez, Fernando and Granger, Brian E.},
  Title     = {{IP}ython: a System for Interactive Scientific Computing},
  Journal   = {Computing in Science and Engineering},
  Volume    = {9},
  Number    = {3},
  Pages     = {21--29},
  month     = may,
  year      = 2007,
  url       = "http://ipython.org",
  ISSN      = "1521-9615",
  doi       = {10.1109/MCSE.2007.53},
  publisher = {IEEE Computer Society},
}

@article{Papa2007,
  author = {Papa, David A. and Markov, Igor L.},
  journal = {Approximation algorithms and metaheuristics},
  pages = {1--38},
  title = {{Hypergraph partitioning and clustering}},
  url = {http://www.podload.org/pubs/book/part\_survey.pdf},
  year = {2007}
}

-->

Examples of citations: [CITE](#cite-PER-GRA:2007) or [CITE](#cite-Papa2007).