# Predictive Maintenance -- Survival Analysis
tags: tech blog
, tutorials
, predictive maintenance
, python
, lifelines
-
-
Predictive maintenance is to predict which machinery at which condition needs preventative maintenance so as to eliminate outages and the costs associated with it. Instead of predicting each individual part's failure, Survival analysis is a statistics approach to estimate failure rate. It is also called reliability analysis and event history analysis. Simply put, it intends to answer -- "How long until an event occurs?"
tech blog
, tutorials
, predictive maintenance
, python
, lifelines
For example,
- How long patients survive?
- How long mechanical parts last?
In this tutorial, we exemplify the second question with 28,529 censored parts which had no failure history, 1334 failed parts and their time to event (fail or exit the censorship). Input table is prepared as below with event
and time_to_event
columns.
df.event.value_counts()
# censor 28529
# fail 1334
# Name: event, dtype: int64
Kaplan-Meier Estimation
At each t + 1, compute how many at risk, which is the last entrance substracted by parts failed at t and also those did not fail but monitored until t (i.e. # censored)
from lifelines import KaplanMeierFitter
durations = df['time_to_event'].apply(lambda x: 0.5 if x==0 else x) # workarounds for DOA
event_observed = df['event'].apply(lambda x: 1 if x=='fail' else 0)
km = KaplanMeierFitter()
km.fit(durations, event_observed, label='KM')
- Event Table:
km.event_table
Survival Function
- What fraction survive past t?
- e.g., 5-yr survival rates, median survival time
- Also called Reliability Function
Estimated by event table
Survival function can be estimated by the following formula where
km.survival_function_
Hazard Function
- Of the people who survive until t, what fraction die at t?
- Conditional density probability
$$ \lambda(t) = {f(y)\over S(t)} $$
- The hazard function might be of more intrinsic interest than the p.d.f. to a patient who had survived a certain time period and wanted to know something about their prognosis.
Popular Distributions
Weibull
from lifelines import *
wbf = WeibullFitter().fit(durations[1:], event_observed[1:], label='WeibullFitter')
wbf.summary
plt.figure(figsize=(10, 24))
plt.subplot(4, 1, 1)
wbf.plot_survival_function()
plt.title('Survival Curve')
plt.xlabel('Time to Event (days)')
plt.ylabel('Survival Probability')
plt.subplot(4, 1, 2)
wbf.plot_cumulative_density()
plt.title('Cumulative Failure Density')
plt.xlabel('Time to Event (days)')
plt.ylabel('Cum. Density Prob. (failure)')
plt.subplot(4, 1, 3)
wbf.plot_hazard()
plt.title('Hazard Function')
plt.xlabel('Time to Event (days)')
plt.ylabel('Hazard Function (pdf)')
plt.subplot(4, 1, 4)
wbf.plot_cumulative_hazard()
plt.title('Cumulative Hazard Function')
plt.xlabel('Time to Event (days)')
plt.ylabel('Cumulative Hazard Function')
plt.show()
QQPlot
In fact, weibull distribution fits our data poortly, and we can observe that by looking at qqplot as below. None falls on the diagonal line. Instead, Log Normal seems to be a better fit.
from lifelines.plotting import qq_plot
durations = df['time_to_event'].apply(lambda x: 0.5 if x==0 else x)
event_observed = df['event'].apply(lambda x: 1 if x=='fail' else 0)
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
axes = axes.reshape(4, )
for i, model in enumerate([WeibullFitter(), LogNormalFitter(), LogLogisticFitter(), ExponentialFitter()]):
model.fit(durations, event_observed)
qq_plot(model, ax=axes[i])
LogNormal
lnf = LogNormalFitter().fit(durations, event_observed)
lnf.summary
plt.figure(figsize=(10, 24))
plt.subplot(4, 1, 1)
lnf.plot_survival_function()
plt.title('Survival Curve')
plt.xlabel('Time to Event (days)')
plt.ylabel('Survival Probability')
plt.subplot(4, 1, 2)
lnf.plot_cumulative_density()
plt.title('Cumulative Failure Density')
plt.xlabel('Time to Event (days)')
plt.ylabel('Cum. Density Prob. (failure)')
plt.subplot(4, 1, 3)
lnf.plot_hazard()
plt.title('Hazard Function')
plt.xlabel('Time to Event (days)')
plt.ylabel('Hazard Function (pdf)')
plt.subplot(4, 1, 4)
lnf.plot_cumulative_hazard()
plt.title('Cumulative Hazard Function')
plt.xlabel('Time to Event (days)')
plt.ylabel('Cumulative Hazard Function')
plt.show()
Poisson
Having explored hazard function, which shows unpredictable but extremely low failure rate, it is reasonable to assume the part's failure rate will converge to a constant (random failure in bathtub curve). With that constant failure rate - probability of failure in day, I can convert that to Mean Time Between Failure (MTBF) and Annualized Failure Rate (AFR).
# ultimate failure rate
fr = lnf.hazard_.iloc[-1, 0]
MTBF_days = 1/fr
MTBF_mons = round(MTBF_days / 30, 1)
MTBF_yrs = round(MTBF_days / 365, 1)
AFR = fr*365
num_active_parts = 200000
num_active_parts * AFR # estimated number of failures per year out of 200K active parts
To fit in a possion distribution, I only need one independent variable. That is average failure rate, and I use AFR multiplied by number of active part base.
from scipy.stats import poisson
mu = num_active_parts * AFR
x = np.arange(0, 1000, 1)
plt.plot(x, poisson.pmf(x, mu))
plt.title('Number of failures occur in a year in Prob.')
plt.plot(x, poisson.cdf(x, mu))
plt.title('Cumu. Demand Distribution (Probability of number of failures less than x)')
plt.xlabel('x (demand)')
plt.show()
i = np.arange(0.01, 1, 0.01)
plt.plot(i, poisson.isf(1-i, mu)) # cdf modified from isf, for some reason, there is no inverse CDF, but inverse survival function
plt.title('Inverse CDF')
plt.show()
With the cummulative density distribution, I can use its inverse function to map cummulative failure rate back to number of parts that failures migth happen - which is equivalent to the number of spares we would like prepare when a failure happens.
(Bonus) Spares Demand Forecast using Newsvendor Model
Newsvendor Model
The optimal spare quantiy given underage cost
Equals to derivative = 0, equals to $$ P(Demand<Q*) = {C_u\over C_u+C_o} \ ==Q* = F^{-1}({C_u \over C_u+C_o}) $$
Results
SLAs = [0.50, 0.75, 0.95, 0.96, 0.97, 0.98, 0.99, 0.9999999, 1]
print('{:<30}{:<40}'.format('Reliability', '# Spares'))
for sla in SLAs:
print('{:<30}{:<40}'.format(sla, poisson.isf(1-sla, mu)))
This simplified model (constant failure rate = 0.0003 per day / AFR = 203 per year) fitted in poisson distribution recommends 237 spares in a year to achieve 99% reliability or when