### Predict device failures

<p>
__!!!_ADD A SIDEBAR TO THE NOTEBOOK_!!!__
<br>
https://github.com/ipython-contrib/jupyter_contrib_nbextensions



This is a program to predict failure of devices used in telemetry. The data was downloaded from here: http://aws-proserve-data-science.s3.amazonaws.com/device_failure.csv
<br>
I downloaded the data and saved in: /Users/valentin/GoogleDrive/Data/aws/device_failure.csv

##### Goal:
You are tasked with building a predictive model using machine learning to predict the
probability of a device failure. When building this model, be sure to minimize false positives and
false negatives. The column you are trying to predict is called failure with binary value 0 for
non-failure and 1 for failure.


#### Plan for development
1. Read the data
2. Data exploration
    * Calculate the distribution of total fail/not fail cases
    * Distribution of fail/not-fail cases by year
    * Distribution of fail/not-fail cases by month
    * Are there any devices that have higher failure rates than others
    * When a device fails, does it disappear from the data, or does it show up again
    * Are there particular days of the week when a higher percent of devices fail?
3. Understand the important variables for splitting the dataset
    * Do a pre-modeling analysis using decision trees
    * Use decision trees to understand which are the important drivers for the outcome
4. Derive new features
    * Determine which features are categorical, and if needed create dummies
    * Use decision trees to decide how to split by categories of attributes
    * Should I do PCA? (likely no..)
5. Sampling of the data (70/30)
    * Create random samples from the data - 70% for training and 30% for validation
    * If the failures are too few compared to the non-failures, I will need to oversample the failures (and add a weight), or simply undersample the non-failures
6. Do we need separate models for different groups of devices?
    * Use decision trees to figure this out
        * Run decision trees on all devices
        * Run decision trees by types of devices, determined by the first letter - S, W, or Z
7. Variables reduction
    * Use decision trees
    * Use p-values
    * LASSO
    * Random forest's variable importance
    * Stepwise logistic regression
    * Compare the suggested features from each method
8. Decide on a methodology for the model estimation
    * Random forest
    * Random forest with boosting
    * SVM
    * Survival analysis
9. Validate the results
10. Calculate c-statistic and ROC
11. Create a confusion matrix - reduce false positives and false negatives
12. Does the methodology we use allow us to understand what drives the outcome? If it doesn't, do we need another methodology that can help with that?


In [75]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt

%matplotlib inline

# This defines the path to the data depending on the OS I use - Win or mac OS
def selectOS(_os):
    if _os == "win":
        data_location = "C:/Users/bre49823/GoogleDrive/Data/aws/device_failure.csv"
    elif _os == "mac":
        data_location = "/Users/valentin/GoogleDrive/Data/aws/device_failure.csv"
    return data_location

# Read in the data and print out the header
device_data = pd.read_csv(selectOS("win"))

# Check if the number of rows is as expected
print len(device_data)

device_data.head(10)

124494


Unnamed: 0,date,device,failure,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
0,2015-01-01,S1F01085,0,215630672,56,0,52,6,407438,0,0,7
1,2015-01-01,S1F0166B,0,61370680,0,3,0,6,403174,0,0,0
2,2015-01-01,S1F01E6Y,0,173295968,0,0,0,12,237394,0,0,0
3,2015-01-01,S1F01JE0,0,79694024,0,0,0,6,410186,0,0,0
4,2015-01-01,S1F01R2B,0,135970480,0,0,0,15,313173,0,0,3
5,2015-01-01,S1F01TD5,0,68837488,0,0,41,6,413535,0,0,1
6,2015-01-01,S1F01XDJ,0,227721632,0,0,0,8,402525,0,0,0
7,2015-01-01,S1F023H2,0,141503600,0,0,1,19,494462,16,16,3
8,2015-01-01,S1F02A0J,0,8217840,0,1,0,14,311869,0,0,0
9,2015-01-01,S1F02DZ2,0,116440096,0,323,9,9,407905,0,0,164


In [None]:
# Calculate the distribution of total fail/not fail cases
pct_fail_cases = sum(device_data["failure"]) / len(device_data["failure"]) * 100
print("It appears that the data is very imbalanced. The percent of fail cases is very low compared to the non-fail cases.")
print("Only %s of the cases are device failures" % pct_fail_cases)

We need to dig a little deeper into the distributions of failure/non-failure and understand the following:
1. Distribution of failure/non-failure by year
2. Distribution of failure/non-failure by month
3. Distribution of failure/non-failure by device - are there specific devices that tend to fail more than others?

In [None]:
# Distribution of failure/non-failure by year
device_data["year"] = pd.to_datetime(device_data["date"]).dt.year
device_data["month"] = pd.to_datetime(device_data["date"]).dt.month
device_data["day_of_week"] = pd.to_datetime(device_data["date"]).dt.dayofweek
device_data["device_type"] = device_data.device.str[0]

print(pd.crosstab(device_data["year"],
                  columns = "count"))
pd.crosstab(device_data["month"],
            columns = "count") / len(device_data.month)

It is evident that the data is only for 2015 and we have information for most of the months of 2015, except for December. However, note that 50% of the observations are from the period January - March.
<p>
Next, we'll look at the distribution of failure/non-failure by month

In [None]:
# Distribution of failure/non-failure by month
failure_month = pd.crosstab(index = device_data["failure"],
                            columns = device_data["month"],
                            margins = True)

failure_month.index = ["non_failure", "failure", "month_total"]
failure_month

In [63]:
failure_month_pct = failure_month / failure_month.loc["month_total"]
failure_month_pct

month,1,2,3,4,5,6,7,8,9,10,11,All
non_failure,0.999041,0.999282,0.999546,0.999251,0.998147,0.999427,0.998481,0.999521,1.0,0.99898,1.0,0.999149
failure,0.000959,0.000718,0.000454,0.000749,0.001853,0.000573,0.001519,0.000479,0.0,0.00102,0.0,0.000851
month_total,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# Plot the % failures/non-failures by month
plt.figure(figsize = (15, 5))
plt.plot(list(failure_month_pct.loc["failure"]), linewidth = 0.8, color = "#ff0000")
plt.xlabel("Months")
plt.ylabel("Percent device failure")
plt.title("Percent Failed Devices by Month")

In [None]:
failure_day = pd.crosstab(index = device_data["date"],
                          columns = device_data["failure"],
                          margins = True)

failure_day.columns = ["non_failure", "failure", "total"]
failure_day

In [None]:
# Plot the % failures/non-failures by day
failure_day[:-1].reset_index().plot(x = "date", y = "failure",
                                    figsize = (15, 5),
                                    title = "Number of Failed Devices by Day")

In [74]:
# Distribution of failure/non-failure by device type - are there specific devices that tend to fail more than others?
failure_device = pd.crosstab(index = device_data["device_type"],
                             columns = device_data["failure"],
                             margins = True)
failure_device.columns = ["non_failure", "failure", "total"]

print (failure_device["failure"] / failure_device["total"])

failure_device

device_type
S      0.000766
W      0.000971
Z      0.000834
All    0.000851
dtype: float64


Unnamed: 0_level_0,non_failure,failure,total
device_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S,54816,42,54858
W,43226,42,43268
Z,26346,22,26368
All,124388,106,124494


In [52]:
# Are there particular days of the week when a higher percent of devices fail?
failure_day_week = pd.crosstab(index = device_data["day_of_week"],
                               columns = device_data["failure"],
                               margins = True)
failure_day_week

failure,0,1,All
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,17859,27,17886
1,17516,18,17534
2,17121,15,17136
3,18119,22,18141
4,18029,12,18041
5,17889,8,17897
6,17855,4,17859
All,124388,106,124494


In [None]:
# Build a decision tree to understand how should the data be split

# http://scikit-learn.org/stable/modules/tree.html
# http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html
# example of a decision tree models with visualizations ->>> http://dataaspirant.com/2017/02/01/decision-tree-algorithm-python-with-scikit-learn/

