In this notebook, i'll explore some predictive models and have a look at what features can be used to predict failing drives. I'll finish up with visualizing those feats, showing how impending failure looks.

We start by loading data...

In [None]:
import numpy as np
import pandas as pd

hdd = pd.read_csv('../input/harddrive.csv')

hdd.shape

Show the first rows..

In [None]:
hdd.head()

Now let's show some basic info ! We'll see below there is few data for failing drives, and many non-failing data rows. That's something we should consider in our modeling efforts!

In [None]:
import seaborn as sns

print(hdd.groupby('failure').size())

sns.countplot(x="failure", data=hdd)

Now let's get rid of some less interesting columns..

In [None]:
## Drop any constant-value columns
## Takes too long :-(
#for i in hdd.columns:
#    if len(hdd.loc[:,i].unique()) == 1:
#        hdd.drop(i, axis=1, inplace=True)

# Drop the normalized columns..
hdd = hdd.select(lambda x: x[-10:] != 'normalized', axis=1)

hdd.shape

Prep data to build a model..

In [None]:
X = hdd.drop(['date', 'serial_number', 'model', 'capacity_bytes', 'failure'], axis=1)[:100000]
y = hdd['failure'][:100000]

X.fillna(value=0, inplace=True)

Now let's make a model. I'm not aiming at a particularly good model.. ! I'm just interested in finding the columns that can be used in predicting and plotting those later...

In [None]:
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

##
## Commented because it runs long.. this finds good class-weights to deal with the imbalanced
## class distribution..
##
# gsc = GridSearchCV(
#      estimator=DecisionTreeClassifier(min_samples_split=20, min_samples_leaf=20),
#      param_grid={
#          'class_weight': [{0: 1, 1: x} for x in range(150, 251, 25)]
#      },
#      scoring='f1',
#      cv=5
# )
#
# grid_result = gsc.fit(X, y)
#
# print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)

tree = DecisionTreeClassifier(min_samples_split=20, min_samples_leaf=20, class_weight={0: 1, 1: 175})
tree.fit(X, y)
print (classification_report(y, tree.predict(X)))

So the model manages to predict "something" - now let's check which features it used!

In [None]:
for feat, imp in zip(X.columns, tree.feature_importances_):
    if imp > 0.0001:
        print("- %s  => %.3f" % (feat, imp))

Now we plot some of the relevant variables for one particular drive...

In [None]:
one_drive = hdd[hdd['serial_number'] == 'S30114J3']

one_drive['smart_197_raw'].plot()
one_drive['smart_198_raw'].plot()
one_drive['failure'].plot()

In the above chart, the green line is the failure indicator. On the last day you can see the drive failed. The orange line is the variable smart_197_raw, this clearly changes value several days before the failure.

For SEAGATE drives, these smart values mean:
- 197 current_pending_sectors
- 198 offline_uncorrecteable

So from the description it is also plausible these SMART indicators should predict failure...

I had fun figuring this out from the data, i hope this is interesting to someone!