## Predicting drive failure with XGBoost and RAPIDS

**Dataset**: Hard disk SMART data and failure dataset from Backblaze ([More information](https://www.backblaze.com/b2/hard-drive-test-data.html))

**Task**: Predict hard disk failure with RAPIDS

In [None]:
import time
import itertools
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

from sklearn.metrics import *
import matplotlib.pyplot as plt
%matplotlib inline

# RAPIDS
import cudf
import xgboost as xgb

### 1. Load Data

#### Training Data

Use Pandas to load training data from CSV ([download link](https://s3-ap-southeast-1.amazonaws.com/deeplearning-iap-material/hdd_test_data/train.csv)). This consists of the pre-processed drive data from **January 2015** to **September 2018**.

In [None]:
!wget -nc https://s3-ap-southeast-1.amazonaws.com/deeplearning-mat/hdd_test_data/train.csv

In [None]:
df = pd.read_csv("train.csv")
df.info()

Split into features (`df_train`) and labels(`df_target`), where each is a Pandas `Dataframe`.

In [None]:
df_train = df.drop(["failure"],axis=1).apply(pd.to_numeric).astype(np.float32)
df_train_target = pd.DataFrame(df["failure"]).apply(pd.to_numeric)

#### Evaluation Data

Do the same thing to load evaluation data from CSV ([download link](https://s3-ap-southeast-1.amazonaws.com/deeplearning-iap-material/hdd_test_data/eval.csv)), if you have a seperate file to load.

In our case, this consists of the pre-processed drive data from **October 2017 to December 2018**.

In [None]:
!wget -nc https://s3-ap-southeast-1.amazonaws.com/deeplearning-mat/hdd_test_data/eval.csv

In [None]:
df_t = pd.read_csv("eval.csv")
df_t.info()

In [None]:
df_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric).astype(np.float32)
df_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)

### 2. Model Parameters

In [None]:
MAX_TREE_DEPTH = 8
TREE_METHOD = 'hist'
ITERATIONS = 85
SUBSAMPLE = 0.6
REGULARIZATION = 1.3
GAMMA = 0.3
POS_WEIGHT = 1
EARLY_STOP = 16

### 3. Train with CPU

XGBoost training with CPU (`params[tree_method] = 'hist'`), using a Pandas `Dataframe` loaded into `xgb.DMatrix`. For more information, check out [this page in the XGBoost Documentation](https://xgboost.readthedocs.io/en/latest/python/python_intro.html).

As we can see, training with even a high-end Intel Xeon CPU is pretty slow!

In [None]:
!lscpu | grep 'Model name:'
!lscpu | grep 'CPU(s)'

In [None]:
start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)
xgeval = xgb.DMatrix(df_test, df_test_target)

params = {'tree_method': TREE_METHOD, 'max_depth': MAX_TREE_DEPTH, 'alpha': REGULARIZATION,
          'gamma': GAMMA, 'subsample': SUBSAMPLE, 'scale_pos_weight': POS_WEIGHT, 'learning_rate': 0.05, 'silent': 1}

bst = xgb.train(params, xgtrain, ITERATIONS, evals=[(xgtrain, "train"), (xgeval, "eval")],
                early_stopping_rounds=EARLY_STOP)

timetaken_cpu = time.time() - start_time

# free up memory
del xgtrain
del bst

### 4. Train with GPU

To use GPU, we set `params[tree_method] = 'gpu_hist'`.

In [None]:
# GPU, without using cuDF

start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)
xgeval = xgb.DMatrix(df_test, df_test_target)

params = {'tree_method': "gpu_"+TREE_METHOD, 'max_depth': MAX_TREE_DEPTH, 'alpha': REGULARIZATION,
          'gamma': GAMMA, 'subsample': SUBSAMPLE, 'scale_pos_weight': POS_WEIGHT, 'learning_rate': 0.05, 'silent': 1}

bst = xgb.train(params, xgtrain, ITERATIONS, evals=[(xgtrain, "train"), (xgeval, "eval")],
                early_stopping_rounds=EARLY_STOP)

timetaken_gpu_nocudf = time.time() - start_time

# free up memory
del xgtrain
del bst

Use full RAPIDS stack by using XGBoost with cuDF for additional speedup. To do this, we load the Pandas `Dataframe` into a cuDF Dataframe (Python object type `cudf.dataframe.dataframe.DataFrame`).

In [None]:
# load into cuDF Dataframe

gdf_train = cudf.DataFrame.from_pandas(df_train)
gdf_train_target = cudf.DataFrame.from_pandas(df_train_target)

gdf_eval = cudf.DataFrame.from_pandas(df_test)
gdf_eval_target = cudf.DataFrame.from_pandas(df_test_target)

In [None]:
# GPU, with using cuDF

start_time = time.time()

xgtrain = xgb.DMatrix(gdf_train, gdf_train_target)
xgeval = xgb.DMatrix(gdf_eval, gdf_eval_target)

params = {'tree_method': "gpu_"+TREE_METHOD, 'max_depth': MAX_TREE_DEPTH, 'alpha': REGULARIZATION,
          'gamma': GAMMA, 'subsample': SUBSAMPLE, 'scale_pos_weight': POS_WEIGHT, 'learning_rate': 0.05, 'silent': 1}

bst = xgb.train(params, xgtrain, ITERATIONS, evals=[(xgtrain, "train"), (xgeval, "eval")],
                early_stopping_rounds=EARLY_STOP)

timetaken_gpu = time.time() - start_time

In [None]:
print("Check GPU memory usage")
!gpustat

### 5. Results

We see a significant speed-up when we use the RAPIDS stack.

In [None]:
print("CPU Time Taken:\n", round(timetaken_cpu,1))
print("\nGPU (no cuDF) Time Taken:\n", round(timetaken_gpu_nocudf,1))
print("\nGPU (cuDF) Time Taken:\n", round(timetaken_gpu,1))
print("\nTotal speed-up with RAPIDS:\n", round(timetaken_cpu/timetaken_gpu*100,1), "%")

Let's look at the model's performance on the evalutation set

In [None]:
preds = bst.predict(xgeval)

y_pred = []

THRESHOLD = 0.5

for pred in preds:
    if pred<=THRESHOLD:
        y_pred.append(0)
    if pred>THRESHOLD:
        y_pred.append(1)

y_pred = np.asarray(y_pred)
        
y_true = df_test_target.values.reshape(len(preds))

In [None]:
print("Accuracy (Eval)", round(accuracy_score(y_true, y_pred),3))

In [None]:
print(classification_report(y_true, y_pred, target_names=["normal", "fail"]))

In [None]:
plt.style.use('seaborn-dark')
def plot_confusion_matrix(cm, labels,
                          normalize=True,
                          title='Confusion Matrix (Validation Set)',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=45)
    plt.yticks(tick_marks, labels)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

plt.figure(figsize=(14,7))
cnf_matrix = confusion_matrix(y_true, y_pred)
plot_confusion_matrix(cnf_matrix, labels=["normal", "fail"])