## Predicting drive failure with XGBoost and RAPIDS

**Dataset**: Hard disk SMART data and failure dataset from Backblaze ([More information](https://www.backblaze.com/b2/hard-drive-test-data.html))

**Task**: Predict hard disk failure with RAPIDS

In [1]:
import time
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# RAPIDS
import cudf
import xgboost as xgb

### 1. Load Data

#### Training Data

Use Pandas to load training data from CSV. This consists of the drive data from July and August 2018.

In [2]:
df = pd.read_csv("merged.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9073 entries, 0 to 9072
Columns: 283 entries, failure to feat279
dtypes: float64(280), int64(3)
memory usage: 19.6 MB


Split into features (`df_train`) and labels(`df_target`), where each is a Pandas `Dataframe`.

In [3]:
df_train = df.drop(["failure"],axis=1).apply(pd.to_numeric)
df_train_target = pd.DataFrame(df["failure"]).apply(pd.to_numeric)

#### Evaluation Data

Do the same thing to load evaluation data from CSV. This consists of the drive data from September 2018.

In [4]:
"""
df_t = pd.read_csv("eval.csv")

# fill in null data
df_t.fillna(0, inplace=True)

df_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric)
df_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)
"""

'\ndf_t = pd.read_csv("eval.csv")\n\n# fill in null data\ndf_t.fillna(0, inplace=True)\n\ndf_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric)\ndf_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)\n'

In [5]:
MAX_TREE_DEPTH = 24

### 2. Train with CPU

XGBoost training with CPU (`params[tree_method] = 'hist'`), using a Pandas `Dataframe` loaded into `xgb.DMatrix`. For more information, check out [this page in the XGBoost Documentation](https://xgboost.readthedocs.io/en/latest/python/python_intro.html).

In [6]:
!lscpu | grep 'Model name:'

Model name:          Intel(R) Xeon(R) CPU @ 2.50GHz


In [7]:
start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)

params = {'tree_method': 'hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_cpu = time.time() - start_time

# free up memory
del xgtrain
del bst

[14:30:32] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[0]	train-rmse:0.458602
[1]	train-rmse:0.422086
[2]	train-rmse:0.388583
[3]	train-rmse:0.359051
[4]	train-rmse:0.32972
[5]	train-rmse:0.304788
[6]	train-rmse:0.281717
[7]	train-rmse:0.260684
[8]	train-rmse:0.242567
[9]	train-rmse:0.22425
[10]	train-rmse:0.209296
[11]	train-rmse:0.193687
[12]	train-rmse:0.180069
[13]	train-rmse:0.166977
[14]	train-rmse:0.155834
[15]	train-rmse:0.14548
[16]	train-rmse:0.136427
[17]	train-rmse:0.128002
[18]	train-rmse:0.119581
[19]	train-rmse:0.11277
[20]	train-rmse:0.107372
[21]	train-rmse:0.101577
[22]	train-rmse:0.096232
[23]	train-rmse:0.09021
[24]	train-rmse:0.085917
[25]	train-rmse:0.082956
[26]	train-rmse:0.078009
[27]	train-rmse:0.075276
[28]	train-rmse:0.071852
[29]	train-rmse:0.068625
[30]	train-rmse:0.06612
[31]	train-rmse:0.064061
[32]	train-rmse:0.061051
[33]	train-rmse:0.059857
[34]	train-rmse:0.056776
[35]	train-rmse:0.053911
[36]	train-rmse:0.

In [8]:
start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)

params = {'tree_method': 'gpu_hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_gpu_nocudf = time.time() - start_time

# free up memory
del xgtrain
del bst

[0]	train-rmse:0.458601
[1]	train-rmse:0.422046
[2]	train-rmse:0.388733
[3]	train-rmse:0.359053
[4]	train-rmse:0.331591
[5]	train-rmse:0.306396
[6]	train-rmse:0.28309
[7]	train-rmse:0.262217
[8]	train-rmse:0.24321
[9]	train-rmse:0.225444
[10]	train-rmse:0.209671
[11]	train-rmse:0.195425
[12]	train-rmse:0.183419
[13]	train-rmse:0.170999
[14]	train-rmse:0.160371
[15]	train-rmse:0.150939
[16]	train-rmse:0.140717
[17]	train-rmse:0.132857
[18]	train-rmse:0.124626
[19]	train-rmse:0.117563
[20]	train-rmse:0.111101
[21]	train-rmse:0.104307
[22]	train-rmse:0.099047
[23]	train-rmse:0.093748
[24]	train-rmse:0.088491
[25]	train-rmse:0.084197
[26]	train-rmse:0.080656
[27]	train-rmse:0.076897
[28]	train-rmse:0.072605
[29]	train-rmse:0.070083
[30]	train-rmse:0.067563
[31]	train-rmse:0.064898
[32]	train-rmse:0.062391
[33]	train-rmse:0.059701
[34]	train-rmse:0.056999
[35]	train-rmse:0.055053
[36]	train-rmse:0.054009
[37]	train-rmse:0.052124
[38]	train-rmse:0.049648
[39]	train-rmse:0.047926
[40]	train-r

### 3. Train with GPU

To use GPU, we set `params[tree_method] = 'gpu_hist'` and also (optionally) load the Pandas `Dataframe` into cuDF Dataframe (Python object type `cudf.dataframe.dataframe.DataFrame`).

In [9]:
# load into cuDF Dataframe

gdf_train = cudf.DataFrame.from_pandas(df_train)
gdf_train_target = cudf.DataFrame.from_pandas(df_train_target)

In [10]:
start_time = time.time()

xgtrain = xgb.DMatrix(gdf_train, gdf_train_target)

params = {'tree_method': 'gpu_hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_gpu = time.time() - start_time

[0]	train-rmse:0.458601
[1]	train-rmse:0.422046
[2]	train-rmse:0.388733
[3]	train-rmse:0.359053
[4]	train-rmse:0.331591
[5]	train-rmse:0.306396
[6]	train-rmse:0.28309
[7]	train-rmse:0.262217
[8]	train-rmse:0.24321
[9]	train-rmse:0.225444
[10]	train-rmse:0.209671
[11]	train-rmse:0.195425
[12]	train-rmse:0.183419
[13]	train-rmse:0.170999
[14]	train-rmse:0.160371
[15]	train-rmse:0.150939
[16]	train-rmse:0.140717
[17]	train-rmse:0.132857
[18]	train-rmse:0.124626
[19]	train-rmse:0.117563
[20]	train-rmse:0.111101
[21]	train-rmse:0.104307
[22]	train-rmse:0.099047
[23]	train-rmse:0.093748
[24]	train-rmse:0.088491
[25]	train-rmse:0.084197
[26]	train-rmse:0.080656
[27]	train-rmse:0.076897
[28]	train-rmse:0.072605
[29]	train-rmse:0.070083
[30]	train-rmse:0.067563
[31]	train-rmse:0.064898
[32]	train-rmse:0.062391
[33]	train-rmse:0.059701
[34]	train-rmse:0.056999
[35]	train-rmse:0.055053
[36]	train-rmse:0.054009
[37]	train-rmse:0.052124
[38]	train-rmse:0.049648
[39]	train-rmse:0.047926
[40]	train-r

In [11]:
print("Check GPU memory usage")
!gpustat

# free up memory
del xgtrain
del bst

Check GPU memory usage
[1m[37mfe29fe2f1b6e[m  Sat Jan 19 14:31:53 2019
[36m[0][m [34mTesla V100-SXM2-16GB[m |[31m 38'C[m, [32m  0 %[m | [36m[1m[33m 1495[m / [33m16130[m MB |


### 4. Results

We see a significant speed-up when we use the RAPIDS stack.

In [12]:
print("CPU Time Taken:", round(timetaken_cpu,1))
print("GPU (no cuDF) Time Taken:", round(timetaken_gpu_nocudf,1))
print("GPU (cuDF) Time Taken:", round(timetaken_gpu,1))
print("Total speed-up with RAPIDS:", round(timetaken_cpu/timetaken_gpu*100,1), "%")

CPU Time Taken: 44.6
GPU (no cuDF) Time Taken: 16.6
GPU (cuDF) Time Taken: 17.1
Total speed-up with RAPIDS: 261.2 %
