## Predicting drive failure with XGBoost and RAPIDS

**Dataset**: Hard disk SMART data and failure dataset from Backblaze ([More information](https://www.backblaze.com/b2/hard-drive-test-data.html))

**Task**: Predict hard disk failure with RAPIDS

In [1]:
import time
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

from sklearn.metrics import mean_squared_error, accuracy_score

# RAPIDS
import cudf
import xgboost as xgb

### 1. Load Data

#### Training Data

Use Pandas to load training data from CSV. This consists of the pre-processed drive data from Feb 2017 to September 2018

In [2]:
df = pd.read_csv("merged.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8545 entries, 0 to 8544
Columns: 283 entries, failure to feat279
dtypes: float64(280), int64(3)
memory usage: 18.4 MB


Split into features (`df_train`) and labels(`df_target`), where each is a Pandas `Dataframe`.

In [3]:
df_train = df.drop(["failure"],axis=1).apply(pd.to_numeric)
df_train_target = pd.DataFrame(df["failure"]).apply(pd.to_numeric)

#### Evaluation Data

Do the same thing to load evaluation data from CSV, if you have a seperate file to load.

In [4]:
"""
df_t = pd.read_csv("eval.csv")

# fill in null data
df_t.fillna(0, inplace=True)

df_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric)
df_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)
"""

'\ndf_t = pd.read_csv("eval.csv")\n\n# fill in null data\ndf_t.fillna(0, inplace=True)\n\ndf_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric)\ndf_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)\n'

### 2. Train with CPU

XGBoost training with CPU (`params[tree_method] = 'hist'`), using a Pandas `Dataframe` loaded into `xgb.DMatrix`. For more information, check out [this page in the XGBoost Documentation](https://xgboost.readthedocs.io/en/latest/python/python_intro.html).

In [5]:
MAX_TREE_DEPTH = 24

In [6]:
!lscpu | grep 'Model name:'
!lscpu | grep 'CPU(s)'

Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
CPU(s):              40
On-line CPU(s) list: 0-39
NUMA node0 CPU(s):   0-39


In [7]:
start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)

params = {'tree_method': 'hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_cpu = time.time() - start_time

# free up memory
del xgtrain
del bst

[15:30:04] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[0]	train-rmse:0.458406
[1]	train-rmse:0.421363
[2]	train-rmse:0.388094
[3]	train-rmse:0.357538
[4]	train-rmse:0.328977
[5]	train-rmse:0.301982
[6]	train-rmse:0.27773
[7]	train-rmse:0.25579
[8]	train-rmse:0.236438
[9]	train-rmse:0.218327
[10]	train-rmse:0.202473
[11]	train-rmse:0.187359
[12]	train-rmse:0.174484
[13]	train-rmse:0.161807
[14]	train-rmse:0.151045
[15]	train-rmse:0.140964
[16]	train-rmse:0.132527
[17]	train-rmse:0.125665
[18]	train-rmse:0.117584
[19]	train-rmse:0.11015
[20]	train-rmse:0.103756
[21]	train-rmse:0.097196
[22]	train-rmse:0.090893
[23]	train-rmse:0.086196
[24]	train-rmse:0.082491
[25]	train-rmse:0.078661
[26]	train-rmse:0.075509
[27]	train-rmse:0.072039
[28]	train-rmse:0.070015
[29]	train-rmse:0.065965
[30]	train-rmse:0.062502
[31]	train-rmse:0.059484
[32]	train-rmse:0.057047
[33]	train-rmse:0.05503
[34]	train-rmse:0.051889
[35]	train-rmse:0.04986
[36]	train-rmse:0

### 3. Train with GPU

To use GPU, we set `params[tree_method] = 'gpu_hist'` and also (optionally) load the Pandas `Dataframe` into cuDF Dataframe (Python object type `cudf.dataframe.dataframe.DataFrame`).

In [8]:
# run 1
# GPU, without using cuDF

start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)

params = {'tree_method': 'gpu_hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_gpu_nocudf = time.time() - start_time

# free up memory
del xgtrain
del bst

[0]	train-rmse:0.458405
[1]	train-rmse:0.421806
[2]	train-rmse:0.388041
[3]	train-rmse:0.356389
[4]	train-rmse:0.327884
[5]	train-rmse:0.301324
[6]	train-rmse:0.277336
[7]	train-rmse:0.255256
[8]	train-rmse:0.235924
[9]	train-rmse:0.217393
[10]	train-rmse:0.201276
[11]	train-rmse:0.186706
[12]	train-rmse:0.173898
[13]	train-rmse:0.160896
[14]	train-rmse:0.148913
[15]	train-rmse:0.138999
[16]	train-rmse:0.130038
[17]	train-rmse:0.122823
[18]	train-rmse:0.115481
[19]	train-rmse:0.108235
[20]	train-rmse:0.10212
[21]	train-rmse:0.096187
[22]	train-rmse:0.091343
[23]	train-rmse:0.087358
[24]	train-rmse:0.0826
[25]	train-rmse:0.079298
[26]	train-rmse:0.076284
[27]	train-rmse:0.073177
[28]	train-rmse:0.07026
[29]	train-rmse:0.067298
[30]	train-rmse:0.064111
[31]	train-rmse:0.060426
[32]	train-rmse:0.058145
[33]	train-rmse:0.055393
[34]	train-rmse:0.052997
[35]	train-rmse:0.050639
[36]	train-rmse:0.048696
[37]	train-rmse:0.046288
[38]	train-rmse:0.043618
[39]	train-rmse:0.041953
[40]	train-rms

In [9]:
# load into cuDF Dataframe

gdf_train = cudf.DataFrame.from_pandas(df_train)
gdf_train_target = cudf.DataFrame.from_pandas(df_train_target)

In [10]:
# run 2
# GPU, with using cuDF

start_time = time.time()

xgtrain = xgb.DMatrix(gdf_train, gdf_train_target)

params = {'tree_method': 'gpu_hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_gpu = time.time() - start_time

[0]	train-rmse:0.458405
[1]	train-rmse:0.421806
[2]	train-rmse:0.388041
[3]	train-rmse:0.356389
[4]	train-rmse:0.327884
[5]	train-rmse:0.301324
[6]	train-rmse:0.277336
[7]	train-rmse:0.255256
[8]	train-rmse:0.235924
[9]	train-rmse:0.217393
[10]	train-rmse:0.201276
[11]	train-rmse:0.186706
[12]	train-rmse:0.173898
[13]	train-rmse:0.160896
[14]	train-rmse:0.148913
[15]	train-rmse:0.138999
[16]	train-rmse:0.130038
[17]	train-rmse:0.122823
[18]	train-rmse:0.115481
[19]	train-rmse:0.108235
[20]	train-rmse:0.10212
[21]	train-rmse:0.096187
[22]	train-rmse:0.091343
[23]	train-rmse:0.087358
[24]	train-rmse:0.0826
[25]	train-rmse:0.079298
[26]	train-rmse:0.076284
[27]	train-rmse:0.073177
[28]	train-rmse:0.07026
[29]	train-rmse:0.067298
[30]	train-rmse:0.064111
[31]	train-rmse:0.060426
[32]	train-rmse:0.058145
[33]	train-rmse:0.055393
[34]	train-rmse:0.052997
[35]	train-rmse:0.050639
[36]	train-rmse:0.048696
[37]	train-rmse:0.046288
[38]	train-rmse:0.043618
[39]	train-rmse:0.041953
[40]	train-rms

In [11]:
print("Check GPU memory usage")
!gpustat

# free up memory
del xgtrain
del bst

Check GPU memory usage
[1m[37mjupyter-timothy-5fliu[m  Wed Jan 23 15:32:12 2019
[36m[0][m [34mTesla V100-DGXS-16GB[m |[31m 41'C[m, [32m  0 %[m | [36m[1m[33m 3083[m / [33m16125[m MB |


### 4. Results

We see a significant speed-up when we use the RAPIDS stack.

In [13]:
print("CPU Time Taken:\n", round(timetaken_cpu,1))
print("\nGPU (no cuDF) Time Taken:\n", round(timetaken_gpu_nocudf,1))
print("\nGPU (cuDF) Time Taken:\n", round(timetaken_gpu,1))
print("\nTotal speed-up with RAPIDS:\n", round(timetaken_cpu/timetaken_gpu*100,1), "%")

CPU Time Taken:
 99.7

GPU (no cuDF) Time Taken:
 13.1

GPU (cuDF) Time Taken:
 12.5

Total speed-up with RAPIDS:
 799.2 %
