## Predicting drive failure with XGBoost and RAPIDS

**Dataset**: Hard disk SMART data and failure dataset from Backblaze ([More information](https://www.backblaze.com/b2/hard-drive-test-data.html))

**Task**: Predict hard disk failure with RAPIDS

In [1]:
import time
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# RAPIDS
import cudf
import xgboost as xgb

### 1. Load Data

#### Training Data

Use Pandas to load training data from CSV. This consists of the drive data from July and August 2018.

In [2]:
df = pd.read_csv("merged.csv")
df["hdd_model"] = df["hdd_model"].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4287 entries, 0 to 4286
Columns: 619 entries, failure to feat615
dtypes: category(1), float64(616), int64(2)
memory usage: 20.2 MB


Split into features (`df_train`) and labels(`df_target`), where each is a Pandas `Dataframe`.

In [3]:
df_train = df.drop(["hdd_model"],axis=1).drop(["failure"],axis=1).apply(pd.to_numeric)
df_train_target = pd.DataFrame(df["failure"]).apply(pd.to_numeric)

#### Evaluation Data

Do the same thing to load evaluation data from CSV. This consists of the drive data from September 2018.

In [4]:
"""
df_t = pd.read_csv("eval.csv")

# fill in null data
df_t.fillna(0, inplace=True)

df_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric)
df_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)
"""

'\ndf_t = pd.read_csv("eval.csv")\n\n# fill in null data\ndf_t.fillna(0, inplace=True)\n\ndf_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric)\ndf_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)\n'

In [5]:
MAX_TREE_DEPTH = 24

### 2. Train with CPU

XGBoost training with CPU (`params[tree_method] = 'hist'`), using a Pandas `Dataframe` loaded into `xgb.DMatrix`. For more information, check out [this page in the XGBoost Documentation](https://xgboost.readthedocs.io/en/latest/python/python_intro.html).

In [6]:
!lscpu | grep 'Model name:'

Model name:          Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz


In [7]:
start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)

params = {'tree_method': 'hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_cpu = time.time() - start_time

# free up memory
del xgtrain
del bst

[04:15:35] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[0]	train-rmse:0.457066
[1]	train-rmse:0.418206
[2]	train-rmse:0.383866
[3]	train-rmse:0.352145
[4]	train-rmse:0.322875
[5]	train-rmse:0.295982
[6]	train-rmse:0.271924
[7]	train-rmse:0.249402
[8]	train-rmse:0.23034
[9]	train-rmse:0.211748
[10]	train-rmse:0.194118
[11]	train-rmse:0.179599
[12]	train-rmse:0.167187
[13]	train-rmse:0.15468
[14]	train-rmse:0.144558
[15]	train-rmse:0.134902
[16]	train-rmse:0.125312
[17]	train-rmse:0.11699
[18]	train-rmse:0.109507
[19]	train-rmse:0.102615
[20]	train-rmse:0.095921
[21]	train-rmse:0.089034
[22]	train-rmse:0.082617
[23]	train-rmse:0.07664
[24]	train-rmse:0.072542
[25]	train-rmse:0.068811
[26]	train-rmse:0.064983
[27]	train-rmse:0.061901
[28]	train-rmse:0.057977
[29]	train-rmse:0.054897
[30]	train-rmse:0.052134
[31]	train-rmse:0.049396
[32]	train-rmse:0.047
[33]	train-rmse:0.044104
[34]	train-rmse:0.042404
[35]	train-rmse:0.040754
[36]	train-rmse:0.0

### 3. Train with GPU

To use GPU, we set `params[tree_method] = 'gpu_hist'` and also (optionally) load the Pandas `Dataframe` into cuDF Dataframe (Python object type `cudf.dataframe.dataframe.DataFrame`).

In [8]:
# load into cuDF Dataframe

gdf_train = cudf.DataFrame.from_pandas(df_train)
gdf_train_target = cudf.DataFrame.from_pandas(df_train_target)

In [9]:
start_time = time.time()

xgtrain = xgb.DMatrix(gdf_train, gdf_train_target)

params = {'tree_method': 'gpu_hist', 'max_depth': MAX_TREE_DEPTH, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train")])

timetaken_gpu = time.time() - start_time

[0]	train-rmse:0.457066
[1]	train-rmse:0.418205
[2]	train-rmse:0.383863
[3]	train-rmse:0.352269
[4]	train-rmse:0.323068
[5]	train-rmse:0.295971
[6]	train-rmse:0.271935
[7]	train-rmse:0.251275
[8]	train-rmse:0.231848
[9]	train-rmse:0.213321
[10]	train-rmse:0.19536
[11]	train-rmse:0.179152
[12]	train-rmse:0.164567
[13]	train-rmse:0.153777
[14]	train-rmse:0.142924
[15]	train-rmse:0.133262
[16]	train-rmse:0.12436
[17]	train-rmse:0.116547
[18]	train-rmse:0.110366
[19]	train-rmse:0.102741
[20]	train-rmse:0.095936
[21]	train-rmse:0.089365
[22]	train-rmse:0.082975
[23]	train-rmse:0.077779
[24]	train-rmse:0.072882
[25]	train-rmse:0.068129
[26]	train-rmse:0.064025
[27]	train-rmse:0.060719
[28]	train-rmse:0.058283
[29]	train-rmse:0.054669
[30]	train-rmse:0.051563
[31]	train-rmse:0.049245
[32]	train-rmse:0.046539
[33]	train-rmse:0.04424
[34]	train-rmse:0.042664
[35]	train-rmse:0.039821
[36]	train-rmse:0.037918
[37]	train-rmse:0.036624
[38]	train-rmse:0.034571
[39]	train-rmse:0.033505
[40]	train-rm

In [10]:
print("Check GPU memory usage")
!gpustat

# free up memory
del xgtrain
del bst

Check GPU memory usage
[1m[37m584fc0f60364[m  Sat Jan 19 04:15:59 2019
[36m[0][m [34mGeForce GTX 1080 Ti[m |[31m 48'C[m, [32m  1 %[m | [36m[1m[33m 1462[m / [33m11175[m MB |


### 4. Results

We see a significant speed-up when we use the RAPIDS stack.

In [11]:
print("CPU Time Taken:", round(timetaken_cpu,1))
print("GPU Time Taken:", round(timetaken_gpu,1))
print("Speed-up with RAPIDS:", round(timetaken_cpu/timetaken_gpu*100,1), "%")

CPU Time Taken: 12.6
GPU Time Taken: 9.4
Speed-up with RAPIDS: 134.6 %
