## Predicting drive failure with XGBoost and RAPIDS

**Dataset**: Hard disk SMART data and failure dataset from Backblaze ([More information](https://www.backblaze.com/b2/hard-drive-test-data.html))

**Task**: Predict hard disk failure with RAPIDS

In [1]:
import time
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# RAPIDS
import cudf
import xgboost as xgb

### 1. Load Data

#### Training Data

Use Pandas to load training data from CSV. This consists of the drive data from July and August 2018.

In [2]:
df = pd.read_csv("train.csv")

# fill in null data
df.fillna(0, inplace=True)

Split into features (`df_train`) and labels(`df_target`), where each is a Pandas `Dataframe`.

In [3]:
df_train = df.drop(["failure"],axis=1).apply(pd.to_numeric)
df_train_target = pd.DataFrame(df["failure"]).apply(pd.to_numeric)

#### Evaluation Data

Do the same thing to load evaluation data from CSV. This consists of the drive data from September 2018.

In [4]:
df_t = pd.read_csv("eval.csv")

# fill in null data
df_t.fillna(0, inplace=True)

df_test = df_t.drop(["failure"],axis=1).apply(pd.to_numeric)
df_test_target = pd.DataFrame(df_t["failure"]).apply(pd.to_numeric)

### 2. Train with CPU

XGBoost training with CPU (`params[tree_method] = 'hist'`), using a Pandas `Dataframe` loaded into `xgb.DMatrix`. For more information, check out [this page in the XGBoost Documentation](https://xgboost.readthedocs.io/en/latest/python/python_intro.html).

In [5]:
!lscpu | grep 'Model name:'

Model name:          Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz


In [6]:
start_time = time.time()

xgtrain = xgb.DMatrix(df_train, df_train_target)
xgeval = xgb.DMatrix(df_test, df_test_target)

params = {'tree_method': 'hist', 'max_depth': 24, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train"), (xgeval, "eval")])

timetaken_cpu = time.time() - start_time

# free up memory
del xgtrain
del bst

[03:41:53] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[0]	train-rmse:0.450007	eval-rmse:0.450008
[1]	train-rmse:0.405014	eval-rmse:0.405017
[2]	train-rmse:0.364522	eval-rmse:0.364526
[3]	train-rmse:0.328079	eval-rmse:0.328086
[4]	train-rmse:0.295281	eval-rmse:0.295294
[5]	train-rmse:0.265764	eval-rmse:0.265779
[6]	train-rmse:0.239199	eval-rmse:0.239218
[7]	train-rmse:0.215292	eval-rmse:0.215315
[8]	train-rmse:0.193778	eval-rmse:0.193805
[9]	train-rmse:0.174416	eval-rmse:0.174448
[10]	train-rmse:0.156991	eval-rmse:0.15703
[11]	train-rmse:0.141311	eval-rmse:0.141355
[12]	train-rmse:0.1272	eval-rmse:0.127252
[13]	train-rmse:0.114503	eval-rmse:0.114563
[14]	train-rmse:0.103076	eval-rmse:0.10315
[15]	train-rmse:0.092794	eval-rmse:0.092885
[16]	train-rmse:0.08354	eval-rmse:0.083653
[17]	train-rmse:0.075215	eval-rmse:0.07535
[18]	train-rmse:0.067726	eval-rmse:0.067882
[19]	train-rmse:0.060989	eval-rmse:0.06117
[20]	train-rmse:0.054925	eval-rmse:0.05

### 3. Train with GPU

To use GPU, we set `params[tree_method] = 'gpu_hist'` and also (optionally) load the Pandas `Dataframe` into cuDF Dataframe (Python object type `cudf.dataframe.dataframe.DataFrame`).

In [7]:
# load into cuDF Dataframe

gdf_train = cudf.DataFrame.from_pandas(df_train)
gdf_train_target = cudf.DataFrame.from_pandas(df_train_target)

gdf_test = cudf.DataFrame.from_pandas(df_test)
gdf_test_target = cudf.DataFrame.from_pandas(df_test_target)

In [8]:
start_time = time.time()

xgtrain = xgb.DMatrix(gdf_train, gdf_train_target)
xgeval = xgb.DMatrix(gdf_test, gdf_test_target)

params = {'tree_method': 'gpu_hist', 'max_depth': 24, 'learning_rate': 0.1, 'silent': 1}
bst = xgb.train(params, xgtrain, 50, evals=[(xgtrain, "train"), (xgeval, "eval")])

timetaken_gpu = time.time() - start_time

[0]	train-rmse:0.450007	eval-rmse:0.450008
[1]	train-rmse:0.405014	eval-rmse:0.405017
[2]	train-rmse:0.364521	eval-rmse:0.364526
[3]	train-rmse:0.328079	eval-rmse:0.328086
[4]	train-rmse:0.29528	eval-rmse:0.295293
[5]	train-rmse:0.265763	eval-rmse:0.265781
[6]	train-rmse:0.239198	eval-rmse:0.239221
[7]	train-rmse:0.215291	eval-rmse:0.215321
[8]	train-rmse:0.193776	eval-rmse:0.19381
[9]	train-rmse:0.174413	eval-rmse:0.174453
[10]	train-rmse:0.156989	eval-rmse:0.157034
[11]	train-rmse:0.141309	eval-rmse:0.141362
[12]	train-rmse:0.127197	eval-rmse:0.127262
[13]	train-rmse:0.114498	eval-rmse:0.114575
[14]	train-rmse:0.103073	eval-rmse:0.103163
[15]	train-rmse:0.092791	eval-rmse:0.092898
[16]	train-rmse:0.083539	eval-rmse:0.083664
[17]	train-rmse:0.075214	eval-rmse:0.075359
[18]	train-rmse:0.067725	eval-rmse:0.067893
[19]	train-rmse:0.060988	eval-rmse:0.061179
[20]	train-rmse:0.054928	eval-rmse:0.055149
[21]	train-rmse:0.049477	eval-rmse:0.049731
[22]	train-rmse:0.044573	eval-rmse:0.044866


In [9]:
print("Check GPU memory usage")
!gpustat

# free up memory
del xgtrain
del bst

Check GPU memory usage
[1m[37mf1f2465685ad[m  Mon Dec 24 03:42:48 2018
[36m[0][m [34mGeForce GTX 1080 Ti[m |[31m 45'C[m, [32m  3 %[m | [36m[1m[33m 9776[m / [33m11175[m MB |


### 4. Results

We see a significant speed-up when we use the RAPIDS stack.

In [10]:
print("CPU Time Taken:", round(timetaken_cpu,1))
print("GPU Time Taken:", round(timetaken_gpu,1))
print("Speed-up with RAPIDS:", round(timetaken_cpu/timetaken_gpu*100,1), "%")

CPU Time Taken: 45.4
GPU Time Taken: 15.1
Speed-up with RAPIDS: 300.0 %
