# Random Forest Classification

**Authorship**<br />
Original Author: Saloni Jain<br />
Last Edit: Taurean Dyer, 9/25/2019<br />

**Test System Specs**<br />
Test System Hardware: GV100<br />
Test System Software: Ubuntu 18.04<br />
RAPIDS Version: 0.10.0a - Docker Install<br />
Driver: 410.79<br />
CUDA: 10.0<br />


**Known Working Systems**<br />
RAPIDS Versions: 0.4, 0.5, 0.5.1, 0.6, 0.6.1, 0.7, 0.8, 0.9, 0.10

## Intro
The Random Forest algorithm is a classification algorithm which builds several decision trees, and aggregates each of their outputs to make a prediction. This makes it more robust to overfitting.

In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the RandomForest model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html

This notebook demonstratrates fitting a RandomForestClassifier on the Higgs dataset. It is a binary classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not. The notebook also compares the performance (accuracy and speed) with sklearn's parallel RandomForestClassifier implementation.

In [None]:
from cuml import RandomForestClassifier as cuRF
from sklearn.ensemble import RandomForestClassifier as sklRF
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import cudf
import numpy as np
import pandas as pd
import os
from urllib.request import urlretrieve
import gzip

## Helper function to download and extract the Higgs dataset

In [None]:
def download_higgs(compressed_filepath, decompressed_filepath):
    higgs_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz'
    if not os.path.isfile(compressed_filepath):
        urlretrieve(higgs_url, compressed_filepath)
    if not os.path.isfile(decompressed_filepath):
        cf = gzip.GzipFile(compressed_filepath)
        with open(decompressed_filepath, 'wb') as df:
            df.write(cf.read())

## Download Higgs data and read using cudf

In [None]:
data_dir = '../../data/rf/'
if not os.path.exists(data_dir):
    print('creating rf data directory')
    os.system('mkdir ../../data/rf')

In [None]:
compressed_filepath = data_dir+'HIGGS.csv.gz' # Set this as path for gzipped Higgs data file, if you already have
decompressed_filepath = data_dir+'HIGGS.csv' # Set this as path for decompressed Higgs data file, if you already have
download_higgs(compressed_filepath, decompressed_filepath)

col_names = ['label'] + ["col-{}".format(i) for i in range(2, 30)] # Assign column names
dtypes_ls = ['int32'] + ['float32' for _ in range(2, 30)] # Assign dtypes to each column
data = cudf.read_csv(decompressed_filepath, names=col_names, dtype=dtypes_ls)
data.head().to_pandas()

## Make train test splits

In [None]:
X, y = data[data.columns.difference(['label'])].as_matrix(), data['label'].to_array() # Separate data into X and y
del data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=500_000)

In [None]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

#### You can consult RandomForestClassifier docstring to check all the parameters, but here are some of the more important ones: 
1. n_estimators: (default = 10) number of trees in the forest.
2. max_depth: (default = -1) Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1.
3. n_bins: (default = 8) Number of bins used by the split algorithm.

Note on `nbins`: Reducing `n_bins` shrinks the histograms used to compute which tree nodes to split. This reduction improves training time, but if you reduce it too low, you may harm model accuracy. 

In [None]:
# cuml Random Forest params

cu_rf_params = {
    'n_estimators': 25,
    'max_depth': 13,
    'n_bins': 15,
}

#### The methods that can be used with the RandomForestClassifier are:
1. fit: Fit the model with X and y.
2. get_params: Sklearn style return parameter state
3. predict: Predicts the y for X.
4. set_params: Sklearn style set parameter state to dictionary of params.
5. cross_validate: Predicts the accuracy of the model for X.

###### Note on input to `fit` method: Since `fit` is processed on the GPU, it can accept `cudf` dataframes or `numpy` arrays

In [None]:
%%time
# Train cuml RF

cu_rf = cuRF(**cu_rf_params)
cu_rf.fit(X_train, y_train)

#### Set Sklearn params and fit RandomForestClassifier

In [None]:
# sklearn Random Forest params

skl_rf_params = {
    'n_estimators': 25,
    'max_depth': 13,
}

In [None]:
%%time
# Train sklearn RF parallely

skl_rf = sklRF(**skl_rf_params, n_jobs=20)
skl_rf.fit(X_train, y_train)

## Predict and compare cuml and sklearn RandomForestClassifier

###### Note on input to cuml `predict` method: Since `predict` is processed on the CPU, it can only accept `numpy` arrays

In [None]:
# Predict

print("cuml RF Accuracy Score: ", accuracy_score(cu_rf.predict(X_test), y_test))
print("sklearn RF Accuracy Score: ", accuracy_score(skl_rf.predict(X_test), y_test))