# Introduction to RF classification in cuML

Aim: Demonstrate how to use GPU Random Forest in cuML. 

Dataset: Small example dataset of fruits of different size and color. Dataset is self generated.

In [None]:
from cuml import RandomForestClassifier as cuRF
import numpy as np

Here we map fuit label to consecutive integers which is required for Random Forest Classifier.

We also create an inverse dictionary to convert integers back to text labels

In [None]:
fruit_to_label = {'apple': 0, 'water melon': 1, 'cherry': 2, 'strawberry': 3} 
label_to_fruit = dict([[v,k] for k,v in fruit_to_label.items()])

The dataset below float numbers for fruit size and one-hot encoding (float based) for different color types.

Each row is below dataset is arranged as:
Red, Green, Blue, Size(cm), Fruit (label)

Dataset is converted to float32 and labels into int32

In [None]:
# Red(0/1), Green(0/1), Blue(0/1), Size(cm), Fruit (label)
features = ['Red', 'Green', 'Blue', 'Size(cm)']
dataset = np.array([[1.0, 0.0, 0.0, 7.0, fruit_to_label['apple']],
                   [0.0, 1.0, 0.0, 20.0, fruit_to_label['water melon']],
                   [1.0, 0.0, 0.0, 1.0, fruit_to_label['cherry']],
                   [0.0, 1.0, 0.0, 7.5, fruit_to_label['apple']],
                   [1.0, 0.0, 0.0, 1.0, fruit_to_label['strawberry']],
                   [1.0, 0.0, 0.0, 0.8, fruit_to_label['cherry']]])

X_train = dataset[:, :-1].astype(np.float32)
y_train = dataset[:, -1].astype(np.int32)

We now set hyper-parameters for cuml Random Forest and create a RF object

In [None]:
# cuml Random Forest params
cu_rf_params = {
    'n_estimators': 3, #number of trees in RF
    'max_depth': 8, # max depth of each tree
    'n_bins': 4, # number of bins used in split point calculation
    'n_streams': 1, # CUDA stream to use for parallel processing on GPU
    'rows_sample': 0.67, # Percentage of input data to be considered for each tree
    'split_algo': 0, # Split algorithm
    'seed': 13233466 # Seed used for Random Number Generator
}
cu_rf = cuRF(**cu_rf_params)

Now we proceed with model training and build a Random Forest on our dataset

In [None]:
cu_rf.fit(X_train, y_train)

Now we create a dataset that we can use for inference. The dataset has two samples for inference

In [None]:
X_test = np.array([[0.0, 1.0, 0.0, 22.0], [1.0, 0.0, 0.0, 0.6]])
X_test = X_test.astype(np.float32)

Running infernece on CPU over Default mode on GPU. At the moment, RF only supports binary inference on GPU. For multi-class inference we defer to CPU mode

In [None]:
predict_label = cu_rf.predict(X_test, predict_model='CPU')
print("Predicted fruit 1-->", label_to_fruit[predict_label[0]])
print("Predicted fruit 2-->", label_to_fruit[predict_label[1]])