# Decision Forest Classification and Regression with Intel® Data Analytics Acceleration Library in Amazon SageMaker

## Introduction

Intel® Data Analytics Acceleration Library (Intel® DAAL) is the library of Intel® architecture optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making. One of its algorithms is Decision Forest.

The library provides decision forest classification and regression algorithms based on an ensemble of tree-structured classifiers (decision trees) built using the general technique of bootstrap aggregation (bagging) and random choice of features. Decision tree is a binary tree graph. Its internal (split) nodes represent a decision function used to select the following (child) node at the prediction stage. Its leaf (terminal) nodes represent the corresponding response values, which are the result of the prediction from the tree.

Intel® DAAL developer guide: https://software.intel.com/en-us/daal-programming-guide

Intel® DAAL documentation for Decision Forest: https://software.intel.com/en-us/daal-programming-guide-decision-forest

## Decision Forest Usage with SageMaker Estimator
Firstly, you need to import SageMaker package, get execution role and create session.

In [1]:
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

Secondly, you can specify parameters of Decision Forest.
#### Hyperparameters
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Type</strong></td>
        <td><strong>Default value</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>nClasses</td>
        <td>int</td>
        <td>3</td>
        <td>Number of classes in data (only for classification)</td>
    </tr>
    <tr>
        <td>fptype</td>
        <td>str</td>
        <td>"double"</td>
        <td>The floating-point type that the algorithm uses for intermediate computations. Can be "float" or "double"</td>
    </tr>
    <tr>
        <td>method</td>
        <td>str</td>
        <td>"defaultDense"</td>
        <td>The only training method supported so far is the default dense method</td>
    </tr>
    <tr>
        <td>nTrees</td>
        <td>int</td>
        <td>100</td>
        <td>The number of trees in the forest</td>
    </tr>
    <tr>
        <td>observationsPerTreeFraction</td>
        <td>int</td>
        <td>1</td>
        <td>Fraction of the training set S used to form the bootstrap set for a single tree training, observationsPerTreeFraction in (0, 1]. The observations are sampled randomly with replacement</td>
    </tr>
    <tr>
        <td>featuresPerNode</td>
        <td>int</td>
        <td>0</td>
        <td>The number of features tried as possible splits per node. If the parameter is set to 0, the library uses the square root of the number of features for classification and (the number of features)/3 for regression</td>
    </tr>
    <tr>
        <td>maxTreeDepth</td>
        <td>int</td>
        <td>0</td>
        <td>Maximal tree depth. Default is 0 (unlimited).</td>
    </tr>
    <tr>
        <td>minObservationsInLeafNode</td>
        <td>int</td>
        <td>1 for classification, 5 for regression</td>
        <td>The number of neighbors</td>
    </tr>
    <tr>
        <td>seed</td>
        <td>int</td>
        <td>777</td>
        <td>The seed for random number generator, which is used to choose the bootstrap set, split features in every split node in a tree, and generate permutation required in computations of MDA variable importance</td>
    </tr>
    <tr>
        <td>impurityThreshold</td>
        <td>float</td>
        <td>0</td>
        <td>The threshold value used as stopping criteria: if the impurity value in the node is smaller than the threshold, the node is not split anymore</td>
    </tr>
    <tr>
        <td>varImportance</td>
        <td>str</td>
        <td>None</td>
        <td>The variable importance computation mode. Possible values:<br/>none – variable importance is not calculated<br/>MDI - Mean Decrease of Impurity, also known as the Gini importance or Mean Decrease Gini<br/>MDA_Raw - Mean Decrease of Accuracy (permutation importance)<br/>MDA_Scaled - the MDA_Raw value scaled by its standard deviation</td>
    </tr>
    <tr>
        <td>resultsToCompute</td>
        <td>str</td>
        <td>"computeOutOfBagError|computeOutOfBagErrorPerObservation"</td>
        <td>Provide one of the following values to request a single characteristic or use bitwise OR to request a combination of the characteristics:<br/>computeOutOfBagError, computeOutOfBagErrorPerObservation</td>
    </tr>
    <tr>
        <td>memorySavingMode</td>
        <td>bool</td>
        <td>False</td>
        <td>If True, memory saving mode is enabled</td>
    </tr>
    <tr>
        <td>bootstrap</td>
        <td>bool</td>
        <td>False for classification, True for regression</td>
        <td>If True, bootstrap is enabled</td>
    </tr>
</table>

Example of hyperparameters dictionary:

In [2]:
decision_forest_params = {
    "nClasses": 3,
    "fptype":"double",
    "method":"defaultDense",
    "nTrees":"100",
    "observationsPerTreeFraction":"1",
    "featuresPerNode":"0",
    "maxTreeDepth":"0",
    "minObservationsInLeafNode":"1",
    "seed":"777",
    "impurityThreshold":"0",
    "varImportance":"None",
    "resultsToCompute":"0",
    "memorySavingMode":"False",
    "bootstrap":"False",
    "distributed":"False"
}

Then, you need to create SageMaker Estimator instance with following parameters:
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>image_name</td>
        <td>The container image to use for training</td>
    </tr>
    <tr>
        <td>role</td>
        <td>An AWS IAM role. The SageMaker training jobs and APIs that create SageMaker endpoints use this role to access training data and models</td>
    </tr>
    <tr>
        <td>train_instance_count</td>
        <td>Number of Amazon EC2 instances to use for training. Should be 1, because it is not distributed version of algorithm</td>
    </tr>
    <tr>
        <td>train_instance_type</td>
        <td>Type of EC2 instance to use for training. See available types on Amazon Marketplace page of algorithm</td>
    </tr>
    <tr>
        <td>input_mode</td>
        <td>The input mode that the algorithm supports. May be "File" or "Pipe"</td>
    </tr>
    <tr>
        <td>output_path</td>
        <td>S3 location for saving the trainig result (model artifacts and output files)</td>
    </tr>
    <tr>
        <td>sagemaker_session</td>
        <td>Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed</td>
    </tr>
    <tr>
        <td>hyperparameters</td>
        <td>Dictionary containing the hyperparameters to initialize this estimator with</td>
    </tr>
</table>
Full SageMaker Estimator documentation: https://sagemaker.readthedocs.io/en/latest/estimators.html

In [9]:
daal_decision_forest_arn = "<algorithm-arn>" # you can find it on algorithm page in your subscriptions

daal_decision_forest = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=daal_decision_forest_arn,
    role=role,
    base_job_name="<base-job-name>",
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    input_mode="File",
    output_path="s3://<bucket-name>/<output-path>",
    sagemaker_session=sess,
    hyperparameters=decision_forest_params
)

### Training stage
On training stage, Decision Forest algorithm consume input data from S3 location.
This container supports only .csv ("comma-separated values") files.

In [None]:
daal_decision_forest.fit({"training": "s3://<bucket-name>/<training-data-path>"})

### Real-time prediction
Firstly, you need to deploy SageMaker endpoint that consumes data.

In [None]:
predictor = daal_decision_forest.deploy(1, "ml.m4.xlarge", serializer=sagemaker.predictor.csv_serializer)

Secondly, you should pass data as numpy array to predictor instance and get transformed data as space-separated values.

In this example we are passing random data, but you can use any numpy 2D array

In [None]:
import numpy as np

predict_data = np.random.random(size=(10,10))
print(predictor.predict(predict_data).decode("utf-8"))

Don't forget to delete endpoint if you don't need it anymore.

In [None]:
sess.delete_endpoint(predictor.endpoint)

### Batch transform job
If you don't need real-time prediction, you can use transform job. It uses saved model, compute transformed data one time and saves it in specified or auto-generated output path.

More about transform jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html

Transformer API: https://sagemaker.readthedocs.io/en/latest/transformer.html

In [None]:
transformer = daal_decision_forest.transformer(1, 'ml.m4.xlarge')
transformer.transform("s3://<bucket-name>/<prediction-data-path>", content_type='text/csv')
transformer.wait()
print(transformer.output_path)