<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# 3.4.2 Random Forests


## Lesson Objectives 

After completing this lesson, you should be able to: 

* Understand the Pipelines API for Random Forests and Gradient-Boosted Trees
* Describe default's Input and Output columns 
* Perform classification and regression with Random Forests (RFs)
* Understand and use Random Forests parameters

## Ensemble Method

* An Ensemble is a learning algorithm which creates an aggregate model composed of a set of other base models
* 'Random Forests' and 'Gradient-Boosted Trees' are ensemble algorithms based on decision trees
* Ensemble algorithms are among the top performers for classification and regression problems


## Random Forests (RFs)

* Random Forests are ensembles of Decision Trees
* One of the most successful machine learning models for classification and regression
* Random Forests combine many decision trees in order to reduce the risk of overfitting
* The Pipelines API for Random Forests supports both binary and multiclass classification
* Supports regression
* It also supports continuous and categorical features


## RF: Basic Algorithm

This is a quick description of the basic algorithm of Random Forests:

* RF trains a set of decision trees separately while at the same time
* RF injects randomness into the training process. This randomness comes from two different sources: 
  * bootstrapping: subsampling the original data set on each iteration to get a different training set
  * considering different random subsets of features to split on at each tree node
* Then each tree makes a prediction and the combined predictions from several trees reduces the variance of the predictions and improves the performance on test data
  * classification: majority vote - each tree's prediction is counted as a vote for one class and the predicted label is the class with larges number of votes
  * regression: average - each tree predicts a real value and the predicted label is equal to the average of all predictions


## Random Forest Parameters I

Now let's look at the parameters of Random Forests in Spark.ml

I start with the most important parameters: the number of trees and the maximum depth which CAN be tuned to improve performance:
* **numTrees**: the total number of trees in the forest. As the number of trees increases:
  * the variance of prediction decreases, improving test time accuracy
  * training time on the other hand increases roughly linearly with the number of trees
* **maxDepth**: the maximum depth of each tree in the forest. As trees get deeper:
  * model gets more expressive and powerful 
  * takes longer to train 
  * more prone to overfitting


## Random Forest Parameters II

The second set of parameters for Random Forests DO NOT require tuning, but they CAN be tuned to speed up training:
* **subsamplingRate**: specifies the fraction of the size of the original data set to be used for training each tree in the forest
  * default = 1.0
  * This means it uses the entire original data set to subsample
  * Decreasing this value can speed up training as it uses a smaller sample, but the accuracy of the model may suffer
* **featureSubsetStrategy**: specifies the fraction of total number of features to use as candidates for splitting at each tree node
  * decreasing this value can speed up training
  * if set too low can also impact the performance


## Inputs and Outputs

The Inputs taken and the Outputs produced by Random Forests in the Pipelines API are, not surprisingly, exactly the same as Decision Trees.

| Param name  | Type(s) | Default    | Description      |
| ----------- | ------- | ---------- | ---------------- |
| labelCol    | Double  | "label"    | Label to predict |
| featuresCol | Vector  | "features" | Feature vector   |

| Param name       | Type(s) | Default         | Description                              | Notes               |
| ---------------- | ------- | --------------- | ---------------------------------------- | ------------------- |
| predictionCol    | Double  | "prediction"    | Predicted label                          |                     |
| rawPredictionCol | Vector  | "rawPrediction" | Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction | Classification only |
| probabilityCol   | Vector  | "probability"   | Vector of length # classes equal to rawPrediction normalized to a multinomial distribution | Classification only |

This is the base model for Random Forests. A quick recap: as Inputs there are the label and features columns and as Outputs there are the prediction, rawPrediction and probability columns, where the last two only apply for classification trees. 


## Continuing from Previous Example

In [1]:
import sys.process._
"wget https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0105EN/data/sample_libsvm_data.txt  -P /resources/data/"!

--2020-06-13 05:33:44--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0105EN/data/sample_libsvm_data.txt
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104736 (102K) [text/plain]
Saving to: ‘/resources/data/sample_libsvm_data.txt.1’

     0K .......... .......... .......... .......... .......... 48% 1.54M 0s
    50K .......... .......... .......... .......... .......... 97% 2.23M 0s
   100K ..                                                    100%  176K=0.07s

2020-06-13 05:33:44 (1.50 MB/s) - ‘/resources/data/sample_libsvm_data.txt.1’ saved [104736/104736]





0

In [2]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer}
import spark.implicits._
import org.apache.spark.sql.functions._

spark = org.apache.spark.sql.SparkSession@2988afba


org.apache.spark.sql.SparkSession@2988afba

In [3]:
val data = MLUtils.loadLibSVMFile(sc, "/resources/data/sample_libsvm_data.txt").toDF()
data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|
|  0.0|(692,[152,153,154...|
|  1.0|(692,[97,98,99,12...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 20 rows



data = [label: double, features: vector]


[label: double, features: vector]

## RF Classification I

Once again I'm going to build up on the previous example: the `DecisionTreeClassifier`. 

Remember the Pipeline I used then had 4 stages: two preprocessing estimators, one decision tree classifier and one postprocessing transformer. 

Since I'm using the same training data the only thing I need to change is the classifier itself. All the rest, pre and post processing estimators and transformers, remain the same. 

So first I create a new instance of a `RandomForestClassifier`. It will take as inputs the columns named `indexedLabel` and `indexedFeatures`. The number of trees I'm going to train is quite small: just 3

In [4]:
import org.apache.spark.mllib.util.MLUtils.{
  convertVectorColumnsFromML => fromML,
  convertVectorColumnsToML => toML
}

import org.apache.spark.mllib.util.MLUtils.{convertVectorColumnsFromML=>fromML, convertVectorColumnsToML=>toML}


In [5]:
val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data)
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(toML(data))

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.classification.RandomForestClassificationModel

val rfC = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setNumTrees(3)

labelIndexer = strIdx_591a2e3622a7
labelConverter = idxToStr_3274656ce132
featureIndexer = vecIdx_d2de5de41a47
rfC = rfc_27801d1cc651


rfc_27801d1cc651

## RF Classification II

Then I create a new Pipeline also with 4 stages but replacing the `DecisionTreeClassifier` with the new `RandomForestClassifier` as its third stage.  

This is the `pipelineRFC`: the Random Forest Classifier.

In [6]:
import org.apache.spark.ml.Pipeline

// split into training and test data
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

val pipelineRFC = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rfC, labelConverter))
val modelRFC = pipelineRFC.fit(toML(trainingData))
val predictionsRFC = modelRFC.transform(toML(testData))

trainingData = [label: double, features: vector]
testData = [label: double, features: vector]
pipelineRFC = pipeline_804359d5c72d
modelRFC = pipeline_804359d5c72d
predictionsRFC = [label: double, features: vector ... 6 more fields]


[label: double, features: vector ... 6 more fields]

All the rest is exactly the same as before. Calling the `fit` method to get a model and calling the `transform` method to make predictions. 

The predictions are then returned in the prediction RFC data frame.

## RF Classification III  

Let's take a look at the `predictionsRFC` `DataFrame`:

In [7]:
predictionsRFC.select("predictedLabel", "label", "features").show(10)

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           0.0|  0.0|(692,[98,99,100,1...|
|           0.0|  0.0|(692,[100,101,102...|
|           0.0|  0.0|(692,[123,124,125...|
|           0.0|  0.0|(692,[124,125,126...|
|           0.0|  0.0|(692,[125,126,127...|
|           0.0|  0.0|(692,[126,127,128...|
|           0.0|  0.0|(692,[126,127,128...|
|           0.0|  0.0|(692,[126,127,128...|
|           0.0|  0.0|(692,[126,127,128...|
|           0.0|  0.0|(692,[152,153,154...|
+--------------+-----+--------------------+
only showing top 10 rows



## RF Classification IV

In [8]:
val rfModelC = modelRFC.stages(2).asInstanceOf[RandomForestClassificationModel]

rfModelC = RandomForestClassificationModel (uid=rfc_27801d1cc651) with 3 trees


RandomForestClassificationModel (uid=rfc_27801d1cc651) with 3 trees

In [9]:
rfModelC.featureImportances

(692,[301,429,454,462,492,512],[0.3000608642726719,0.03327246906066141,0.04598930481283424,0.3035897435897436,0.029743589743589732,0.2873440285204992])

We can derive the Random Forest Classification model and from that, see the feature importances.

## RF Classification V

Now let's take a look at the model's rules. I can use `toDebugString` to inspect the rules of each and every tree:

In [10]:
println("Learned classification forest model:\n" + rfModelC.toDebugString)

Learned classification forest model:
RandomForestClassificationModel (uid=rfc_27801d1cc651) with 3 trees
  Tree 0 (weight 1.0):
    If (feature 512 <= 1.5)
     If (feature 454 <= 24.5)
      Predict: 0.0
     Else (feature 454 > 24.5)
      Predict: 1.0
    Else (feature 512 > 1.5)
     Predict: 1.0
  Tree 1 (weight 1.0):
    If (feature 462 <= 62.5)
     If (feature 492 <= 205.5)
      Predict: 1.0
     Else (feature 492 > 205.5)
      Predict: 0.0
    Else (feature 462 > 62.5)
     Predict: 0.0
  Tree 2 (weight 1.0):
    If (feature 301 <= 27.0)
     If (feature 429 <= 7.0)
      Predict: 0.0
     Else (feature 429 > 7.0)
      Predict: 1.0
    Else (feature 301 > 27.0)
     Predict: 1.0



## RF for Regression

Having completed an example of classification with Random Forests, it is time for an example of regression. Once again I will build up on the previous regression example using Decision Trees. 

The Pipeline for regression, in that case, had only 2 stages - the `featureIndexer` and the `DecisionTreeRegressor`.

Now I replace the Decision Tree with the `RandomForestRegressor` and create a new `Pipeline`. This is the `pipelineRFR`, from Random Forest `Regressor`. All the rest is exactly the same as before: calling the `fit` method to get a model and calling the `transform` method to make predictions.

In [11]:
import org.apache.spark.ml.regression.RandomForestRegressor
import org.apache.spark.ml.regression.RandomForestRegressionModel

val rfR = new RandomForestRegressor().setLabelCol("label").setFeaturesCol("indexedFeatures")

val pipelineRFR = new Pipeline().setStages(Array(featureIndexer, rfR))

val modelRFR = pipelineRFR.fit(toML(trainingData))

val predictions = modelRFR.transform(toML(testData))
predictions.show()

+-----+--------------------+--------------------+----------+
|label|            features|     indexedFeatures|prediction|
+-----+--------------------+--------------------+----------+
|  0.0|(692,[98,99,100,1...|(692,[98,99,100,1...|       0.0|
|  0.0|(692,[100,101,102...|(692,[100,101,102...|      0.25|
|  0.0|(692,[123,124,125...|(692,[123,124,125...|       0.0|
|  0.0|(692,[124,125,126...|(692,[124,125,126...|      0.05|
|  0.0|(692,[125,126,127...|(692,[125,126,127...|       0.1|
|  0.0|(692,[126,127,128...|(692,[126,127,128...|       0.0|
|  0.0|(692,[126,127,128...|(692,[126,127,128...|       0.0|
|  0.0|(692,[126,127,128...|(692,[126,127,128...|       0.0|
|  0.0|(692,[126,127,128...|(692,[126,127,128...|      0.05|
|  0.0|(692,[152,153,154...|(692,[152,153,154...|      0.15|
|  1.0|(692,[123,124,125...|(692,[123,124,125...|       1.0|
|  1.0|(692,[124,125,126...|(692,[124,125,126...|       1.0|
|  1.0|(692,[124,125,126...|(692,[124,125,126...|       1.0|
|  1.0|(692,[126,127,128

rfR = rfr_0a54b4fa24bc
pipelineRFR = pipeline_75994e068ce2
modelRFR = pipeline_75994e068ce2
predictions = [label: double, features: vector ... 2 more fields]


[label: double, features: vector ... 2 more fields]

## RF for Regression

The predictions are then returned in the `predictionsRFR` data frame:

In [12]:
val predictionsRFR = modelRFR.transform(toML(testData))
predictionsRFR.select("prediction", "label", "features").show(5)

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|(692,[98,99,100,1...|
|      0.25|  0.0|(692,[100,101,102...|
|       0.0|  0.0|(692,[123,124,125...|
|      0.05|  0.0|(692,[124,125,126...|
|       0.1|  0.0|(692,[125,126,127...|
+----------+-----+--------------------+
only showing top 5 rows



predictionsRFR = [label: double, features: vector ... 2 more fields]


[label: double, features: vector ... 2 more fields]

## Lesson Summary

Having completed this lesson we should now be able to:

* Understand how to run a random forest in Spark
* Grasp most of the parameters and their effects 
* Understand inputs and outputs 
* Understand how to use Random Forest for regression and categorization

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.