# Tree Methods (Documentation Example)

This is just a quick walkthrough of the Documentation's Example of Random Forest.

Remember, you can use tree methods for both regression and classification problems. 

In [4]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-3.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('tree_methods_doc').getOrCreate()

from pyspark.ml import Pipeline
from pyspark.ml.classification import (RandomForestClassifier, GBTClassifier, DecisionTreeClassifier)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

ModuleNotFoundError: No module named 'numpy'

In [None]:
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("Datasets/sample_libsvm_data.txt")

In [None]:
# Let's get a better look at the data.
data.show()

data.printSchema()

In [None]:
# Split the data into training and test sets (30% held out for testing).
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [None]:
# Create all three models. Note the number of trees. 
# The more trees you have, the more computation time. But this could also significantly increase accuracy. So there's a tradeoff. 
dt = DecisionTreeClassifier()
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=20)

In [None]:
# Train model. 
model_rf = rf.fit(trainingData)
model_dt = dt.fit(trainingData)

In [None]:
# Now let's do the transformation.
prediction_rf = model_rf.transform(testData)
prediction_dt = model_dt.transform(testData)

In [None]:
# Let's have a look at the first one. 
prediction_rf.show()

In [None]:
# Select example rows to display.
prediction_rf.select("prediction", "label", "features").show(5)

In [None]:
# Select (prediction, true label) and compute test error.
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

In [None]:
# A test error of zero means that the model accuracy is at 100%. 
# In most cases this is unrealistic, but here it's correct due to the simple data used in the documentation.
accuracy = evaluator.evaluate(prediction_rf)
print("Test Error = %g" % (1.0 - accuracy))

## Gradient Boosted Trees

Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. More information about the spark.ml implementation can be found further in the section on [GBTs](http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-trees-gbts). For more information on the algorithm itself, please see the [spark.mllib documentation on GBTs.](http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts)

Luckily Spark makes very easy to use, basically just an import switch:

In [None]:
from pyspark.ml.classification import GBTClassifier

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("Datasets/sample_libsvm_data.txt")

# Split the data into training and test sets (30% held out for testing).
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)

# Train model.  This also runs the indexers.
model = gbt.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

In [None]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

Let's move on to a more realistic example!