3 different tree methods:
* A single decision tree
* A random forest
* A gradient boosted tree classifier
    
Using dataset as below to try to classify colleges as Private or Public based on these features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

## 1. Import and get spark instance

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import (StringIndexer, 
                                VectorAssembler)
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import (DecisionTreeClassifier,
                                       RandomForestClassifier,
                                       GBTClassifier)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

In [2]:
spark = SparkSession.builder.appName('TreeClassifier').getOrCreate()

## 2. Get input data

In [3]:
df = spark.read.csv('College.csv', inferSchema=True, header=True)

In [4]:
df.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [5]:
df.head()

Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)

## 3. Make features column

In [6]:
# Note: Spark can accept the data in the form of two columns
# ("label","features")

In [7]:
df.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

#### make features colum

In [8]:
assembler = VectorAssembler(
  inputCols=['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F_Undergrad',
             'P_Undergrad', 'Outstate', 'Room_Board', 'Books', 'Personal', 'PhD',
             'Terminal', 'S_F_Ratio', 'perc_alumni', 'Expend', 'Grad_Rate'],
  outputCol="features")

In [9]:
df_output = assembler.transform(df)

#### Transform categorical data to numerical data

Deal with Private column being "yes" or "no"

In [10]:
indexer = StringIndexer(inputCol="Private", outputCol="PrivateIndex")
df_output_fixed = indexer.fit(df_output).transform(df_output)

In [11]:
df_output_fixed.show(1)

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+--------------------+------------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|            features|PrivateIndex|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+--------------------+------------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|[1660.0,1232.0,72...|         0.0|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------

#### select only 2 columns that Spark need

In [12]:
df_final_data = df_output_fixed.select("features", 'PrivateIndex')

## 4. The Classifiers

#### train and test split

In [13]:
train_data, test_data = df_final_data.randomSplit([0.8, 0.2])  # 80% and 20%

#### Create all three models

In [14]:
# Use mostly default params
dtc = DecisionTreeClassifier(labelCol='PrivateIndex', featuresCol='features')
rfc = RandomForestClassifier(labelCol='PrivateIndex', featuresCol='features')
gbt = GBTClassifier(labelCol='PrivateIndex', featuresCol='features')

#### Train all three models:

In [15]:
%%time
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

CPU times: user 37.7 ms, sys: 6.63 ms, total: 44.3 ms
Wall time: 18.9 s


## 5. Model Comparison

Let's compare each of these models!

In [16]:
dtc_predictions = dtc_model.transform(test_data)
rfc_predictions = rfc_model.transform(test_data)
gbt_predictions = gbt_model.transform(test_data)

**Evaluation Metrics:**

In [17]:
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="PrivateIndex", 
                                                  predictionCol="prediction", 
                                                  metricName="accuracy")

In [18]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)
gbt_acc = acc_evaluator.evaluate(gbt_predictions)

In [19]:
print("The results!")
print('-'*70)
print('A single decision tree, accuracy = {0:2.2f}%'.format(dtc_acc*100))
print('-'*70)
print('A random forest ensemble, accuracy = {0:2.2f}%'.format(rfc_acc*100))
print('-'*70)
print('A ensemble using GBT, accuracy = {0:2.2f}%'.format(gbt_acc*100))

The results!
----------------------------------------------------------------------
A single decision tree, accuracy = 88.51%
----------------------------------------------------------------------
A random forest ensemble, accuracy = 92.53%
----------------------------------------------------------------------
A ensemble using GBT, accuracy = 90.80%
