# Tree Methods Code Along

In this lecture we will code along with some data and test out 3 different tree methods:

* A single decision tree
* A random forest
* A gradient boosted tree classifier
    
We will be using a college dataset to try to classify colleges as Private or Public based off these features:

    Private A factor with levels No and Yes indicating private or public university
    Apps Number of applications received
    Accept Number of applications accepted
    Enroll Number of new students enrolled
    Top10perc Pct. new students from top 10% of H.S. class
    Top25perc Pct. new students from top 25% of H.S. class
    F.Undergrad Number of fulltime undergraduates
    P.Undergrad Number of parttime undergraduates
    Outstate Out-of-state tuition
    Room.Board Room and board costs
    Books Estimated book costs
    Personal Estimated personal spending
    PhD Pct. of faculty with Ph.D.’s
    Terminal Pct. of faculty with terminal degree
    S.F.Ratio Student/faculty ratio
    perc.alumni Pct. alumni who donate
    Expend Instructional expenditure per student
    Grad.Rate Graduation rate

In [1]:
#Tree methods Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('treecode').getOrCreate()

In [3]:
# Cargar los datos
data = spark.read.csv('./data/College.csv',inferSchema=True,header=True)

In [4]:
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [5]:
data.head(1)

[Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)]

### Spark Formatting of Data

In [8]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [6]:
data.columns

['School',
 'Private',
 'Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate']

In [10]:
assembler = VectorAssembler(inputCols = ['Apps',
 'Accept',
 'Enroll',
 'Top10perc',
 'Top25perc',
 'F_Undergrad',
 'P_Undergrad',
 'Outstate',
 'Room_Board',
 'Books',
 'Personal',
 'PhD',
 'Terminal',
 'S_F_Ratio',
 'perc_alumni',
 'Expend',
 'Grad_Rate'],outputCol= 'features')

In [11]:
output = assembler.transform(data)

Deal with Private column being "yes" or "no"

In [12]:
from pyspark.ml.feature import StringIndexer

In [13]:
indexer = StringIndexer(inputCol= 'Private',outputCol = 'PrivateIndex')

In [14]:
output_fixed = indexer.fit(output).transform(output)

In [15]:
output_fixed.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PrivateIndex: double (nullable = false)



In [16]:
final_data = output_fixed.select('features','PrivateIndex')

In [17]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

### The Classifiers

In [18]:
from pyspark.ml.classification import DecisionTreeClassifier,GBTClassifier,RandomForestClassifier
from pyspark.ml import pipeline

In [None]:
from pyspark.ml.regression import RandomForestRegressor

**RandomForestClassifier:**
- Utilizado para problemas de clasificación.
- La variable objetivo es categórica, es decir, tiene etiquetas o clases discretas.
- Ejemplos de problemas de clasificación incluyen la clasificación de correos electrónicos como spam o no spam, la identificación de imágenes de gatos o perros, etc.

**RandomForestRegressor:**
- Utilizado para problemas de regresión.
- La variable objetivo es numérica, es decir, tiene valores continuos.
- Ejemplos de problemas de regresión incluyen la predicción de precios de viviendas, la estimación de ingresos, etc.

**Pipeline**
En PySpark, un Pipeline es una secuencia de etapas (stages) que se encadenan para ejecutar de manera ordenada un conjunto de operaciones en un flujo de trabajo de procesamiento de datos. Cada etapa en el pipeline puede ser una transformación de datos o un modelo de aprendizaje automático, y el pipeline proporciona una forma de organizar y ejecutar estas etapas de manera eficiente.



In [22]:
dtc = DecisionTreeClassifier(labelCol = 'PrivateIndex',featuresCol= 'features')
rfc = RandomForestClassifier(labelCol = 'PrivateIndex',featuresCol= 'features')
gbt = GBTClassifier(labelCol = 'PrivateIndex',featuresCol= 'features')

In [23]:
# Entrenar a los modelos
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

## Model Comparison

Let's compare each of these models!

**Evaluation Metrics:**

In [26]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [28]:
my_binary_eval = BinaryClassificationEvaluator(labelCol = 'PrivateIndex')

In [30]:
print('DTC')
print(my_binary_eval.evaluate(dtc_preds))

DTC
0.8818934688499905


In [31]:
print('RFC')
print(my_binary_eval.evaluate(rfc_preds))

RFC
0.977131564088086


In [33]:
print('GBT')
print(my_binary_eval.evaluate(gbt_preds))

GBT
0.935064935064935


### Accuracy

In [38]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [40]:
acc_eval = MulticlassClassificationEvaluator(labelCol='PrivateIndex',metricName='accuracy')

In [41]:
print('DTC')
print(acc_eval.evaluate(dtc_preds))

DTC
0.8878923766816144


In [42]:
print('RFC')
print(acc_eval.evaluate(rfc_preds))

RFC
0.9192825112107623


In [44]:
print('GBT')
print(acc_eval.evaluate(gbt_preds))

GBT
0.8834080717488789
