# Tree Methods: Decision Tree & Random Forest

## Task: Use decison tree and random forest classifier to determine which independent variable (A,B,C or D) is most causing the independent variable (Spoiled) to be 1.

## Result:  The indpeendent varialbe that is most causing the dependent variable to be 1 is C because 2 (which is the third independent variable) has the highest feature importance at .9506 (RFC*) and .9686 (DTC*).  

*RFC stands for random forest classifier.  
*DTC stands for decision tree classifier.

In [2]:
#Tree methods Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('tree_project').getOrCreate()

In [3]:
# Use Spark to read in the Cruise Ship Info csv file.
data = spark.read.csv("spark_master/Spark_for_Machine_Learning/Tree_Methods/dog_food.csv",inferSchema=True,header=True)

In [4]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [5]:
for item in data.head(5):
    print(item)
    print('\n')

Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0)


Row(A=5, B=6, C=12.0, D=7, Spoiled=1.0)


Row(A=6, B=2, C=13.0, D=6, Spoiled=1.0)


Row(A=4, B=2, C=12.0, D=1, Spoiled=1.0)


Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0)




### Spark Formatting of Data

In [8]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler
from pyspark.ml.feature import VectorAssembler

In [9]:
data.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [10]:
assembler = VectorAssembler(
  inputCols=['A', 'B', 'C', 'D'],
              outputCol="features")

In [11]:
output = assembler.transform(data)

In [15]:
output.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)
 |-- features: vector (nullable = true)



In [16]:
final_data = output.select('features', 'Spoiled')

In [17]:
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- Spoiled: double (nullable = true)



### The Classifiers

In [12]:
from pyspark.ml.classification import DecisionTreeClassifier,RandomForestClassifier
from pyspark.ml import Pipeline

In [21]:
# Use mostly defaults to make this comparison "fair"

dtc = DecisionTreeClassifier(labelCol='Spoiled',featuresCol='features')
rfc = RandomForestClassifier(labelCol='Spoiled',featuresCol='features',numTrees=20)

In [22]:
# Train the models (its three models, so it might take some time)
dtc_model = dtc.fit(final_data)
rfc_model = rfc.fit(final_data)

In [23]:
dtc_model.featureImportances

SparseVector(4, {0: 0.0026, 1: 0.0089, 2: 0.9686, 3: 0.0199})

In [24]:
rfc_model.featureImportances

SparseVector(4, {0: 0.0143, 1: 0.0171, 2: 0.9506, 3: 0.018})

## Result:  The indpeendent varialbe that is most causing the dependent variable to be 1 is C because 2 (which is the third independent variable) has the highest feature importance at .9506 (RFC*) and .9686 (DTC*).  

*RFC stands for random forest classifier.  
*DTC stands for decision tree classifier.