# Tree Methods Consulting Project 

The dog food company tries to predict why some batches of the dog food are spoiling much quicker than intended!
Unfortunately this Dog Food company hasn't upgraded to the latest machinery, meaning that the amounts 
of the five preservative chemicals they are using can vary a lot, 
but which is the chemical that has the strongest effect? 
The dog food company first mixes up a batch of preservative that
contains4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. 
The food scientists beliEve one of the A,B,C, or D preservatives is causing the problem, 
but need help to figure out which one.
Using RF to find out which parameter had the most predicitive power, thus finding out which chemical 
causes the early spoiling.

* A : Percentage of preservative A in the mix
* B : Percentage of preservative B in the mix
* C : Percentage of preservative C in the mix
* D : Percentage of preservative D in the mix
* Spoiled: Label indicating whether or not the dog food batch was spoiled.

In [None]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('dogfood').getOrCreate()

In [3]:
data = spark.read.csv('dog_food.csv', header = True, inferSchema = True)

In [4]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [5]:
data.describe().show()

+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|                10|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



In [7]:
data.show(5)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
+---+---+----+---+-------+
only showing top 5 rows



In [8]:
from pyspark.sql.types import IntegerType

In [12]:
data = data.withColumn('C',data['C'].cast(IntegerType()))
data = data.withColumn('Spoiled',data['Spoiled'].cast(IntegerType()))

In [13]:
data.show(5)

+---+---+---+---+-------+
|  A|  B|  C|  D|Spoiled|
+---+---+---+---+-------+
|  4|  2| 12|  3|      1|
|  5|  6| 12|  7|      1|
|  6|  2| 13|  6|      1|
|  4|  2| 12|  1|      1|
|  4|  2| 12|  3|      1|
+---+---+---+---+-------+
only showing top 5 rows



# Formating Data

In [14]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [15]:
data.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [16]:
assembler = VectorAssembler(inputCols = ['A', 'B', 'C', 'D'],
                           outputCol = 'features')

In [17]:
output = assembler.transform(data)

In [22]:
trainD, testD = output.randomSplit([0.7, 0.3])

# The Classifiers

In [18]:
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier


In [23]:
dtc = DecisionTreeClassifier(labelCol = 'Spoiled', featuresCol = 'features')
rfc = RandomForestClassifier(labelCol='Spoiled',featuresCol='features')
gbt = GBTClassifier(labelCol='Spoiled',featuresCol='features')

In [24]:
dtc_model = dtc.fit(trainD)
rfc_model = rfc.fit(trainD)
gbt_model = gbt.fit(trainD)

# Features which caused early spoiled

In [48]:
dtc_model.featureImportances

SparseVector(4, {0: 0.0257, 1: 0.002, 2: 0.9714, 3: 0.0009})

In [54]:
rfc_model.featureImportances

SparseVector(4, {0: 0.0222, 1: 0.0213, 2: 0.9378, 3: 0.0187})

In [55]:
gbt_model.featureImportances

SparseVector(4, {0: 0.0614, 1: 0.0347, 2: 0.8869, 3: 0.017})

# Model Comparison

In [25]:
dtc_predictions = dtc_model.transform(testD)
rfc_predictions = rfc_model.transform(testD)
gbt_predictions = gbt_model.transform(testD)

In [36]:
dtc_predictions.select('spoiled', 'prediction').show(100)

+-------+----------+
|spoiled|prediction|
+-------+----------+
|      1|       0.0|
|      1|       1.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      1|       1.0|
|      0|       0.0|
|      0|       0.0|
|      1|       1.0|
|      0|       0.0|
|      0|       0.0|
|      1|       1.0|
|      1|       1.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      1|       1.0|
|      0|       0.0|
|      1|       1.0|
|      1|       1.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      1|       1.0|
|      0|       0.0|
|      1|       1.0|
|      0|       0.0|
|      1|       0.0|
|      0|       0.0|
|      1|       1.0|
|      0|       0.0|
|      0|       0.0|
|      1|       1.0|
|      1|       1.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|       0.0|
|      0|    

# Evaluation

In [26]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [27]:
acc_evaluator = MulticlassClassificationEvaluator(labelCol = 'Spoiled', predictionCol ='prediction', metricName = 'accuracy')

In [28]:
dtc_acc = acc_evaluator.evaluate(dtc_predictions)
rfc_acc = acc_evaluator.evaluate(rfc_predictions)
gbt_acc = acc_evaluator.evaluate(gbt_predictions)

In [60]:
print('Results:')
print ('-'*80)
print ('A single decision tree had an accuracy of : {0:2.2f}%'.format(dtc_acc* 100))
print('-'*80)
print('A random forest ensemble had an accuracy of:  {0:2.2f}%'.format(rfc_acc* 100))
print('-'*80)
print('An ensemble using GBT  had an accuracy of:  {0:2.2f}%'.format(gbt_acc* 100))

Results:
--------------------------------------------------------------------------------
A single decision tree had an accuracy of : 95.48%
--------------------------------------------------------------------------------
A random forest ensemble had an accuracy of:  96.13%
--------------------------------------------------------------------------------
An ensemble using GBT  had an accuracy of:  95.48%
