## The data

Investigate the cause of accelerated spoilage in specific batches of dog food for some Dog Food company. 

This Dog Food company uses five preservative chemicals. The company initially combines four preservative chemicals (A, B, C, D) in a batch, concluding with a "filler" chemical. Unfortunately, it hasn't upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary a lot. The variation in the quantities of five preservative chemicals (A, B, C, D, and a filler) prompts the need to identify the preservative with the most significant impact. 

Food scientists suspect that one of the A, B, C, or D preservatives is responsible for the issue and seek assistance in determining which one. 

We will create a Random Forest (RF) model, attempting to uncover the preservative with the highest predictive power and identifying the chemical causing premature spoilage. Key parameters include:

    Pres_A : Percentage of preservative A in the mix
    Pres_B : Percentage of preservative B in the mix
    Pres_C : Percentage of preservative C in the mix
    Pres_D : Percentage of preservative D in the mix
    Spoiled: Label indicating whether or not the dog food batch was spoiled.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dogfood').getOrCreate()

24/01/20 02:34:26 WARN Utils: Your hostname, Savvass-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.193.99 instead (on interface en0)
24/01/20 02:34:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/20 02:34:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/20 02:34:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
# Load training data
data = spark.read.csv('dog_food.csv',inferSchema=True,header=True)

In [3]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [4]:
data.head(1)

[Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0)]

In [5]:
data.describe().show()

24/01/20 02:35:04 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+------------------+------------------+------------------+------------------+-------------------+
|summary|                 A|                 B|                 C|                 D|            Spoiled|
+-------+------------------+------------------+------------------+------------------+-------------------+
|  count|               490|               490|               490|               490|                490|
|   mean|  5.53469387755102| 5.504081632653061| 9.126530612244897| 5.579591836734694| 0.2857142857142857|
| stddev|2.9515204234399057|2.8537966089662063|2.0555451971054275|2.8548369309982857|0.45221563164613465|
|    min|                 1|                 1|               5.0|                 1|                0.0|
|    max|                10|                10|              14.0|                10|                1.0|
+-------+------------------+------------------+------------------+------------------+-------------------+



## Data transformation

In [6]:
from pyspark.ml.feature import VectorAssembler



In [7]:
assembler = VectorAssembler(inputCols=data.columns[:-1], outputCol='features')

In [8]:
output = assembler.transform(data)

In [9]:
output.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)
 |-- features: vector (nullable = true)



In [10]:
final_data = output.select('features', 'Spoiled')

In [11]:
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- Spoiled: double (nullable = true)



## Creating the Model

In [12]:
from pyspark.ml.classification import RandomForestClassifier

In [13]:
rfc = RandomForestClassifier(labelCol='Spoiled', featuresCol='features', numTrees=20, maxDepth=5)

In [14]:
rfc_model = rfc.fit(final_data)

In [15]:
rfc_model.featureImportances

SparseVector(4, {0: 0.0151, 1: 0.0176, 2: 0.9489, 3: 0.0184})

Feature at index 2 (Chemical C) is by far the most important feature, meaning it is causing the early spoilage!