# Problem statement

The food scientists believe one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one! Use Machine Learning with RF to find out which parameter had the most predicitive power, thus finding out which chemical causes the early spoiling! So create a model and then find out how you can decide which chemical is the problem!

## Solution:
As we can see from the problem statement, we are not trying to predict anything here. Our objective is to find out which preservative is contributing to spoil the food. To solve this we have to find the relation between variables and Target variable "Spoiled". We will apply Random Forest on data and find out feature importance. The feature with highest importance would be the obvious choice that has maximum effect on spoilt food.

In [3]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("DogFood").getOrCreate()

In [4]:
#Load data from table as Spark DataFrame
data=spark.read.table("d_food")

In [5]:
#Check schema of table (datatypes of variables)
data.printSchema()  

In [6]:
#Change the datatypes of columns
data=data.withColumn("A",data["A"].cast("integer"))
data=data.withColumn("B",data["B"].cast("integer"))
data=data.withColumn("C",data["C"].cast("integer"))
data=data.withColumn("D",data["D"].cast("integer"))
data=data.withColumn("Spoiled",data["Spoiled"].cast("integer"))

In [7]:
import matplotlib.pyplot as plt
df=data.toPandas()    #Convert Spark Dataframe to Pandas

In [8]:
fig,ax=plt.subplots()
#ax.set_xlim((1, 10))
ax.hist(df['A'], color="red")
#ax.hist(df['B'], color="blue")
display(fig)

In [9]:
fig1,ax1=plt.subplots()
#ax.set_xlim((1, 10))
ax1.hist(df['B'], color="red")
display(fig1)

In [10]:
fig,ax=plt.subplots()
#ax.set_xlim((1, 10))
ax.hist(df['C'], color="red")
#ax.hist(df['B'], color="blue")
display(fig)

In [11]:
fig,ax=plt.subplots()
#ax.set_xlim((1, 10))
ax.hist(df['D'], color="red")
#ax.hist(df['B'], color="blue")
display(fig)

In [12]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

In [13]:
assembler = VectorAssembler(
  inputCols=['A','B','C','D'],
              outputCol="features")

In [14]:
output = assembler.transform(data)

In [15]:
indexer = StringIndexer(inputCol="Spoiled", outputCol="SpoiledIndexed")
output_fixed = indexer.fit(output).transform(output)

In [16]:
final_data = output_fixed.select("features",'SpoiledIndexed')

In [17]:
from pyspark.ml.classification import RandomForestClassifier

In [18]:
rfc = RandomForestClassifier(labelCol='SpoiledIndexed',featuresCol='features')

In [19]:
rfc_model = rfc.fit(final_data)

In [20]:
rfc_model.featureImportances

## Conclusion

From featureImportance output it is clear that 3rd feature which is preservative **"C"** is causing early spoiling of the food as it has highest impact on our target variable **Spoiled**