# Apple Quality Classification

In this notebook we will build a binary classification model from the apple quality dataset from Kaggle using Tensorflow. We hope to construct a model that will predict an apple as 'good' or 'bad' depending on its inputs.
First we will investigate the dataset to perform an exploratory data analysis. 

In [122]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import sklearn
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('apple_quality.csv')
df.head()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,0.0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.491590483,good
1,1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809367,good
2,2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636473,bad
3,3.0,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723217,good
4,4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984036,good


There are 4000 rows of data with 9 columns, 'A_id' for unique IDs, 'Quality' for 'good' or 'bad' apples, while the rest is numerical columns including 'Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', and 'Acidity'.

## Check for missing and duplicated values

In [123]:
# Check data description
df.describe()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0
mean,1999.5,-0.503015,-0.989547,-0.470479,0.985478,0.512118,0.498277
std,1154.844867,1.928059,1.602507,1.943441,1.402757,1.930286,1.874427
min,0.0,-7.151703,-7.149848,-6.894485,-6.055058,-5.961897,-5.864599
25%,999.75,-1.816765,-2.01177,-1.738425,0.062764,-0.801286,-0.771677
50%,1999.5,-0.513703,-0.984736,-0.504758,0.998249,0.534219,0.503445
75%,2999.25,0.805526,0.030976,0.801922,1.894234,1.835976,1.766212
max,3999.0,6.406367,5.790714,6.374916,7.619852,7.364403,7.237837


In [124]:
# Check for missing values
df.isnull().any()

A_id            True
Size            True
Weight          True
Sweetness       True
Crunchiness     True
Juiciness       True
Ripeness        True
Acidity        False
Quality         True
dtype: bool

In [125]:
# Check the missing value index
np.where(df.isnull())[0]

array([4000, 4000, 4000, 4000, 4000, 4000, 4000, 4000], dtype=int64)

We have missing values in the 4000th row. Let's check and handle that!

In [126]:
# Drop the last row
df.drop(df.tail(1).index, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   A_id         4000 non-null   float64
 1   Size         4000 non-null   float64
 2   Weight       4000 non-null   float64
 3   Sweetness    4000 non-null   float64
 4   Crunchiness  4000 non-null   float64
 5   Juiciness    4000 non-null   float64
 6   Ripeness     4000 non-null   float64
 7   Acidity      4000 non-null   object 
 8   Quality      4000 non-null   object 
dtypes: float64(7), object(2)
memory usage: 281.4+ KB


In [127]:
# Convert 'Acidity' type to float
df['Acidity'] = df['Acidity'].astype(float)

## Build classification model

As we noticed before, we have a binary column and a bunch of numerical columns. We want to encode the binary column to be our target column, and the rest 7 columns as our numerical columns which will be the input parameters for our prediction. We would also drop our ID column since we don't need that.

In [128]:
# Encode target column to 'good': 1 and 'bad': 0
target_column = 'Quality'
df[target_column] = df[target_column].map({'good': 1, 'bad': 0})

# Drop target column
output_rows = df[target_column]
numerical_column = df.columns.drop(target_column)
df.drop(target_column, axis=1, inplace=True)
df.drop('A_id', axis=1, inplace=True)

We would to normalize our data using Min-Max scaler to build our classification model. Then, we would split our train-test data, and train our model.

In [129]:
scaler = MinMaxScaler()
scaler.fit(df)
t_df = scaler.transform(df)

X_train, X_test, y_train, y_test = train_test_split(t_df, output_rows, test_size=0.25, random_state=0)

# Model with a dense layer with 64 neurons and 7 input from the parameters
model = keras.models.Sequential([
    keras.layers.Dense(units=64, activation='relu', input_shape=(7,)),
    keras.layers.Dense(1, activation='sigmoid')
])

adam = keras.optimizers.Adam(learning_rate=0.005)

model.compile(loss='binary_crossentropy', optimizer=adam, metrics=["accuracy"])

model.fit(X_train, y_train, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<keras.src.callbacks.History at 0x2a9ffbe5c50>

As we can see, our model have reached 90% accuracy.

## References 
Elgiriyewithana, N. (2023). Apple Quality [Data set]. Kaggle. https://www.kaggle.com/datasets/nelgiriyewithana/apple-quality