# Apple Quality Classification

In this notebook we will build a binary classification model from the apple quality dataset from Kaggle using Tensorflow. We hope to construct a model that will predict an apple as 'good' or 'bad' depending on its inputs.
First we will investigate the dataset to perform an exploratory data analysis. 

In [105]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import sklearn
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('apple_quality.csv')
df.head()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,0.0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.491590483,good
1,1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809367,good
2,2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636473,bad
3,3.0,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723217,good
4,4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984036,good


There are 4000 rows of data with 9 columns, 'A_id' for unique IDs, 'Quality' for 'good' or 'bad' apples, while the rest is numerical columns including 'Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', and 'Acidity'.

## Check for missing and duplicated values

In [106]:
# Check data description
df.describe()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0
mean,1999.5,-0.503015,-0.989547,-0.470479,0.985478,0.512118,0.498277
std,1154.844867,1.928059,1.602507,1.943441,1.402757,1.930286,1.874427
min,0.0,-7.151703,-7.149848,-6.894485,-6.055058,-5.961897,-5.864599
25%,999.75,-1.816765,-2.01177,-1.738425,0.062764,-0.801286,-0.771677
50%,1999.5,-0.513703,-0.984736,-0.504758,0.998249,0.534219,0.503445
75%,2999.25,0.805526,0.030976,0.801922,1.894234,1.835976,1.766212
max,3999.0,6.406367,5.790714,6.374916,7.619852,7.364403,7.237837


In [107]:
# Check for missing values
df.isnull().any()

A_id            True
Size            True
Weight          True
Sweetness       True
Crunchiness     True
Juiciness       True
Ripeness        True
Acidity        False
Quality         True
dtype: bool

We have missing values in the 4000th row. Let's check and handle that!

In [108]:
# Check the missing value index
np.where(df.isnull())[0]

array([4000, 4000, 4000, 4000, 4000, 4000, 4000, 4000], dtype=int64)

In [109]:
# Drop the last row
df.drop(df.tail(1).index, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   A_id         4000 non-null   float64
 1   Size         4000 non-null   float64
 2   Weight       4000 non-null   float64
 3   Sweetness    4000 non-null   float64
 4   Crunchiness  4000 non-null   float64
 5   Juiciness    4000 non-null   float64
 6   Ripeness     4000 non-null   float64
 7   Acidity      4000 non-null   object 
 8   Quality      4000 non-null   object 
dtypes: float64(7), object(2)
memory usage: 281.4+ KB


## Build classification model

As we noticed before, we have a binary column and a bunch of numerical columns. We want to encode the binary column to be our target column, and the rest 7 columns as our numerical columns which will be the input parameters for our prediction. We would also drop our ID column since we don't need that.

In [110]:
# Encode target column to 'good': 1 and 'bad': 0
target_column = 'Quality'
df[target_column] = df[target_column].map({'good': 1, 'bad': 0})

# Drop target column
output_rows = df[target_column]
numerical_column = df.columns.drop(target_column)
df.drop(target_column, axis=1, inplace=True)
df.drop('A_id', axis=1, inplace=True)

We would to normalize our data using Min-Max scaler to build our classification model. Then, we would split our train-test data, and train our model.

In [119]:
scaler = MinMaxScaler()
scaler.fit(df)
t_df = scaler.transform(df)

X_train, X_test, y_train, y_test = train_test_split(t_df, output_rows, test_size=0.25, random_state=0)

# Model with a dense layer with 64 neurons and 7 input from the parameters
model = keras.models.Sequential([
    keras.layers.Dense(units=64, activation='relu', input_shape=(7,)),
    keras.layers.Dense(1, activation='sigmoid')
])

adam = keras.optimizers.Adam(learning_rate=0.005)

model.compile(loss='binary_crossentropy', optimizer=adam, metrics=["accuracy"])

model.fit(X_train, y_train, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.src.callbacks.History at 0x2a9fe37ac90>