Carlo Antonio T. Taleon BSCS-2A | 2020-2021 | Written January 2021

# **Wine Classification using Tensorflow with Keras**
- This is my final project for the 1st semester of Introduction to Artificial Intelligence class at WVSU-CICT.
- It's also the first notebook I'm writing on Kaggle. :)
- This project classifies Wine using the [Wine Data Set on UCI](https://archive.ics.uci.edu/ml/datasets/wine).

## **1. Import modules**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
import numpy as np

## **2. Data Preparation**
### Import data and adding headers

In [None]:
df = pd.read_csv('../input/wineuci/Wine.csv')

df.columns = [  'name',
                'alcohol',
             	'malicAcid',
             	'ash',
            	'ashalcalinity',
             	'magnesium',
            	'totalPhenols',
             	'flavanoids',
             	'nonFlavanoidPhenols',
             	'proanthocyanins',
            	'colorIntensity',
             	'hue',
             	'od280_od315',
             	'proline'
                ]

### Defining features and labels

In [None]:
X = df.drop(['name','alcohol'],axis=1)
y = df['name'] - 1 # shifted y to range 0 to 1
labels = ['Wine 1', 'Wine 2', 'Wine 3'] # dataset does not supply the names of the labels

### Splitting X and y and defining input shape

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=1) # random_state=1 for same results every time.
input_shape = [X_train.shape[1]]

## **3. The Model**
### Defining and compiling the model

In [None]:
model = keras.Sequential([
    # input layer
    layers.BatchNormalization(input_shape=input_shape),
    # hidden layer 1
    layers.Dense(units=256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    # hidden layer 2
    layers.Dense(units=128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    # hidden layer 3
    layers.Dense(units=64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    # hidden layer 4
    layers.Dense(units=32, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    layers.Dense(units=3, activation='softmax')
])

model.compile(
    optimizer=Adam(learning_rate=0.0001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

## **4. Training**

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=700,
)
print("")

## **5. Evaluation of Model's Performance**
### Results of Training
Learning Curve showing Loss and Accuracy during training.

In [None]:
### Loss Graph
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Learning Curve: Loss over Epochs")
plt.ylabel("Loss")
plt.xlabel("Epochs")
plt.legend(['Training Loss', 'Validation Loss'])

### Accuracy Graph
history_df.loc[:, ['accuracy', 'val_accuracy']].plot(title="Learning Curve: Accuracy over Epochs")
plt.ylabel("Accuracy")
plt.xlabel("Epochs")
plt.legend(['Training Accuracy', 'Validation Accuracy'])

### Results of Predictions

In [None]:
y_actual = y_train.to_numpy()
y_pred = model.predict(X_train, verbose=0)
y_pred = np.argmax(y_pred, axis=-1)

print("On {} samples of untrained(test) dataset:".format(len(y_pred)))
print("Prediction:")
print(y_pred)
print("Actual:")
print(y_actual)

### Classification Report
print("\nClassification Report:")
print(classification_report(y_actual,y_pred, target_names=labels))

### Confusion Matrix Graph
cm = confusion_matrix(y_true=y_actual, y_pred=y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot()
plt.title('Confusion Matrix')

### My Evaluation:

- I tweaked the neural network through trial and error to consistently get low loss values for both the training and validation data for training.
- I also made sure that the accuracy for both the training and validation data are above 0.9
- Specifically, I increased the dropout rate and added batch normalization to try and reduce overfitting. I learned this from the Deep Learning course here on Kaggle.
- Although the learning curve for loss and accuracy show that there are signs of overfitting as the training data does better than the validation data, it's miniscule enough to not affect the precision of the model.