1. <a href="#load"> Loading data </a>
2. <a href="#eda">Exploratory data analysis</a>
3. <a href="#proc">Pre-processing</a>
4. <a href="#prep">Data preparation</a>
    * <a href="#tts">train-test split / LabelEncoder</a>
    * <a href="#scal">Scaler</a>
5. <a href="#modl">Build Model</a>
6. <a href="#eval">Evaluate Model</a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# <a id="load"> Loading data </a>

In [None]:
file = "/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv"
df = pd.read_csv(file)

# <a id="eda">Exploratory Data Analysis</a>

In [None]:
print(df.shape)
df.head()

In [None]:
df.info()

Checking linear correlation among features:

In [None]:
sns.pairplot(df, hue="quality")

In [None]:
mask = np.zeros_like(df.corr())
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(18,8))
sns.heatmap(df.corr(), cmap='viridis', mask=mask, annot=False, square=True)

It seems there are some high correlated features in dataset, mainly among the following features (their correlations are all above 0.6):

* 'citric acid', 'density' and 'pH' to 'fixed acidity'
* 'citric acid' to 'volatile acidity'
* 'total sulfur dioxide' to 'free sulfur dioxide'.

Let's now check correlations to the target (_quality_ feature) and its distribution:

In [None]:
# check correlations above 0.6 for fixed acidity', 'volatile acidity' and 'total sulfur dioxide' features

(abs(df.corr()[['fixed acidity', 'volatile acidity', 'total sulfur dioxide']])>0.6)*1

In [None]:
df.corr()['quality'].iloc[:-1].sort_values().plot(kind='bar')

In [None]:
sns.histplot(df.quality)

In [None]:
df.quality.value_counts()

clearly imbalanced dataset... Maybe we could think about applying SMOTE or some cost-sensitive learning technique. https://machinelearningmastery.com/multi-class-imbalanced-classification/

now, let's scale all the data, aiming at checking a boxplot of all features, so we can easily see its variability.

WARNING: note that this scaling will be used only for this observation purpose! do not use this specific scaler for posterior training purposes, as you may incur in data snooping.

In [None]:
scaler = MinMaxScaler()

X_train = pd.DataFrame(scaler.fit_transform(df))
X_train.columns = df.columns

In [None]:
sns.set_theme(style="ticks", palette="pastel")

plt.figure(figsize=(18,8))
sns.boxplot(data=X_train)

# <a id="proc">Pre-processing</a>

Based on our previous data exploration, let's drop some high correlated features:

In [None]:
# dropping 'citric acid', 'density', 'pH', 'total sulfur dioxide':

df = df.drop(columns=['citric acid', 'density', 'pH', 'total sulfur dioxide'])

In [None]:
df.head()

# <a id="prep">Data preparation</a>

Data preparation steps:
* train-test split
* MinMax scaling

### <a id="tts">train-test split / LabelEncoder</a>

In [None]:
X = df.loc[:, df.columns != 'quality'].values
y = df.quality.values
le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

y_train_cat = to_categorical(y_train, 6)
y_test_cat = to_categorical(y_test, 6)

### <a id="scal">Scaler</a>

In [None]:
scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(X_train.shape)
print(X_test.shape)

# <a id="modl">Build Model</a>

In [None]:
xavier_init = tf.keras.initializers.GlorotNormal()

model = Sequential()
model.add(Dense(64, kernel_initializer=xavier_init,  activation='relu'))
model.add(Dense(32, kernel_initializer=xavier_init, activation='relu'))
model.add(Dense(16, kernel_initializer=xavier_init, activation='relu'))
model.add(Dense(6, kernel_initializer=xavier_init, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
model.fit(X_train,
          y_train_cat,
          epochs=30,
          validation_data=(X_test,y_test_cat),
          verbose=1)

In [None]:
losses = pd.DataFrame(model.history.history)

losses[['loss','val_loss']].plot()
losses[['accuracy','val_accuracy']].plot()

# <a id="eval">Evaluate Model</a>

In [None]:
print(model.metrics_names)
print(model.evaluate(X_test,y_test_cat,verbose=0))

In [None]:
predictions = le.inverse_transform(np.argmax(model.predict(X_test), axis=-1))

In [None]:
print(classification_report(le.inverse_transform(y_test),
                            predictions))

In [None]:
CM = confusion_matrix(le.inverse_transform(y_test), predictions)
CM