# Project: Wine quality classification

# About the datases:

link to hte dataset at Kaggle:
https://www.kaggle.com/danielpanizzo/wine-quality

Citation Request:
This dataset is public available for research. The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Title: Wine Quality

Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

Past Usage:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
these datasets under a regression approach. The support vector machine model achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
analysis procedure).

Relevant Information:

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines than
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.

Number of Instances: red wine - 1599; white wine - 4898.

Number of Attributes: 11 + output attribute

Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
feature selection.

Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Missing Attribute Values: None

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):
12 - quality (score between 0 and 10)

# White wines: STEP 1: Learning the dataset and feature engineering

In [None]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# importing the dataset
df = pd.read_csv("/kaggle/input/wine-quality/wineQualityWhites.csv", index_col=0)

Learning the dataset and making feature engineering

In [None]:
# showing first five rows of the dateset
df.head()

In [None]:
# showing the column names
list(df.columns)

In [None]:
# showing statistical information about the dataset
df.info()

In [None]:
# showing statistical data of the dataset
df.describe()

Visualize data columns

Explore distribution, skewness, outliers and other statistical properties

In [None]:
# plotting fixed.acidity variable
ax = sns.boxplot(x=df["fixed.acidity"]);

In [None]:
# finding outliers
df.loc[df["fixed.acidity"] > 11]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["fixed.acidity"] > 11].index)

In [None]:
# printing skewness and kurtosis
skewness = df["fixed.acidity"].skew()
kurtosis = df["fixed.acidity"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting volatile.acidity variable
ax = sns.boxplot(x=df["volatile.acidity"]);

In [None]:
# finding outliers
df.loc[df["volatile.acidity"] > 1.05]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["volatile.acidity"] > 1.05].index)

In [None]:
# printing skewness and kurtosis
skewness = df["volatile.acidity"].skew()
kurtosis = df["volatile.acidity"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# volatile.acidity distribution is highly skewed
# trying log transformation to improve the skewness
log_volatile_acidity_skewness = np.log(df["volatile.acidity"]).skew()
log_volatile_acidity_skewness

In [None]:
# creating log(volatile.acidity) column
df["log(volatile.acidity)"] = np.log(df['volatile.acidity'])

In [None]:
# removing volatile.acidity column
df.drop(["volatile.acidity"], axis=1, inplace=True)

In [None]:
# plotting citric.acid variable
ax = sns.boxplot(x=df["citric.acid"]);

In [None]:
# finding outliers
df.loc[df["citric.acid"] > 1.1]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["citric.acid"] > 1.1].index)

In [None]:
# printing skewness and kurtosis
skewness = df["citric.acid"].skew()
kurtosis = df["citric.acid"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting residual.sugar variable
ax = sns.boxplot(x=df["residual.sugar"]);

In [None]:
# finding outliers
df.loc[df["residual.sugar"] > 30]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["residual.sugar"] > 30].index)

In [None]:
# printing skewness and kurtosis
skewness = df["residual.sugar"].skew()
kurtosis = df["residual.sugar"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting chlorides variable
ax = sns.boxplot(x=df["chlorides"]);

In [None]:
# finding outliers
df.loc[df["chlorides"] > 0.22]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["chlorides"] > 0.22].index)

In [None]:
# printing skewness and kurtosis
skewness = df["chlorides"].skew()
kurtosis = df["chlorides"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# chlorides distribution is highly skewed
# trying log transformation to improve the skewness
log_chlorides_skewness = np.log(df["chlorides"]).skew()
log_chlorides_skewness

In [None]:
# creating log(chlorides) column
df["log(chlorides)"] = np.log(df['chlorides'])

In [None]:
# removing chlorides column
df.drop(["chlorides"], axis=1, inplace=True)

In [None]:
# plotting free.sulfur.dioxide variable
ax = sns.boxplot(x=df["free.sulfur.dioxide"]);

In [None]:
# finding outliers
df.loc[df["free.sulfur.dioxide"] > 200]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["free.sulfur.dioxide"] > 200].index)

In [None]:
# printing skewness and kurtosis
skewness = df["free.sulfur.dioxide"].skew()
kurtosis = df["free.sulfur.dioxide"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting total.sulfur.dioxide variable
ax = sns.boxplot(x=df["total.sulfur.dioxide"]);

In [None]:
# finding outliers
df.loc[df["total.sulfur.dioxide"] > 320]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["total.sulfur.dioxide"] > 320].index)

In [None]:
# printing skewness and kurtosis
skewness = df["total.sulfur.dioxide"].skew()
kurtosis = df["total.sulfur.dioxide"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting density variable
ax = sns.boxplot(x=df["density"]);

In [None]:
# printing skewness and kurtosis
skewness = df["density"].skew()
kurtosis = df["density"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting pH variable
ax = sns.boxplot(x=df["pH"]);

In [None]:
# printing skewness and kurtosis
skewness = df["pH"].skew()
kurtosis = df["pH"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting sulphates variable
ax = sns.boxplot(x=df["sulphates"]);

In [None]:
# finding outliers
df.loc[df["sulphates"] > 1.05]

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["sulphates"] > 1.05].index)

In [None]:
# printing skewness and kurtosis
skewness = df["sulphates"].skew()
kurtosis = df["sulphates"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting alcohol variable
ax = sns.boxplot(x=df["alcohol"]);

In [None]:
# printing skewness and kurtosis
skewness = df["alcohol"].skew()
kurtosis = df["alcohol"].kurtosis()
print("skewnessis {:.2f}, kurtosis is {:.2f}".format(skewness, kurtosis))

In [None]:
# plotting quality variable
ax = sns.catplot("quality", data=df, kind='count', aspect=1.5);

In [None]:
# counting all the values of categorical dependent variable quality
df["quality"].value_counts(dropna=False)

In [None]:
# deleting the outliers
df = df.drop(df.loc[df["quality"] == 3].index)
df = df.drop(df.loc[df["quality"] == 9].index)

In [None]:
# counting all the values of categorical dependent variable quality
df["quality"].value_counts(dropna=False)

In [None]:
# reordering the dataframe columns in original order
df = df[['fixed.acidity',
 'log(volatile.acidity)',
 'citric.acid',
 'residual.sugar',
 'log(chlorides)',
 'free.sulfur.dioxide',
 'total.sulfur.dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality']]

In [None]:
# calculating a correlation matrix
corr_matrix = df.corr()

In [None]:
# drawing a heatmap
plt.figure(figsize = (20,20))
ax = sns.heatmap(corr_matrix, annot=True, square=True, cmap='Blues')
plt.show()

Discussion: density is correlated with residual.sugar and alcohol.

Discussion: variables citric.acid and free.sulfur.dioxide have futile influence on the quality

In [None]:
# removing citric.acid column
df.drop(["citric.acid"], axis=1, inplace=True)

In [None]:
# removing  free.sulfur.dioxide column
df.drop(["free.sulfur.dioxide"], axis=1, inplace=True)

In [None]:
df.head()

# White wines: STEP 2: Choosing best performing machine learning model

In [None]:
# defining variables
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [None]:
# splitting the dataset to a train and a test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Logistic Regression

In [None]:
# training the  model on the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=42)
classifier.fit(X_train, y_train)

In [None]:
# predicting y_test
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# K Nearest Neighbors

In [None]:
# training the  model on the training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

In [None]:
# predicting y_test
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# Support Vector Machine

In [None]:
# training the SVM model on the training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 42)
classifier.fit(X_train, y_train)

In [None]:
# predicting y_valid
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# Kernel SVM

In [None]:
# training the  model on the training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 42)
classifier.fit(X_train, y_train)

In [None]:
# predicting y_valid
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# Naive Bayes

In [None]:
# training the  model on the training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# Decision Tree Classification

In [None]:
# training the  model on the training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# Random Forest Classification

In [None]:
# training the  model on the training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 15, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# XG Boost

In [None]:
# training the  model on the training set
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

# Cat Boost

In [None]:
# training the  model on the training set
from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = classifier.predict(X_test)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Discussion: random forest regression has shown best performance prediction 67% accuracy and 65% accuracy with cross fold validation.

# STEP 3: Building and optimising deep learning model

In [None]:
# importing tensorflow
import tensorflow as tf
tf.__version__

In [None]:
# initializing the ANN
ann = tf.keras.models.Sequential()

In [None]:
# showing a shape of X_train array
X_train.shape

In [None]:
# adding the input layer and the first hidden layer
ann.add(tf.keras.layers.Dense(units=9, activation='relu'))

In [None]:
# adding the second hidden layer
ann.add(tf.keras.layers.Dense(units=36, activation='relu'))

In [None]:
# adding the third hidden layer
ann.add(tf.keras.layers.Dense(units=288, activation='relu'))

In [None]:
# adding the fourth hidden layer
ann.add(tf.keras.layers.Dense(units=1152, activation='relu'))

In [None]:
# adding the fifth hidden layer
ann.add(tf.keras.layers.Dense(units=1152, activation='relu'))

In [None]:
# adding the sixth hidden layer
ann.add(tf.keras.layers.Dense(units=288, activation='relu'))

In [None]:
# adding the seventh hidden layer
ann.add(tf.keras.layers.Dense(units=36, activation='relu'))

In [None]:
# adding the output layer
ann.add(tf.keras.layers.Dense(units=1))

Training the ANN

In [None]:
# compiling the ANN
ann.compile(optimizer = 'adam', loss = 'mean_squared_error')

In [None]:
# training the ANN model on the Training set
ann.fit(X_train, y_train, batch_size = 64, epochs = 1000)

In [None]:
y_pred = ann.predict(X_test)
y_pred = np.around(y_pred, decimals=0)

In [None]:
# making the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

Discussion: ANN has shown best performance prediction 62% accuracy.