# Quality of red wine

**Dataset:** https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009?fbclid=IwAR3XVugl70idQBVCitRsa1ilR0A0ByzOA3WXSAbVCfx9Dk0GLzWMtplvvaM

**Notebook's author: Danh Nguyễn**

### Table of Contents

* [1. Problem defining](#chapter1)       
* [2. Data collecting](#chapter2)


* [3. Data processing](#chapter3)
    * [3.1 Prepare a dataframe](#3.1)
    * [3.2 Handle null values](#3.2)
    * [3.3 Discretize dataset](#3.3)
    * [3.4 Split train and test set](#3.4)
    
    
* [4. Problem modeling](#chapter4)
    * [4.1 Overview data and train set](#4.1)
    * [4.2 Visualize data and train set](#4.2)
    * [4.3 Bayesian network](#4.3)


* [5. Training and Predicting](#chapter5)
    * [5.1 Build Bayesian model](#5.1)
    * [5.2 Predict by Bayesian model](#5.2)
    * [5.3 Performance metrics](#5.3)
    * [5.4 Adjust structures](#5.4)
    * [5.5 Another model](#5.5)


* [Future work](#futurework)

* [Reference](#reference)

# 1. Problem defining <a class="anchor" id="chapter1"></a>

Predicting the quality of wine by its property, using Bayesian model.

The observed features are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality. The quality of wine have 6 classes from quality 3 to quality 8. We will discretize the values in these observed features into 5 classes and predict the quality of wine by Bayesian model.

Moreover, linear regression model will be used to label the quality of wine and compare the result with the result by Bayesian model.

# 2. Data collecting <a class="anchor" id="chapter2"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# import packages
# Plot and image packages
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
from IPython import display
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
!pip install pgmpy

In [None]:
data = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
data.head()

# 3. Data processing <a class="anchor" id="chapter3"></a>

## 3.1 Preparing a data frame <a class="anchor" id="3.1"></a>

In [None]:
#Create dataframe
pd.DataFrame(data)

##### **Note:** All features in this dataset are relevant to determine the quality of wine, therefore, all data should be used.

## 3.2 Handle null value <a class="anchor" id="3.2"></a>

In [None]:
#Check NaN and fill NaN
#Check NaN:
data.isna().sum()

##### **Note**: Because there is no NaN, so we don't need to fill NaN or drop NaN.

## 3.3 Discretize data <a class="anchor" id="3.3"></a>

In [None]:
#Discretize the data with 5 classes
def discretize(feature, nclass):
    min_val = feature.min()
    max_val = feature.max()
    interval = (max_val - min_val)/nclass
    i=0
    for value in feature:
        feature[i]=(feature[i]-min_val)//interval+1
        if feature[i] == nclass + 1: feature[i] = nclass
        i+=1
i=0
discreted5_data = data.copy()
for col in discreted5_data.drop(['quality'], axis=1):
    discretize(discreted5_data.iloc[:,i], 5)
    i+=1
discreted5_data

## 3.4 Splitting train and test set <a class="anchor" id="3.4"></a>

In [None]:
#Split data into train and test set (80% train, 20% test)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(discreted5_data.drop(['quality'], axis = 1), discreted5_data['quality'], test_size = 0.2, random_state = 42, stratify = discreted5_data['quality'])
train_data = pd.concat([y_train, X_train], axis=1)
train_data

##### **Note**: Train set has 1276 rows and 12 columns. All values (except quality) are splitted into 5 classes with respect to their feature.

# 4. Problem modeling <a class="anchor" id="chapter4"></a>

## 4.1 Overview data and train set <a class="anchor" id="4.1"></a>

In [None]:
#Overview the intitial dataset
data.describe()

In [None]:
#Overview the discretized data
discreted5_data.describe()

In [None]:
# Overview the train set
train_data.describe()

## 4.2 Visualize dataset and train set <a class="anchor" id="4.2"></a>

In [None]:
#Write function for plotting distribution
def plot_distribution(dataset, titlename):
    fig = plt.figure(figsize = (18, 10))
    title = fig.suptitle(titlename, fontsize=24)
    fig.subplots_adjust(top=.85, wspace=.6, hspace=.6)
    i=0
    for col in dataset:
        ax = fig.add_subplot(3,4, i+1)
        #ax.set_title(data.columns[i])
        ax.set_xlabel("Value")
        ax.set_ylabel("Frequency")
        ax.tick_params(axis='both', which='major', labelsize=8.5)
        ax = sns.distplot(dataset.iloc[:,i], color='red')
        i+=1
    plt.show()
#Plotting distribution of the initial dataset
plot_distribution(data,"Distribution of features in the initial dataset")

In [None]:
#Plotting the distribution of the train dataset
plot_distribution(train_data, "Distribution of features in the train dataset")

##### **Insight:** Some features are not in normal distribution, we could use log transformation to make them normally distributed

In [None]:
#PLotting heatmap of the intial dataset to see correlation between features
plt.figure(figsize=(15,8))
sns.heatmap(data.corr(), annot=True, linewidths=2)

In [None]:
#Plotting the top 10 highest correlated features with respect to wine quality
plt.figure(figsize=(15,15))
data.corr().quality.apply(lambda x: abs(x)).sort_values(ascending=False).iloc[1:11][::-1].plot(kind='barh',color='pink') 
plt.title("Top 10 highly correlated properties with Quality", size=20, pad=26)
plt.xlabel("Correlation coefficient")
plt.ylabel("Property")

In [None]:
# Visualize the target (quality)
sns.countplot(train_data.quality)

##### **Note:** Quality mức 5, 6 nhiều hơn các mức còn lại => imbalanced data

##### => Có thể sử dụng reversampling method để cải thiện imbalanced data

## 4.3 Bayesian network <a class="anchor" id="4.3"></a>

![Bayesian network for wine quality](https://media-exp3.licdn.com/dms/image/C4D12AQGIueUur0CPig/article-inline_image-shrink_1000_1488/0/1566703342795?e=1632355200&v=beta&t=ds0DR6_Y8VMg2AMKSDeH9aum5nKoQhubOq_-MXTpPZ0)

#### **Explanation:** Regarding to the network in the reference [2] and the correlations in the heatmap, we have this Bayesian network of wine quality.

# 5. Training and predicting <a class="anchor" id="chapter5"></a>

## 5.1 Build Bayesian model <a class="anchor" id="5.1"></a>

In [None]:
#Build Bayesian model from the network
from pgmpy.models import BayesianModel
model1 = BayesianModel([('volatile acidity', 'fixed acidity'), ('density','fixed acidity'), ('fixed acidity','citric acid'), ('pH','citric acid'), ('total sulfur dioxide','free sulfur dioxide'), ('residual sugar','quality'), ('chlorides','quality'), ('free sulfur dioxide','quality'), ('sulphates','quality'), ('alcohol','quality'), ('citric acid','quality')])

In [None]:
#Get cpds and add cpds to nodes
from pgmpy.estimators import MaximumLikelihoodEstimator
model1.fit(train_data, estimator = MaximumLikelihoodEstimator)
def get_and_add_cpds(model, df):
    i=0
    for col in df:
        model.add_cpds(model.get_cpds(df.columns[i]))
        i+=1
get_and_add_cpds(model1, train_data)

In [None]:
#Get conditional probability density at Chlorides, pH, Sulphates
i=0
print(model1.get_cpds('chlorides'))
print(model1.get_cpds('pH'))
print(model1.get_cpds('sulphates'))

## 5.2 Predict quality by Bayesian model <a class="anchor" id="5.2"></a>

In [None]:
#Predict quality by using Bayesian model
y_pred = model1.predict(X_test)

## 5.3 Performance metrics (precision & recall & confusion matrix) <a class="anchor" id="5.3"></a>

#### **Precision**: 

#### Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. [3]

#### Precision = TP/(TP+FP)

#### **Recall**: 

#### Recall is the ratio of correctly predicted positive observations to the all observations in actual class. [3]

#### Recall = TP/(TP+FN)

#### **F1 score**: 

#### F1 Score is the weighted average of Precision and Recall.[3]

#### F1 Score = 2*(Recall * Precision) / (Recall + Precision)

#### The **root-mean-square deviation** (RMSD) or **root-mean-square error** (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. [5]

In [None]:
#Evaluating model by using classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, roc_auc_score, roc_curve
def report_efficiency(true, pred):
    print(classification_report(true, pred))
    labels = np.unique(true)
    cf_matrix=confusion_matrix(true,pred)
    df_cfmatrix = pd.DataFrame(cf_matrix, index = labels, columns = labels)
    print(df_cfmatrix)
    print("RMSE: " + str(mean_squared_error(true, pred)**0.5))
    
report_efficiency(y_test, y_pred)

#### **Accuracy = 0.56**
#### **Root-mean-square error: 0.955**

## 5.4 Adjust some data structures <a class="anchor" id="5.4"></a>

#### **Note**: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve. [4]

#### => Use log transformation to make data normally distributed.


In [None]:
#Copy to a new dataframe
adjusted_data = data.copy()

Using log1p transformation to make skewed features normally distributed. Do not use log transformation, since the majority of values is nearly equal to zero.

In [None]:
#Log1p transformation
def log1p_transform(col):
    return np.log1p(col)
i=0
for col in adjusted_data.columns:
    if col != 'quality' and col !='density' and col != 'pH':
        adjusted_data[[col]] = adjusted_data[[col]].apply(log1p_transform, axis = 1)
    i+=1
plot_distribution(adjusted_data, "Distribution of features in the adjusted dataset")

##### **Insight**: Almost all skewed data have been handled and changed into normal distribution form.

In [None]:
#Discretize the adjusted dataset
i=0
for col in adjusted_data.drop(['quality'], axis =1):
    discretize(adjusted_data.iloc[:, i], 5)
    i+=1
adjusted_data

In [None]:
#Split the adjusted dataset into train and test set
train2, test2 = train_test_split (adjusted_data, test_size = 0.2, random_state = 42)
X_test2, y_test2 = test2.drop(['quality'], axis = 1), test2[['quality']]

In [None]:
#Build Bayesian model regarding to new train data
model2 = BayesianModel([('volatile acidity', 'fixed acidity'), ('density','fixed acidity'), ('fixed acidity','citric acid'), ('pH','citric acid'), ('total sulfur dioxide','free sulfur dioxide'), ('residual sugar','quality'), ('chlorides','quality'), ('free sulfur dioxide','quality'), ('sulphates','quality'), ('alcohol','quality'), ('citric acid','quality')])
model2.fit(train2, estimator = MaximumLikelihoodEstimator)
get_and_add_cpds(model2, train2)

In [None]:
#Predict based on the new training data
y_pred2 = model2.predict(X_test2)

In [None]:
report_efficiency(y_test2, y_pred2)

### **Accuracy: 0.42**
#### **Root-mean-square error: 1.39**

##### **Insight**: Adjusting data structure by using log1p transformation do not improve the score, maybe since it make the discrete data changed badly.

## 5.5 Use another model to predict (Linear Regression) <a class="anchor" id="5.5"></a>

* Use log1p transformation to make the skewed dataset normally distributed.
* Use linear regression model.
* Change the predicted data into its nearest integer.
* Evaluate by classification report, confusion matrix, and root-mean-square error

In [None]:
#Copy new dataset from the initial dataset and use log1p transform
new_data = data.copy()
i=0
for col in new_data.columns:
    if col != 'quality' and col !='density' and col != 'pH':
        new_data[[col]] = new_data[[col]].apply(log1p_transform, axis = 1)
    i+=1
plot_distribution(new_data, "Distribution of features in the new dataset")

In [None]:
#Spit train and test set
nX_train, nX_test,ny_train, ny_test = train_test_split(new_data.drop(['quality'], axis = 1), new_data[['quality']], test_size = 0.2, random_state = 42, stratify = new_data['quality'])

In [None]:
#Build linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
LR = Pipeline([
        ('lr',  LinearRegression())
 ])  

LR.fit(nX_train,ny_train)

In [None]:
#Predict by linear regression model
ny_pred = LR.predict(nX_test)

In [None]:
#Change the predicted value into the nearest interger
i=0
for value in ny_pred:
    val = ny_pred[i] - ny_pred[i]//1
    if val < 0.5:
        ny_pred[i] = ny_pred[i]//1
    else: ny_pred[i] = ny_pred[i]//1 + 1
    i+=1

In [None]:
#Evaluation
report_efficiency(ny_test, ny_pred)

#### **Accuracy: 0.6**
#### **Root-mean-square error:  0.69**

##### **Insight**: Prediction results by using linear regression model is better than using Bayesian model, since the original dataset is not discrete.

# Future work <a class="anchor" id="futurework"></a>

* Using resampling methods (such as sklearn.utils.resample, stratified K-Fold, SMOTE) to handling imbalanced data
* Using other models such as KNN, XgBoost, and LightGBM to improve the results.

# Reference <a class="anchor" id="reference"></a>

   #### [1] MaSSP pipeline notebook, by *Thanh Vuong*
    
   #### [2] Bayesian Networks with Continuous Distributions - Regression model to describe wine quality, by *Robson Fernandes* (via LinkedIn)
   
   #### [3] Accuracy, Precision, Recall & F1 Score: Interpretation of Performance Measures, by *Renuka Joshi*
   
   #### [4] Normal Distribution in Statistics, by *Jim Frost*
   #### [5] Root-mean-square deviation, *via Wikipedia* 