The problem we have to solve is to **help in predicting the quality of wine** with the help of a *classification model*. The dataset is provided for the same to help us in training and testing phase. Altough lets move ahead looking at the dataset.

In [None]:
'''Starting off with Importing Important Libraries '''
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

In [None]:
warnings.filterwarnings('ignore')

In [None]:
# Reading and then loading the dataset onto the notebook.
ds = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
ds

# EXPLORATORY DATA ANALYSIS
In this dataset for the quality prediction, there are **11 attributes and 1 label or target class.** We will have a look at the target variable and see that how many classes or unique values are present in the target variable.

In [None]:
print(f"The number of unique class in target variable are {ds.quality.nunique()}.\n"
      f"Those unique target variables are {ds.quality.unique()}")

ds.quality.value_counts()

We see that the quality of Red wine is defined in *6 unique values* starting from 3 all the way upto 8, *3 being the lowest quality and 8 being the highest quality.* The majority of the wine sample are of medium quality (5 or 6 quality score). Rest samples are of either high quality (7 or 8 quality score) or of low quality (3 or 4 quality score.)

**The dataset is heavily imbalanced.** Working on this dataset *can result in overfitting* as the model will learn the majority target variable better that the other and give the same result for the prediction.

We now have a look at the datatype of the attributes and the target variable. We also check if there are any null values present into the dataset.

In [None]:
ds.info()

From the above results, we get the information that the **dataset consists of no null values** and the *attributes are of float datatype*. The target variable as we learned above has int datatype.

We can also visualise the presence of null values in data using a heatmap.

In [None]:
sb.heatmap(ds.isnull())

Now, we use descriptive statistics on the dataset and try to get some insights from them.

In [None]:
ds.iloc[:,0:-1].describe() # excluding the target class

From the above values, we can draw some insights for the data.

1 Standard Deviation for the features 'free sulphur dioxide' and and 'total sulphur dioxide' is high. So data spread is 
  present in the features.
2 The difference between mean and median is large in 'toatl sulphur dioxide' feature. So hte data is skewed in this
  column.
3 From the minimum values, we can see that no negative values are present in the dataset.
4 Some outliers may be present in the features 'residual sugar', 'free sulphur dioxide' and 'total sulphur dioxide'.
We will now look at the each feature indivisualy and get information regarding the problem from them.

Plotting the boxplot and distplot of this feature to visualise the data spread in the column.

In [None]:
plt.figure(figsize = (12,5))
plt.subplot(1, 2, 1)
sb.boxplot(ds['fixed acidity'], color = 'yellow')
plt.title("Box Plot for Fixed Acidity")

plt.subplot(1, 2, 2)
sb.distplot(ds['fixed acidity'])
plt.title("Distribution Plot for Fixed Acidity")

plt.tight_layout(pad = 4)
plt.show()

From the above plots, we see that the 'fixed acidity' feature have some outliers present in the column values. Also the data is slightly skewed as seen in distplot.

# Similarly,
We can plot the boxplot and check the presence of outliers.

In [None]:
clist = ['volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide',
         'density','pH','sulphates', 'alcohol']

plt.figure(figsize = (16,14))
for i in range(0, len(clist)):
    plt.subplot(4,3, i+1)
    sb.boxplot(ds[clist[i]], color = 'yellow')
print("BoxPlots of the features:")
plt.show()

Above plots indicate that *there are some outliers present in all the features of the dataset.* They need to be removed as that we can improve the learning of the model.

Now, plotting the distplot for the other features to look at their distribution of data:

In [None]:
plt.figure(figsize = (16, 14))
for i in range(0, len(clist)):
    plt.subplot(4,3, i+1)
    sb.distplot(ds[clist[i]])
print("Distplots of the features:")
plt.show()

From the above plots, we can see that **'residual sugar', 'chlorides', 'sulphates', 'free sulfur dioxide' and 'total sulfur dioxide' are positively skewed or right skewed.** *Some skewness is present into the other other data as well.* This skewness can be removed by removing the outliers. If skewness is still present, then we use boxcox or log transform to remove the skewness.

Now, we remove the outliers from the dataset using the z-score:

In [None]:
from scipy.stats import zscore
zabs = np.abs(zscore(ds)) # calculating the absolute z-score

# Removing the outliers
ds_new = ds[(zabs < 3).all(axis = 1)]
ds_new.shape

From the above shape of new dataset, we get that *1451 rows are left after removing outliers out of 1599 rows.* So a total of **148 rows have been removed.** *Which is around 9.25% of the total data.*

After the removal of outliers, checking if the skewness is treated for the data. For this we compare the old and the new data.

In [None]:
clist = ['fixed acidity', 'volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide',
         'density','pH','sulphates', 'alcohol']
for i in range(0, len(clist)):
    print(f"The old vs new skewness for feature {clist[i]} is: {ds[clist[i]].skew()} : {ds_new[clist[i]].skew()}")

From the above values, we can see that the *skewness is significantly reduced for the features after the outliers removal.* Still some skewness is present in the for few columns. **We remove the remaning skewness from those columns using log transform.**

In [None]:
nlist = ['fixed acidity','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide', 'sulphates', 'alcohol']

for i in range(0, len(nlist)):
    ds_new[nlist[i]] = np.log(ds_new[nlist[i]])
    
print("Skeness for new dataset after log transform:")
ds_new.skew()

Some *skewness is still present in the 'residual sugar' columnn*. So we use **boxcox transform on that feature.**

In [None]:
from scipy.stats import boxcox
ds_new['residual sugar'] = boxcox(ds_new['residual sugar'])[0]
print("Skeness for new dataset after boxcox transform:")
ds_new.skew()

The **skewness of the dataset is now treated**, we can visualise it again using the distplot.

In [None]:
plt.figure(figsize = (16, 14))
for i in range(0, len(clist)):
    plt.subplot(4,3, i+1)
    sb.distplot(ds_new[clist[i]])
print("Distplots of the features:")
plt.show()

We can visualise that the *features are more normally distributed than before and skewness is corrected.*

Checking the **Correlation between the attributes and target class.**

In [None]:
plt.figure(figsize = (10, 6))
sb.heatmap(ds_new.corr(), annot = True)

From the heatmap, we can see that the *'quality' of Wine samples are highly related to the 'alcohol' and 'sulphates' amount in the wine samples.* 'citric acid' and 'fixed acidity' also have some correlation to the quality of wine. *'free sulfur dioxide' and 'total sulfur dioxide' are highly correlated to each other.* Similarly 'citric acid' and 'density' are highly correlated to 'fixed acidity'

We can visualise the relation between features and target using the scatter plot.

In [None]:
plt.figure(figsize = (10, 6))
sb.scatterplot(x = ds_new['alcohol'], y = ds_new['quality'], hue = ds_new['quality'],
               size = ds_new['quality'])

We can see that the high quality wine has alcohol content more than 2.28 whereas in low quality wine alcohol content starts from 2.19. Similarly, we can plot a scatter plot for the other features and visualise the relation between them and target variable.

plotting a scatter plot to see the relationship between 'free sulphur dioxide' and 'total sulphur dioxide' and how this varies according to the quality of wine.

In [None]:
plt.figure(figsize = (10, 6))
sb.scatterplot(x = ds_new['free sulfur dioxide'], y = ds_new['total sulfur dioxide'], hue = ds_new['quality'],
               size = ds_new['quality'])

We see that as the of *free sulfur dioxide increases, total sulfur dioxide also increases.*

# Data Imbalance:
Now that we have cleaned the data and looked at the correlation between the features and target, we visualise the data imbalance.

In [None]:
fig = px.histogram(ds_new, x = 'quality', color = 'quality', opacity = 0.8, nbins = 15)
fig.show()

As we can see that the **data is heavily imbalanced** and *it can cause overfitting of the model.* In order to avoid overfitting of the model and improve the performance and prediction, *we balance the dataset.*

Since we only have 47 wine samples with 4 quality score and only 10 wine samples with 8 quality score. exact prediction of these values will be hard. So, *setting an arbitrary cutoff for the dependent variable (wine quality) at 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'*. So we **replace quality score values which are less that 7 by 0 and equal to or greater than 7 by 1.**

In [None]:
# Replacing quality score 4, 5 and 6 by 0
ds_new['quality'].replace(to_replace = 4, value = 0, inplace = True)
ds_new['quality'].replace(to_replace = 5, value = 0, inplace = True)
ds_new['quality'].replace(to_replace = 6, value = 0, inplace = True)

# Replacing quality score 7 and 8 by 1
ds_new['quality'].replace(to_replace = 7, value = 1, inplace = True)
ds_new['quality'].replace(to_replace = 8, value = 1, inplace = True)

print(f"The number of unique class in target variable after replacing",
      f"are {ds_new.quality.nunique()}.\nThose new unique target variables are {ds_new.quality.unique()}")

ds_new.quality.value_counts()

Now, we **balance the dataset using SMOTE** class. For that we split the new dataset into x and y variables, x being features and y being target.

In [None]:
# Splitting into attributes and target
x = ds_new.iloc[:, 0:-1]
y = ds_new.iloc[:, -1]

# Data balancing
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
x_new, y_new = oversample.fit_resample(x, y)

# New Data Visualization
good = y_new[y_new == 1] # values where loan were paid
bad = y_new[y_new == 0] # values where loan were not paid.

fig = go.Figure()
fig.add_traces(go.Histogram(x = good, name='Good Quality', marker_color='purple', opacity=0.9))
fig.add_traces(go.Histogram(x = bad, name='Poor Quality', marker_color='thistle', opacity=0.9))
fig.update_layout(title_text="Red Wine Quality Score", xaxis_title_text='Good or Poor', yaxis_title_text='Value Count')
fig.show()

From the above plot, we can visualise that the *number of samples for good wine has been increased using SMOTE algorithm.* As a result, **the data imbalance has been removed** and now we have 1250 samples for both good and poor quality of wine.Now, *we can build a prediction model without the issue of overfitting.*

## Scaling:
During the Exploratory Data Analysis, we found that all the data types are of float datatype. So the scaling of the dataset is not required. Also, there is not a large difference between the maximum and minimun values in the features across the dataset. So, Min-Max scaling can also be avaided for the data.

# MODEL BUILDING:
## Best Random State:
In order to achive high preformance and accuracy, we first find out the best possible random state where the model will give the best possible score. For that, we write a small code which returns us the beat random state possible.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

max_accuracy = 0
best_rs = 0
for i in range(1, 150):
    x_train, x_test, y_train, y_test = train_test_split(x_new, y_new, test_size = 0.25, random_state = i)
    lg = LogisticRegression()
    lg.fit(x_train, y_train)
    pred = lg.predict(x_test)
    acc = accuracy_score(y_test, pred)
    if acc > max_accuracy: # after each iteration, acc is replace by the best possible accuracy
        max_accuracy = acc
        best_rs = i
print(f"Best Random State is {best_rs}")

Best Random State is 98. From the above result, we can split the data using random state as 98.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_new, y_new, test_size = 0.25, random_state = 98)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

## Model Selection:
The problem is a classification problem. So we need to select a model which is a classification model. We find out the best model among *Logistic Regressor, Decision Tree Classifier, KNN Classifier and SVC Classifier.* Since the **datatset is not that huge, we do not need to use bagging or boosting.** We fit the training and testing data one by one into the models and compare their accuracy score.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# For Logistic Regression
lg = LogisticRegression()
lg.fit(x_train, y_train)
pred_lg = lg.predict(x_test)
print("Accuracy Score of Logistic Regression model is", accuracy_score(y_test, pred_lg)*100)

# For Decision Tree Classifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
pred_dtc = dtc.predict(x_test)
print("Accuracy Score of Decision Tree Classifier model is", accuracy_score(y_test, pred_dtc)*100)

# For K-Nearest Neighbour Classifier
knc = KNeighborsClassifier(n_neighbors = 5)
knc.fit(x_train, y_train)
pred_knc = knc.predict(x_test)
print("Accuracy Score of K-Nearest Neighbour Classifier model is", accuracy_score(y_test, pred_knc)*100)

# For Support Vector Classifier
svc = SVC(kernel = 'rbf')
svc.fit(x_train, y_train)
pred_svc = svc.predict(x_test)
print("Accuracy Score of Support Vector Classifier model is", accuracy_score(y_test, pred_svc)*100)

From the above accuracy scores, we get find out that the *accuracy score for the Decision Tree Classifier is highest. But, this could be the result of overfitting of the model.*

## Cross Validation:
In order to check whether the *accuracy score given by the metrics is real and if model is overfitting or not,* **we cross validate the model for scoring criteria as f1-score.** This will tell us if the model is actually performing welll and is ot overfitting. *The model whose difference between accuracy score and mean accuracy given by Cross Validation will be the least will the best model.*

In [None]:
from sklearn.model_selection import cross_val_score

lg_scores = cross_val_score(lg, x_new, y_new, cv = 10) # cross validating the model
print(lg_scores) # accuracy scores of each cross validation cycle
print(f"Mean of accuracy scores is for Logistic Regression is {lg_scores.mean()*100}\n")

dtc_scores = cross_val_score(dtc, x_new, y_new, cv = 10)
print(dtc_scores)
print(f"Mean of accuracy scores is for Decision Tree Classifier is {dtc_scores.mean()*100}\n")

knc_scores = cross_val_score(knc, x_new, y_new, cv = 10)
print(knc_scores)
print(f"Mean of accuracy scores is for KNN Classifier is {knc_scores.mean()*100}\n")

svc_scores = cross_val_score(svc, x_new, y_new, cv = 10)
print(svc_scores)
print(f"Mean of accuracy scores is for KNN Classifier is {svc_scores.mean()*100}\n")

From the above results, we see that the *least difference between accuracy score and mean accuracy is give by knn classifier.* We can now tune **the SVC model using the hyperparameter tuning.** After that we evaluate the model on basis of auc score, recall and precison.

## Hyper-Parameter Tuning:
**Tuning the SVC model:**

In [None]:
from sklearn.model_selection import GridSearchCV
svc = SVC()
parameters = { 'kernel' : ['rbf', 'linear'], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'C': [0.1, 1, 10, 100, 1000]}
gs = GridSearchCV(estimator = svc, param_grid = parameters, scoring = 'f1', cv = 5)
gs.fit(x_train, y_train)
print("The best parameters for SVC Model are:")
print(gs.best_params_)

**Best parameters for SVC Classifier after tuning are C - 1000, gamma - 1, kernel - 'rbf'.**

# MODEL EVALUATION:
## SVC Evaluation:
Evaluation of SVC using classification report and AUC score.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import plot_roc_curve

svc = SVC(kernel = 'rbf', C = 1000, gamma = 1)
svc.fit(x_train, y_train)
pred_svc = svc.predict(x_test)

print("Accuracy Score of SVC model is", accuracy_score(y_test, pred_svc))
print("Confusion matrix for SVC Model is")
print(confusion_matrix(y_test, pred_svc))
print("Classification Report of the SVC Model is")
print(classification_report(y_test, pred_svc))

From the above classification report, we see that the SVC model has a **f1-score of 0.95 and the precision and recall greater than 0.92.** Now, we plot the ROC Curve and look at the AUC score for the model.

In [None]:
plot_roc_curve(svc, x_test, y_test) # arg. are model name, feature testing data, label testing data.
plt.title("Recevier's Operating Characteristic")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.show()

From the above ROC Curve, we visualise that the model has an **AUC score of 0.97.** This means that the prediction model is working effeciently.

We can also *visualise the performance of the model during testing phase* using the histogram plot as shown below.

In [None]:
act_pos = y_test[y_test == 1]
pred_pos = pred_svc[pred_svc == 1]
act_neg = y_test[y_test == 0]
pred_neg = pred_svc[pred_svc == 0]

fig1 = go.Figure()

fig1.add_traces(go.Histogram(x = act_pos, name='Actual Good', marker_color='springgreen', opacity=0.9))

fig1.add_traces(go.Histogram(x = pred_pos, name='Predicted Good', marker_color='mediumspringgreen', opacity=0.9))

fig1.add_traces(go.Histogram(x = act_neg, name='Actual Poor', marker_color='peru', opacity=0.9))

fig1.add_traces(go.Histogram(x = pred_neg, name='Predicted Poor', marker_color='tan', opacity=0.9))

fig1.update_layout(title_text="Model's Wine Quality Prediction Result", xaxis_title_text='Actual and Predicted',
                   yaxis_title_text='Counts', bargap=0.1, bargroupgap=0.3)

fig1.show()

From the histogram above, it is clear that the model's **predictions are very accurate**. This model can be used to predict and differentiate the wine between good and poor.

# SERIALISATION:
## Saving the model-
The fitted **model can now be saved as an object outside the notebook** and *used for prediction.*

In [None]:
import joblib # used for serialisation
joblib.dump(svc, 'Wine_Quality_Prediction_Model.obj') # saving the model as an object