# **About dataset**

This dataset is related to red variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Input variables (based on physicochemical tests):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
Output variable (based on sensory data):
12. quality (score between 0 and 10)

Our goal is to make a **machine learning** model that can predict the quality of wine based on the input variable (features) given above.

# **Viewing and understanding the basic details of our dataset**

In [None]:
#importing libraries that are used in this notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix as CM

In [None]:
#importing dataset
dataset = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
dataset.head()

In [None]:
dataset.shape

In [None]:
dataset.describe()

From above data description we can conclude that value of density is almost similar throughout the dataset, because mean, min, 25%, 50%, 75% are all ~0.99.

So I am choosing to drop density feature as it will not have much significance in predicting quality of wine.

In [None]:
dataset.drop(labels='density', axis=1, inplace=True)

# **Checking for missing values**

In [None]:
dataset.isnull().sum()

There are no missing values in our dataset

# **Checking for outliers in our dataset**

In [None]:
#Plotting boxplots to see if there are any outliers in our data (considering data betwen 25th and 75th percentile as non outlier)
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(15, 5))
ax = ax.flatten()
index = 0
for i in dataset.columns:
  if i != 'quality':
    sns.boxplot(y=i, data=dataset, ax=ax[index])
    index +=1
plt.tight_layout(pad=0.4)
plt.show()

From the above box plots we can clearly see that there are outliers in all features.

**BUT**

Here I am choosing not remove/modify outliers as we are looking for accuracy to minute levels, not just some approximation — high quality wine may have very rare composition (hence outlier) from other average quality wines, so we can not remove or modify outlier values in out dataset.

# **Feature Extraction**

Plotting bar plots to see relation between each independent feature with dependent feature 'Quality'

In [None]:
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(15, 5))
ax = ax.flatten()
index=0
for i in dataset.columns:
  if i != 'quality':
    sns.barplot(x='quality', y=i, data=dataset, ax=ax[index])
    index+=1
plt.tight_layout(pad=0.4)
plt.show

**From the above visualisation we derieve that:**
1. Features fixed acidity and residual sugar might not give any specification to classify/predict the quality.
2. Quality increases with
    * decrease in volatile acidity.
    * increase in citric acid.
    * decrease in chlorides.
    * decrease in pH.
    * increase in sulphates.
    * increase in alcohol.
3. Free sulfur dioxide alone will not be able to predict the quality.
4. Total sulfur dioxide alone will not be able to predict the quality.

**Plotting correlation heatmap to verify the above statements**




In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(dataset.corr(method='pearson'), annot=True, square=True)
plt.show()

print('Correlation of different features of our dataset with quality:')
for i in dataset.columns:
  corr, _ = pearsonr(dataset[i], dataset['quality'])
  print('%s : %.4f' %(i,corr))

In [None]:
#for a better view this way can be used
print('Another (more clear) view of correlations among features:\n')
dataset.corr().style.background_gradient(cmap="coolwarm")

**From the above plots and values we can conclude:**
1. volatile acidity, chlorides and ph are negatively correlated to quality -- hence our statement was right that quality increases with decrease in value of these features; and vice versa for other features.
2. free sulfur dioxide and total sulfur dioxide are highly correlated to each other with correlation of 0.67.
3. There are many features with correlation < 0.5 to quality, and may be removed from the dataset.

BUT for the same reason as mentioned above in outlier section, that -- we are looking for accuracy to minute levels, not just some approximation — high quality wine may have very rare composition from other average quality wines, hence we need to take every feature in account while predicting quality of wine, so we can not remove or modify outlier values in out dataset.

# **Machine learning**

Now implementing classification algorithms based machine learning models and selecting the best out of them based on some score.

**First let's prepare our dataset for Machine Learning**

In [None]:
#our dataset
dataset.head()

Dividing quality of wine in two buckets, ie. Good wine and Bad wine, and on the basis of this we will give our final result.

In [None]:
bins = (2, 6.5, 8)
group_names = ['bad', 'good']
dataset['quality'] = pd.cut(dataset['quality'], bins = bins, labels = group_names)
dataset.head()

From the above code have divided the quality of wine in two buckets:
* Bad wine : range 2 - 6.5
* Good wine : range 6.5 - 8

This can be changed as per the requiremnt of our client.

Now we will map the values of bad and good to 0 and 1 respectively, as machine learning models can perform calculation only on numerical data.

In [None]:
dataset['quality'] = dataset['quality'].map({'bad' : 0, 'good' : 1})
dataset.head(10)

Let's count and visualise the total number of different wine samples

In [None]:
print(dataset['quality'].value_counts())
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(15, 5))
ax = ax.flatten()
print('\nVisualisation of accuracies of differnt classification models')
dataset['quality'].value_counts().plot(x=0, y=1, kind='pie', figsize=(15,5), ax=ax[0])
sns.countplot(dataset['quality'], ax=ax[1])
plt.show()

Creating set of independent and dependent features

In [None]:
X = dataset.iloc[:, :-1]
Y = (dataset.iloc[:, 10])

Creating traing and test set

In [None]:
X_train, X_test, Y_train, Y_test = tts(X, Y, test_size=0.20, random_state=0)

Feature scaling, but not scaling dependent variable as it has catagorical data

In [None]:
from sklearn.preprocessing import StandardScaler as ss
SS = ss()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)

**Now implementing machine learning models**

Logistic Regression Classification

In [None]:
logisticRegression = LogisticRegression(solver='lbfgs', random_state=0)
logisticRegression.fit(X_train, Y_train)
Y_pred_logisticRegression = logisticRegression.predict(X_test)
Y_compare_logisticRegression = pd.DataFrame({'Actual' : Y_test, 'Predicted' : Y_pred_logisticRegression})
print(Y_compare_logisticRegression.head())
print('\nConfussion matrix:')
print(CM(Y_test, Y_pred_logisticRegression))

K-Nearest Neighbour Classification

In [None]:
knn = KNN(n_neighbors=2, metric='minkowski', p=2,)
knn.fit(X_train, Y_train)
Y_pred_knn = knn.predict(X_test)
Y_compare_knn = pd.DataFrame({'Actual' : Y_test, 'Predicted' : Y_pred_knn})
print(Y_compare_knn.head())
print('\nConfussion matrix:')
print(CM(Y_test, Y_pred_knn))

Support Vector Classification

In [None]:
svc = SVC(kernel='rbf', gamma='scale', random_state=0)
svc.fit(X_train, Y_train)
Y_pred_svc = svc.predict(X_test)
Y_compare_svc = pd.DataFrame({'Actual' : Y_test, 'Predicted' : Y_pred_svc})
print(Y_compare_svc.head())
print('\nConfussion matrix:')
print(CM(Y_test, Y_pred_svc))

Naive Bayes Calssification

In [None]:
nb = GaussianNB()
nb.fit(X_train, Y_train)
Y_pred_nb = nb.predict(X_test)
Y_compare_nb = pd.DataFrame({'Actual' : Y_test, 'Predicted' : Y_pred_nb})
print(Y_compare_nb.head())
print('\nConfussion matrix:')
print(CM(Y_test, Y_pred_nb))

Random Forrest Classification

In [None]:
rfc = RFC(n_estimators=25, criterion='gini', random_state=0,)
rfc.fit(X_train, Y_train)
Y_pred_rfc = rfc.predict(X_test)
Y_compare_rfc = pd.DataFrame({'Actual' : Y_test, 'Predicted' : Y_pred_rfc})
print(Y_compare_rfc.head())
print('\nConfussion matrix:')
print(CM(Y_test, Y_pred_rfc))

**Checking accuracy of different classification models**

In [None]:
#K-fold cross validation
modelNames = ['Logistic Regression', 'K-Nearest Neighbour', 'Support Vector', 'Naive Bayes', 'Random Forrest']
modelClassifiers = [logisticRegression, knn, svc, nb, rfc]
models = pd.DataFrame({'modelNames' : modelNames, 'modelClassifiers' : modelClassifiers})
counter=0
score=[]
for i in models['modelClassifiers']:
  accuracy = cross_val_score(i, X_train, Y_train, scoring='accuracy', cv=10)
  print('Accuracy of %s Classification model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
  score.append(accuracy.mean())
  counter+=1

In [None]:
pd.DataFrame({'Model Name' : modelNames,'Score' : score}).sort_values(by='Score', ascending=True).plot(x=0, y=1, kind='bar', figsize=(15,5), title='Comparison of accuracies of differnt classification models')
plt.show()

From the above scores and visualiations we can conclude that Random Forrest Classification model gives the best score and we can use it to predict the quality of wine for this particular problem.

However other models like Logisgic Regression, KNN and SVC also have comparable score to Random Forrest and may also be used to predict quality of wine.

# **Final Summary**

From above data engineering and machine learning (classification) techniques we can conclude that:

1. We have chosen not to remove outliers and extract the more relavant features form out dataset - as we were looking for accuracy to minute levels, not just some approximation (high quality wine may have very rare composition from other average quality wines)
2. Random Forrest Classification model gave the best accuracy and can be considered as a good model for predictiong the quality of wine for this problem.
3. However other models like Logisgic Regression, KNN and SVC also have comparable score to Random Forrest and may also be used to predict quality of wine.
4. Naive Bayes model gave the least accuracy, which can be considered bad model to predict the quality of fine.
5. Performance tuning using methods like Grid Search, etc. can be done to improve the accuracy of these models. So, accuracy of these models will improve and we might get another best model for our problem.
6. We might(will) get different results if we remove outliers and consider feature extraction.

Finally I would like to end this notebook the fact, that no Data Science technique is perfect, there are many other ways/ models to get better results and there is always scope for imporvements.

**Please comment your suggestions**

**Please upvote if this notebook is helpful**