
# Red Wine Quality

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality.


Input variables (based on physicochemical tests):

1 - fixed acidity 

2 - volatile acidity 

3 - citric acid 

4 - residual sugar 

5 - chlorides 

6 - free sulfur dioxide 

7 - total sulfur dioxide 

8 - density 

9 - pH 

10 - sulphates 

11 - alcohol 

Output variable (based on sensory data): 

12 - quality (score between 0 and 10) 

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.

In this notebook we will handle this problem in classification and regression to see the different approach.

### Import libraries

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import pyplot
sns.set(style='white', context='notebook', palette='deep')
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix 

%config Completer.use_jedi = False


In [None]:
wine = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

In [None]:
wine.head()

In [None]:
wine.shape

In [None]:
wine.info()

As we can see we have only continious variable.

In [None]:
wine.describe()


## Exploratory data analysis

Let's start by looking at our target quality

In [None]:
wine.quality.value_counts()

In [None]:
sns.countplot(x="quality", data=wine)

**Analysing Fixed acidity**

In [None]:
wine['fixed acidity'].plot.hist(grid=True)


In [None]:
print('Min value',wine['fixed acidity'].min())
print('Max value', wine['fixed acidity'].max())

As we can see the fixed have values between 4.6 and 15.9

**Analysing Volatile acidity**

In [None]:
wine['volatile acidity'].plot.hist(grid=True)


In [None]:
print('Min value',wine['volatile acidity'].min())
print('Max value', wine['volatile acidity'].max())

As we can see the volatile acidity have values between 0.12 and 1.58

**Analysing Citric acidity**

In [None]:
wine['citric acid'].plot.hist(grid=True)


In [None]:
print('Min value',wine['citric acid'].min())
print('Max value', wine['citric acid'].max())

As we can see the citric acidity have values between 0 and 1

**Analysing residual sugar**

In [None]:
wine['residual sugar'].plot.hist(grid=True)


In [None]:
print('Min value',wine['residual sugar'].min())
print('Max value',wine['residual sugar'].max())

As we can see the residual sugar have values between 0.9 and 15.5



**Analysing chlorides**

In [None]:
wine['chlorides'].plot.hist(grid=True)


In [None]:
print('Min value',wine['chlorides'].min())
print('Max value',wine['chlorides'].max())

As we can see the chlorides have values between 0.012 and 0.611


**Analysing free sulfur dioxide**

In [None]:
wine['free sulfur dioxide'].plot.hist(grid=True)

In [None]:
print('Min value',wine['free sulfur dioxide'].min())
print('Max value',wine['free sulfur dioxide'].max())

As we can see the free sulfur dioxide have values between 1 and 72



**Analysing total sulfur dioxide**

In [None]:
wine['total sulfur dioxide'].plot.hist(grid=True)

In [None]:
print('Min value',wine['total sulfur dioxide'].min())
print('Max value',wine['total sulfur dioxide'].max())

As we can see the total sulfur dioxide have values betwee6 1 and 289. 
We can deduce that we have certainly outliers here

**Analysing density**

In [None]:
wine['density'].plot.hist(grid=True)

In [None]:
print('Min value',wine['density'].min())
print('Max value',wine['density'].max())

As we can see the density have values betwee6 0.99007 and 1.00369. 



**Analysing pH**

In [None]:
wine['pH'].plot.hist(grid=True)

In [None]:
print('Min value',wine['pH'].min())
print('Max value',wine['pH'].max())

As we can see the ph have values betwee6 2.74 and 4.01. 



**Analysing sulphates**

In [None]:
wine['sulphates'].plot.hist(grid=True)

In [None]:
print('Min value',wine['sulphates'].min())
print('Max value',wine['sulphates'].max())

As we can see the sulfates have values betwee6 0.33 and 2.0. 




**Analysing alcohol**

In [None]:
wine['alcohol'].plot.hist(grid=True)

In [None]:
print('Min value',wine['alcohol'].min())
print('Max value',wine['alcohol'].max())

As we can see the alcohol have values betwee6 8.4 and 14.9. 


All the max and min values will help us when building the application ! 

**Let's havae a look at density and alcohol**

In [None]:
sns.scatterplot(x='density', y='alcohol', data=wine)

**Let's havae a look at pH and alcohol**

In [None]:
sns.scatterplot(x='pH', y='alcohol', data=wine)

In [None]:
sns.boxplot(data=wine, x='density')

Most value have a density between 0.996 and 0.998

### Checking for Missing Values

### Checking for Outliers



In [None]:
wine.isnull().sum()

We have any missing values

## Let's try to understand how to have a quality wine

Here we see that fixed acidity does not give any specification to classify the quality.



In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'fixed acidity', data = wine)

Here we see that its quite a downing trend in the volatile acidity as we go higher the quality 

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'residual sugar', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'free sulfur dioxide', data = wine)

Composition of citric acid go higher as we go higher in the quality of the wine


In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'citric acid', data = wine)

Composition of chloride also go down as we go higher in the quality of the wine



In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'chlorides', data = wine)

Sulphates level goes higher with the quality of wine


In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'sulphates', data = wine)

Alcohol level also goes higher as te quality of wine increases


In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'alcohol', data = wine)

Quality level also goes higher as te PH of wine decreases

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(x = 'quality', y = 'pH', data = wine)

#### From the above visualisation we observe that:

- Features fixed acidity and residual sugar might not give any specification to classify/predict the quality.
Quality increases with : 
- decrease in volatile acidity.
- increase in citric acid.
- decrease in chlorides.
- decrease in pH.
- increase in sulphates.
- increase in alcohol.
- Free sulfur dioxide alone will not be able to predict the quality.
- Total sulfur dioxide alone will not be able to predict the quality.

### Correlation

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(wine.corr(method='pearson'), annot=True, square=True)
plt.show()

In [None]:
print('Correlation of different features of our dataset with quality:')
for i in wine.columns:
  corr, _ = pearsonr(wine[i], wine['quality'])
  print('%s : %.4f' %(i,corr))

In [None]:
#for another view, this method can be used to view correlations
print('Another view of correlations among features:\n')
wine.corr().style.background_gradient(cmap="coolwarm")

From the above plots and values we observe:

- volatile acidity, chlorides and ph are negatively correlated to quality — hence our statement was right that quality increases with decrease in value of these features; and vice versa for other features.
- free sulfur dioxide and total sulfur dioxide are highly correlated to each other with correlation of 0.67.
- There are many features with correlation < 0.5 to quality, and may be removed from the dataset.

We are looking for accuracy to minute levels, not just some approximation — high quality wine may have very rare composition from other average quality wines, hence we need to take every feature in account while predicting quality of wine, so I am choosing not remove any feature from the dataset.

We will first handle this problem with classification algorithm.

## Machine learning

Dividing quality of wine in two buckets, ie. Good wine and Bad wine, and on the basis of this we will give our final result.

In [None]:
bins = (2, 6.5, 8)
group_names = ['bad', 'good']
wine['quality'] = pd.cut(wine['quality'], bins = bins, labels = group_names)
wine.head()

From the above code have divided the quality of wine in two buckets:
- Bad wine : range 2 – 6.5
- Good wine : range 6.5 – 8



Now we will map the values of bad and good to 0 and 1 respectively, as machine learning models can perform calculation only on numerical data:

In [None]:
wine['quality'] = wine['quality'].map({'bad' : 0, 'good' : 1})
wine.head(10)

Let's visualise our new data ! 

In [None]:
print(wine['quality'].value_counts())
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(18, 5))
ax = ax.flatten()
wine['quality'].value_counts().plot(x=0, y=1, kind='pie', figsize=(15,5), ax=ax[0])
sns.countplot(wine['quality'], ax=ax[1])
plt.show()

### Train test split

In [None]:
X = wine.drop('quality', axis = 1)
y = wine['quality']
 
#Creating training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
#Feature scaling, but not scaling dependent variable as it has categorical data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

#### Our training and testing data is ready now to perform machine learning algorithm

### Logistic Regression

In [None]:
logreg = LogisticRegression(solver='lbfgs', random_state=42)
logreg.fit(X_train, y_train)
pred_logreg = logreg.predict(X_test)
Y_compare_logisticRegression = pd.DataFrame({'Actual' : y_test, 'Predicted' : pred_logreg})
print(Y_compare_logisticRegression.sample(5))
print('\nConfussion matrix:')
print(confusion_matrix(y_test, pred_logreg))

In [None]:
import itertools
def plot_confusion_matrix(cm, classes, normalize = False,
                          title='Confusion matrix',
                          cmap=plt.cm.Greens): # can change color 
    plt.figure(figsize = (5, 5))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, size = 24)
    plt.colorbar(aspect=4)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, size = 14)
    plt.yticks(tick_marks, classes, size = 14)
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    # Label the plot
    for i, j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
             plt.text(j, i, format(cm[i, j], fmt), 
             fontsize = 20,
             horizontalalignment="center",
             color="white" if cm[i, j] > thresh else "black")
    
    plt.tight_layout()
    plt.ylabel('True label', size = 18)
    plt.xlabel('Predicted label', size = 18)

# Let's plot it out
cm = confusion_matrix(y_test, pred_logreg)
plot_confusion_matrix(cm, classes = ['0 - Bad', '1 - Good'],
                      title = 'Confusion Matrix')

### K-Nearest Neighbour Classification



In [None]:
knn = KNN(n_neighbors=2, metric='minkowski', p=2,)
knn.fit(X_train, y_train)
pred_knn = knn.predict(X_test)
Y_compare_knn = pd.DataFrame({'Actual' : y_test, 'Predicted' : pred_knn})
print(Y_compare_knn.head())
print('\nConfussion matrix:')
print(confusion_matrix(y_test, pred_knn))

In [None]:
cm = confusion_matrix(y_test, pred_knn)
plot_confusion_matrix(cm, classes = ['0 - Bad', '1 - Good'],
                      title = 'Confusion Matrix')

### Random Forrest Classification

In [None]:
rfc = RandomForestClassifier(n_estimators=25, criterion='gini', random_state=0,)
rfc.fit(X_train, y_train)
pred_rf = rfc.predict(X_test)
Y_compare_rfc = pd.DataFrame({'Actual' : y_test, 'Predicted' : pred_rf})
print(Y_compare_rfc.head())
print('\nConfussion matrix:')
print(confusion_matrix(y_test, pred_rf))

In [None]:
cm = confusion_matrix(y_test, pred_rf)
plot_confusion_matrix(cm, classes = ['0 - Bad', '1 - Good'],
                      title = 'Confusion Matrix')

### Checking accuracy of different model

In [None]:
#K-fold cross validation
modelNames = ['Logistic Regression', 'K-Nearest Neighbour', 'Random Forrest']
modelClassifiers = [logreg, knn,rfc]
models = pd.DataFrame({'modelNames' : modelNames, 'modelClassifiers' : modelClassifiers})
counter=0
score=[]
for i in models['modelClassifiers']:
  accuracy = cross_val_score(i, X_train, y_train, scoring='accuracy', cv=10)
  print('Accuracy of %s Classification model is %.2f' %(models.iloc[counter,0],accuracy.mean()))
  score.append(accuracy.mean())
  counter+=1

In [None]:
#Plotting the accuracies of different models
pd.DataFrame({'Model Name' : modelNames,'Score' : score}).sort_values(by='Score', ascending=True).plot(x=0, y=1, kind='bar', figsize=(15,5), title='Comparison of accuracies of differnt classification models')
plt.show()

### End of classification

#### If you find this notebook useful then please upvote