
# **WINE QUALITY ANALYSIS AND PREDICTION**

Link for Dataset: https://archive.ics.uci.edu/ml/datasets/wine+quality

Problem Statement

The dataset is related to red vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.
Input variables (based on physicochemical tests):

1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol
Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Steps in modeling:

1. Data exploration
2. Feature Wngineering
3. Model prediction
4. Tuning and evaluation
5. Prediction


# 1. Data Loading & Exploration

In [None]:
#importing required libraries for exploratory data analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#loading data into pandas dataframe

df=pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

In [None]:
df.describe()

In [None]:
#checking for missing or null values

df.isna().sum()

In [None]:
df.columns

In [None]:
# extracting feature names form the dataset

features = (list(df.columns))
features.remove('quality')
features

## 1.1 Plotting frequency distribution of all features

In [None]:
i=1
plt.figure(figsize = (20,15))
for fe in features:
    plt.subplot(4,4,i,)
    sns.histplot(x=fe, data = df,legend=True,hue='quality',palette='Spectral_r')
    i=i+1

In [None]:
#correlation between the features and quality
sns.heatmap(df.corr())

## 1.2 Plotting features with target variable(quality)

In [None]:
i=1
plt.figure(figsize = (20,15))
for fe in features:
    plt.subplot(4,4,i,)
    sns.barplot(x='quality', y=fe, data = df)
    i=i+1

**From the graphs we can see that:**

1. Quality increases with increase in citric acid, sulphates and alcohol. (positively correlated)
2. Quality increases with decrease in volatile acidity and chlorides. (negatively correlated)
3. There is no effect of density and ph on quality. (zero correlation)
4. Fixed acidity, residual sugar and sulphur dioxide shows no particular trend with quality.

# 2. Feature Engineering

Converting quality into two categories:

1. 0 for bad quality
2. 1 for good quality

The quality lies between 2 and 8, so dividing them into 2 categories low quality(0) for 6.5 and below and high quality(1) for 6.5 and above.

In [None]:
bins = (2, 6.5, 8)
group_names = [0, 1]
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)

In [None]:
i=1
plt.figure(figsize = (20,15))
for fe in features:
    plt.subplot(4,4,i,)
    sns.histplot(x=fe, data = df,legend=True,hue='quality',palette='icefire')
    i=i+1

It can be inferred that for a wine to be of good quaity:

  1.  fixed acidity should be between and 12
  2.  volatile acidity should be around 0.30
  3.  citric acid should be around 0.4
  4.  residual sugar should be around 0.2
  5.  chlorides should be less than 0.1
  6.  free sulphor dioxide should be between 0-20
  7.  total sulphur dioxide should be below 50
  8.  density should be less than 1
  9.  ph should be acidic
  10. sulphates should be below 1
  11. alcohol shpuld be grater than 10

    (All measures in standard measuring units)

In [None]:
df.head(20)

In [None]:
# number of samples in each category of quality
sns.countplot(x= 'quality',data = df)

We must note that there is a significant class imbalance in the dataset. This tells us that accuracy is not at all a good metrics to determine and compare the performance of our models. We can instead use the F1 score to ensure that the class imbalance does not give us a false idea that our model is performing well.

The F1 score for this model will be very low since it takes into account both the precision and the recall of our predictions.

Hence, the F1 score will be more reliable and robust compared to the accuracy.

In [None]:
pd.value_counts(df['quality'])

## 2.2 Converting the data to same scale

We can see from the frequency plot that all the features are on not saame scale, so scaling them to same scale (in the scale of 0-1).

In [None]:
from sklearn.preprocessing import minmax_scale

X= df.drop('quality',axis=1)
y=df['quality']
def scale_it(X):
    return minmax_scale(X)

X= X.apply(scale_it)

In [None]:
X

## 2.3 Performing train-test split of data

In [None]:
from sklearn.model_selection import train_test_split
X_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

# 3. Training model

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=7)
model.fit(X_train,y_train)

# 4. Model Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
predicted = model.predict(x_test)

In [None]:
print(classification_report(y_test,predicted))

In [None]:
confusion_matrix(y_test, predicted)

In [None]:
print("Accuracy:",accuracy_score(y_test, predicted))

# 5. Making predictions

In [None]:
data = [4.4,0.50,0.20,1.1,0.276,7.0,23.0,0.438,1.51,0.52,9.4]
scale_it(data)
if(model.predict([data])==1):
    print('Nice Quality! :)')
else:
    print('Not good! ;(')

In [None]:
data = [0.247788,0.294521,0.120000, 0.089041,0.156928,0.154930,0.233216,0.428047,0.244094,0.079641,0.223077]
scale_it(data)
if(model.predict([data])==1):
    print('Nice Quality! :)')
else:
    print('Not good! ;(')