Dataset is taken from Kaggle(https://www.kaggle.com/rajyellow46/wine-quality). The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. The reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. Following dataset having different variables, Some of them are correlated to each other. lets perform some analysis and check how data will predict quality of wine.

First we have to import libraries, these are libraries help us to import data also help us to do analysis.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1.Import Libraries

In [None]:
import matplotlib 
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(color_codes = True)
%matplotlib inline


from sklearn.linear_model import LinearRegression,SGDClassifier, RidgeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder,MinMaxScaler , StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# 2.Import Data

In [None]:
df = pd.read_csv('/kaggle/input/wine-quality/winequalityN.csv')
df.head()

In [None]:
## check Nan value
for i in df.columns:
    print (i+": "+str(df[i].isna().sum()))

# 3. Visualize Data

In [None]:
# correlation gives us relation between each varibale. how much each variable is contributing.
#correlation shows how each feature is dependent on other. from this will find out colinearity between each function, if colinearity is more than 0.5 that leads to problem however we can avoid that problem by dropping feature which highly correlated to each feature.

In [None]:
correlation = df.corr()

In [None]:
plt.figure(figsize = (15,8))
sns.heatmap(correlation,annot = True, cmap = 'Blues')

In [None]:
df.head()

In [None]:
# with the help of correaltion check weather na values columns affects low in that case if you drop values thah will be fine but will go with replacement with median value of particular feture
df['pH'] = df['pH'].fillna(df['pH'].median())
df['sulphates'] = df['sulphates'].fillna(df['sulphates'].median())
df['chlorides'] = df['chlorides'].fillna(df['chlorides'].median())
df['residual sugar'] = df['residual sugar'].fillna(df['residual sugar'].median())
df['citric acid'] = df['citric acid'].fillna(df['citric acid'].median())
df['volatile acidity'] = df['volatile acidity'].fillna(df['volatile acidity'].median())
df['fixed acidity'] = df['fixed acidity'].fillna(df['fixed acidity'].median())

In [None]:

x = np.unique(df["quality"])
x

Now as we can see quality score is vaires in between 3 to 8, as we know low quality wine having low score and high quality wine having high score accordingly we will going to assign class to score and try to predict classes.

# 4.Preprocessing Data

In [None]:
def values(x):
    if x <= 5:
        x = 'low'
    elif x >5 and x <7:
        x = 'medium'
    else:
        x = 'high'
    
    return(x)
df['level'] = df['quality'].apply(lambda x: values(x))

using preprocessing method convert quality classes into numerical variable and apply ordinal encoding method.

In [None]:
label = LabelEncoder()

quality_score  = label.fit_transform(df['level'])

print(quality_score)
print((label.classes_))

In [None]:
# seaborn packages gives us nice visualitons where in barplot helps us to predict how much each classes having alcohol.
plt.figure(figsize = (15,8))
ax = sns.barplot(x="level", y="alcohol", data=df)

In [None]:
#Again will check how much sulphates is used in each classes and which class had used more sulphate.
plt.figure(figsize = (15,8))
ax = sns.barplot(x="level", y="sulphates", data=df)

In [None]:
ax = sns.countplot(x="level", data=df, palette="Set3")

outliers lead to error in data, to avoid that firstly check weather outliers are present in data,if outliers are there then try to remove and avoid error.
outliers find out using histogram using matplotlib function and also will check how each variable is spread and based on that will decide which algorithm is best suitable for predicting accurate values.

In [None]:
df.hist(bins=10,figsize=(15,12))
plt.show()

In [None]:
#In introduction part we already discussed about type of wine is present in data, so will use dummuy encoding method for converting categorical feature into numerical.
df['type'] = pd.get_dummies(df['type'],drop_first = True)

In [None]:
x = df.iloc[:,:-2]
x.head()

In [None]:
ax = sns.countplot(x="type", data=df, palette="Set3")

from above count plot you can see most of the data is from white wine.

to achieve minimum global minima we have to reduce cost function as in dataset some of values having high values to avoid errors, will perform feature scaling

In [None]:
standard = StandardScaler()

std_x = standard.fit_transform(x)

# 5. Split Data

In [None]:
x_train,x_test,y_train,y_test = train_test_split(std_x,quality_score,test_size = 0.20,random_state = 40)


print("Training data:{}".format(x_train.shape))
print("Test data:{}".format(x_test.shape))

In [None]:
results = []

# 6. Build Model and Check Accuracy for Each Model

In [None]:
clf = SGDClassifier(max_iter = 10000,random_state = 0)



clf.fit(x_train,y_train)
y_predicted = clf.predict(x_test)
score = clf.score(x_test,y_test)


print(score)
results.append(score)

In [None]:
clf_1 = RidgeClassifier(alpha = 2,max_iter = 10000)



clf_1.fit(x_train,y_train)
y_predicted = clf_1.predict(x_test)
score = clf_1.score(x_test,y_test)


print(score)
results.append(score)

In [None]:
clf = LogisticRegression(max_iter= 10000,solver ='newton-cg',random_state = 0,n_jobs = 2 )

clf.fit(x_train,y_train)
y_predicted = clf.predict(x_test)
score = clf.score(x_test,y_test)


print(score)
results.append(score)

# 7. Use Confusion Matrix

In [None]:
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
classes = df['level'].value_counts()

plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes.index,
                      title='Confusion matrix, without normalization')
# With normalization
plt.figure()
plot_confusion_matrix(cnf_matrix, classes= classes.index, normalize=True,
                      title='Normalized confusion matrix')

plt.show()

In [None]:
clf_1 = DecisionTreeClassifier(criterion = 'entropy',min_samples_split=7,max_depth = 8,)



clf_1.fit(x_train,y_train)
y_predicted = clf_1.predict(x_test)
score = clf_1.score(x_test,y_test)


print(score)
results.append(score)

In [None]:
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix

In [None]:
# Build Model
clf = RandomForestClassifier(criterion= "entropy",bootstrap = False,n_estimators = 1000,n_jobs = 2,verbose = 1,max_features =3)
clf.fit(x_train, y_train)
y_predicted = clf.predict(x_test)
score=clf.score(x_test,y_test)
results.append(score)

print(score)

In [None]:
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes.index,
                      title='Confusion matrix, without normalization')
# With normalization
plt.figure()
plot_confusion_matrix(cnf_matrix, classes= classes.index, normalize=True,
                      title='Normalized confusion matrix')

plt.show()

In [None]:
result_df = pd.DataFrame({"ML Models":["SGDClassifier","Ridge classifier","Logistic Regression",
                                       "Decision Tree","Random Forest"],"Score":results})

In [None]:
result_df