In this kernel I have performed Exploratory Data Analysis on the **Red Wine Quality** dataset and tried to identify relationship between heart the quality of wine and various other features. After EDA data pre-processing is done I have applied **k-NN(k-Nearest Neighbors)**,  **Logistic Regression**  and **Decision Tree** Algorithm to make the predictions. I will use various other algorithms for predictions in future and add them in this kernel.

I hope you find this kernel helpful and some **<font color='red'>UPVOTES</font>** would be very much appreciated

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Importing required libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

# setting plot style for all the plots
plt.style.use('fivethirtyeight')

### Loading the data

In [None]:
df = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

### Dimensions of the dataset

In [None]:
print('Number of rows in the dataset: ',df.shape[0])
print('Number of columns in the dataset: ',df.shape[1])

### Features in the data set

In [None]:
df.info()

### Basic statistical details about the dataset

In [None]:
df.describe().round(decimals=3)

**The features described in the above data set are:**

**1. Count** tells us the number of NoN-empty rows in a feature.

**2. Mean** tells us the mean value of that feature.

**3. Std** tells us the Standard Deviation Value of that feature.

**4. Min** tells us the minimum value of that feature.

**5. 25%, 50%, and 75%** are the percentile/quartile of each features.

**6. Max** tells us the maximum value of that feature.

## Exploratory Data Analysis(EDA)

### 1. Number of wines of a given quality in the dataset

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='quality', data=df)
plt.title('Number of wines present in the dataset of a given quality')
plt.show()

#### Plotting the relationship between quality of wine and various other features

In [None]:
# Function to plot barplot and boxplot of a given feature
def plot(x_val, y_val, palette='pastel'):
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
    sns.barplot(x=x_val, y=y_val, data=df, ax=ax[0], palette=palette)
    sns.boxplot(x= x_val, y= y_val, data=df, ax=ax[1],palette=palette, linewidth=3)
    plt.tight_layout(w_pad=2)
    plt.show()

### 2. Fixed Acidity vs. Quality

In [None]:
plot('quality','fixed acidity')

### 3. Volatile Acidity vs. Quality

In [None]:
plot('quality', 'volatile acidity')

### 4. Citric Acid vs. Quality

In [None]:
plot('quality', 'citric acid')

### 5. Residual Sugar vs. Quality

In [None]:
plot('quality', 'residual sugar')

### 6. Chlorides vs. Quality

In [None]:
plot('quality', 'chlorides')

### 7. Correlation Heatmap between various features

In [None]:
plt.figure(figsize=(12,8))
corr = df.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(corr,mask=mask, annot=True, linewidths=1, cmap='YlGnBu')
plt.show()

## Preporcessing the data before applying Machine Learning algorithms

### 1. Dividing the wine quality as good or bad to make it a binary classification problem

In [None]:
bins = (2, 6.5, 8)
group_names = ['bad', 'good']
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)

The quality column in the dataset now has only two values i.e. good and bad.

In [None]:
df.head()

### Number of good and bad quality wines in the dataset

In [None]:
plt.figure(figsize=(7,6))
sns.countplot(x='quality', data=df, palette='pastel')
plt.title('Number of good and bad quality wines')
plt.show()

### 2. Assigning a label(numerical value) to the quality variable.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
label_encoder = LabelEncoder()
df['quality'] = label_encoder.fit_transform(df['quality'])
df.head(3)

The **'quality'** column now contains values 0 and 1. Although Label encoder assigns incremental values i.e 1, 2, 3, 4, ... it can be used here in place of OneHot Encoder since there are only two values in the quality column.

## Implementing Machine Learing Algorithms

### 1. Splitting the features and target variables

In [None]:
X = df.drop('quality', axis=1)
y = df['quality']

### 2. Scaling the features

In [None]:
from sklearn.preprocessing import scale

In [None]:
X_scaled = scale(X)

### 3. Splitting the dataset into Training and Testing sets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, stratify=y, random_state=41)

### 4. Applying ML Algorithms

### i. K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
knn = KNeighborsClassifier()

In [None]:
params = {
    'n_neighbors':list(range(1,15)),
    'p':[1, 2, 3, 4],
    'leaf_size':list(range(1,50)),
    'weights':['uniform', 'distance']
}

In [None]:
# Doing Gridsearch to find optimal parameters
knn_grid = GridSearchCV(estimator=knn, param_grid=params, scoring='accuracy',cv=5,n_jobs=-1)
knn_grid.fit(X_train, y_train)

#### Best parameters for the model

In [None]:
knn_grid.best_params_

#### Best score for the model

In [None]:
knn_grid.best_score_

#### Making predictions

In [None]:
knn_predict = knn_grid.predict(X_test)

#### Accuracy

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix
print('Accuracy Score: ',accuracy_score(y_test,knn_predict))
print('Using k-NN we get an accuracy score of: ',
      round(accuracy_score(y_test,knn_predict),5)*100,'%')

#### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix


# Fucntion to create confusion Matrix
def conf_matrix(actual, predicted, model_name):
    cnf_matrix = confusion_matrix(actual, predicted)
#     cnf_matrix
    class_names = [0,1]
    fig,ax = plt.subplots()
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks,class_names)
    plt.yticks(tick_marks,class_names)

    #create a heat map
    sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'YlGnBu',
               fmt = 'g')
    ax.xaxis.set_label_position('top')
    plt.tight_layout()
    plt.title('Confusion matrix for ' + model_name + ' Model', y = 1.1)
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')
    plt.show()

In [None]:
conf_matrix(y_test, knn_predict, 'k-Nearest Neighbors')

#### Classification report

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, knn_predict))

#### Receiver Operating Characterstic(ROC) Curve

In [None]:
from sklearn.metrics import roc_auc_score,roc_curve

In [None]:
y_probabilities = knn_grid.predict_proba(X_test)[:,1]

#Create true and false positive rates
false_positive_rate_knn,true_positive_rate_knn,threshold_knn = roc_curve(y_test,y_probabilities)

#Plot ROC Curve
plt.figure(figsize=(10,6))
plt.title('Revceiver Operating Characterstic')
plt.plot(false_positive_rate_knn,true_positive_rate_knn, linewidth=2)
plt.plot([0,1],ls='--', linewidth=2)
plt.plot([0,0],[1,0],c='.5', linewidth=2)
plt.plot([1,1],c='.5',linewidth=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
#Calculate area under the curve
roc_auc_score(y_test,y_probabilities)

### ii. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression()

In [None]:
params = {'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
             'class_weight': [{1:0.5, 0:0.5}, {1:0.4, 0:0.6},{1:0.6, 0:0.4}, {1:0.7, 0:0.3},{1:0.3, 0:0.7}],
             'penalty': ['l1', 'l2'],
             'solver': ['liblinear', 'saga'],
             'max_iter':[50,100,150,200]
             }

In [None]:
# Doing Gridsearch to find optimal parameters
log_grid = GridSearchCV(estimator=logreg, param_grid=params, scoring='accuracy', cv=5, n_jobs=-1)
log_grid.fit(X_train, y_train)

#### Best parameters for the model

In [None]:
log_grid.best_params_

#### Best score for the model

In [None]:
log_grid.best_score_

#### Making predictions

In [None]:
log_predict = log_grid.predict(X_test)

#### Accuracy

In [None]:
print('Accuracy Score: ',accuracy_score(y_test,log_predict))
print('Using k-NN we get an accuracy score of: ',
      round(accuracy_score(y_test,log_predict),5)*100,'%')

#### Confusion Matrix

In [None]:
conf_matrix(y_test, log_predict, 'Logistic Regression')

#### Classification report

In [None]:
print(classification_report(y_test, knn_predict))

#### Receiver Operating Characterstic(ROC) Curve

In [None]:
y_probabilities = log_grid.predict_proba(X_test)[:,1]

#Create true and false positive rates
false_positive_rate_log,true_positive_rate_log,threshold_log = roc_curve(y_test,y_probabilities)

#Plot ROC Curve
plt.figure(figsize=(10,6))
plt.title('Revceiver Operating Characterstic')
plt.plot(false_positive_rate_log,true_positive_rate_log, linewidth=2)
plt.plot([0,1],ls='--', linewidth=2)
plt.plot([0,0],[1,0],c='.5', linewidth=2)
plt.plot([1,1],c='.5', linewidth=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
#Calculate area under the curve
roc_auc_score(y_test,y_probabilities)

### iii. Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier()

In [None]:
param_grid = {
    'criterion': ['gini','entropy'],
    'max_depth': [None, 1, 2, 3, 4, 5, 6],
    'max_features': ['auto', 'sqrt','log2'],
    'max_leaf_nodes': [None, 1, 2, 3, 4, 5, 6],
    'min_samples_leaf': [1,2,3,4,5,6,7],
    'min_samples_split': [2,3,4,5,6,7,8,9,10]
}

In [None]:
# Doing Gridsearch to find optimal parameters
dt_grid = GridSearchCV(estimator=dt, param_grid=param_grid, scoring='accuracy',cv=5, n_jobs=-1)
dt_grid.fit(X_train, y_train)

#### Best parameters for the model

In [None]:
dt_grid.best_params_

#### Best score for the model

In [None]:
dt_grid.best_score_

#### Making predictions

In [None]:
dt_predict = dt_grid.predict(X_test)

#### Accuracy

In [None]:
print('Accuracy Score: ',accuracy_score(y_test,dt_predict))
print('Using Decision Tree Classifier we get an accuracy score of: ',
      round(accuracy_score(y_test,dt_predict),5)*100,'%')

#### Confusion Matrix

In [None]:
conf_matrix(y_test, log_predict, 'Decision Tree')

#### Classification report

In [None]:
print(classification_report(y_test, dt_predict))

#### Receiver Operating Characterstic(ROC) Curve

In [None]:
y_probabilities = dt_grid.predict_proba(X_test)[:,1]

#Create true and false positive rates
false_positive_rate_dt,true_positive_rate_dt,threshold_dt = roc_curve(y_test,y_probabilities)

#Plot ROC Curve
plt.figure(figsize=(10,6))
plt.title('Revceiver Operating Characterstic')
plt.plot(false_positive_rate_dt,true_positive_rate_dt, linewidth=2)
plt.plot([0,1],ls='--', linewidth=2)
plt.plot([0,0],[1,0],c='.5', linewidth=2)
plt.plot([1,1],c='.5', linewidth=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
#Calculate area under the curve
roc_auc_score(y_test,y_probabilities)

### Comparing ROC Curve of k-Nearest Neighbors, Logistic Regression and Decision Tree


In [None]:
#Plot ROC Curve
plt.figure(figsize=(10,6))
plt.title('Reciver Operating Characterstic Curve')
plt.plot(false_positive_rate_knn,true_positive_rate_knn,linewidth=2, label='k-Nearest Neighbor')
plt.plot(false_positive_rate_log,true_positive_rate_log, linewidth=2, label='Logistic Regression')
plt.plot(false_positive_rate_dt,true_positive_rate_dt, linewidth=2, label='Decision Tree')
plt.plot([0,1],ls='--', linewidth=2)
plt.plot([0,0],[1,0],c='.5', linewidth=2)
plt.plot([1,1],c='.5', linewidth=2)
plt.ylabel('True positive rate')
plt.xlabel('False positive rate')
plt.legend()
plt.show()

**What's next?**
1. Applying SVM and Random Forest Algorithms
2. Applying various ensemble methods such as bagging, boosting.
3. Compare the models on the basis of their accuracy score.

**Suggestions are welcome**

**<font color='red'>UPVOTE</font>** if you found the notebook helpful.