<font color = 'purple'>
    
# Introduction
<font color = 'black'>

In this notebook we are gonna try to determine which  physiochemical properties make a good quality wine.


<font color = 'purple'>

Content:
1. [Load and Check Data](#1)
2. [Variable Description](#2)
3. [Variable Visualization](#3)
   * [Fixed acidity](#4)
   * [Volatile acidity](#5)
   * [Citric acid](#6)
   * [Residual sugar](#7)      
   * [Chlorides](#8)   
   * [Free sulfur dioxide](#9)   
   * [Total sulfur dioxide](#10) 
   * [Density](#11)
   * [PH](#12)       
   * [Sulphates](#13)             
   * [Alcohol](#14)           
   * [Quality](#15)
1. [Preparing Data](#16)
1. [Modelling](#17)
    * [Train-Test Split](#18)
    * [Simple Logistic Regression Model](#19)
    * [Hyperparameter Tuning-Grid Search-Cross Validation](#20)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="1"></a>
<font color='purple'>
## Load and Check Data

In [None]:
data=pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

In [None]:
data.info()

In [None]:
data.isnull().any()

In [None]:
data.columns

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data["quality"].unique()

<a id="2"></a>
<font color='purple'>
## Variable Description

1. fixed acidity:most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
1. volatile acidity:the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
1. citric acid:found in small quantities, citric acid can add 'freshness' and flavor to wines
1. residual sugar:the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
1. chlorides:the amount of salt in the wine
1. free sulfur dioxide:the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
1. total sulfur dioxide:amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
1. density:the density of water is close to that of water depending on the percent alcohol and sugar content
1. pH:describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
1. sulphates:a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
1. alcohol

<a id="3"></a>
<font color='purple'>
## Variable  Visualization 

We are gonna check the correlation between features with heatmap.Then we are gonna decide which features are effecting quality more or less with this plot.

In [None]:
data.corr()

In [None]:
f,ax=plt.subplots(figsize=(10,10))
sns.heatmap(data.corr(), annot=True, linewidth=.5)
plt.show()

It looks like residual sugar, ph and free sulfur dioxide has less correlation with quality than other features. I am gonna take a look at the barplots too so i can make sure which features i will drop and then i can start modelling. For that first, i am gonna normalize our features so we can have healthier results.


In [None]:
x_data=data.drop(["quality"],axis=1)
x=(x_data-np.min(x_data))/(np.max(x_data)- np.min(x_data)).values
x.head()

<a id="4"></a>
<font color='green'>
Fixed Acidity    

In [None]:
sns.barplot(y=x['fixed acidity'], x=data['quality'], data=data)
plt.show()

There is some irregular connection.

<a id="5"></a>
<font color='green'>
Volatile Acidity

In [None]:
sns.barplot(y=x['volatile acidity'], x=data['quality'], data=data)
plt.show()

We can easily see the there is some trend here.

<a id="6"></a>
<font color='green'>
Citric Acid

In [None]:
sns.barplot(y=x['citric acid'], x=data['quality'], data=data)
plt.show()

There is almost has a linear increase with quality at this plot.

<a id="7"></a>
<font color='green'>
Residual Sugar

In [None]:
sns.barplot(y=x['residual sugar'], x=data['quality'], data=data)
plt.show()

Like as we determine at the heatmap, there is not much connection with quality here so we can drop this feature.

<a id="8"></a>
<font color='green'>
Chlorides

In [None]:
sns.barplot(y=x['chlorides'], x=data['quality'], data=data)
plt.show()

This plot almost have linear decrease with quality.So we can use that feature.

<a id="9"></a>
<font color='green'>
Free Sulfur Dioxide

In [None]:
sns.barplot(y=x['free sulfur dioxide'], x=data['quality'], data=data)
plt.show()

When we check heatmap we were not sure for this feature. But after checking this barplot i decided not to drop this feature.

<a id="10"></a>
<font color='green'>
Total Sulfur Dioxide

In [None]:
sns.barplot(y=x['total sulfur dioxide'], x=data['quality'], data=data)
plt.show()

<a id="11"></a>
<font color='green'>
Density

In [None]:
sns.barplot(y=x['density'], x=data['quality'], data=data)
plt.show()

There is so little connection with this feature. We can drop it.

<a id="12"></a>
<font color='green'>
Ph

In [None]:
sns.barplot(y=x['pH'], x=data['quality'], data=data)
plt.show()

This barplot confirms the heatmap. There is not much connection. We are gonna drop ph feature.

<a id="13"></a>
<font color='green'>
Sulphates

In [None]:
sns.barplot(y=x['sulphates'], x=data['quality'], data=data)
plt.show()

<a id="14"></a>
<font color='green'>
Alcohol

In [None]:
sns.barplot(y=x['alcohol'], x=data['quality'], data=data)
plt.show()

<a id="15"></a>
<font color='green'>
Quality

In [None]:
data["quality"]

In [None]:
data["quality"].unique()

In [None]:
sns.countplot(data["quality"], data=data)
plt.show()

We are done with visualization of variables. After examinate all plots , i decided to drop ,"fixed acidity",residual sugar", "density" and "ph" features.

<a id="16"></a>
<font color='purple'>
## Preparing Data

I am gonna seperate quality feature to 0 and 1. If i seperate it <7 bad and >=7 good that will create us an imbalanced data. Even though if our modelling is not correct and it give us 0 all the time , our accuracy will be high anyway because of this imbalance. So i am gonna use, if quality<6 it has a bad quality and if it is >=6 it has a good quality.

In [None]:
data.head()

In [None]:
data["quality"]=[0 if i <6 else 1 for i in data["quality"]]
data.head()

In [None]:
data["quality"].unique()

In [None]:
x.drop(labels=["residual sugar","density","pH","fixed acidity"], axis=1 , inplace=True)
x.head()

In [None]:
x=x.rename(columns={'citric acid':'citric_acid','free sulfur dioxide':'free_sulfur_dioxide','total sulfur dioxide':'total_sulfur_dioxide'})
x.columns

In [None]:
y=data["quality"]
y.head()

<a id="17"></a>
<font color='purple'>
## Modelling

We are gonna import libraries.

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

<a id="18"></a>
<font color='green'>
### Train-Test Split

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3, random_state=42)

print("x_train",len(x_train))
print("x_test",len(x_test))
print("y_train",len(y_train))
print("y_test",len(y_test))

<a id="19"></a>
<font color='green'>
### Simple Logistic Regression Model

In [None]:
logreg= LogisticRegression()
logreg.fit(x_train,y_train)
acc_train=round(logreg.score(x_train,y_train)*100,2)
acc_test=round(logreg.score(x_test,y_test)*100,2)
print("Training Accuracy:%{}".format(acc_train))
print("Test Accuracy:{}".format(acc_test))


I am gonna try to improve accuracy rate with hyperparameter tuning.

<a id="20"></a>
<font color='green'>
### Hyperparameter Tuning-Grid Search-Cross Validation
    

I am looking for best parameters and i will use grid search for this.

In [None]:
random_state=42
classifier=[DecisionTreeClassifier(random_state=random_state), SVC(random_state=random_state), RandomForestClassifier(random_state=random_state),LogisticRegression(random_state=random_state),KNeighborsClassifier()]

dt_param_grid={"min_samples_split":range(10,500,20),"max_depth":range(1,20,2)}
                                                                   
svc_param_grid={"kernel":["rbf"],"gamma":[0.001,0.01,0.1,1],"C":[1,10,50,100,200,300,1000]}                                                                   
                                   
rf_param_grid={"max_features":[2,3],"min_samples_split":[8,10,12],"min_samples_leaf":[3,4,5],"bootstrap":[False],"n_estimators":[100,200,500]}  
                                                              
logreg_param_grid={"C":np.logspace(-3,3,7),"penalty":["l1","l2"]}                                                                  
                                           
knn_param_grid={"n_neighbors":np.linspace(1,19,10, dtype=int).tolist(),"weights":["uniform","distance"],"metric":["euclidean","manhattan"]}                                                                   
                                                                   
classifier_param=[dt_param_grid,svc_param_grid,rf_param_grid,logreg_param_grid,knn_param_grid]    

In [None]:
cv_result=[]
best_estimators=[]

for i in range(len(classifier)):
    
    clf=GridSearchCV(classifier[i],param_grid=classifier_param[i],cv=StratifiedKFold(n_splits=10),scoring="accuracy",n_jobs=-1,verbose=1)
    clf.fit(x_train,y_train)
    cv_result.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print(cv_result[i])

In [None]:
cv_results= pd.DataFrame({"Cross Validation Means":cv_result,"ML Models":["DecisionTreeClassifier", "SVC", "RandomForestClassifier","LogisticRegression","KNeighborsClassifier"]})

g=sns.barplot("Cross Validation Means","ML Models", data=cv_results)
g.set_xlabel("Mean Accuracy")
g.set_title("Cross Validtion Scores")

In [None]:
cv_result[2]

As we can see at barplot Random Forest Classifier is giving us the best score

We got %80 accuracy. This is the first machine learning that i have done all by myself. You can leave comments or suggestions for help me to improve. Thank you :)