# <font color="red"> <div align="center"> CLASSIFICATION PROJECT  
    
## <div align="center"> WINE QUALITY


## Contents
1.  Introduction
2.  The Aim of Analysis
3.  General Information of the Data
4.  Data Exploration
5.  Checking for NULL Values 
6.  Filling of the Row Data 
7.  General Looking at Wine Quality Classes
     * 7. 1 Creating 2 Bins Model of Two Types of Wine Quality Classes
8.  Overview about Outliers 
     * 8.1    Winsorization
9.  LOGISTIC REGRESSION CLASSIFIER
     * 9.1    Creating Train / Test Groups with 2 Bins Model 
     * 9.1.a  LogisticRegression
     * 9.1.b  Performance Measurements
     * 9.2    Creating Train / Test Groups with 3 Bins Model
     * 9.2.a  LogisticRegression
     * 9.2.b  Performance Measurements
10.  Imbalanced Data
     * 10.1   Resampling Imbalance Data
     * 10.2   Cross Validation
     * 10.3   K-Fold Cross Validation
     * 10.4   Cross_val_score & Cross_validate 
11. Hyperparameter Tuning
     * 11.1   Grid Search
     * 11.2   RandomizedSearchCV

# <div align="center">  **1. Introduction**

### <font color="gray">The dataset was downloaded from the UCI Machine Learning Repository.

### <font color="gray">The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine

# <div align="center"> **2. The Aim of Analysis**

### <font color="gray"> This study aims to search for the elements which effects WINE QUALITY by using multiclass  decision classification methods such as Support Vector Machines, K-NN, Logistic Regression, Softmax, Confusion Matrix, Accuracy, Precision, Specificity, F1 Score, ROC/AUC, Logarithmic Loss, Cross Validation, K-Fold Cross Validation, Grid Search, SMOTE 

# <div align="center">  **3. General Information of the Data**


### <font color="black">Type:<font color="gray"> Two types of wines such as red wine and white wine.
    
### <font color="black">Fixed acidity:<font color="gray"> Fixed acids include tartaric, malic, citric, and succinic acids which are found in grapes (except succinic)

### <font color="gray">Acids are one of the fundamental properties of wine and contribute greatly to the taste of the wine, Acidity in food and drink tastes tart and zesty. Tasting acidity is also sometimes confused with alcohol. Wines with higher acidity feel lighter-bodied because they come across as “spritzy”. Reducing acids significantly might lead to wines tasting flat. If you prefer a wine that is richer and rounder, you enjoy slightly less acidity.

### <font color="black">Volatile acidity:<font color="gray"> These acids are to be distilled out from the wine before completing the production process. It is primarily constituted of acetic acid though other acids like lactic, formic and butyric acids might also be present. Excess of volatile acids are undesirable and lead to unpleasant flavour.

### <font color="black">Citric acid:<font color="gray"> This is one of the fixed acids which gives a wine its freshness. Usually most of it is consumed during the fermentation process and sometimes it is added separately to give the wine more freshness.

### <font color="black">Residual sugar: <font color="gray">This typically refers to the natural sugar from grapes which remains after the fermentation process stops, or is stopped.

### <font color="black">Chlorides: <font color="gray">Chloride concentration in the wine is influenced by terroir and its highest levels are found in wines coming from countries where irrigation is carried out using salty water or in areas with brackish terrains.

### <font color="black">Free sulfur dioxide:<font color="gray"> This is the part of the sulphur dioxide that when added to a wine is said to be free after the remaining part binds. Winemakers will always try to get the highest proportion of free sulphur to bind. They are also known as sulfites and too much of it is undesirable and gives a pungent odour.

### <font color="black">Total sulfur dioxide:<font color="gray"> This is the sum total of the bound and the free sulfur dioxide. This is mainly added to kill harmful bacteria and preserve quality and freshness. There are usually legal limits for sulfur levels in wines and excess of it can even kill good yeast and give out undesirable odour.

### <font color="black">Density: <font color="gray">This can be represented as a comparison of the weight of a specific volume of wine to an equivalent volume of water. It is generally used as a measure of the conversion of sugar to alcohol. 

### <font color="black">pH: <font color="gray"> Also known as the potential of hydrogen, this is a numeric scale to specify the acidity or basicity the wine. Fixed acidity contributes the most towards the pH of wines. You might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.

### <font color="black">Sulphates: <font color="gray">These are mineral salts containing sulfur. Sulphates are to wine as gluten is to food. They are a regular part of the winemaking around the world and are considered essential. They are connected to the fermentation process and affects the wine aroma and flavour. 

### <font color="black">Alcohol: <font color="gray"> It's usually measured in % vol or alcohol by volume (ABV).

### <font color="black">Quality:<font color="gray"> Wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual quality score is the median of at least three evaluations made by the same wine experts.


# <div align="center"> **4. Data Exploration**

### 0. Importing Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

#### ***Getting Data***

In [None]:
from subprocess import check_output

print(check_output(["ls", "../input/wine-quality"]).decode("utf8"))

In [None]:
df = pd.read_csv('../input/wine-quality/winequalityN.csv')

#### ***About data***

In [None]:
df.info()

#### ***Switching Column Names into a suitable format***

In [None]:
print(*df.columns, sep='\n')

In [None]:
df.columns = ('type', 'fixed_acidity', 'volatile_acidity', 'citric_acid',
       'residual_sugar', 'chlorides', 'free_sulfur_dioxide',
       'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality')

#### ***First 5 rows***

In [None]:
df.head()

# <div align="center"> **5. Checking for NULL Values**

#### ***Looking NAN values with heatmap***

In [None]:
import seaborn as sns

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

#### ***Checking for NULL Values***

In [None]:
Sum = df.isnull().sum()
Percentage = ( df.isnull().sum()/df.isnull().count())

pd.concat([Sum,Percentage], axis =1, keys= ['Sum', 'Percentage'])

In [None]:
def null_cell(df): 
    total_missing_values = df.isnull().sum() 
    missing_values_per = df.isnull().sum()/df.isnull().count() 
    null_values = pd.concat([total_missing_values, missing_values_per], axis=1, keys=['total_null', 'total_null_perc']) 
    null_values = null_values.sort_values('total_null', ascending=False) 
    return null_values[null_values['total_null'] > 0] 

# <div align="center"> **6. Filling of the Row Data**

In [None]:
fill_list = (null_cell(df)).index

In [None]:
df_mean = df.copy()

for col in fill_list:
    df_mean.loc[:, col].fillna(df_mean.loc[:, col].mean(), inplace=True)

In [None]:
sns.heatmap(df_mean.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
corr_matrix = df_mean.corr()
corr_list = corr_matrix.quality.abs().sort_values(ascending=False).index[0:]

In [None]:
corr_list

In [None]:
plt.figure(figsize=(11,9))
dropSelf = np.zeros_like(corr_matrix)
dropSelf[np.triu_indices_from(dropSelf)] = True

sns.heatmap(corr_matrix, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)

sns.set(font_scale=1.5)

Wine quality has the highest correlation with alcohol. Other relation degrees are very low with each other,such as citric acid,free_sulfur_dioxide, sulphates and pH.
Quality also has a low negative correlation with density,volatile acidity, chlorides, total_sulfur_dioxide and residual_sugar. 

 #### ***Distribution  of Variables***

In [None]:
from scipy.stats import norm 

In [None]:
plt.figure(figsize = (20,22))

for i in range(1,13):
    plt.subplot(5,4,i)
    sns.distplot(df_mean[df_mean.columns[i]], fit=norm)
    

# <div align="center"> **7. General Looking at Wine Quality Classes

## <font color="darkblue"> 7. 1 Creating 2 Bins Model of Two Types of Wine Quality Classes

In [None]:
df_bins= df_mean.copy()

In [None]:
bins = [0,5,10]


labels = [0, 1] # 'low'=0, 'high'=1
df_bins['quality_range']= pd.cut(x=df_bins['quality'], bins=bins, labels=labels)

print(df_bins[['quality_range','quality']].head(5))

df_bins = df_bins.drop('quality', axis=1) 

## Quality in  Different Wine Types

In [None]:
plt.figure(figsize=(8,5))

sns.countplot(x = 'type', hue = 'quality_range', data = df_bins)
plt.show()
# 'low'=0, 'high'=1

As we see on the chart, Low quality red wine has the highest numerical value in data set as well as low quality white wine. 
High quality white and red wines have little place in data. 

## Quality & Alcohol Relation 

In [None]:
plt.figure(figsize=(8,7))
sns.scatterplot(x='quality_range', 
                y='alcohol', 
                hue='type',
                data=df_bins);
plt.xlabel('Quality',size=15)
plt.ylabel('Alcohol', size =15)
plt.show()

Red and White wines has similar results on the chart. High quality wines are mostly red wines and have more alcohol level.

## Quality & Volatile Acidity by Types

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 8))
f.suptitle('Wine Types by Quality & Acidity', fontsize=14)

sns.violinplot(x='quality_range', y='volatile_acidity', hue='type', data=df_bins, split=True, inner='quart', linewidth=1.3,
               palette={'red': 'red', 'white': 'white'}, ax=ax1)
ax1.set_xlabel("Wine Quality Class ",size = 15,alpha=0.8)
ax1.set_ylabel("Wine Fixed Acidity",size = 15,alpha=0.8)

sns.violinplot(x='quality_range', y='alcohol', hue='type', data=df_bins, split=True, inner='quart', linewidth=1.3,
               palette={'red': 'darkred', 'white': 'white'}, ax=ax2)
ax2.set_xlabel("Wine Quality Class",size = 15,alpha=0.8)
ax2.set_ylabel("Wine Fixed Alcohol",size = 15,alpha=0.8)
plt.show()

Fixed acidity level is low in both wine classes, especially in white wine while red wine has more in low quality class up to 1.70.  Fixed alcohol level is again high in red wine class comparing white wine in low quality. High quality class has the highest fixed alcohol level in booth wine classes. 

## Chlorides Level in Quality Classes 

In [None]:
plt.figure(figsize= (6,4))

low_quality = df_bins [df_bins['quality_range']== 0]['chlorides']
high_quality   = df_bins [df_bins['quality_range']== 1][ 'chlorides']
ax = sns.kdeplot(data= low_quality, label= 'low_quality', shade=True, color=None)
ax = sns.kdeplot(data= high_quality,label= 'high_quality',shade=True, color= "r")

plt.title("Chloride Level in Wine Classes")
plt.xlim(0.0,0.3)
plt.legend()
plt.show()

Chloride Level is a bit higher in red wine in contrats with white wine. 

## Fixed Acidity & Volatile Acidity & Citric Acid Density in Quality Classes

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(3, figsize = (10,10))

f.suptitle('Wine Quality - Acidity Levels', fontsize=14)


fixed_acidity_low_quality    = df_bins [df_bins['quality_range']== 0]['fixed_acidity']
fixed_acidity_high_quality   = df_bins [df_bins['quality_range']== 1]['fixed_acidity']


volatile_acidity_low_quality = df_bins [df_bins['quality_range']== 0]['volatile_acidity']
volatile_acidity_high_quality= df_bins [df_bins['quality_range']== 1]['volatile_acidity']

citric_acid_low_quality      = df_bins [df_bins['quality_range']== 0]['citric_acid']
citric_acid_high_quality     = df_bins [df_bins['quality_range']== 1]['citric_acid']


sns.kdeplot(data=fixed_acidity_low_quality, label="low_quality", shade=True,ax=ax1)
sns.kdeplot(data=fixed_acidity_high_quality, label="high_quality", shade=True, ax=ax1)
ax1.set_xlabel("fixed_acidity",size = 15,alpha=0.8)
ax1.set_ylabel("Wine Quality",size = 15,alpha=0.8)


sns.kdeplot(data=volatile_acidity_low_quality, label="low_quality", shade=True,ax=ax2)
sns.kdeplot(data=volatile_acidity_high_quality, label="high_quality", shade=True, ax=ax2)
ax2.set_xlabel("volatile_acidity",size = 15,alpha=0.8)
ax2.set_ylabel("Wine Quality",size = 15,alpha=0.8)


sns.kdeplot(data=citric_acid_low_quality, label="low_quality", shade=True,ax=ax3)
sns.kdeplot(data=citric_acid_high_quality, label="high_quality", shade=True, ax=ax3)
ax3.set_xlabel("citric_acid",size = 15,alpha=0.8)
ax3.set_ylabel("Wine Quality",size = 15,alpha=0.8)


plt.legend()
plt.show()

## Residual Sugar Levels by Wine Quality Classes

In [None]:
plt.figure(figsize=(8,5))

residual_sugar_low   = df_bins [df_bins['quality_range']== 0]['residual_sugar']
residual_sugar_high  = df_bins [df_bins['quality_range']== 1]['residual_sugar'] 
ax = sns.kdeplot(data= residual_sugar_low, label= 'low quality', shade=True)
ax = sns.kdeplot(data= residual_sugar_high,   label= 'high quality',   shade=True)

plt.title("Distributions of Residual Sugar by Wine Qualities")
plt.legend()
plt.show()

## Sulfur Dioxide Distribution in Wine Quality Classes

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x='total_sulfur_dioxide', y='free_sulfur_dioxide', hue='quality_range',data=df_bins);
plt.xlabel('total_sulfur_dioxide',size=15)
plt.ylabel('free_sulfur_dioxide', size =15)

There are some extreme values in low quality wine class. Total sulfur dioxide level is getting higher in some low quality wine class while general disturubution is standing up to 100 level of free sulfur dioxide.   

## pH Level in Wine Quality

In [None]:
plt.figure(figsize=(8,7))

pH_low_quality  = df_bins [df_bins['quality_range']== 0]['pH']
pH_high_quality = df_bins [df_bins['quality_range']== 1][ 'pH']
ax = sns.kdeplot(data= pH_low_quality, label= 'low_quality', shade=True) 
ax = sns.kdeplot(data= pH_high_quality,label= 'high_quality',   shade=True)

plt.title("pH Levels in Low/High Quality Wines")
plt.xlabel('pH')
plt.legend()
plt.show()

## Density by Wine Quality Classes

In [None]:
plt.figure(figsize=(8,5))

density_low_quality  = df_bins [df_bins['quality_range']== 0]['density']
density_high_quality = df_bins [df_bins['quality_range']== 1][ 'density']
ax = sns.kdeplot(data= density_low_quality, label= 'low_quality', shade=True) 
ax = sns.kdeplot(data= density_high_quality,label= 'high_quality', shade=True)

plt.title("Density Levels in Low/High Quality of Wines")
plt.xlabel('density')
plt.legend()
plt.show()

## Sulphate Values in Wine Quality Classes

In [None]:
plt.figure(figsize=(8,5))

sulphates_low_quality    = df_mean [df_bins['quality_range']== 0]['sulphates']
sulphates_high_quality   = df_mean [df_bins['quality_range']== 1][ 'sulphates']
ax = sns.kdeplot(data= sulphates_low_quality, label= 'low_quality',  shade=True) 
ax = sns.kdeplot(data= sulphates_high_quality,label= 'high_quality', shade=True)

plt.title("Sulphates Levels in Low/High Quality of Wines")
plt.xlabel('sulphates')
plt.legend()
plt.show()

There is more low quality wine in between 0.4 and 0.6 levels of sulphate levels. Both quality classes have similar values.

# <div align="center">**8.  Overview about Outliers**

In [None]:
outliers_by_12_variables = ['fixed_acidity', 'volatile_acidity', 'citric_acid',
                            'residual_sugar', 'chlorides', 'free_sulfur_dioxide',
                            'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol'] 
plt.figure(figsize=(22,20))

for i in range(0,11):
    plt.subplot(5, 4, i+1)
    plt.boxplot(df_bins[outliers_by_12_variables[i]])
    plt.title(outliers_by_12_variables[i])

## <font color="darkblue"> 8.1 Winsorization

In [None]:
def winsor(x, multiplier=3): 
    upper= x.median() + x.std()*multiplier
    for limit in np.arange(0.001, 0.20, 0.001):
        if np.max(winsorize(x,(0,limit))) < upper:
            return limit
    return None 

In [None]:
from scipy.stats.mstats import winsorize

kolon_isimleri = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide',
                                  'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol']

for i in range(1,len(kolon_isimleri)):

    df_bins[kolon_isimleri[i]] = winsorize(df_bins[kolon_isimleri[i]], (0, winsor(df_bins[kolon_isimleri[i]])))

# <div align="center">**9. LOGISTIC REGRESSION CLASSIFIER**

## <font color="darkblue"> 9.1 Creating Train / Test Groups with 2 Bins Model 

In order to have all variables in numeric data, I mapped wine types as following by using the previous data frame 'df_bins':

In [None]:
df_bins.type = df_bins.type.map({'white':0, 'red':1})

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 

In [None]:
X = df_bins[['type', 'alcohol', 'density', 'volatile_acidity', 'chlorides',
       'citric_acid', 'fixed_acidity', 'free_sulfur_dioxide',
       'total_sulfur_dioxide', 'sulphates', 'residual_sugar', 'pH']] 
y = df_bins.quality_range

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=40)


## <font color="darkblue"> 9.1.a  LogisticRegression

In [None]:
lr = LogisticRegression(random_state=40)
lr.fit(X_train, y_train)

In [None]:
train_accuracy = lr.score(X_train, y_train)
test_accuracy = lr.score(X_test, y_test)
print('One-vs-rest', '-'*35, 
      'Accuracy in Train Group   : {:.2f}'.format(train_accuracy), 
      'Accuracy in Test  Group   : {:.2f}'.format(test_accuracy), sep='\n')

#### Confusion Matrix in Chart

In [None]:
from sklearn.metrics import confusion_matrix as cm

predictions = lr.predict(X_test)
score = round(accuracy_score(y_test, predictions), 3)
cm1 = cm(y_test, predictions)
sns.heatmap(cm1, annot=True, fmt=".0f")
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Accuracy Score: {0}'.format(score), size = 15)
plt.show()

In [None]:
pred_test  = lr.predict(X_test)
pred_train = lr.predict(X_train)

#### Confusion Matrix in array format

In [None]:
from sklearn.metrics import confusion_matrix 


cm = confusion_matrix(y_test,pred_test)
cm

## <font color="darkblue"> 9.1.b   Performance Measurements

In [None]:
quality_pred = LogisticRegression(random_state=40)
quality_pred.fit(X_train,y_train)

In [None]:
confusion_matrix_train = confusion_matrix(y_train,pred_train)
confusion_matrix_test = confusion_matrix(y_test,pred_test)

print('Confusion Matrix Train Data', '--'*20, confusion_matrix_train, sep='\n')
print('Confusion Matrix Test Data', '--'*20, confusion_matrix_test, sep='\n')

In [None]:
TN = confusion_matrix_test[0][0]
TP = confusion_matrix_test[1][1]
FP = confusion_matrix_test[0][1]
FN = confusion_matrix_test[1][0]

print("(Total) True Negative       :", TN)
print("(Total) True Positive       :", TP)
print("(Total) Negative Positive   :", FP)
print("(Total) Negative Negative   :", FN)

In [None]:
FP+FN 

It is better to check FP and FN values for another deep study to focus on false predictions for a better target of accurancy and results.  
A new data set can be created with predictions, X_test and y_test data, than we can check for prediction value of this seperate data set. 

### <font color='dark pink'> Accuracy

In [None]:
from sklearn.metrics import accuracy_score

print("Accuracy Score of Our Model     : ",  quality_pred.score(X_test, y_test))
#print("Accuracy Score of Our Model     : ",  accuracy_score(y_test, pred_test)) # same 

### <font color='dark pink'> Error Rate

In [None]:
Error_Rate = 1- (accuracy_score(y_test, pred_test))  
Error_Rate

### <font color='dark pink'> Precision: Out of all the predicted positive instances, how many were predicted correctly = TP / (TP + FP) ) 


In [None]:
from sklearn.metrics import precision_score

print("precision_score()         : ",  precision_score(y_test, pred_test, average='micro'))

### <font color='dark pink'> Recall ( Out of all the positive classes, how many instances were identified correctly = TP / (TP + FN)) 

In [None]:
from sklearn.metrics import recall_score

print("recall_score()            : ",  recall_score(y_test, pred_test, average='micro'))

### <font color='dark pink'> Specificity :(TN)/(TN + FP)) 

In [None]:
print(" Specificity Score   : ",  (TN)/(TN + FP)) 

### <font color='dark pink'> F1-Score: From Precision and Recall, F-Measure is computed and used as metrics sometimes. F – Measure is nothing but the harmonic mean of Precision and Recall =(2 * Recall * Precision) / (Recall + Precision) )

In [None]:
from sklearn.metrics import f1_score

precision_s = precision_score(y_test, pred_test,average='micro')
recall_s    = recall_score(y_test, pred_test, average='micro')


print("F1_score     : ",  2*((precision_s*recall_s)/(precision_s + recall_s)))
#print("F1_score     : ",  f1_score(y_test, pred_test,average='micro')) #By formula

In [None]:
from sklearn.metrics import classification_report, precision_recall_fscore_support

print(classification_report(y_test,pred_test))

print("f1_score        : {:.2f}".format(f1_score(y_test, pred_test, average='micro')))
print("recall_score    : {:.2f}".format(recall_score(y_test, pred_test, average='micro')))
print("precision_score : {:.2f}".format(precision_score(y_test, pred_test, average='micro')))

print('\n')
metrics =  precision_recall_fscore_support(y_test, pred_test)
print("Precision       :" , metrics[0]) 
#print("Recall          :" , metrics[1]) 
print("F1 Score        :" , metrics[2]) 

### <font color='dark pink'> ROC/AUC(Area Under Curve)

In [None]:
probs = quality_pred.predict_proba(X_test)[:,1]  #Predict probabilities for the test data

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds  = roc_curve(y_test, probs) #Get the ROC Curve


import matplotlib.pyplot as plt


plt.figure(figsize=(8,5))
# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate = 1 - Specificity Score')
plt.ylabel('True Positive Rate  = Recall Score')
plt.title('ROC Curve')
plt.show()

In [None]:
print('AUC Değeri : ', roc_auc_score(y_test.values, probs))

### <font color='dark pink'>PRECISION RECALL CURVE 
(The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.)

In [None]:
from sklearn.metrics import precision_recall_curve
precision, recall, _ = precision_recall_curve(y_test, pred_test)

plt.plot(recall, precision)
plt.show()

### <font color='dark pink'> Log Loss (calculating the difference between ground truth and predicted score for every observation and average those errors over all observations. )

In [None]:
from sklearn.metrics import log_loss

print("Log-Loss)    : " , log_loss(y_test.values, probs))
print("Error Rate   : " , 1- accuracy_score(y_test.values, pred_test))

## General Looking at Results

In [None]:
C_values = [0.001,0.01,0.1,1,10,100, 1000]
accuracy_df = pd.DataFrame(columns = ['C_values','Accuracy'])

accuracy_values = pd.DataFrame(columns=['C Value', 'Accuracy Train', 'Accuracy Test'])

for c in C_values:
    
    # Apply logistic regression model to training data
    lr = LogisticRegression(penalty = 'l2', C = c, random_state = 0)
    lr.fit(X_train,y_train)
    accuracy_values = accuracy_values.append({'C Value': c,
                                                    'Accuracy Train' : lr.score(X_train, y_train),
                                                    'Accuracy Test': lr.score(X_test, y_test)
                                                    }, ignore_index=True)
display(accuracy_values)

# <font color="darkblue">9.2 Creating 3 Bins Models by a large Margin

In [None]:
df_mean.head(1)

In [None]:
df_bins3= df_mean.copy()

In [None]:
df_bins3.type = df_bins3.type.map({'white':0, 'red':1})

In [None]:
bins = [0,4,7,10]

labels = [0,1,2] # 'low'=0,'average'=1, 'high'=2

df_bins3['quality_range']= pd.cut(x=df_bins3['quality'], bins=bins, labels=labels)

#df_bins3.type = df_bins3.type.map({'white':0, 'red':1})

print(df_bins3[['quality_range','quality']].head(5))


In [None]:
X = df_bins3[['type', 'alcohol', 'density', 'volatile_acidity', 'chlorides',
       'citric_acid', 'fixed_acidity', 'free_sulfur_dioxide',
       'total_sulfur_dioxide', 'sulphates', 'residual_sugar', 'pH']]
y = df_bins3.quality_range

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=40)


In [None]:
X_test.head()

## <font color="darkblue">9.2.a  LogisticRegression

In [None]:
lr    = LogisticRegression(random_state=40)
lr.fit(X_train, y_train)

In [None]:
train_accuracy = lr.score(X_train, y_train)
test_accuracy = lr.score(X_test, y_test)
print('One-vs-rest', '-'*35, 
      'Accuracy Score of Train Model : {:.2f}'.format(train_accuracy), 
      'Accuracy Score of Test  Model : {:.2f}'.format(test_accuracy), sep='\n')

In [None]:
from sklearn.metrics import confusion_matrix as cm

predictions = lr.predict(X_test)
score = round(accuracy_score(y_test, predictions), 3)
cm1 = cm(y_test, predictions)
sns.heatmap(cm1, annot=True, fmt=".0f")
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Accuracy Score: {0}'.format(score), size = 15)
plt.show()

In [None]:
y_pred = lr.predict(X_test)
y_pred[y_pred == 2]

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

## <font color="darkblue">9.2.b   Performance Measurements

In [None]:
quality_pred = LogisticRegression(random_state=40)
quality_pred.fit(X_train,y_train)

In [None]:
pred_train = lr.predict(X_train)
pred_test  = lr.predict(X_test)

In [None]:
confusion_matrix_train = confusion_matrix(y_train,pred_train)
confusion_matrix_test = confusion_matrix(y_test,pred_test)

print('Confusion Matrix Train Data', '--'*20, confusion_matrix_train, sep='\n')
print('Confusion Matrix Test  Data ', '--'*20, confusion_matrix_test, sep='\n')

In [None]:
#TN = confusion_matrix_test[0][0]
#TP = confusion_matrix_test[1][1]
#FP = confusion_matrix_test[0][1]
#FN = confusion_matrix_test[1][0]

print("(Total) True Negative       :", TN)
print("(Total) True Positive       :", TP)
print("(Total) Negative Positive   :", FP)
print("(Total) Negative Negative   :", FN)

### <font color='dark pink'> Accuracy

In [None]:
from sklearn.metrics import accuracy_score

print("Accuracy Score of Test Model : ",  quality_pred.score(X_test, y_test))

### <font color='dark pink'> Error Rate

In [None]:
Error_Rate = 1 - (accuracy_score(y_test, pred_test))
Error_Rate

### <font color='dark pink'> Hassasiyet (Precision)

In [None]:
from sklearn.metrics import precision_score

print("precision_score        : ",  precision_score(y_test, pred_test, average='micro'))

### <font color='dark pink'>  Duyarlılık (Recall/Sensitivity)

In [None]:
from sklearn.metrics import recall_score

print("recall_score        : ",  recall_score(y_test, pred_test, average='micro'))

### <font color='dark pink'>  F1 (F1 Score)

In [None]:
from sklearn.metrics import f1_score

precision_s = precision_score(y_test, pred_test,average='micro')
recall_s    = recall_score(y_test, pred_test, average='micro')


print("F1_score     : ",  2*((precision_s*recall_s)/(precision_s + recall_s)))# by mathematical formula
print("f1_score()   : ",  f1_score(y_test, pred_test,average='micro'))  #By formula

In [None]:
from sklearn.metrics import classification_report, precision_recall_fscore_support

print(classification_report(y_test,pred_test) )

print("f1_score()         : {:.2f}".format(f1_score(y_test, pred_test, average='micro')))
print("recall_score()     : {:.2f}".format(recall_score(y_test, pred_test, average='micro')))
print("precision_score()  : {:.2f}".format(precision_score(y_test, pred_test, average='micro')))

print('\n')
metrikler =  precision_recall_fscore_support(y_test, pred_test)
print("Precision   :" , metrics[0]) 
print("Recall      :" , metrics[1]) 
print("F1 Score    :" , metrics[2]) 

warnings.filterwarnings('ignore')

In [None]:
from sklearn.preprocessing import LabelBinarizer

In [None]:
def multiclass_roc_auc_score(y_test, y_pred, average="macro"):
    lb = LabelBinarizer()
    lb.fit(y_test)
    y_test = lb.transform(y_test)
    y_pred = lb.transform(y_pred)
    return roc_auc_score(y_test, y_pred, average=average)

In [None]:
print('AUC Değeri : ', multiclass_roc_auc_score(y_test.values, y_pred))

### <font color='dark pink'> ROC/AUC(Area Under Curve)

In [None]:
probs = quality_pred.predict_proba(X_test)[:,1]

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds  = roc_curve(y_test, probs, pos_label=1)


# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

### <font color='dark pink'> PRECISION RECALL CURVE

In [None]:
from sklearn.metrics import precision_recall_curve
precision, recall, _ = precision_recall_curve(y_test, probs, pos_label=1)

plt.plot(precision, recall)
plt.show()

### General Looking at Results 

In [None]:
C_values = [0.001,0.01,0.1,1,10,100, 1000]
accuracy_df = pd.DataFrame(columns = ['C_values','Accuracy'])

accuracy_values = pd.DataFrame(columns=['C Value', 'Accuracy Train', 'Accuracy Test'])

for c in C_values: 
    
    # Apply logistic regression model to training data
    lr = LogisticRegression(penalty = 'l2', C = c, random_state = 0)
    lr.fit(X_train,y_train)
    accuracy_values = accuracy_values.append({'C Value': c,
                                                    'Accuracy Train' : lr.score(X_train, y_train),
                                                    'Accuracy Test': lr.score(X_test, y_test)
                                                    }, ignore_index=True)
display(accuracy_values)

# <div align="center">  **10. Imbalanced Data**

In order to see the differency between logistic regression model, I also would like to check resampling imblance data. In previous steps, I added bins in low and high ranges on quality variable, this section will show the results by using resampling method.  

In [None]:
df_mean_imb = df_mean.copy() 

In [None]:
bins = [0,4,10] 


labels = [0, 1] # 'low'=0, 'high'=1 
df_mean_imb['quality_range']= pd.cut(x=df_mean_imb['quality'], bins=bins, labels=labels) 

print(df_mean_imb[['quality_range','quality']].head(5)) 

df_mean_imb = df_mean_imb.drop('quality', axis=1) #

In [None]:
sns.countplot(df_mean_imb.quality_range)
 #'low'=0, 'high'=1
    
print("Low Quality  0   : %{:.2f}".format(sum(df_mean_imb.quality_range)/len(df_mean_imb.quality_range)*100))
print("High Quality 1   : %{:.2f}".format((len(df_mean_imb.quality_range)-sum(df_mean_imb.quality_range))/len(df_mean_imb.quality_range)*100))

When splitting data in two parts starting from four, it gives an imbalanced data. 

In [None]:
balance = (df_mean_imb.quality_range.value_counts()[1]/df_mean_imb.quality_range.shape[0])*100
print('Data Quality Percentage:\n', balance,'%')

## <font color="darkblue">10.1 Resampling Imbalance Data

In [None]:
from sklearn.utils import resample 
from imblearn.over_sampling import SMOTE 
smote = SMOTE() 

In [None]:
df_mean_imb.type = df_mean_imb.type.map({'white':0, 'red':1}) 

In [None]:
X =  df_mean_imb.drop(['quality_range'], axis=1) 
y =  df_mean_imb.quality_range 

X_sm, y_sm =smote.fit_resample(X,y) 

print(X.shape, y.shape) 
print(X_sm.shape, y_sm.shape) 
sns.countplot(y_sm) 

In [None]:
def create_model(X, y): 
    X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.20, random_state=40, stratify = y) 
    logreg_model = LogisticRegression() 
    logreg_model.fit(X_train, y_train) 

    pred_train = logreg_model.predict(X_train) 
    pred_test = logreg_model.predict(X_test) 
    confusion_matrix_train = confusion_matrix(y_train, pred_train) 
    confusion_matrix_test = confusion_matrix(y_test, pred_test) 
    print("Accuracy of Test Model : ",  logreg_model.score(X_test, y_test)) 
    print("Train Data Set") 
    print(classification_report(y_train,pred_train) ) 
    print("Test Data Set ") 
    print(classification_report(y_test,pred_test) ) 
    return  None 

In [None]:
create_model(X_sm,y_sm) 
warnings.filterwarnings('ignore')

## <font color="darkblue">10.2 Cross Validation with 2  Bins Model

In [None]:
df_bins.head()

In [None]:
X = df_bins.drop(['quality_range'], axis=1)
y = df_bins.quality_range
y = np.array(y)

In [None]:
plt.style.use('fivethirtyeight')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
print("Number of Rows in    Training dataset :  {} ".format(len(X_train)))
print("Number of Targets in Training dataset :  {} ".format(len(y_train)))
print("Number of Rows in    Test dataset :  {} ".format(len(X_test)))
print("Number of Targets in Test dataset :  {} ".format(len(y_test)))

In [None]:
sns.countplot(y_test)
plt.ylim((0,1000))

In [None]:
plt.figure(figsize=(15,9))
y_list = [y, y_train, y_test]
titles = ['All Data','Train Data', 'Test Data']

for i in range(1,4):
    plt.subplot(1,3,i)
    sns.countplot(y_list[i-1])
    plt.title(titles[i-1])
    


In [None]:
print("Tüm veri kümesi '0' yüzdesi : %{:.0f} ".format(len(y[y==0])/len(y)*100))
print("Test verisi '0' yüzdesi     : %{:.0f} ".format(len(y_test[y_test==0])/len(y_test)*100))
print("Eğitim verisi '0' yüzdesi   : %{:.0f} ".format(len(y_train[y_train==0])/len(y_train)*100))

In [None]:
LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
tahmin_eğitim = model.predict(X_train)
tahmin_test = model.predict(X_test)
model.score(X_test, y_test)

We splitted y values equally and trained our model.However, in order to see X values distribution we need following Cross Validation Measurement.

##  <font color="darkblue">10.3 K-Fold Cross Validation

In [None]:
from sklearn.model_selection import KFold 
kf = KFold(n_splits=5, shuffle=True, random_state=40) 

In [None]:
X.loc[[3,5]] 


In [None]:
parcalar = kf.split(X)
for num, (train_index, test_index) in enumerate(parcalar): 
    print("{}.Training Set Size : {}".format(num+1,len(train_index)))  
    print("{}.Test Set Size     : {}".format(num+1,len(test_index))) 
    print('-'*26)

In [None]:
from sklearn.metrics import mean_squared_error 

model2 = LogisticRegression()
pieces = kf.split(X)
accuracy_list = []

for i, (egitim_indeks, test_indeks) in enumerate(pieces):
    
    X_train, y_train = X.loc[train_index], y[train_index]
    X_test, y_test = X.loc[test_indeks], y[test_indeks]
    
    model2.fit(X_train, y_train)
    tahmin = model2.predict(X_test)
    accuracy_value = model2.score(X_test, y_test)  
    
    accuracy_list.append(accuracy_value)
    
    print("{}.Accuracy Value of Pieces: {:.3f}".format(i+1, accuracy_value))
    print("-"*30)

In [None]:
print("Avarage Accuracy Value : {:.2f}".format(np.mean(accuracy_list)))

We splited our function in 5 pieces and trained them with Kfold method. In the following section, Cross Validate and Cross Validation Score tools will do everything itself.  

##  <font color="darkblue">10.4   Cross Validation Score & Cross Validate

In [None]:
from sklearn.model_selection import cross_validate, cross_val_score

In [None]:
lrm = LogisticRegression()
cv = cross_validate(estimator=lrm,
                     X=X,
                     y=y,
                     cv=10,return_train_score=True
                    )
print('Test Scores            : ', cv['test_score'], sep = '\n')
print("-"*50)
print('Train Scores           : ', cv['train_score'], sep = '\n')

In [None]:
print('Mean of Test Set  : ', cv['test_score'].mean())
print('Mean of Train Set : ', cv['train_score'].mean())

The average accuracy score is calculated from 10 different accuracy scores from the model.

We still have similiar accuracy scores (.96-.97) by different methods applied previously. 

In [None]:
cv = cross_validate(estimator=lrm, 
                     X=X,
                     y=y,
                     cv=10,return_train_score=True,
                     scoring = ['accuracy', 'r2', 'precision']
                    )

In [None]:
print('Test Set Accuracy   Mean      : {:.2f}'.format(cv['test_accuracy'].mean()))
print('Test Set R Square   Mean      : {:.2f}'.format(cv['test_r2'].mean()))
print('Test Set Precision  Mean      : {:.2f}'.format(cv['test_precision'].mean()))
print('Train Set Accuracy  Mean      : {:.2f}'.format(cv['train_accuracy'].mean()))
print('Train Set R Square  Mean      : {:.2f}'.format(cv['train_r2'].mean()))
print('Train Set Precision Mean      : {:.2f}'.format(cv['train_precision'].mean()))

In [None]:
cv = cross_val_score(estimator=lrm,
                     X=X,
                     y=y,
                     cv=10                    
                    )
print('Model Scores           : ', cv, sep = '\n')

cross_val_score and cross_validate functions used only test set. In order to have model predictions we can also check cross_val_predict  function.

In [None]:
from sklearn.model_selection import cross_val_predict 

In [None]:
y_pred = cross_val_predict(estimator=lrm, X=X, y=y, cv=10)
print(y_pred[0:10])

# <div align="center">  **11. Hyperparameter Tuning**

Apart from using appropriate function for our model, using the suitable parameter is also an important detail to have accurate predictions. I will use Grid Search and Random Search for this aim. 

In order to have suitable parametres I used get_params() function. 

In [None]:
logreg = LogisticRegression()
print(logreg.get_params())

##  <font color="darkblue">11.1 Grid Search

In [None]:
parameters = {"C": [10 ** x for x in range (-5, 5, 1)],
                "penalty": ['l1', 'l2']
                }

In [None]:
parameters

In [None]:
from sklearn.model_selection import GridSearchCV


grid_cv = GridSearchCV(estimator=logreg,
                       param_grid = parameters,
                       cv = 10
                      )
grid_cv.fit(X, y)

In [None]:
print("The Best Parametre : ", grid_cv.best_params_)
print("The Best Score     : ", grid_cv.best_score_)

In [None]:
results = grid_cv.cv_results_
df = pd.DataFrame(results)
df.head()

In [None]:
df = df[['param_penalty','param_C', 'mean_test_score']]
df = df.sort_values(by='mean_test_score', ascending = False)
df

In [None]:
#The most successful 10 parametres on a chart.
plt.style.use('fivethirtyeight')

plt.figure(figsize=(12,12))

sns.scatterplot(x = 'param_C', y = 'mean_test_score', hue = 'param_penalty', data = df[0:10], s=150)

plt.xscale('symlog')
#plt.ylim((0.9,1))
plt.show()

##  <font color="darkblue">11.2 RandomizedSearchCV

While we checked all combinations of our parameters with Grid Search method, we can also use this function with desired number of conbinations of parameters. 

In [None]:
parametres = {"C": [10 ** x for x in range (-5, 5, 1)],
                "penalty": ['l1', 'l2']
                }

I will make 10 combination with 'n_iter' parameter.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')


rs_cv = RandomizedSearchCV(estimator=logreg,
                           param_distributions = parametres,
                           cv = 10,
                           n_iter = 10,
                           random_state = 111,
                           scoring = 'precision'
                      )
rs_cv.fit(X, y)

In [None]:
print("The Best Parametres        : ", rs_cv.best_params_)
print("All Precisions Values      : ", rs_cv.cv_results_['mean_test_score'])
print("The Best Precision Value   : ", rs_cv.best_score_)

In [None]:
results_rs = rs_cv.cv_results_
df_rs = pd.DataFrame(results_rs)

In [None]:
results_rs = rs_cv.cv_results_
df_rs = pd.DataFrame(results_rs)
df_rs = df_rs[['param_penalty','param_C', 'mean_test_score']]
df_rs = df_rs.sort_values(by='mean_test_score', ascending = False)
df_rs

In [None]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,12))
sns.scatterplot(x = 'param_C', y = 'mean_test_score', hue = 'param_penalty', data = df_rs, s=200)
plt.xscale('symlog')
plt.ylim((0.0,1))
plt.show()



In the beginning of this study, I checked general characteristic of the data set. Data has some NULL values. Even though, dropping missing values is still an option due to low percentage of missing values in data, I preferred to filled them by the mean of data. 

Data set shows that red wine is very reach in wine quality with a high correlation with alcohol. 

I also looked at quality levels in each variable by using suitable charts for a general understanding.  

Following sections, I searched for 2 different types of models with different bins. Behind this study I created many models for a better accuracy and recall scores. This study only shows the best model with good scores and predictions.

The first model was included 2 bins with all variables in a quality range of 0-5,5-10. This model gives %0.74 accuracy score on train and test samples. 

df_bins3 data frame was split in 3 different bins to check accuracy levels. First bin was between 0-5,5-6,6-10 range. This model gives score of %0.58.  

On the other hand, when bins are arranged by following 0-4,4-7,7-10; score reached 0.93%. I continued with this model for the further steps on other performance measurements. 

A general note: These results for imbalance data, thus I would like to see scores after balancing data set. Due to this reason, and to check the difference between logistic regression model, I resampled imbalance data. In previous steps, I added bins in low and high ranges on quality variable, this section will show the results by using resampling method.
However, having very high scores and a negative R square show that data set needs another approach at the end. For further studies, more suitable data set can be chosen. 

A Quick Note for Resampling Data: Splitting data in 2 parts from 0 to 5 gives a balance distribution. However, when we split data  from 0 to 4 in the first bin, I got an imbalance data. 
After resampling our data, I needed to switch y values in array format to have Cross Validation scores. 

Generally, our measurements and model scores worked well to show the aim of the study. This study can be completed in a shorter way as well without repeating similar functions; however, this study also aims to use different methods to have accurate scores from variable sources. 

I focused on classification methods on this study. However, I agree that other algorithms can be more successful such as Random forest and Boosting algorithms give better results. I will use these methods in my next kernel.
