# "Wine Quality."

### _"Quality ratings of Portuguese white wines" (Classification task)._

## Table of Contents


## Part 0: Introduction

### Overview
The dataset that's we see here contains 12 columns and 4898 entries of data about Portuguese white wines.
    
**Метаданные:**
    
* **fixed acidity** 

* **volatile acidity**

* **citric acid** 

* **residual sugar** 

* **chlorides** 

* **free sulfur dioxide** 

* **total sulfur dioxide**

* **density** 

* **pH** 

* **sulphates** 

* **alcohol** 

* **quality** - score between 3 and 9


### Questions:
    
Predict which wines are 'Good/1' and 'Not Good/0' (use binary classification; check balance of classes; calculate perdictions; choose the best model)


## [Part 1: Import, Load Data](#Part-1:-Import,-Load-Data.)
* ### Import libraries, Read data from ‘.csv’ file

## [Part 2: Exploratory Data Analysis](#Part-2:-Exploratory-Data-Analysis.)
* ### Info, Head, Describe
* ### Encoding 'quality' attribute
* ### 'quality' attribute value counts and visualisation
* ### Resampling of an imbalanced dataset
* ### Random under-sampling of an imbalanced dataset
* ### Random over-sampling of an imbalanced dataset
* ### Initialisation of target
* ### Drop column 'quality'

## [Part 3: Data Wrangling and Transformation](#Part-3:-Data-Wrangling-and-Transformation.)
* ### StandardScaler
* ### Creating datasets for ML part
* ### 'Train\Test' splitting method

## [Part 4: Machine Learning](#Part-4:-Machine-Learning.)
* ### Build, train and evaluate models without hyperparameters
    * #### Logistic Regression, K-Nearest Neighbors, Decision Trees 
    * #### Classification report
    * #### Confusion Matrix
    * #### ROC-AUC score
* ### Build, train and evaluate models with hyperparameters
    * #### Logistic Regression, K-Nearest Neighbors, Decision Trees 
    * #### Classification report
    * #### Confusion Matrix
    * #### ROC-AUC score

## [Conclusion](#Conclusion.)



## Part 1: Import, Load Data.

* ### Import libraries

In [39]:
# import standard libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm
%matplotlib inline
sns.set()

import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings('ignore')



* ### Read data from ‘.csv’ file

In [40]:
# read data from '.csv' file
dataset = pd.read_csv('winequality.csv') 

## Part 2: Exploratory Data Analysis.

* ### Info

In [41]:
# print the full summary of the dataset  
dataset.info()

* ### Head

In [42]:
# preview of the first 5 lines of the loaded data 
dataset.head()

* ### Describe

In [None]:
dataset.describe()

Предположим, вам дали такой датасет и поставили конктетный вопрос: классифицируйте какие вина хорошие, а какие нет? У вас нет атрибута "Y" и ответа. Но есть хороший вспомогательный атрибут "quality" из которого мы сможем создать наш атрибут "Y" с ответом для обучения модели. Атрибут "quality" имеет значения от 3 до 9, где 3 это "Not Good", а 9 это "Good" качество вина. Чем выше число, тем лучше вино.

* ### Encoding 'quality' attribute

In [43]:
# lambda function; wine quality from 3-6 == 0, from 7-9 == 1.
dataset['quality'] = dataset.quality.apply(lambda q: 0 if q <= 6 else 1)

In [44]:
# preview of the first 5 lines of the loaded data 
dataset.head()

* ### 'quality' attribute value counts and visualisation

In [None]:
print('Not good wine', round(dataset['quality'].value_counts()[0]/len(dataset) * 100,2), '% of the dataset')
print('Good wine', round(dataset['quality'].value_counts()[1]/len(dataset) * 100,2), '% of the dataset')

dataset['quality'].value_counts()

In [45]:
# visualisation plot
dataset['quality'].value_counts().plot(x = dataset['quality'], kind='bar')

* ### Resampling of an imbalanced dataset

In [46]:
# class count
#count_class_0, count_class_1 = dataset.quality.value_counts()

# divide by class
#class_0 = dataset[dataset['quality'] == 0]
#class_1 = dataset[dataset['quality'] == 1]

* ### Random under-sampling of an imbalanced dataset

In [None]:
#class_0_under = class_0.sample(count_class_1)
#dataset_under = pd.concat([class_0_under, class_1], axis=0)

#print('Random under-sampling:')
#print(dataset_under.quality.value_counts())

#dataset_under.quality.value_counts().plot(kind='bar', title='Count (target)');

* ### Random over-sampling of an imbalanced dataset

In [None]:
#class_1_over = class_1.sample(count_class_0, replace=True)
#dataset_over = pd.concat([class_0, class_1_over], axis=0)

#print('Random over-sampling:')
#print(dataset_over.quality.value_counts())

#dataset_over.quality.value_counts().plot(kind='bar', title='Count (target)');

* ### Initialisation of target

In [None]:
# initialisation of target
target = dataset['quality']

# for under-sampling dataset
#target_under = dataset_under['quality']

# for over-sampling dataset
#target_over = dataset_over['quality'] 

* ### Drop column 'quality'

In [None]:
dataset = dataset.drop(columns=['quality'])

# for under-sampling dataset
#dataset_under = dataset_under.drop(columns=['quality'])

# for over-sampling dataset
#dataset_over = dataset_over.drop(columns=['quality'])

## Part 3: Data Wrangling and Transformation.

* ### StandardScaler

In [None]:
# StandardScaler 
sc = StandardScaler()

dataset_sc = sc.fit_transform(dataset)

# for under-sampling dataset
#dataset_sc = sc.fit_transform(dataset_under)

# for over-sampling dataset
#dataset_sc = sc.fit_transform(dataset_over)

dataset_sc = pd.DataFrame(dataset_sc)
dataset_sc.head()


* ### Creating datasets for ML part

In [47]:
# set 'X' for features' and y' for the target ('quality').
y = target
X = dataset_sc.copy()

# for under-sampling dataset 
#y = target_under
#X = dataset_sc.copy()

# for over-sampling dataset 
#y = target_over
#X = dataset_sc.copy() 


In [48]:
# preview of the first 5 lines of the loaded data 
X.head()

* ### 'Train\Test' split

In [50]:
# apply 'Train\Test' splitting method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [51]:
# print shape of X_train and y_train
X_train.shape, y_train.shape

In [52]:
# print shape of X_test and y_test
X_test.shape, y_test.shape

## Part 4: Machine Learning.

* ### Build, train and evaluate models without hyperparameters

* Logistic Regression
* K-Nearest Neighbors
* Decision Trees


In [53]:
# Logistic Regression
LR = LogisticRegression()
LR.fit(X_train, y_train)
LR_pred = LR.predict(X_test)

# K-Nearest Neighbors
KNN = KNeighborsClassifier()
KNN.fit(X_train, y_train)
KNN_pred = KNN.predict(X_test)

# Decision Tree
DT = DecisionTreeClassifier(random_state = 0)
DT.fit(X_train, y_train)
DT_pred = DT.predict(X_test)

* ### Classification report

In [None]:
print("LR Classification Report: \n", classification_report(y_test, LR_pred, digits = 6))
print("KNN Classification Report: \n", classification_report(y_test, KNN_pred, digits = 6))
print("DT Classification Report: \n", classification_report(y_test, DT_pred, digits = 6))

* ### Confusion matrix

In [None]:
LR_confusion_mx = confusion_matrix(y_test, LR_pred)
print("LR Confusion Matrix: \n", LR_confusion_mx)
print()
KNN_confusion_mx = confusion_matrix(y_test, KNN_pred)
print("KNN Confusion Matrix: \n", KNN_confusion_mx)
print()
DT_confusion_mx = confusion_matrix(y_test, DT_pred)
print("DT Confusion Matrix: \n", DT_confusion_mx)
print()

* ### ROC-AUC score

In [None]:
roc_auc_score(DT_pred, y_test)

* ### Build, train and evaluate models with hyperparameters

In [54]:
# Logistic Regression
LR = LogisticRegression()
LR_params = {'C':[1,2,3,4,5,6,7,8,9,10], 'penalty':['l1', 'l2', 'elasticnet', 'none'], 'solver':['lbfgs', 'newton-cg', 'liblinear', 'sag', 'saga'], 'random_state':[0]}
LR1 = GridSearchCV(LR, param_grid = LR_params)
LR1.fit(X_train, y_train)
LR1_pred = LR1.predict(X_test)

# K-Nearest Neighbors
KNN = KNeighborsClassifier()
KNN_params = {'n_neighbors':[5,7,9,11]}
KNN1 = GridSearchCV(KNN, param_grid = KNN_params)             
KNN1.fit(X_train, y_train)
KNN1_pred = KNN1.predict(X_test)

# Decision Tree
DT = DecisionTreeClassifier()
DT_params = {'max_depth':[2,10,15,20], 'criterion':['gini', 'entropy'], 'random_state':[0]}
DT1 = GridSearchCV(DT, param_grid = DT_params)
DT1.fit(X_train, y_train)
DT1_pred = DT1.predict(X_test)

In [55]:
# print the best hyper parameters set
print("Logistic Regression Best Hyper Parameters:   ", LR1.best_params_)
print("K-Nearest Neighbour Best Hyper Parameters:   ", KNN1.best_params_)
print("Decision Tree Best Hyper Parameters:         ", DT1.best_params_)


* ### Classification report

In [None]:
print("LR Classification Report: \n", classification_report(y_test, LR1_pred, digits = 6))
print("KNN Classification Report: \n", classification_report(y_test, KNN1_pred, digits = 6))
print("DT Classification Report: \n", classification_report(y_test, DT1_pred, digits = 6))

* ### Confusion matrix

In [56]:
# confusion matrix of DT model
DT_confusion_mx = confusion_matrix(y_test, DT1_pred)
print('DT Confusion Matrix')

# visualisation
ax = plt.subplot()
sns.heatmap(DT_confusion_mx, annot = True, fmt = 'd', cmap = 'Blues', ax = ax, linewidths = 0.5, annot_kws = {'size': 15})
ax.set_ylabel('FP       True label        TP')
ax.set_xlabel('FN       Predicted label        TN')
ax.xaxis.set_ticklabels(['1', '0'], fontsize = 10)
ax.yaxis.set_ticklabels(['1', '0'], fontsize = 10)
plt.show()
print() 

* ### ROC-AUC score

In [None]:
roc_auc_score(DT1_pred, y_test)

##  Conclusion.

In [57]:
# submission of .csv file with predictions
sub = pd.DataFrame()
sub['ID'] = X_test.index
sub['quality'] = DT1_pred
sub.to_csv('WinePredictionsTest.csv', index=False)


Question: Predict which wines are 'Good/1' and 'Not Good/0' (use binary classification; check balance of classes; calculate perdictions; choose the best model).

Answers:

   1. Binary classification was applied.

   2. Classes were highly imbalanced with 78.36 % of '0' class and only 21.64 % of '1' class in our dataset.

   3. Three options were applied in order to calculate the best predictions:
      -  Calculate predictions with imbalanced dataset
      -  Calculate predictions with random under-sampling technique of an imbalanced dataset
      -  Calculate predictions with random over-sampling technique of an imbalanced dataset

   4. Three ML models were used: Logistic Regression, KNN, Decision Tree (without and with hyper parameters).

   5. The best result was choosen:
       - Random over-sampling dataset with 3838 enteties in class '0' and 3838 enteties in class '1', 7676 enteties in total.
       - Train/Test split: test_size=0.2, random_state=0
       - Decision Tree model without hyper parameters tuning, with an accuracy score equal ... and ROC-AUC score equal ... .

        
