# Red Wine Quality

This DataSet is provided by [Cortez et al., 2009](https://doi.org/10.1016/j.dss.2009.05.016) and is a part of the work ***Modeling wine preferences by data mining from physicochemical properties*** <a href="#1">1</a> uploaded in [Kaggle Plataform](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009).

In the present work we gonna use this DataSet to apply the concepts of syllabus of the course Introduction to Data Science provided by **ICMC - USP** as a part of the **Master in Business Administration in Data Science Program** <a href="#2">2</a>.

This work is divided as follow:

<a href="#intro">1. Introduction</a>

<a href="#eda">2. Exploratory Data Analysis</a>

<a href="#data-cleansing">3. Data Cleansing and Data Treatment</a>

<a href="#ml-models">4. Machine Learning Models</a>

- <a href="#classification">4.1. Classification</a>

    - - <a href="#scenario1">4.1.1. Scenario 1 - Multi-class problem</a>
    
    - - <a href="#scenario2">4.1.2. Scenario 2 - 2-classes problem</a>
    
- <a href="#regression">4.2. Regression</a>
    
<a href="#conclusions">5. Conclusions</a>

<a href="#references">6. References</a>


## <a id="intro">1. Introduction</a>

As defined by the author of dataset on Kaggle:
*"This datasets is related to red variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones)."*

This dataset presents data about Wine Quality and the main goal of this work is to explore it, draw some hipoteses and by using some tecniquies reach conclusions about what and how features affect the Wine Quality.

In Kaggle, this DataSet is present as a sole DataSet, meaning that is no competition behind it.
So it's up to the person taken this dataset the task of divide it into test, train and validations sub-datasets.

In [None]:
# Imports and setup

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #biblioteca gráfica para mostrar os gráficos

import warnings
warnings.filterwarnings('ignore')


DF = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#import networkx as nx

#import plotly.express as px
#import plotly.figure_factory as ff
#from plotly.graph_objs import graph_objs
#import plotly.graph_objects as go
import matplotlib.pyplot as plt
#import seaborn as sns

import itertools
import time

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier, Lasso, LogisticRegression
from sklearn.naive_bayes import GaussianNB
#from xgboost.sklearn import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from mlxtend.plotting import plot_decision_regions
#from sklearn.gaussian_process import GaussianProcessClassifier
#from sklearn.gaussian_process.kernels import RBF

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import average_precision_score
from sklearn.metrics import make_scorer
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# Any results you write to the current directory are saved as output.

In [None]:
for c,d in zip(DF.columns, DF.dtypes):
    print("Column {}, type {}".format(c,d))
print("Number of missing values: {}.".format(DF.isna().sum().sum()))
print("Number of duplicated rows: {}.".format(DF.shape[0] - DF.drop_duplicates().shape[0]))
DF.drop_duplicates(inplace=True)
DF.describe()

As we could see, all features are numeric (?), and that is no line with null values (all columns have same number of values, which is the number of rows in dataset).

We also could see, that apparently some features are correlated, as we see in groups like: (fixed acidity, volatile acidity and citric acid) and (free sulfur dioxide, total sulfur dioxide and sulphates).

One last thing we could see is that Wine Quality is defined as a integer number varying from 3 to 8.

Now, in order to know better the dataset's columns we could take more information on Kaggle:

**fixed acidity:** most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

**volatile acidity:** the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

**citric acid:** found in small quantities, citric acid can add 'freshness' and flavor to wines

**residual sugar:** the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

**chlorides:** the amount of salt in the wine

**free sulfur dioxide:** the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

**total sulfur dioxide:** amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

**density:** the density of water is close to that of water depending on the percent alcohol and sugar content

**pH:** describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**sulphates:** a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

**alcohol:** the percent alcohol content of the wine

**quality:** output variable (based on sensory data, score between 0 and 10)

After reading the description of each feature we could now confirm or deny some of our first thoughts about the dataset:

Yes, the features are indeed correlated with each other as we see that, for instance, sulfur and sulphates are related with SO2 gas.

The Wine Quality measure can vary in a scale from 0 up to 10, although the samples in this dataset vary from 3 up to 8.

And one last interesting thing: Quality Evaluation is a subjective matter (although the effort of Quality Classification of Enologists), but nonetheless, it seems like that are some underlying rules in Wine Quality Classifications, like:

- volatile acidity can't be *too high* or the wine will taste unpleasant;
- citric acids in the *correct amout* add some 'freshness' and flavor to wines (which is a **very distinct characteristic** of this wine, as we could see in citation on the author's academic page: *"Vinho verde is a unique product from the Minho (northwest) region of Portugal. Medium in alcohol, is it particularly appreciated due to its freshness (specially in the summer)"* <a href='#3'>3</a>
- total sulfur dioxide can become evident in concentrations over 50 ppm and this may interfer in wine taste and aroma

Given the rules above, one first hypothesis we could think of is:

These underlying rules mentioned could be built by a Decision Tree, because this type of algorithm uses a non-balanced tree on which each node is a feature-criterium pair and determines, in the end, the class of each sample.

So **Hipothesys 1**: *"Decision Tree will perform well in the task of classification this dataset, by identifing the underlying rules that determines wine quality"*

## <a id='eda'>2. Exploratory Data Analysis</a>

We can also analyze distribution of data in a graphic way:

In [None]:
# Analysing the features distribution on graphic way

def histogram(data, title, ax): #index
    n_bins = 30
    ax.hist(data, n_bins, density=True, histtype='bar')
    ax.legend(prop={'size': 8})
    ax.set_title(title)

fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(15,15))
for i in range(4):
    for j in range(3):
        idx_col = i*3+j
        if(idx_col >= DF.shape[1]):
            continue
        col = list(DF.columns)[idx_col]
        print(col)
        #axs[i][j] = histogram(DF[col])
        ax = axes[i][j]
        histogram(DF[col], col, ax)

fig.tight_layout()
plt.show()

Another interesting analysis is about data correlation. By data correlation we can determine which variable influences each other and take some decision, like, exclude two redundant features, or in case of correlation with target-feature, look better for a feature.

In [None]:
corr = DF.corr()
#Plot Correlation Matrix using Matplotlib
plt.figure(figsize=(7, 5))
plt.imshow(corr, cmap='YlOrBr', interpolation='none', aspect='auto')
plt.colorbar()
plt.xticks(range(len(corr)), corr.columns, rotation='vertical')
plt.yticks(range(len(corr)), corr.columns);
plt.suptitle('Correlation between variables', fontsize=15, fontweight='bold')
plt.grid(False)
plt.show()

In previous analysis we could not distinguish very well the correlation of the variable, so in order to take a better look, we will approach in a text-based solution:

In [None]:
def correlation_pairs(df, threshold, sort=False):
    """
        Function to filter pair of features of a given DataFrame :df:
        that area correlated at least at :threshold:
    """
    pairs = []
    corr = df.corr()
    corr = corr.reset_index()
    for i in range(corr.shape[0]):
        for j in range(corr.shape[1]):
            if(j < i):
                col = corr.columns[j+1]
                corr_val = corr.loc[i][col]
                if(abs(corr_val) > threshold):
                    #print(i, j, corr.loc[i]['index'], col, corr_val)
                    pairs.append((corr.loc[i]['index'], col, corr_val))
    return pairs

correlation_pairs(DF, 0.3)

Now we can see what are the most correlated variables and confirm some of our initial toughts (citric acid - fixed acidity - volatile acidity and total sulfur dioxide - free sulfur dioxide correlations), and see new and perhaps more interesting correlations like: **quality - volatile acidity - alcohol**

Let's analyze features for two different groups to see if there is a pattern that differ:

(*Obs.: The following division was firstly proposed by dataset's publisher on Kaggle.*

In [None]:
DF_qlt7 = DF[DF['quality'] < 7]
DF_qge7 = DF[DF['quality'] >= 7]

In [None]:
print(DF_qlt7.shape)
DF_qlt7.describe()

In [None]:
print(DF_qge7.shape)
DF_qge7.describe()

In [None]:
qlt7_stats = DF_qlt7.describe().loc[['mean', 'std']]
qge7_stats = DF_qge7.describe().loc[['mean', 'std']]

Taking a closer look to percentual difference between mean and standard deviation of each feature of each set, we have:

In [None]:
round(((qlt7_stats - qge7_stats) / qge7_stats) * 100, 2)

In order to eliminate the different number of samples effect, we may take a sample of the grater set with same size of the other set.

In [None]:
DF_qlt7_eq = DF_qlt7.sample(n=DF_qge7.shape[0], random_state=1)
qlt7_eq_stats = DF_qlt7_eq.describe().loc[['mean', 'std']]
round(((qlt7_eq_stats - qge7_stats) / qge7_stats) * 100, 2)

Apparently, that are some features that really differs (high mean difference with a low std percentuals) between the two groups:

- citric acid;
- total sulfur dioxide;
- alcohol;

These features, may be a good point to differentiate a bad wine to good ones.

So **Hyphoteses 2:** *As features: 'citric acid', 'total sulfur dioxide' and 'alcohol' have a relative high difference between considerable bad wines and good ones, they will have a high relevance for classifiers.*

## <a id='data-cleansing'>3. Data Cleansing and Data Treatment</a>

Some of the most efficient techniquies of Data Cleansing include:

    Manual feature deletion (Delete features that do not contribute with model understanding. e.g.: Name of Wine)
    Data Integration (When you have multiple sources of data you may want to merge them in some way in order to enrich your data)
    Data Sampling (When you have too much samples, you may want to not use some samples to redece the model complexity, and thus the model overfitting) <- review
    Dimensionality Reduction (the same ideia behind Data Sampling, but now applied to columns)
    unbalanced samples or to balance data
    Data Balance (In Classification, when you have much more samples of one class than the ther, you have unballanced data. This could lead to difficulties to classification, because, even a dummy classifier (Classifier which always classify a sample by the most frequent class in dataset) will have a high accurate score. So Balance data means the process of take same amout of samples by class).
    Data Cleansing (Aims to resolve some commum problems with data, like: inconsistence, redundancy, missing values, outliers and etc.).
    Data Transformation (process to change data in order to equal scale(e.g.: standartization, min-max saclling, softmax), ?(e.g.: translating), ?(e.g.: enconding), etc).

In our project we already use one tool of data cleasing (drop_duplicates) to delete duplicated rows on our dataset. Now we gonna use detection and treatment of outliers (if present).

After that, we gonna normalize data, to reduce the effect of different scales between features. 

The other techniquies does'nt contribute too much for this project so we will not gonna use.

In [None]:
def boxplot(data, title, ax): #index
    green_diamond = dict(markerfacecolor='g', marker='D')
    ax.set_title(title)
    ax.boxplot(data, flierprops=green_diamond)

fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(15,15))
for i in range(4):
    for j in range(3):
        idx_col = i*3+j
        if(idx_col >= DF.shape[1]):
            continue
        col = list(DF.columns)[idx_col]
        print(col)
        #axs[i][j] = histogram(DF[col])
        ax = axes[i][j]
        boxplot(DF[col], col, ax)

fig.tight_layout()
plt.show()

In [None]:
# In order to detect and remove outliers we will use the zscore function of scipy.stats package
from scipy.stats import zscore
from functools import reduce

z = np.abs(zscore(DF))
threshold = 3 # our threshold will be 3 * std_dev
zmask = abs(z) > 3
zmask_per_line = [reduce(lambda curr, res : curr or res, zmask[i]) for i in range(len(zmask))]
print("Using Z-score to remove outliers we would remove ~ {} % of our data.".format(round(sum(zmask_per_line) / DF.shape[0] * 100), 2))

Q1 = DF.quantile(0.25)
Q3 = DF.quantile(0.75)
IQR = Q3 - Q1
iqrmask = (DF < (Q1 - 1.5 * IQR)) |(DF > (Q3 + 1.5 * IQR))
iqrmask_per_line = list(iqrmask.apply(lambda row : reduce(lambda curr, res : curr or res, row), axis=1))
print("Using IQR to remove outliers we would remove ~ {} % of our data.".format(round(sum(iqrmask_per_line) / DF.shape[0] * 100), 2))

zmask_per_line = [not z for z in zmask_per_line]

# So we use Z-score to reduce the data loss
DF = DF[zmask_per_line]

In [None]:
# First of all, we need to split data into two sets:
# X -> With all dependent variables
# y -> With target-feature

X = DF[DF.columns[:-1]]
y = DF[DF.columns[-1]]

In [None]:
from sklearn.preprocessing import StandardScaler

columns = X.columns

# Prepate the transformation function
scaler = StandardScaler().fit(X)
# Standardize data (mean=0, variance = 1)
X = pd.DataFrame(scaler.transform(X), columns=columns)

Let's see now, how our dataset was after data treatment:

In [None]:
X.describe()

## <a id='ml-models'>4. Machine Learning Models</a>

In this section we will initiate our analysis with Machine Learning Models.
A Machine Learning Model is a function that try to predict a independent value y by a set of dependent values $X_{i}$ times some adjusted coeficients $\Theta$$_{i}$ with some error $\epsilon$

or:


f(X) = $\Theta$$_{i}$ $X_{i}$ + $\Theta$$_{0}$+ $\epsilon$ = y

When the domain of our indenpendant variable is discrete (Naturals) we say that our function is a **classifier function**.

When the domain of our indenpendant variable is continuos (Reals) we say that our function is a **regression function**.

The most popular and efficient models are (but not limited to):

- KNN (K-Nearest Neighboors)
- Decision Trees
- Naive-Bayes
- SVM (Support Vector Machines)
- Random Forests

And we gonna use, tunning and compare them in order to predict the *target-feature:* ***'quality'***

And, in order to test some concepts we also gonna use **Linear Regression Models** like **Lasso** and **Ridge** Regressions to predict the alcohol percentage.

But first, let's see how data is distributed in the classes.

In order to see this we will need to plot a 2-dimensional graph, and to choose which variables we will use in this 2-dimension graph we will use PCA (Principal Component Analysis).

In [None]:
pca = PCA()
pca_result = pca.fit_transform(X)
var_exp = pca.explained_variance_ratio_

attr_x_var_exp = sorted(list(zip(X.columns, var_exp)), key=lambda x: x[1])
importances = [var_exp for _, var_exp in attr_x_var_exp]
attr_rank = [attr for attr, _ in attr_x_var_exp]

for attr, var_exp in attr_x_var_exp:
    print(attr, var_exp)

plt.title('Feature Importances')
plt.tight_layout()
plt.barh(range(len(importances)), importances, color='b', align='center')
plt.yticks(range(len(importances)), attr_rank, fontsize=25)
plt.xlabel('Relative Importance',fontsize=25)
plt.xticks(color='k', size=15)
plt.yticks(color='k', size=15)
plt.xlim([0.0, 1])
plt.show()

We could also see, the relation between the number of components and how much variance that components can explain.

In [None]:
pca = PCA().fit(X)
plt.figure(figsize=(8,5))
ncomp = np.arange(1, np.shape(X)[1]+1)
plt.plot(ncomp, np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.xlabel('number of components', fontsize=15)
plt.ylabel('cumulative explained variance', fontsize=15);
plt.xticks(color='k', size=15)
plt.yticks(color='k', size=15)
plt.grid(True)
plt.show(True)

Now using the two most relevant attributes to plot data:

In [None]:
features_names = ['fixed acidity', 'volatile acidity']
X_ = X[features_names]
class_labels = np.unique(y)
#Plotting
all_colors = ['red', 'blue', 'orange', 'purple', 'green', 'yellow', 'black']
colors = all_colors[:len(class_labels)]
for i, c in enumerate(class_labels):
    ind = np.where(y == c)
    # mostra os grupos com diferentes cores
    plt.scatter(X_.loc[ind][X.columns[0]], X_.loc[ind][X.columns[1]], color = colors[i], label = c)
plt.legend()
plt.show()

### <a id='classification'>4.1. Classification</a>

Now we gonna use our custom-made function test_models to test each model and compare them.
To test each model we gonna use a method called Grid Search Cross Validation that will tune models on selected parameters (using a method named all-versus-all) and evaluate them using the Cross Validation method.
After that, the best model parameters will be selected for each model.
And to visualize the adjusted models we gonna to plot the decision boundaries for each model.

In [None]:
def test_models(models, cv, X, y, scoring=None):
    for i in range(len(models)):
        print("Testing ", models[i]['name'])
        
        #if('multiclassTransformation' in models[i] and models[i]['multiclassTransformation']):
            # This line is required when using classification models for multi-class classification
        #    y_ = preprocessing.label_binarize(y, classes=list(y.unique()))
        #else:
        #    y_ = y

        y_ = y
        model = models[i]['model']
                 
        if('multiclassClassifier' in models[i]):
            multiclassClassifier = models[i]['multiclassClassifier']
            if(multiclassClassifier != None):
                #print(multiclassClassifier)
                model = multiclassClassifier(models[i]['model'])
        
        clf = GridSearchCV(model, models[i]['params'], cv=cv, scoring=scoring, verbose=2, n_jobs=-1)
        clf.fit(X, y_)
        models[i]['exec_time'] = (sum(clf.cv_results_['mean_fit_time']) * cv)
        models[i]['best_params'] = clf.best_params_
        models[i]['best_model'] = clf.best_estimator_ 
        models[i]['best_score'] = clf.best_score_

### <a id='scenario1'>4.1.1. Scenario 1 - Multi-classe problem</a>

In this first scenario, we gonna show, detailed way, how we train, test and evaluate our models (with all extra-dificulties included by having a multi-class problem)

In [None]:
lb = preprocessing.LabelBinarizer()
y_encoded = lb.fit_transform(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size = 0.2, random_state = 42)

In [None]:
# ALL
# ---------------------------------------------------------------------
knn_params = {
    'n_neighbors' : list(range(6,50)),
    'weights' : ['uniform', 'distance'],
    'p' : [1, 2]
}
# ---------------------------------------------------------------------
svc_params = {
    'estimator__C' : [0.01, 0.1, 1, 10],
    'estimator__gamma' : ['auto', 'scale'],
    'estimator__class_weight' : [None, 'balanced'],
}
# ---------------------------------------------------------------------
dt_params = {
    'max_depth' : [1, 3, 5, 8, 13, 21, 34],
    'criterion' : ['gini', 'entropy'],
    'splitter' : ['best', 'random']
}
# ---------------------------------------------------------------------
gnb_params = {}
# ---------------------------------------------------------------------
rf_params = {
    'max_depth' : [1, 3, 5, 7, 11, 21],
    'n_estimators' : [3, 10, 20, 50, 100, 200],
    'max_features' : [2, 3, 5, 7, 9]
}
# ---------------------------------------------------------------------

Models = [
    {'name': "Dummy", 'model' : DummyClassifier(strategy='most_frequent', random_state=0), 'params' : {}, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "KNN", 'model' : KNeighborsClassifier(), 'params' : knn_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "DecisionTreeClassifier", 'model' : DecisionTreeClassifier(), 'params' : dt_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "GaussianNB", 'model' : GaussianNB(), 'params' : gnb_params, 'multiclassTransformation' : True, 'multiclassClassifier' : OneVsRestClassifier, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "SVC", 'model' : SVC(), 'params' : svc_params, 'multiclassTransformation' : True, 'multiclassClassifier' : OneVsRestClassifier, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "RandomForestClassifier", 'model' : RandomForestClassifier(), 'params' : rf_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
]

test_models(Models, 10, X_train, y_train, scoring=make_scorer(accuracy_score))
Models

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

for model in Models:
    clf = model['best_model']
    name = model['name']
    
    predictions_clf = clf.predict(X_test)
    predictions_clf_decoded = lb.inverse_transform(predictions_clf)
    y_test_decoded = lb.inverse_transform(y_test)
    print(name ,classification_report(y_test_decoded, predictions_clf_decoded))

So these are the results of classifiers (considering **f1-score**):

- KNN (k = 8, using Euclidean distance): 55%
- Decision Tree (criterium = 'entropy'): 49%
- Naive-Bayes (kernel='gaussian'): 51%
- SVC (C = 10): 57%
- Random Forest (# estimators = 100): 56%

So the best classifier is SVC(C=10)

Now plotting the decision boundaries we shall see the difference between each classifier:

In [None]:
features_names = ['fixed acidity', 'volatile acidity']
X_ = X[features_names]

classifiers = [x['best_model'] for x in Models]
model_names = [x['name'] for x in Models]

# We save this variable to restore later
max_features = Models[5]['best_model'].max_features
# changing the best model of RandomForests to match with new number of features
Models[5]['best_model'].max_features = 2

for model in Models:
    clf = model['best_model']
    name = model['name']
    
    clf.fit(X_, y)
    
    plot_decision_regions(np.array(X_), np.array(y), clf=clf, legend=2)
    
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.title('Decision Regions for ' + name)
    
    plt.show()

We gonna train separated only the RandomForest Algorithm and plot the feature importance for the best choosen model.

In [None]:
rf_params = {
    'max_depth' : [1, 3, 5, 7, 11, 21],
    'n_estimators' : [3, 10, 20, 50, 100, 200],
    'max_features' : [2, 3, 5, 7, 9]
}
# ---------------------------------------------------------------------

RFModel = [
    {'name': "RandomForestClassifier", 'model' : RandomForestClassifier(), 'params' : rf_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
]

test_models(RFModel, 10, X, y, scoring=make_scorer(accuracy_score))

In [None]:
features_names = DF.columns
importances = RFModel[0]['best_model'].feature_importances_
indices = np.argsort(importances)
lmeas_order = []
for i in indices:
    lmeas_order.append(features_names[i])
plt.figure(figsize=(10,6))
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), lmeas_order, fontsize=15)
plt.xlabel('Relative Importance',fontsize=15)
plt.xticks(color='k', size=15)
plt.yticks(color='k', size=15)
plt.show()

### <a id='scenario2'>4.1.2. Scenario 2 - 2-classes problem</a>

One last thing we may do, just to test, is to consider the separation of wines in a binary way like the author comment:

```
$quality$ > 6.5 => "good"
TRUE => "bad"
```

considering this, we have:

In [None]:
# =====================================================================================================
X = DF[DF.columns[:-1]]
y = DF[DF.columns[-1]].copy()

# =====================================================================================================

columns = X.columns
scaler = StandardScaler().fit(X)
X = pd.DataFrame(scaler.transform(X), columns=columns)

# =====================================================================================================

y[DF[DF.columns[-1]] > 6.5] = 1
y[DF[DF.columns[-1]] <= 6.5] = 0

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# =====================================================================================================

# ALL
# ---------------------------------------------------------------------
knn_params = {
    'n_neighbors' : list(range(6,50)),
    'weights' : ['uniform', 'distance'],
    'p' : [1, 2]
}
# ---------------------------------------------------------------------
svc_params = {
    'estimator__C' : [0.01, 0.1, 1, 10],
    'estimator__gamma' : ['auto', 'scale'],
    'estimator__class_weight' : [None, 'balanced'],
}
# ---------------------------------------------------------------------
dt_params = {
    'max_depth' : [1, 3, 5, 8, 13, 21, 34],
    'criterion' : ['gini', 'entropy'],
    'splitter' : ['best', 'random']
}
# ---------------------------------------------------------------------
gnb_params = {}
# ---------------------------------------------------------------------
rf_params = {
    'max_depth' : [1, 3, 5, 7, 11, 21],
    'n_estimators' : [3, 10, 20, 50, 100, 200],
    'max_features' : [2, 3, 5, 7, 9]
}
# ---------------------------------------------------------------------

Models = [
    {'name': "Dummy", 'model' : DummyClassifier(strategy='most_frequent', random_state=0), 'params' : {}, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "KNN", 'model' : KNeighborsClassifier(), 'params' : knn_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "DecisionTreeClassifier", 'model' : DecisionTreeClassifier(), 'params' : dt_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "GaussianNB", 'model' : GaussianNB(), 'params' : gnb_params, 'multiclassTransformation' : True, 'multiclassClassifier' : OneVsRestClassifier, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "SVC", 'model' : SVC(), 'params' : svc_params, 'multiclassTransformation' : True, 'multiclassClassifier' : OneVsRestClassifier, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
    {'name': "RandomForestClassifier", 'model' : RandomForestClassifier(), 'params' : rf_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
]

test_models(Models, 10, X_train, y_train, scoring=make_scorer(accuracy_score))

# =====================================================================================================

for model in Models:
    clf = model['best_model']
    name = model['name']
    
    predictions_clf = clf.predict(X_test)
    print(name ,classification_report(y_test, predictions_clf))


In [None]:
features_names = ['fixed acidity', 'volatile acidity']
X_ = X[features_names]

classifiers = [x['best_model'] for x in Models]
model_names = [x['name'] for x in Models]

# We save this variable to restore later
max_features = Models[5]['best_model'].max_features
# changing the best model of RandomForests to match with new number of features
Models[5]['best_model'].max_features = 2

for model in Models:
    clf = model['best_model']
    name = model['name']
    
    clf.fit(X_, y)
    
    plot_decision_regions(np.array(X_), np.array(y), clf=clf, legend=2)
    
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.title('Decision Regions for ' + name)
    
    plt.show()

And again, showing the feature importance for Random Forest algorithm:

In [None]:
rf_params = {
    'max_depth' : [1, 3, 5, 7, 11, 21],
    'n_estimators' : [3, 10, 20, 50, 100, 200],
    'max_features' : [2, 3, 5, 7, 9]
}
# ---------------------------------------------------------------------

RFModel = [
    {'name': "RandomForestClassifier", 'model' : RandomForestClassifier(), 'params' : rf_params, 'multiclassTransformation' : True, 'best_model' : None,'best_score' : 0, 'best_params' : None, 'exec_time' : 0.0},
]

test_models(RFModel, 10, X, y, scoring=make_scorer(accuracy_score))

# =====================================================================================================

features_names = DF.columns
importances = RFModel[0]['best_model'].feature_importances_
indices = np.argsort(importances)
lmeas_order = []
for i in indices:
    lmeas_order.append(features_names[i])
plt.figure(figsize=(10,6))
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), lmeas_order, fontsize=15)
plt.xlabel('Relative Importance',fontsize=15)
plt.xticks(color='k', size=15)
plt.yticks(color='k', size=15)
plt.show()

As we could see, the f1-score improved a lot!

So, train a model to classify 2 classes is way better, simple and faster than multi-class classification.

As in, this particular case, we can do the two classifications, we gonna choose by the 2-classes classification.

### <a id='regression'>4.2. Regression</a>

To test some concepts of regression, we gonna try to predict the values of alcohol.

To do that, we gonna compare the models Ridge and Lasso Regression

In [None]:
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error

X_for_regression = DF[list(DF.columns[:10]) + [DF.columns[-1]]]

columns = X_for_regression.columns

scaler = StandardScaler().fit(X)
X_for_regression = pd.DataFrame(scaler.transform(X), columns=columns)
y_for_regression = DF[DF.columns[10]]

from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
p = 0.3 # fracao de elementos no conjunto de teste
x_train, x_test, y_train, y_test = train_test_split(X_for_regression, y_for_regression, test_size = p, random_state = 42)

In [None]:
lm = LinearRegression()
lm.fit(x_train, y_train)

y_pred = lm.predict(x_test)

from sklearn.metrics import r2_score
R2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print('R2 Coefficient: {} and MSE: {}', round(R2,2), round(mse,2))

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure()
l = plt.plot(y_pred, y_test, 'bo')
plt.setp(l, markersize=10)
plt.setp(l, markerfacecolor='C0')

plt.ylabel("y", fontsize=15)
plt.xlabel("Prediction", fontsize=15)

# show original and predicted values
xl = np.arange(min(y_test), 1.2*max(y_test),(max(y_test)-min(y_test))/10)
yl = xl
plt.plot(xl, yl, 'r--')

plt.show(True)

In [None]:
np.random.seed(42)

vRDGRmse = []
vLASRmse = []
valpha = []
# varying values of alpha
for alpha in np.arange(1,30,1):
    
    ridge = Ridge(alpha = alpha, random_state=101, normalize=True)
    ridge.fit(x_train, y_train)             # Fit a ridge regression on the training data
    y_pred = ridge.predict(x_test)           # Use this model to predict the test data
    rmse = mean_squared_error(y_test, y_pred)
    vRDGRmse.append(rmse)
    
    lasso = Lasso(alpha = alpha, random_state=101, normalize=True) # normalize=True
    lasso.fit(x_train, y_train)             # Fit a lasso regression on the training data
    y_pred = lasso.predict(x_test)           # Use this model to predict the test data
    rmse = mean_squared_error(y_test, y_pred)
    vLASRmse.append(rmse)
    
    valpha.append(alpha)
    
plt.plot(valpha, vRDGRmse, '-ro')
plt.plot(valpha, vLASRmse, '-bo')
plt.xlabel("alpha", fontsize=15)
plt.ylabel("Mean Squared Error", fontsize=15)
plt.legend(['Ridge', 'Lasso'])
plt.show(True)

By looking at the graph above, we may lead to conclude that as we increase alpha *ridge regression* have a worse performance, and lasso improve, but not considerably.

## <a id='conclusions'>5. Conclusions</a>

So, this work was very interesting to use almost all class concepts.
As I could observe by some hypotheses made, I reach these conclusions:

- It was not simple and easy to classify the multi-class dataset of wines. Observing the 2-D plots, we could not see a easy division between data and the classifiers confirm that (with the best classifier with a f1-score of 56%). The task of classify the 2-classes dataset was a lot easier and simpler (confirmed by run times of algorithms) and way more assertive (with the two best classifiers with a f1-score of 89%).
- My Hypothese 1 was wrong, as we saw in two different scenarios. This may be explained by the fact of the dataset doesn't follow simple set of rules, as I expected.
- My Hypothese 2 was not so wrong, in the end, as we can see in the plot of feature importance in one of the best models of classification (in both scenarios) the Random Forest Algorithm.
- The one last challenge of try to predict the alcohol levels by the other features was very interesting too, mostly because I could test some regression techniquies.

## <a id='references'>6. References:</a>


<a id='1'></a>
[1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.](https://www.sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub)

<a id='2'></a>
[2. Master Business Administration in Data Science - USP - ICMC](http://cemeai.icmc.usp.br/MBA/)

<a id='3'></a>
[3. http://www3.dsi.uminho.pt/pcortez/wine/ accessed on: March 03 - 2020](http://www3.dsi.uminho.pt/pcortez/wine/)

Other very helpfull links:

- [Matplotlib - docs](https://scikit-learn.org/stable/index.html)

- [Scikit-Learn - docs](https://matplotlib.org/)

- [Towards Data Science - Removing outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)

### Acknowledges

I would like to thank all people that have contribuited to this present work.

Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, José Reis (authors of Modeling wine preferences by data mining from physicochemical properties) who gave us the database and provided me a copy of they amazing paper!

The professors, tutors and classmates for all concepts, help and tips.