# _Wine Quality_

Estudo prático de Classificação de quailidade de vinhos (com *dataset* do [_Kaggle_](https://www.kaggle.com/rajyellow46/wine-quality?select=winequalityN.csv));

---

[Open In Colab](https://colab.research.google.com/drive/1AcaArOrR-e1XQl4N8jUDXBR8ZRt3sSmP?usp=sharing)

[Open in Kaggle](https://www.kaggle.com/leonichel/wine-quality)

---

[Leonichel Guimarães (PIBITI/CNPq-FA-UEM)](https://github.com/leonichel)

Professora Linnyer Ruiz (orientadora)

---

Referências bibliográficas:

GÉRON, Aurélien. _Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems_. 2. ed. O'Reilly Media, 2019.

---

Manna Team  |  UEM       |     CNPq
:----------:|:----------:|:----------:|
<img src="https://manna.team/_next/static/images/logo2-e283461cfa92b2105bfd67e8e530529e.png" alt="Manna Team" width="200"/> | <img src="https://marcoadp.github.io/WebSiteDIN/img/logo-uem2.svg" alt="UEM" width="200"/> | <img src="https://www.gov.br/cnpq/pt-br/canais_atendimento/identidade-visual/logo_cnpq.svg" alt="CNPq" width="200"/>

## Leitura e exploração do banco de dados

### Importando bibliotecas

In [None]:
# básico
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# estatística
from scipy import stats # para z-score

# sklearn
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.utils import class_weight
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, RobustScaler, QuantileTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve, roc_curve, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, cross_val_score, train_test_split, cross_val_predict

from sklearn.linear_model import RidgeClassifier, LogisticRegression, SGDClassifier, Perceptron, PassiveAggressiveClassifier

from sklearn.svm import SVC 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

### Lendo banco de dados

In [None]:
df = pd.read_csv('../input/wine-quality/winequalityN.csv')

In [None]:
df.head()

### Exploração do banco de dados

In [None]:
df.info()

* Há 1 atributo de números inteiros (numérico e categórico): *quality*;
* Há 1 atributo de texto (textual e categórico): *type*;
* Há 11 atributos numéricos reais (numérico e não-categórico): demais atributos.

In [None]:
df.describe()

* Os atributos não-categóricos precisam ser padronizados ou normalizados.

In [None]:
df.describe(include=['O'])

In [None]:
df['quality'].value_counts()

* Há poucos vinhos com notas superiores a 8.

### Valores nulos (*missing data*)

In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
null_values_cols = df.isnull().sum().sort_values(ascending=False).index[:7]

* Os atributos com valores nulos estão no vetor *null_values_cols*, e precisam ser tratados.

In [None]:
df['quality_label'] = df.quality.apply(lambda q: 0 if q <= 5 else 1)
df.drop('quality', axis=1, inplace=True)
df

### Visualização dos dados (*data visualization*)

#### Histogramas

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Histogram(
        x = df['fixed acidity'],
        name = 'fixed acidity',
    )
)

for column in df.columns[2:11]:
    fig.add_trace(
        go.Histogram(
            x = df[column],
            name = column,
            visible='legendonly'
        )
    )

fig.update_layout(barmode='overlay', template='plotly_dark')
fig.update_traces(opacity=0.75)
fig.show()

#### *Boxplots*

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Violin(
        y = df['fixed acidity'],
        name = 'fixed acidity',
        box_visible=True,
    )
)

for column in df.columns[2:11]:
    fig.add_trace(
        go.Violin(
            y = df[column],
            name = column,
            box_visible=True,
            visible='legendonly'
        )
    )

fig.update_layout(barmode='overlay', template='plotly_dark',  width=1790,
    height=800)
fig.update_traces(opacity=0.75)
fig.show()

#### *Outliers*

In [None]:
# z-scores

outliers = df[df.columns[1:10]][(stats.zscore(df[df.columns[1:10]]) > 3)]
outliers.tail()

In [None]:
outliers.info()

* A variável *outliers* armazena os pontos *outliers*.

#### *Heatmap*

In [None]:
df.corr()

In [None]:
fig = go.Figure(go.Heatmap(x=df.corr().index, y=df.corr().columns, 
    z=df.corr().values))
fig.update_layout(template='plotly_dark')
fig.show()

In [None]:
topCols = np.abs(df.corr()).nlargest(4, 'quality_label')['quality_label'][1:].index
topCols

* Os atributos com maiores correlações estão no vetor *topCols*.

#### *Scatter matrix*

In [None]:
fig = px.scatter_matrix(df, opacity=0.3, template='plotly_dark', height=2000)
fig.show()

#### Análise dos vinhos de alta qualidade

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Violin(
        x = df['quality_label'],
        y = df['fixed acidity'],
        name = 'fixed acidity',
        box_visible=True,
    )
)

for column in df.columns[2:11]:
    fig.add_trace(
        go.Violin(
            x = df['quality_label'],
            y = df[column],
            name = column,
            box_visible=True,
            visible='legendonly'
        )
    )

fig.update_layout(barmode='overlay', template='plotly_dark',  width=1790,
    height=800)
fig.update_traces(opacity=0.75)
fig.show()

In [None]:
quartils = {'Atributo': [], 'Qualidade': [], 'Quartil25': [], 
            'Quartil50': [], 'Quartil75': []}

for i in df.columns[1:11]:
    for j in df['quality_label'].value_counts().index:
        top = df[i][df['quality_label'] == j]
        q25, q50, q75 = np.percentile(top.values, [25, 50, 75])
        quartils['Atributo'].append(i)
        quartils['Qualidade'].append(j)
        quartils['Quartil25'].append(q25)
        quartils['Quartil50'].append(q50)
        quartils['Quartil75'].append(q75)

quartils = pd.DataFrame(quartils)
quartils

## Pré-processamento

### Removendo *outliers*

In [None]:
df.drop(outliers.index, inplace=True)

### Separando banco de dados em treinamento e teste

In [None]:
train, test = train_test_split(df, test_size=0.2, random_state=0)
test

In [None]:
y = train['quality_label']
train.drop(['quality_label'] , axis=1, inplace=True)
X = train.copy()
y

### Obtendo atributos categóricos e não-categóricos

In [None]:
numerical_features = train.select_dtypes(exclude=['object']).columns.tolist()
categorical_features = train.select_dtypes(include=['object']).columns.tolist()
numerical_features

### _Pipelines_

In [None]:
# Numérico
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

In [None]:
# Categórico
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
# Juntando
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer,   numerical_features),
        ('cat', categorical_transformer, categorical_features)])

## Modelos de aprendizado

### *Pipelines*

In [None]:
# Pipeline com modelos de aprendizagem lineares

pipe_ridge = Pipeline(
    steps   = [('preprocessor', preprocessor),
            ('ridge', RidgeClassifier())])  
  
pipe_logistic = Pipeline(
    steps  = [('preprocessor', preprocessor),
            ('logistic', LogisticRegression(random_state=0))])  

pipe_SGD = Pipeline(
    steps  = [('preprocessor', preprocessor),
            ('SGD', SGDClassifier())])

pipe_perceptron = Pipeline(
    steps   = [('preprocessor', preprocessor),
            ('perceptron', Perceptron())])

pipe_PasAgg = Pipeline(
    steps   = [('preprocessor', preprocessor),
            ('PasAgg', PassiveAggressiveClassifier())])

linear_pipes = [pipe_ridge, pipe_logistic, pipe_SGD, pipe_perceptron, pipe_PasAgg]

In [None]:
# Pipeline com modelos de aprendizagem não lineares

pipe_SVM = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('SVM', SVC())])

pipe_KN = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('KN', KNeighborsClassifier())])

pipe_tree = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('tree', DecisionTreeClassifier())])

pipe_MLP = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('MLP', MLPClassifier())])

n_linear_pipes = [pipe_SVM, pipe_KN, pipe_tree, pipe_MLP]

In [None]:
# Pipeline com modelos de aprendizagem ensemble

pipe_GB = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('GB', GradientBoostingClassifier())])

pipe_ET = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('ET', ExtraTreesClassifier())])

pipe_RF = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('RF', RandomForestClassifier())])

pipe_bagging = Pipeline(
    steps = [('preprocessor', preprocessor),
            ('bagging', BaggingClassifier())])

pipe_ADA = Pipeline(
    steps= [('preprocessor', preprocessor),
            ('ADA', AdaBoostClassifier(DecisionTreeClassifier()))])

pipe_LGBM = Pipeline(
    steps= [('preprocessor', preprocessor),
            ('LGBM', LGBMClassifier())])

pipe_XGB = Pipeline(
    steps= [('preprocessor', preprocessor),
            ('XGB', XGBClassifier())])

ensemble_pipes = [pipe_GB, pipe_ET, pipe_RF, pipe_bagging, pipe_ADA, pipe_LGBM, pipe_XGB]

### Treinamento e testes

In [None]:
# Lineares

for pipe in linear_pipes:
    print('Model: ', pipe.steps[1][0])
    y_pred = cross_val_predict(pipe, X, y, cv=5)
    print(classification_report(y, y_pred))

In [None]:
# Não lineares

for pipe in n_linear_pipes:
    print('Model: ', pipe.steps[1][0])
    y_pred = cross_val_predict(pipe, X, y, cv=5)
    print(classification_report(y, y_pred))

In [None]:
# Ensembles
for pipe in ensemble_pipes:
    print('Model: ', pipe.steps[1][0])
    y_pred = cross_val_predict(pipe, X, y, cv=5)
    print(classification_report(y, y_pred))

### Otimização de hiper-parâmetros

#### Hiper-parâmetros

In [None]:
# Lineares

parameters_ridge = {'ridge__alpha': np.arange(0.1, 1, 0.1),
                    'ridge__class_weight': ['balanced', 'None']}

parameters_logistic = {'logistic__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                    'logistic__penalty': ['none', 'l1', 'l2', 'elasticnet'],
                    'logistic__C': [100, 10, 1.0, 0.1, 0.01],
                    'logistic__class_weight': ['balanced', 'None']}

parameters_SGD = {'SGD__alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],
                'SGD__max_iter': [100, 1000, 2000, 5000],
                'SGD__class_weight': ['balanced', 'None']}

parameters_perceptron = {'perceptron__eta0': [0.0001, 0.001, 0.01, 0.1, 1.0],
                        'perceptron__max_iter': [100, 1000, 10000],
                        'perceptron__class_weight': ['balanced', 'None']}

parameters_PasAgg = {'PasAgg__C': [10, 1.0, 0.1, 0.01],
                    'PasAgg__max_iter': [500, 1000, 1500, 5000],
                    'PasAgg__class_weight': ['balanced', 'None']}

# Não lineares

parameters_SVM = {'SVM__kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
                'SVM__C': [100, 10, 1.0, 0.1, 0.01, 0.001],
                'SVM__class_weight': ['balanced', 'None']}

parameters_KN = {'KN__n_neighbors' : np.arange(1, 21, 1),
                'KN__metric':  ['euclidean', 'manhattan', 'minkowski'],
                'KN__weights':  ['uniform', 'distance']}

parameters_tree = {'tree__max_depth': [3, None],
                'tree__max_features': np.arange(1, 9, 1),
                'tree__min_samples_leaf': np.arange(1, 9, 1),
                'tree__criterion': ["gini", "entropy"],
                'tree__class_weight': ['balanced', 'None']}

parameters_MLP = {'MLP__max_iter': [100, 500, 1000, 1500, 5000],
                'MLP__hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
                'MLP__activation': ['tanh', 'relu'],
                'MLP__solver': ['sgd', 'adam'], 
                'MLP__learning_rate': ['constant','adaptive'],
                'MLP__class_weight': ['balanced', 'None']}

# Ensembles

parameters_GB = {'GB__learning_rate': [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
                'GB__max_depth': [3, 5, 8, 10, 14],
                'GB__subsample': [0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
                'GB__min_samples_leaf': np.linspace(0.1, 0.5, 12), 
                'GB__min_samples_split': np.linspace(0.1, 0.5, 12),
                'GB__max_features': ['log2', 'sqrt', 'auto'],
                'GB__subsample': [0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0]}

parameters_ET = {'ET__n_estimators': np.arange(90,200,10),
                'ET__min_samples_leaf': np.linspace(0.1, 0.5, 12), 
                'ET__min_samples_split': np.linspace(0.1, 0.5, 12),
                'ET__max_features': ['log2', 'sqrt', 'auto'],
                'ET__class_weight': ['balanced', 'None']}

parameters_RF = {'RF__n_estimators': np.arange(90,200,10),
                'RF__max_depth': np.arange(1, 110, 10),
                'RF__min_samples_leaf': [1,2,4], 
                'RF__min_samples_split': [2,5,10],
                'RF__max_features': ['log2', 'sqrt', 'auto'],
                'RF__class_weight': ['balanced', 'None']}

parameterns_bagging = {'bagging__n_estimators': np.arange(5, 200, 5),
                       'bagging__max_features': np.arange(1, 9, 1)}


#### Busca aleatória

In [None]:
# Lineares

rscv_ridge = RandomizedSearchCV(pipe_ridge, parameters_ridge, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_ridge.fit(X, y)

rscv_logistic = RandomizedSearchCV(pipe_logistic, parameters_logistic, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_logistic.fit(X, y)

rscv_SGD = RandomizedSearchCV(pipe_SGD, parameters_SGD, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_SGD.fit(X, y)

rscv_perceptron = RandomizedSearchCV(pipe_perceptron, parameters_perceptron, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_perceptron.fit(X, y)

rscv_PasAgg = RandomizedSearchCV(pipe_PasAgg, parameters_PasAgg, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_PasAgg.fit(X, y)

rscv_linear = [rscv_ridge, rscv_logistic, rscv_SGD, rscv_perceptron, rscv_PasAgg]

# Não lineares

rscv_SVM = RandomizedSearchCV(pipe_SVM, parameters_SVM, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_SVM.fit(X, y)

rscv_KN = RandomizedSearchCV(pipe_KN, parameters_KN, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_KN.fit(X, y)

rscv_tree = RandomizedSearchCV(pipe_tree, parameters_tree, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_tree.fit(X, y)

rscv_MLP = RandomizedSearchCV(pipe_MLP, parameters_MLP, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

#rscv_MLP.fit(X, y)

rscv_n_linear = [rscv_SVM, rscv_KN, rscv_tree]

# Ensembles

rscv_GB = RandomizedSearchCV(pipe_GB, parameters_GB, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_GB.fit(X, y)

rscv_ET = RandomizedSearchCV(pipe_ET, parameters_ET, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_ET.fit(X, y)

rscv_RF = RandomizedSearchCV(pipe_RF, parameters_RF, cv=5, 
    random_state=0, n_jobs=-1, scoring='neg_mean_squared_error', verbose=0)

rscv_RF.fit(X, y)

rscv_ensemble = [rscv_GB, rscv_ET, rscv_RF]

In [None]:
# Lineares
for rscv in rscv_linear:
    print('Model: ', rscv.best_estimator_.steps[1][0])
    y_pred = cross_val_predict(rscv.best_estimator_, X, y, cv=5)
    print(classification_report(y, y_pred))

In [None]:
# Não lineares
for rscv in rscv_n_linear:
    print('Model: ', rscv.best_estimator_.steps[1][0])
    y_pred = cross_val_predict(rscv.best_estimator_, X, y, cv=5)
    print(classification_report(y, y_pred))

In [None]:
# Ensembles
for rscv in rscv_ensemble:
    print('Model: ', rscv.best_estimator_.steps[1][0])
    y_pred = cross_val_predict(rscv.best_estimator_, X, y, cv=5)
    print(classification_report(y, y_pred))

In [None]:
model = rscv_RF.best_estimator_
model

## Avaliação

In [None]:
probs = model.predict_proba(X)[:,1]
fpr, tpr, thresholds  = roc_curve(y, probs)
sns.lineplot(fpr, tpr);

## Predições com teste

In [None]:
y_test = test['quality_label']
test.drop(['quality_label'] , axis=1, inplace=True)
X_test = test.copy()
y_test

In [None]:
y_pred = cross_val_predict(model, X_test, y_test, cv=5)
y_pred

## Validação

* Matriz confusa
* Precisão (quanto dos selecionados são os relavantes?)
* Recuperação - _recall_ ou sensibilidade (quanto dos relevantes são selecionados?)
* _F1 score_
* _ROC curve_


In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
probs = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds  = roc_curve(y_test, probs)
sns.lineplot(fpr, tpr);

In [None]:
print(classification_report(y_test, y_pred))

## Melhorias possíveis

* Arrumar banco de dados desbalanceado;
* Adicionar 'ovo' ou 'ovr' para multi-classes

## Exportar/Importar modelo

### Exportar

In [None]:
import pickle

pickle.dump(model, open('model.sav', 'wb'))

### Importar

In [None]:
loaded_model = pickle.load(open('model.sav', 'rb'))

In [None]:
loaded_model