# Wine Quality Analysis

<img style="margin-left:0" src="https://thumbor.forbes.com/thumbor/fit-in/1200x0/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fdam%2Fimageserve%2F1133888244%2F0x0.jpg%3Ffit%3Dscale" width="600px" />

This notebook analyse a database of **red** and **white** variants of the Portuguese "Vinho Verde" wine based on wine **physicochemical test results** and quality scores that experts assign to each wine sample.

- EDA Part of the Analysis: https://www.kaggle.com/glushko/wine-quality-domain-driven-eda-part-i
- Feel free to upvote this notebook if you find it helpful ðŸ’«

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer, PolynomialFeatures, PowerTransformer
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, cross_validate, GridSearchCV, cross_val_predict
from sklearn.metrics import f1_score, balanced_accuracy_score, classification_report, confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn import set_config
from sklearn.utils.multiclass import unique_labels

from yellowbrick.model_selection import ValidationCurve
import shap

In [None]:
plt.rcParams['figure.figsize'] = (12, 8)
set_config(display='diagram')


RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
shap.initjs()

In [None]:
full_df = pd.read_csv('../input/wine-quality/winequalityN.csv')

full_df.head()

# Classification Objective ðŸŽ¯

The most obvious classification objective for this training set is **multiclass wine quality classification**. 

The dataset is **higly imbalanced**. We have only 5 samples of exellent wines and 30 samples of the lowest quality wines. If we take into account test set split and cross-validation folds, we may have only a couple of examples during training. This means that there may be a problem of applying SMOTE and similar synthetical methods to balance datasets as they would require more samples to create clusters for samplings from.

Other possible objectives are:
- multiclass quality classification with only 3 classes: low, medium, high quality wines
- binary quality classification: good or bad quality wine
- binary wine type classification: red or white wine (which would also suffer from imbalance, but could be fixed by synthetic resampling)

In [None]:
full_df['quality'].value_counts()

We will stick with **multiclass quality classification** and 3 classes: low, medium, high quality wines:

In [None]:
def impute_quality_group(quality):
    if quality <= 5:
        return 0 # low
    if quality > 5 and quality < 7:
        return 1 # average
    if quality >= 7:
        return 2 # high

full_df['quality_group'] = full_df['quality'].apply(impute_quality_group)

In [None]:
full_df['quality_group'].value_counts()

# Feature Engineering

In [None]:
for feature in ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'pH', 'sulphates']:
    full_df[feature] = full_df.groupby(['type'])[feature].transform(lambda x: x.fillna(x.median()))

In [None]:
def impute_sweetness(residual_sugar):
    if residual_sugar < 1:
        return 0
    if residual_sugar >= 1 and residual_sugar < 9:
        return 1
    if residual_sugar >= 9 and residual_sugar < 18:
        return 2
    if residual_sugar >= 18 and residual_sugar < 50:
        return 3
    if residual_sugar >= 50 and residual_sugar < 120:
        return 4
    if residual_sugar >= 120:
        return 5

full_df['sweetness'] = full_df['residual sugar'].apply(impute_sweetness)

In [None]:
full_df['fixed_acidity_red_wine'] = (full_df['type'] == 'red') * full_df['fixed acidity']
full_df['fixed_acidity_white_wine'] = (full_df['type'] == 'white') * full_df['fixed acidity']

full_df['molecular_sulfur_dioxid'] = full_df['free sulfur dioxide'] / (1 + 10 ** (full_df['pH'] - 1.8))
full_df['free_total_so2_rate'] = full_df['free sulfur dioxide'] / full_df['total sulfur dioxide']
full_df['bound_sulfur_dioxid'] = full_df['total sulfur dioxide'] - full_df['free sulfur dioxide']
full_df['sugar_acidity_ratio'] = full_df['residual sugar'] / full_df['fixed acidity']

alcohol_labels = ['low', 'medium', 'high']
alcohol_bins = [0, 9.5, 11.5, 20]
full_df['alcohol_groups'] = pd.cut(full_df['alcohol'], bins=alcohol_bins, labels=alcohol_labels) 

pH_labels = ['high', 'mod high', 'medium', 'low']
pH_bins = [2.5, 3.2, 3.3, 3.4, 4.1]
full_df['pH_groups'] = pd.cut(full_df['pH'], bins=pH_bins, labels=pH_labels) 

Feature Engineering:
- `total sulfur dioxide` - doesn't improve models in a raw view
- `free_total_so2_rate` - brings 0 improvements
- `sweetness` - degrades performance of all models
- `alcohol_groups` - degrades performance of all models
- `pH_groups` - degrades performance of all models
- `sugar_acidity_ratio` improves score's std but degrades CV scores

In [None]:
model_features = [
    'type',
    'alcohol',
    'fixed acidity',
    'volatile acidity',
    'citric acid',
    'pH',
    'residual sugar',
    'free sulfur dioxide',
    'chlorides',
    'density',
    'sulphates',
    'bound_sulfur_dioxid',
    'molecular_sulfur_dioxid',
    'sugar_acidity_ratio'
]

X = full_df[model_features]
y = full_df['quality_group']

stratified_splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=RANDOM_SEED)

for train_idx, test_idx in stratified_splitter.split(X, y):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

# Data Processing

In [None]:
def get_feature_transformer():
    oneplus_transformer = FunctionTransformer(func=lambda x: 1 + x, inverse_func=lambda x: 1 - x)
    boxcox_transformer = PowerTransformer(method='box-cox', standardize=False)

    numerical_transformer = Pipeline([
        ('positive_transforming', oneplus_transformer),
        ('boxcox_transforming', boxcox_transformer),
    ])

    return ColumnTransformer([
            ('feature_transforming', numerical_transformer, [
                'fixed acidity', 'chlorides', 'citric acid', 'volatile acidity', 
                'sulphates', 'alcohol', 'residual sugar', 'free sulfur dioxide', 
                'sulphates', 'pH', 'sugar_acidity_ratio'
            ]),
            ('wine_type_onehot', OneHotEncoder(), ['type']),
        ],
        remainder='passthrough'
    )

# Modelling ðŸ§ª

In [None]:
def plot_confusion_matrix_by_predictions(y_true, y_predicted, *, labels=None,
                          sample_weight=None, normalize=None,
                          display_labels=None, include_values=True,
                          xticks_rotation='horizontal',
                          values_format=None,
                          cmap='viridis', ax=None):
    
    cm = confusion_matrix(y_true, y_predicted, sample_weight=sample_weight,
                          labels=labels, normalize=normalize)

    if display_labels is None:
        if labels is None:
            display_labels = unique_labels(y_true, y_predicted)
        else:
            display_labels = labels

    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                  display_labels=display_labels)

    return disp.plot(include_values=include_values,
                     cmap=cmap, ax=ax, xticks_rotation=xticks_rotation,
                     values_format=values_format)


In [None]:
def score_classification_model(model, X_train, y_train):
    
    cv_scores = cross_validate(
        model, X_train, y_train, 
        scoring=['f1_weighted', 'balanced_accuracy'],
        cv=5,
        n_jobs=-1, verbose=0
    )

    cv_y_predicted = cross_val_predict(
        model, X_train, y_train,
        cv=5,
        n_jobs=-1
    )

    cv_f1_weighted, f1_weighted_std = cv_scores['test_f1_weighted'].mean(), cv_scores['test_f1_weighted'].std()
    cv_balanced_accuracy, balanced_accuracy_std = cv_scores['test_balanced_accuracy'].mean(), cv_scores['test_balanced_accuracy'].std()

    model.fit(X_train, y_train)

    y_train_predicted = model.predict(X_train)

    train_f1_weighted = f1_score(y_train, y_train_predicted, average='weighted')
    train_balanced_accuracy = balanced_accuracy_score(y_train, y_train_predicted)

    print('[Train] F1 Weighted: %.4f' % (train_f1_weighted))
    print('[Train] Balanced Accuracy: %.4f' % (train_balanced_accuracy))
    print('Train Set Report:')
    print(classification_report(y_train, y_train_predicted, digits=3))

    print('[CV] F1 Weighted: %.4f (%.4f)' % (cv_f1_weighted, f1_weighted_std))
    print('[CV] Balanced Accuracy: %.4f (%.4f)' % (cv_balanced_accuracy, balanced_accuracy_std))
    print('CV Report:')
    print(classification_report(y_train, cv_y_predicted, digits=3))
    
    # display confusion matrixes

    _, (ax0, ax1) = plt.subplots(1, 2)

    ax0.set_title('Train Confusion Matrix')
    plot_confusion_matrix(
        model, X_train, y_train,
        cmap=plt.cm.Blues,
        normalize='true',
        ax=ax0,
    )

    ax1.set_title('CV Confusion Matrix')
    plot_confusion_matrix_by_predictions(
        y_train, cv_y_predicted,
        cmap=plt.cm.Blues,
        normalize='true',
        ax=ax1,
    )

    return y_train_predicted, cv_y_predicted

In [None]:
# sklearn's pipeline API is limited at this point and doesn't provide a way to get columns of transformed X array
# This snippet will cover our back 

def get_columns_from_transformer(column_transformer, input_colums):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names(raw_col_name)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)

    [_, _, reminder_columns] = column_transformer.transformers_[-1]

    for col_idx in reminder_columns:
        col_name.append(input_colums[col_idx])

    return col_name

## LogisticRegression

In [None]:
logistic_regression = LogisticRegression(
    solver='liblinear',
    penalty='l1',
    C=0.9,
    max_iter=500,
    class_weight='balanced',
    random_state=RANDOM_SEED,
    n_jobs=-1,
)

logistic_regression_pipeline = Pipeline([
    ('feature_processing', get_feature_transformer()),
    ('scaling', StandardScaler()),
    ('quality_classification', logistic_regression),
])

logistic_regression_pipeline

## [Tr] F1 Weighted: 0.5732, Balanced Accuracy: 0.5920
## [CV] F1 Weighted: 0.5688 (0.0135), Balanced Accuracy: 0.5866 (0.0099)
# solver='liblinear',
# penalty='l1',
# C=0.9,
# max_iter=500,
# class_weight='balanced'

In [None]:
score_classification_model(logistic_regression_pipeline, X_train, y_train);

### Hyperparam Tuning

In [None]:
parameters = {
    'quality_classification__penalty': ['l2', 'l1', 'elasticnet', 'none'], # 'l1', 'elasticnet', 'none'
    'quality_classification__C': [1.0, 0.95, 0.9, 0.8], # 1.0
    'quality_classification__tol': [1e-4],
    'quality_classification__class_weight': ['balanced'],
    'quality_classification__solver': ['lbfgs', 'liblinear', 'sag', 'saga'], # lbfgs
    'quality_classification__max_iter': [500],
    'quality_classification__l1_ratio': [1.0, 0.0, 0.3, 0.4, 0.5],
}

param_searcher = GridSearchCV(
   estimator=logistic_regression_pipeline,
   scoring='balanced_accuracy',
   param_grid=parameters,
   cv=5,
   n_jobs=-1, 
   verbose=3
)

#param_searcher.fit(X_train, y_train)
#param_searcher.best_params_, param_searcher.best_score_

## Polynomial Regression

In [None]:
logistic_classifier = LogisticRegression(
    penalty='l2',
    solver='newton-cg',
    class_weight='balanced',
    random_state=RANDOM_SEED,
    n_jobs=-1,
)

polynomial_pipeline = Pipeline([
    ('feature_processing', get_feature_transformer()),
    ('polynomial_features', PolynomialFeatures()),
    ('scaling', StandardScaler()),
    ('quality_classification', logistic_classifier),
])

polynomial_pipeline

In [None]:
score_classification_model(polynomial_pipeline, X_train, y_train);

### Hyperparam Tuning

In [None]:
parameters = [
    {
        'quality_classification__solver': ['newton-cg'], # lbfgs, liblinear, 'lbfgs', 'sag', 'saga',
        'quality_classification__penalty': ['l2', 'l1', 'elasticnet', 'none'], # 'l1', 'elasticnet', 'none'
        'quality_classification__C': [1.0], # 1.0
        'quality_classification__l1_ratio': [1.0, 0.9],
        'quality_classification__max_iter': [100, 200],
        'quality_classification__class_weight': ['balanced'],
        'polynomial_features__degree': [2],
    },
]

param_searcher = GridSearchCV(
   estimator=polynomial_pipeline,
   scoring='balanced_accuracy',
   param_grid=parameters,
   cv=5,
   n_jobs=-1, 
   verbose=3
)

#param_searcher.fit(X_train, y_train)
#param_searcher.best_params_, param_searcher.best_score_

## SVC

In [None]:
from sklearn.svm import LinearSVC, SVC

lsvm_classifier = LinearSVC(
    C=0.01,
    max_iter=1000,
    loss='squared_hinge',
    class_weight='balanced',
    random_state=RANDOM_SEED,
)

lsvm_pipeline = Pipeline([
    ('feature_processing', get_feature_transformer()),
    ('scaling', StandardScaler()),
    ('quality_classification', lsvm_classifier),
])

psvm_classifier = SVC(
    kernel='poly',
    degree=4,
    coef0=1,
    class_weight='balanced',
    random_state=RANDOM_SEED,
)

psvm_pipeline = Pipeline([
    ('feature_processing', get_feature_transformer()),
    ('scaling', StandardScaler()),
    ('quality_classification', psvm_classifier),
])

ksvm_classifier = SVC(
    kernel='rbf',
    C=5,
    gamma=0.01,
    class_weight='balanced',
    random_state=RANDOM_SEED,
)

ksvm_pipeline = Pipeline([
    ('feature_processing', get_feature_transformer()),
    ('scaling', StandardScaler()),
    ('quality_classification', ksvm_classifier),
])

In [None]:
score_classification_model(lsvm_pipeline, X_train, y_train);

# [CV] F1 Weighted: 0.5552 (0.0132)
# [CV] Balanced Accuracy: 0.5846 (0.0094)

In [None]:
score_classification_model(psvm_pipeline, X_train, y_train);

## [Train] F1 Weighted: 0.5476, Balanced Accuracy: 0.5654
## [CV] F1 Weighted: 0.5225 (0.0097), Balanced Accuracy: 0.5395 (0.0134)
# kernel='poly',
# degree=2,
# class_weight='balanced'

## [Train] F1 Weighted: 0.5979, Balanced Accuracy: 0.6337
## [CV] F1 Weighted: 0.5807 (0.0036), Balanced Accuracy: 0.6147 (0.0090)
# kernel='poly',
# degree=2,
# coef0=1,
# class_weight='balanced'

## [Train] F1 Weighted: 0.6323, Balanced Accuracy: 0.6724
## [CV] F1 Weighted: 0.5865 (0.0068), Balanced Accuracy: 0.6264 (0.0127)
# kernel='poly',
# degree=3,
# coef0=1,
# class_weight='balanced'

## [Train] F1 Weighted: 0.6874, Balanced Accuracy: 0.7242
## [CV] F1 Weighted: 0.6072 (0.0154), Balanced Accuracy: 0.6388 (0.0167)
# kernel='poly',
# degree=4,
# coef0=1,
# class_weight='balanced'

In [None]:
score_classification_model(ksvm_pipeline, X_train, y_train);

## [Train] F1 Weighted: 0.7151, [CV] Balanced Accuracy: 0.7483
## [CV] F1 Weighted: 0.6048 (0.0102), [CV] Balanced Accuracy: 0.6400 (0.0124)
# kernel='rbf',
# C=10,
# class_weight='balanced'

## [Train] F1 Weighted: 0.6772, [CV] Balanced Accuracy: 0.7152
## [CV] F1 Weighted: 0.5971 (0.0140), [CV] Balanced Accuracy: 0.6353 (0.0190)
# kernel='rbf',
# C=5,
# class_weight='balanced'

## [Train] F1 Weighted: 0.5954, [CV] Balanced Accuracy: 0.6341
## [CV] F1 Weighted: 0.5784 (0.0041), [CV] Balanced Accuracy: 0.6191 (0.0039)
# kernel='rbf',
# C=5,
# gamma=0.01,
# class_weight='balanced'

### Hypertuning

In [None]:
parameters = {
    'quality_classification__C': [0.01, 0.1, 1],
}

param_searcher = GridSearchCV(
   estimator=lsvm_pipeline,
   scoring='balanced_accuracy',
   param_grid=parameters,
   cv=5,
   n_jobs=-1, 
   verbose=3
)

# param_searcher.fit(X_train, y_train)
# param_searcher.best_params_, param_searcher.best_score_

In [None]:
parameters = {
    'quality_classification__C': [20, 60, 70, 80, 90],
    'quality_classification__gamma': ['scale', 'auto', 0.01, 0.1, 1, 5, 10],
}

param_searcher = GridSearchCV(
   estimator=ksvm_pipeline,
   scoring='balanced_accuracy',
   param_grid=parameters,
   cv=5,
   n_jobs=-1, 
   verbose=3
)

#param_searcher.fit(X_train, y_train)
#param_searcher.best_params_, param_searcher.best_score_

## DecisionTree

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_classifier = DecisionTreeClassifier(
    max_depth=12,
    max_leaf_nodes=65,
    class_weight='balanced',
    random_state=RANDOM_SEED,
)

tree_pipeline = Pipeline([
    ('feature_processing', get_feature_transformer()),
    ('quality_classification', tree_classifier),
])

## [CV] F1 Weighted: 0.5421 (0.0147), Balanced Accuracy: 0.5787 (0.0163)
# max_leaf_nodes=25

In [None]:
score_classification_model(tree_pipeline, X_train, y_train);

### Hypertuning

In [None]:
parameters = {
    'quality_classification__max_depth': np.arange(1, 15),
    'quality_classification__max_leaf_nodes': np.arange(1, 80, 5),
}

param_searcher = GridSearchCV(
   estimator=tree_pipeline,
   scoring='balanced_accuracy',
   param_grid=parameters,
   cv=5,
   n_jobs=-1, 
   verbose=3
)

param_searcher.fit(X_train, y_train)
param_searcher.best_params_, param_searcher.best_score_

## RandomForest

In [None]:
rf_classifier = RandomForestClassifier(
    criterion='entropy',
    n_estimators=200,
    max_depth=6,
    max_leaf_nodes=10,
    max_features='sqrt',
    class_weight='balanced',
    random_state=RANDOM_SEED,
    n_jobs=-1,
)

rf_pipeline = Pipeline([
    ('feature_processing', get_feature_transformer()),
    ('quality_classification', rf_classifier),
])

rf_pipeline

## F1 Weighted: 0.6949 (0.0167), Balanced Accuracy: 0.6948 (0.0211)
# criterion='entropy',
# n_estimators=179,
# min_samples_split=5,
# min_samples_leaf=4,
# max_features='sqrt',
# class_weight='balanced'

## F1 Weighted: 0.4946 (0.0078), Balanced Accuracy: 0.5912 (0.0106)
# criterion='entropy',
# n_estimators=200,
# max_depth=6,
# max_leaf_nodes=10,
# max_features='sqrt',
# class_weight='balanced'

In [None]:
y_train_pred, y_cv_pred = score_classification_model(rf_pipeline, X_train, y_train);

# Model Inspection ðŸ”Ž

In [None]:
X_train_features = get_columns_from_transformer(rf_pipeline.named_steps['feature_processing'], list(X_train.columns))

In [None]:
features_importance = sorted(zip(rf_pipeline.named_steps['quality_classification'].feature_importances_, X_train_features), reverse=True)
pd.DataFrame(features_importance, columns=['importance', 'feature'])

In [None]:
X_train_transformed = rf_pipeline.named_steps['feature_processing'].fit_transform(X_train)

rf_explainer = shap.TreeExplainer(rf_classifier)
rf_explanation = rf_explainer.shap_values(X_train_transformed)

In [None]:
shap.summary_plot(rf_explanation, X_train_transformed, X_train_features)

# Generalization

In [None]:
def score_model_generalization(model, X_test, y_test):
    y_test_predicted = model.predict(X_test)

    test_f1_weighted = f1_score(y_test, y_test_predicted, average='weighted')
    test_balanced_accuracy = balanced_accuracy_score(y_test, y_test_predicted)

    print('[Test] F1 Weighted: %.4f' % (test_f1_weighted))
    print('[Test] Balanced Accuracy: %.4f' % (test_balanced_accuracy))
    print('Test Set Report:')
    print(classification_report(y_test, y_test_predicted, digits=3))

    plot_confusion_matrix_by_predictions(
        y_test, y_test_predicted,
        cmap=plt.cm.Greens,
        normalize='true',
    )

## Logistic Regression

In [None]:
score_model_generalization(logistic_regression_pipeline, X_test, y_test)

## Polynomial Regression

In [None]:
score_model_generalization(polynomial_pipeline, X_test, y_test)

## SVM

In [None]:
score_model_generalization(lsvm_pipeline, X_test, y_test)

In [None]:
score_model_generalization(psvm_pipeline, X_test, y_test)

In [None]:
score_model_generalization(ksvm_pipeline, X_test, y_test)

## Random Forest

In [None]:
score_model_generalization(rf_pipeline, X_test, y_test)

# Summary ðŸ’«

Wine Quality database is a good example of datasets you may face in the real life. It's **imbalanced** and **quality classes** are hard to separate. It pushed us to rethinking the our classification objectives and assuming what we could potentialy sqeeze from it.

We have trained several models from a simple Logistic Regression and SVM to RandomForest and measured their performance with **balanced accuracy** metric.

Turned out, the polinomial **SVC model** performs best for us:
- CV: 63.88% (-+1.67%)
- Test: 61.73% 

Our goal was to get the higher balanced accuracy while keeping a score difference between train and CV scores small.

RandomForest and XGBoost are easily overfit and show around 70% of balanced accuracy on the CV and test datasets (while almost 100% on training sets). However, we don't believe these model would generalize when even if they showed good results on the current test set (which include only 1300 observations (20% of the overall dataset)).

**Another approach to improve the accuracy** is to train two separate models for red and white wines. Meanwhile, the fact that quality classes are hardly separable makes us think it would be little improvement.