## Kaggle - Red Wine Quality Regression

This notebook contains the regression pipeline for the Red Wine Quality Kaggle dataset. This file is a python module that can be opened as a notebook using jupytext.

1. [Business/Data Understanding](#1.-Business/Data-Understanding)
2. [Exploratory Data Analysis](#2.-Exploratory-Data-Analysis)
3. [Data Preparation](#3.-Data-Preparation)
4. [Modeling](#4.-Modeling)
5. Evaluation

[References](#References)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
from joblib import dump, load

from google.colab import files

### 1. Business/Data Understanding

Wine quality refers to the factors that go into producing a wine, as well as the indicators or characteristics that tell you if the wine is of high quality.

When you know what influences and signifies wine quality, you’ll be in a better position to make good purchases. You’ll also begin to recognize your preferences and how your favorite wines can change with each harvest. Your appreciation for wines will deepen once you’re familiar with wine quality levels and how wines vary in taste from region to region.

Some wines are higher-quality than others due to the factors described below. From climate to viticulture to winemaking, a myriad of factors make some wines exceptional and others run-of-the-mill.

### 2. Exploratory Data Analysis

The original dataset is divided into train and validation datasets. 

The exploration is applied to the train dataset, since the validation dataset is supposed to be unseen data. The basic aspects of the train data are shown bellow.

In [None]:
raw_df = pd.read_csv('data/winequality-red.csv', header=0)
train_df, validation_df = train_test_split(
  raw_df,
  test_size=0.2,
  random_state=3
)

train_df.to_csv('data/train.csv', header=True, index=True)
validation_df.to_csv('data/validation.csv', header=True, index=True)

From the dataset info we can see that the data is really clean, there are no null values. The target variable, quality, is composed by integers, while the features are floats.

In [None]:
train = pd.read_csv('train.csv', header=0, index_col=0)
print(f'Dataset shape: {train.shape} \n')
print(train.info())

Using the `describe()` method we can see the mean, standard deviation and the quartiles.

It is also worth noting that the quality only has 6 different values.

In [None]:
print(train['quality'].unique())
train.describe()

From the correlation plot, on the last row we can see that some features such as **sulphates** and **alcohol** are positive correlated to quality, while others like **volatile acidity** are negative correlated.

In [None]:
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

From the violin plots we can see which features contain outliers and how is the distribution of each feature

In [None]:
fig, axes = plt.subplots(12, 1, figsize=(15, 20))
for count, col in enumerate(train.columns):
  sns.violinplot(ax=axes[count], data=train[col], orient="h")
  axes[count].set_title(col)

### 3. Data Preparation

Since the data is already very clean, it is only necessary to scale the features. We are going to use 2 different scalers and compare the results obtained.

In [None]:
train_x = train.iloc[:,:-1]
train_y = train.iloc[:,-1:]

standard_scaler_x = StandardScaler().fit(train_x)
standard_train_x = standard_scaler_x.transform(train_x)
standard_scaler_y = StandardScaler().fit(train_y)
standard_train_y = standard_scaler_y.transform(train_y)

normal_scaler_x = MinMaxScaler().fit(train_x)
normal_train_x = normal_scaler_x.transform(train_x)
normal_scaler_y = MinMaxScaler().fit(train_y)
normal_train_y = normal_scaler_y.transform(train_y)

### 4. Modeling

Evaluate the performance from neural networks using 2 preprocessing scalers.

In [None]:
params = {
    'hidden_layer_sizes': [(i, ) for i in range(1, 22, 2)],
    'activation': ('identity', 'logistic', 'tanh', 'relu'),
    'batch_size': [50, 100, 200],
    'max_iter': [200, 400, 800],
}

base_mlp = MLPRegressor(random_state=3)
mlp = GridSearchCV(
    base_mlp,
    params,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    cv=5,
    return_train_score=True,
)
mlp.fit(standard_train_x, standard_train_y)

results = pd.DataFrame(mlp.cv_results_)
results = results[['param_hidden_layer_sizes', 'param_activation',
  'param_batch_size', 'param_max_iter','mean_test_score', 'rank_test_score',
  'mean_train_score']]
results.to_csv('mlp_standard_results.csv', index=False)
files.download('mlp_standard_results.csv')

dump(mlp, 'mlp_standard.joblib')
files.download('mlp_standard.joblib')

In [None]:
mlp.fit(normal_train_x, normal_train_y)

results = pd.DataFrame(mlp.cv_results_)
results = results[['param_hidden_layer_sizes', 'param_activation',
  'param_batch_size', 'param_max_iter','mean_test_score', 'rank_test_score',
  'mean_train_score']]
results.to_csv('mlp_minmax_results.csv', index=False)
files.download('mlp_minmax_results.csv')

dump(mlp, 'mlp_minmax.joblib')
files.download('mlp_minmax.joblib')

### References

[1] [Wine Quality Introduction](https://www.jjbuckley.com/wine-knowledge/blog/the-4-factors-and-4-indicators-of-wine-quality/1009) 