# Predicting Wine Quality From Physical Properties

In this notebook, we make a ML model to predict the quality of wine from labeled data. Our procedure is
1. Perform exploratory data analysis
2. Test out several ML models with minimal hyperparameter tuning
3. Tune the hyperparameters of one of the best peforming models
4. Further characterization of the best performing model

## Dataset definition

This data is from 

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009. 

It consists of a quality ranking and measured physical attributes for 1599 different vinho verde red wines from Portugal. The data was collected from May 2004 to February 2007.

The quality is based on sensory data. Possible values are integers between 1 and 10, although 3 is the minimum and 8 the maximum of observed values.

There are 11 physical attributes and all are numerical and continuous. They are (note 1 liter = 1 dm^3)
- Fixed acidity (g(tartaric acid)/dm^3):
- Volatile acidty (g(acetic acid)/dm^3) 
- Citric acid (g/dm^3)
- Residual sugar (g/dm^3)
- Chlorides (g(sodium chloride)/dm^3)
- Free sulfur dioxide (mg/dm^3)
- Total sulfur dioxide (mg/dm^3)
- Density (g/cm^3)
- pH: Acidity of wine, lower numbers are more acidic
- Sulphates: (g(potassium sulphate)/dm^3)
- Alcohol (%/vol)

## 1. Exploratory Data Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
# load data
wine = pd.read_csv(r'/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

In [None]:
features = wine.drop('quality', axis='columns').columns.tolist()
print(features)

In [None]:
summary = pd.DataFrame()
summary['dtype'] = wine.dtypes
summary['unique'] = wine.nunique(axis=0)
summary['missing'] = wine.isnull().sum()
summary['mean'] = wine.mean()
summary['std'] = wine.std()
summary

In [None]:
wine.describe()

Conveinently, there are no missing values. The variables have somewhat different scales with a typical standard deviation of ~1, but the density having a standard deviation of only 0.0019.

In [None]:
print(wine.shape)

The data has 11 features and one target (quality). There are 1599 instances in the data. The number of instances is much larger than the number of features, so we won't try dimensionality reduction techniques.

In [None]:
wine.head()

In [None]:
sns.countplot(x='quality', data=wine)

Most of the wines are in the middle qualities of 5 or 6. Very few wines are at the extremes of 3 and 8. Some wines are 4 and 7.

In [None]:
sns.pairplot(wine)

In [None]:
plt.figure(figsize=(8, 8))
sns.heatmap(wine.corr(), annot=True, fmt='.1f', cmap='coolwarm', center=0)
plt.title('Pearson Correlation of Variables')

None of the features have a strong correlation by themselves with quality. A few have moderate correlation with quality though: Alcohol content, volatile acidity and sulphates. Some of the features have significant correlations with eachother that are not surprising, such as pH with fixed acidity and density with alcohol content. Features like pH and density are not too skewed and have a non-zero center. Other features like total sulfur dioxide and sugar have the peak of their distribution near zero and are strongly right-skewed. Alcohol has a peak near 10%, but is also strongly right-skewed.

In [None]:
f, axs = plt.subplots(len(features), 1, sharex=True, figsize=(4, 20))
for i, variable in enumerate(features):
    sns.boxplot(y=variable, x='quality', data=wine, ax=axs[i])

In [None]:
f, axs = plt.subplots(len(features), 1, sharex=True, figsize=(5, 24))
for i, variable in enumerate(features):
    sns.pointplot(x='quality', y=variable, data=wine, ax=axs[i])

Some features are clearly correlated with quality. For some features like chlorides and sulphates, the relationship looks approximately linear. For free and total sulfur dioxide, however, the relationship is significantly nonlinear. Thus a regressor that is capable of representing nonlinear relationships between features and the target may perform better.

## 2. Testing Models with Minimal Tuning

We'll evaluate several common ML models with minimal hyperparameter tuning based on their mean squared error. We'll hold out 10 percent of the data for final estimation of performance and use 10-fold cross-validation for model selection.

The models we evaluate are
  - Dummy Regressor that always predicts mean of training set
  - Linear regression
  - Ridge regression
  - k-Nearest Neighbors
  - Decision Tree
  - Extra Trees Regressor
  - Lightgbm Regressor
  - Neural Network

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

from sklearn.metrics import mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler

from sklearn.base import BaseEstimator
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
import lightgbm as lgbm

In [None]:
X = wine.drop('quality', axis='columns')
y = wine['quality']

In [None]:
# split off some data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from typing import Dict

def evaluate_model(estimator: BaseEstimator, cv: int =10) -> Dict[str, float]:
    """Print and return cross validation of model
    """
    scoring = 'neg_mean_squared_error'
    scores = cross_validate(estimator, X_train, y_train, return_train_score=True, cv=cv, scoring=scoring)
    train_mean, train_std = -1*scores['train_score'].mean(), scores['train_score'].std()
    print(f'Train MSE: {train_mean} ({train_std})')
    val_mean, val_std = -1*scores['test_score'].mean(), scores['test_score'].std()
    print(f'Validation MSE: {val_mean} ({val_std})')
    fit_mean, fit_std = scores['fit_time'].mean(), scores['fit_time'].std()
    print(f'Fit time: {fit_mean} ({fit_std})')
    score_mean, score_std = scores['score_time'].mean(), scores['score_time'].std()
    print(f'Score time: {score_mean} ({score_std}')
    result = {
        'Train MSE': train_mean,
        'Train std': train_std,
        'Validation MSE': val_mean,
        'Validation std': val_std,
        'Fit time (s)': fit_mean,
        'Score time (s)': score_mean,
    }
    return result

In [None]:
dummy = DummyRegressor()
dummy_result = evaluate_model(dummy)

In [None]:
linear = LinearRegression()
linear_result = evaluate_model(linear)

In [None]:
from sklearn.linear_model import RidgeCV
ridge = Pipeline([
    ('scale', StandardScaler()),
    ('ridge', RidgeCV())
])
ridge_result = evaluate_model(ridge)

In [None]:
knn = Pipeline([
    ('scale', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=50)),
])
#knn = KNeighborsRegressor()
knn_result = evaluate_model(knn)

In [None]:
dt = DecisionTreeRegressor(max_depth=4)
dt_result = evaluate_model(dt)

In [None]:
extra_tree = ExtraTreesRegressor()
extra_tree_result = evaluate_model(extra_tree)

In [None]:
lgb = lgbm.LGBMRegressor()
lgb_result = evaluate_model(lgb)

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasRegressor

def create_nn_model() -> Sequential:
    """Create neural network model"""
    model = Sequential()
    model.add(Dense(100, input_dim=11, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(1, activation='linear'))
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mse'])
    return model

nn = Pipeline([
    ('scale', StandardScaler()),
    ('nn', KerasRegressor(build_fn=create_nn_model, epochs=150, batch_size=50, verbose=0))
])
# we only use 3-fold CV for the neural network since it trains slower
nn_result = evaluate_model(nn, cv=3)

In [None]:
# Summarize Performances
pd.DataFrame({
    'dummy': dummy_result,
    'linear': linear_result,
    'ridge': ridge_result,
    'knn': knn_result,
    'dt': dt_result,
    'extra_trees': extra_tree_result,
    'lgbm': lgb_result,
    'nn': nn_result,
}).transpose().sort_values(by='Validation MSE', ascending=True)

The Extra Trees and LGBM models have similar validation scores that are both significantly better than the other models we tested.

## 3. Tuning Hyperparameters of Chosen Model: Extra Trees

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
from typing import List

def plot_param_search(estimator: BaseEstimator, parameter: str, parameter_values: List):
    """Plot training and validation MSE as a function of parameter values
    """
    param_grid = {parameter: parameter_values}
    estimator_cv = GridSearchCV(estimator, return_train_score=True, param_grid=param_grid, scoring='neg_mean_squared_error', cv=10)
    estimator_cv.fit(X_train, y_train)
    results = estimator_cv.cv_results_
    f, axs = plt.subplots(2, 1, sharex=True)
    n_splits = estimator_cv.n_splits_
    axs[0].errorbar(parameter_values, -1*results['mean_train_score'], yerr=results['std_train_score']/np.sqrt(n_splits))
    axs[1].errorbar(parameter_values, -1*results['mean_test_score'], yerr=results['std_test_score']/np.sqrt(n_splits))
    axs[0].set_ylabel('Train MSE')
    axs[1].set_ylabel('Validation\nMSE')
    axs[1].set_xlabel(parameter)

In [None]:
parameter = 'n_estimators'
parameter_values = [10, 20, 50, 100, 200, 500]
et = ExtraTreesRegressor()
plot_param_search(et, parameter, parameter_values)

In [None]:
parameter = 'min_samples_split'
parameter_values = [2, 3, 4, 5, 10, 20, 50]
et = ExtraTreesRegressor(n_estimators=100)
plot_param_search(et, parameter, parameter_values)

In [None]:
parameter = 'max_features'
parameter_values = [1, 2, 4, 8, 11]
et = ExtraTreesRegressor(n_estimators=100, min_samples_split=5)
plot_param_search(et, parameter, parameter_values)

In [None]:
tuned_model = ExtraTreesRegressor(n_estimators=100, min_samples_split=5)

In [None]:
evaluate_model(tuned_model)

Hyperparameter tuning didn't significantly change the MSE. It seems the defaults were already pretty good.

## 4. Further Characterization of Tuned Model

In [None]:
best_model = tuned_model
best_model.fit(X_train, y_train)
print('Mean Squared Error of Tuned Model on Test Set:')
print(mean_squared_error(best_model.predict(X_test), y_test))

In [None]:
plt.scatter(y_test, best_model.predict(X_test), s=1)
plt.title('Wine Quality')
plt.xlabel('Ground Truth')
plt.ylabel('Prediction')

In [None]:
y_predict = best_model.predict(X_test)
mse_list = []
for score in range(3, 9):
    at_score = (y_test == score)
    mse = mean_squared_error(y_test[at_score], y_predict[at_score])
    mse_list.append(mse)
plt.figure()
plt.plot(np.arange(3, 9), mse_list)
plt.xlabel('Ground Truth Score')
plt.ylabel('Test Set MSE')

The model overestimates the quality of the low scoring wines and underestimates the quality of high scoring wines. The MSE of for with low or high predicted scores is higher than those in the middle. These also account for a small percent of the total wines though. One could look into developing a different model if, for example, estimating the scores of the highest rated wines is more important than the others.

In [None]:
feature_importances = pd.Series(best_model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_importances.plot(kind='bar')
plt.title('Feature Importances')

As could be expected from the exploratory data analysis, alcohol, volatile acidity and sulphates are all important features in the model. Alcohol content is the most important feature.

In [None]:
from sklearn.inspection import plot_partial_dependence
plt.figure(figsize=(12, 12))
plot_partial_dependence(best_model, X_train, X_train.columns, ax=plt.gca())

The model has a significant and nonlinear partial dependence on alcohol and sulphates. The model has a more modest partial dependence on other features.