## Regression models on the wine quality dataset

In this notebook, we show how to use regression models to predict the wine quality in the wine quality dataset. 
In particular, we run regression models on the white wine dataset, but if we want to run experiment on the red wine dataset, it should be similar.

In [1]:
import pandas as pd
import sklearn

In [2]:
red = pd.read_csv('winequality-red.csv', header=0)
white = pd.read_csv('winequality-white.csv', header=0)

In [55]:
# show the columns of the the data
white.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [56]:
# find X and y for the white wine dataset
whiteX = white.loc[:, 'fixed acidity': 'alcohol'].as_matrix()
whitey = white.loc[:, 'quality'].as_matrix()

In [57]:
# find X and y for the white wine dataset
redX = red.loc[:, 'fixed acidity': 'alcohol'].as_matrix()
redy = red.loc[:, 'quality'].as_matrix()

## First model: SVM

We use the SVMregressor in sklearn, make sure to set up `max_iter` so the program does not take forever.

In [3]:
from sklearn.svm import SVR
svr = SVR(max_iter=1000000, gamma=0.03, epsilon=0.1, C=1)


In [64]:
param_SVM = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'degree':[2,3,4,5,6]}
SVM_grid = GridSearchCV(svr,  param_SVM, scoring = 'neg_mean_absolute_error', n_jobs=8, cv=5, verbose=True, return_train_score=True)
SVM_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:  2.8min


[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 11.9min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SVR(C=1, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.03,
  kernel='rbf', max_iter=1000000, shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'degree': [2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_absolute_error', verbose=True)

In [65]:
SVM_grid.cv_results_

{'mean_fit_time': array([ 26.44275041, 139.02047524,   1.95401397,   0.45022049,
         33.03270149, 182.2554975 ,   2.15190535,   0.56166439,
         38.67539644, 184.91105847,   1.93239598,   0.59925499,
         39.91496577, 188.22781754,   2.12911825,   0.55905452,
         40.96137075, 152.13425841,   1.87154036,   0.56012678]),
 'mean_score_time': array([0.04879308, 0.12294068, 0.14000206, 0.05242782, 0.11315107,
        0.10205355, 0.1967515 , 0.06241283, 0.10652165, 0.09881649,
        0.18498449, 0.07589769, 0.10089107, 0.1041357 , 0.16071076,
        0.05933366, 0.11529622, 0.05376692, 0.15313897, 0.06840892]),
 'mean_test_score': array([-7.79333211e-01, -9.70428488e+00, -6.67113575e-01, -6.63495304e-01,
        -7.79333211e-01, -1.41210332e+04, -6.67113575e-01, -6.63495304e-01,
        -7.79333211e-01, -5.62139608e+06, -6.67113575e-01, -6.63495304e-01,
        -7.79333211e-01, -8.83146508e+09, -6.67113575e-01, -6.63495304e-01,
        -7.79333211e-01, -1.56651914e+13, -6.

In [66]:
SVM_grid.best_score_, SVM_grid.best_params_

(-0.6634953042058312, {'degree': 2, 'kernel': 'sigmoid'})

## Multi-layer Perceptron Regressor

Again, we use API from sklearn.

In [72]:
from sklearn.neural_network import MLPRegressor
MLP = MLPRegressor(early_stopping=True)
param_MLP = {'learning_rate': ['invscaling', 'adaptive'], 
             'solver':['lbfgs', 'sgd', 'adam'],
             'hidden_layer_sizes': [(9000, ), (4000, ), (1000, )],
            'activation': ['tanh', 'relu']}
MLP_grid = GridSearchCV(MLP, param_MLP,  cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
MLP_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed: 22.2min
[Parallel(n_jobs=8)]: Done 180 out of 180 | elapsed: 62.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'learning_rate': ['invscaling', 'adaptive'], 'solver': ['lbfgs', 'sgd', 'adam'], 'hidden_layer_sizes': [(9000,), (4000,), (1000,)], 'activation': ['tanh', 'relu']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [73]:
MLP_grid.best_score_, MLP_grid.best_estimator_, MLP_grid.best_params_

(-0.5989539574396067,
 MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
        beta_2=0.999, early_stopping=True, epsilon=1e-08,
        hidden_layer_sizes=(9000,), learning_rate='invscaling',
        learning_rate_init=0.001, max_iter=200, momentum=0.9,
        nesterovs_momentum=True, power_t=0.5, random_state=None,
        shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
        verbose=False, warm_start=False),
 {'activation': 'relu',
  'hidden_layer_sizes': (9000,),
  'learning_rate': 'invscaling',
  'solver': 'lbfgs'})

## Random Forest Regressor

In [74]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0, n_jobs=8)
param_rf = {'max_depth':[1,2,3,4,5], 'n_estimators': [10, 50]}
rf_grid = GridSearchCV(rf, param_rf, cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
rf_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.7s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    2.3s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=8,
           oob_score=False, random_state=0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'max_depth': [1, 2, 3, 4, 5], 'n_estimators': [10, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [75]:
rf_grid.best_score_

-0.5864756244532766

In [76]:
rf_grid.best_params_

{'max_depth': 5, 'n_estimators': 50}

## Gradient Boosting Regressor

Based on our experiments, Gradient Boosting consistently gives remarkable results.

In [77]:
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor()
param_gb = {'max_depth':[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 20], 'n_estimators': [12, 18, 22, 25, 30, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 50, 55, 60, 65, 70]}
gb_grid = GridSearchCV(gb, param_gb, cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
gb_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 273 candidates, totalling 1365 fits


[Parallel(n_jobs=8)]: Done 160 tasks      | elapsed:    1.8s
[Parallel(n_jobs=8)]: Done 662 tasks      | elapsed:   16.7s
[Parallel(n_jobs=8)]: Done 952 tasks      | elapsed:   41.9s
[Parallel(n_jobs=8)]: Done 1302 tasks      | elapsed:  1.9min
[Parallel(n_jobs=8)]: Done 1365 out of 1365 | elapsed:  2.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 20], 'n_estimators': [12, 18, 22, 25, 30, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 50, 55, 60, 65, 70]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [78]:
gb_grid.best_score_

-0.574935677964577

In [79]:
gb_grid.best_params_

{'max_depth': 4, 'n_estimators': 43}

## AdaBoost 
It is not too bad either.

In [80]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor()
param_ada = {'n_estimators': [10, 100, 1000, 10000, 50000]}
ada_grid = GridSearchCV(ada, param_ada, cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
ada_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=8)]: Done  10 out of  25 | elapsed:    0.4s remaining:    0.6s
[Parallel(n_jobs=8)]: Done  25 out of  25 | elapsed:    1.3s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'n_estimators': [10, 100, 1000, 10000, 50000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [81]:
ada_grid.best_score_

-0.6006140187754887

In [82]:
ada_grid.best_params_

{'n_estimators': 100}

# More linear models

The following models are different types of linear models. Their performances, unsurprisingly, are similar.

## Lasso Regression

In [83]:
from sklearn.linear_model import Lasso
lasso = Lasso()
param_lasso = {'alpha': [0.1, 0.5, 1]}
lasso_grid = GridSearchCV(lasso, param_lasso, cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
lasso_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=8)]: Done  15 out of  15 | elapsed:    0.1s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'alpha': [0.1, 0.5, 1]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [84]:
lasso_grid.best_score_, lasso_grid.best_params_

(-0.6248222827899399, {'alpha': 0.1})

## Ridge Regression

In [85]:
from sklearn.linear_model import Ridge
ridge = Ridge()
param_ridge = {'alpha': [0.1, 0.2, 0.3, 0.5, 1]}
ridge_grid = GridSearchCV(ridge, param_ridge, cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
ridge_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=8)]: Done  10 out of  25 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=8)]: Done  25 out of  25 | elapsed:    0.1s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'alpha': [0.1, 0.2, 0.3, 0.5, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [86]:
ridge_grid.best_score_, ridge_grid.best_params_

(-0.5959974639014806, {'alpha': 0.1})

## Multiple Linear Regression

In [87]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression(n_jobs=8)
param_lr = {'normalize': [True, False]}
lr_grid = GridSearchCV(lr, param_lr, cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
lr_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.1s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=8, normalize=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'normalize': [True, False]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [88]:
lr_grid.best_score_, lr_grid.best_params_

(-0.5938098071626763, {'normalize': False})

## Elastic nets

In [89]:
from sklearn.linear_model import ElasticNet
enet = ElasticNet()
param_enet = {'alpha': [0.1, 0.5, 1], 'l1_ratio': [0.5, 0.7, 0.3]}
enet_grid = GridSearchCV(enet, param_enet, cv=5, scoring = 'neg_mean_absolute_error', n_jobs=8, verbose=True)
enet_grid.fit(whiteX, whitey)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=8)]: Done  45 out of  45 | elapsed:    0.2s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'alpha': [0.1, 0.5, 1], 'l1_ratio': [0.5, 0.7, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=True)

In [90]:
enet_grid.best_score_, enet_grid.best_params_

(-0.6237531427820255, {'alpha': 0.1, 'l1_ratio': 0.5})