# Machine Learning with H2O - Tutorial 3b: Regression Models (Grid Search)

<hr>

**Objective**:

- This tutorial explains how to fine-tune regression models for better out-of-bag performance.

<hr>

**Wine Quality Dataset:**

- Source: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
- CSV (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)

<hr>
    
**Steps**:

1. GBM with default settings
2. GBM with manual settings
3. GBM with manual settings & cross-validation
4. GBM with manual settings, cross-validation and early stopping
5. GBM with cross-validation, early stopping and full grid search
6. GBM with cross-validation, early stopping and random grid search
7. Model stacking (combining different GLM, DRF, GBM and DNN models)


<hr>

**Full Technical Reference:**

- http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/RBooklet.pdf

<br>


In [1]:
# Start and connect to a local H2O cluster
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/Rtmptn0ivG/h2o_joe_started_from_r.out
    /tmp/Rtmptn0ivG/h2o_joe_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 808 milliseconds 
    H2O cluster version:        3.10.3.5 
    H2O cluster version age:    10 days  
    H2O cluster name:           H2O_started_from_R_joe_jyt717 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.21 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.2 (2016-10-31) 



<br>

In [2]:
# Import wine quality data from a local CSV file
wine = h2o.importFile("winequality-white.csv")
head(wine, 5)



fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6


In [3]:
# Define features (or predictors)
features = colnames(wine)  # we want to use all the information
features = setdiff(features, 'quality')    # we need to exclude the target 'quality'
features

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = h2o.splitFrame(wine, ratios = 0.8, seed = 1234)

wine_train = wine_split[[1]] # using 80% for training
wine_test = wine_split[[2]]  # using the rest 20% for out-of-bag evaluation

In [5]:
dim(wine_train)

In [6]:
dim(wine_test)

<br>

## Step 1 - Gradient Boosting Machines (GBM) with Default Settings

In [7]:
# Build a Gradient Boosting Machines (GBM) model with default settings
gbm_default = h2o.gbm(x = features,
                      y = 'quality',
                      training_frame = wine_train,
                      seed = 1234,
                      model_id = 'gbm_default')



In [8]:
# Check the model performance on test dataset
h2o.performance(gbm_default, wine_test)

H2ORegressionMetrics: gbm

MSE:  0.4551121
RMSE:  0.67462
MAE:  0.5219768
RMSLE:  0.1001376
Mean Residual Deviance :  0.4551121


<br>

## Step 2 - GBM with Manual Settings

In [9]:
# Build a GBM with manual settings
gbm_manual = h2o.gbm(x = features,
                     y = 'quality',
                     training_frame = wine_train,
                     seed = 1234,
                     model_id = 'gbm_manual',
                     ntrees = 100,
                     sample_rate = 0.9,
                     col_sample_rate = 0.9)



In [10]:
# Check the model performance on test dataset
h2o.performance(gbm_manual, wine_test)

H2ORegressionMetrics: gbm

MSE:  0.4432567
RMSE:  0.6657752
MAE:  0.5114358
RMSLE:  0.0989581
Mean Residual Deviance :  0.4432567


<br>

## Step 3 - GBM with Manual Settings & Cross-Validation (CV)

In [11]:
# Build a GBM with manual settings & cross-validation
gbm_manual_cv = h2o.gbm(x = features,
                        y = 'quality',
                        training_frame = wine_train,
                        seed = 1234,
                        model_id = 'gbm_manual_cv',
                        ntrees = 100,
                        sample_rate = 0.9,
                        col_sample_rate = 0.9,
                        nfolds = 5)



In [12]:
# Check the cross-validation model performance
gbm_manual_cv

Model Details:

H2ORegressionModel: gbm
Model ID:  gbm_manual_cv 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             100                      100               32355         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         31    20.59000


H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.2743835
RMSE:  0.5238162
MAE:  0.4075921
RMSLE:  0.07748354
Mean Residual Deviance :  0.2743835



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4502182
RMSE:  0.670983
MAE:  0.5185164
RMSLE:  0.1000784
Mean Residual Deviance :  0.4502182


Cross-Validation Metrics Summary: 
                        mean          sd cv_1_valid cv_2_valid cv_3_valid
mae                0.5183839 0.005799581  0.5173063 0.53289497   0.507699
mse               0.45017532 0.0090

In [13]:
# Check the model performance on test dataset
h2o.performance(gbm_manual_cv, wine_test)
# It should be the same as gbm_manual above as the model is trained with same parameters

H2ORegressionMetrics: gbm

MSE:  0.4432567
RMSE:  0.6657752
MAE:  0.5114358
RMSLE:  0.0989581
Mean Residual Deviance :  0.4432567


<br>

## Step 4 - GBM with Manual Settings, CV and Early Stopping

In [14]:
# Build a GBM with manual settings, CV and early stopping
gbm_manual_cv_es = h2o.gbm(x = features,
                           y = 'quality',
                           training_frame = wine_train,
                           seed = 1234,
                           model_id = 'gbm_manual_cv_es',
                           ntrees = 10000,              # increase the number of trees
                           sample_rate = 0.9,
                           col_sample_rate = 0.9,
                           nfolds = 5,
                           stopping_metric = 'MSE',     # let early stopping feature determine
                           stopping_rounds = 15,        # the optimal number of trees
                           score_tree_interval = 1)     # by looking at the MSE metric



In [15]:
# Check the model summary
# which also includes cross-validation model performance
summary(gbm_manual_cv_es)

Model Details:

H2ORegressionModel: gbm
Model Key:  gbm_manual_cv_es 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             155                      155               49780         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         32    20.37419

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.2210799
RMSE:  0.4701914
MAE:  0.3620056
RMSLE:  0.06954328
Mean Residual Deviance :  0.2210799



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4428879
RMSE:  0.6654982
MAE:  0.5094015
RMSLE:  0.09937082
Mean Residual Deviance :  0.4428879


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid cv_2_valid cv_3_valid
mae               0.50952744  0.006480625  0.4993615  0.5179948 0.49803287
mse               0.44306728

In [16]:
# Check the model performance on test dataset
h2o.performance(gbm_manual_cv_es, wine_test)

H2ORegressionMetrics: gbm

MSE:  0.4287345
RMSE:  0.6547782
MAE:  0.4990124
RMSLE:  0.09753734
Mean Residual Deviance :  0.4287345


<br>

## Step 5 - GBM with CV, Early Stopping and Full Grid Search

In [17]:
# define the criteria for full grid search
search_criteria = list(strategy = "Cartesian")

In [18]:
# define the range of hyper-parameters for grid search
param_list <- list(
  sample_rate = c(0.7, 0.8, 0.9),
  col_sample_rate = c(0.7, 0.8, 0.9)
)

In [19]:
# Set up GBM grid search
# Add a seed for reproducibility
# Full Grid Search
gbm_full_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 10000,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "gbm_full_grid",
    hyper_params = param_list,
    algorithm = "gbm",
    search_criteria = search_criteria,

    # Parameters for early stopping
    stopping_metric = "MSE",
    stopping_rounds = 15,
    score_tree_interval = 1
  
)



In [20]:
# Sort and show the grid search results
gbm_full_grid <- h2o.getGrid(grid_id = "gbm_full_grid", sort_by = "mse")
print(gbm_full_grid)

H2O Grid Details

Grid ID: gbm_full_grid 
Used hyper parameters: 
  -  col_sample_rate 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate sample_rate             model_ids                 mse
1             0.8         0.9 gbm_full_grid_model_7 0.43780785687779805
2             0.7         0.9 gbm_full_grid_model_6 0.44060532786277523
3             0.8         0.8 gbm_full_grid_model_4 0.44096100224896634
4             0.9         0.9 gbm_full_grid_model_8 0.44288792056243054
5             0.9         0.8 gbm_full_grid_model_5 0.44475412455519636
6             0.9         0.7 gbm_full_grid_model_2  0.4457317997358452
7             0.7         0.8 gbm_full_grid_model_3   0.448140619501795
8             0.7         0.7 gbm_full_grid_model_0  0.4528872144586896
9             0.8         0.7 gbm_full_grid_model_1  0.4529771807006373


In [21]:
# Extract the best model from full grid search
best_model_id <- gbm_full_grid@model_ids[[1]] # top of the list
best_gbm_from_full_grid <- h2o.getModel(best_model_id)
summary(best_gbm_from_full_grid)

Model Details:

H2ORegressionModel: gbm
Model Key:  gbm_full_grid_model_7 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             187                      187               57184         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         31    19.16043

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.2103961
RMSE:  0.4586895
MAE:  0.3519789
RMSLE:  0.06802612
Mean Residual Deviance :  0.2103961



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4378079
RMSE:  0.6616705
MAE:  0.5053946
RMSLE:  0.09860186
Mean Residual Deviance :  0.4378079


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid cv_2_valid cv_3_valid
mae                0.5049402  0.010057319  0.5143834 0.52700925 0.48693994
mse               0.437

In [22]:
# Check the model performance on test dataset
h2o.performance(best_gbm_from_full_grid, wine_test)

H2ORegressionMetrics: gbm

MSE:  0.4196124
RMSE:  0.647775
MAE:  0.4896544
RMSLE:  0.09630233
Mean Residual Deviance :  0.4196124


<br>

## GBM with CV, Early Stopping and Random Grid Search

In [23]:
# define the criteria for random grid search
search_criteria = list(strategy = "RandomDiscrete",
                       max_models = 9,
                       seed = 1234)

In [24]:
# define the range of hyper-parameters for grid search
# 27 combinations in total
param_list <- list(
    sample_rate = c(0.7, 0.8, 0.9),
    col_sample_rate = c(0.7, 0.8, 0.9),
    max_depth = c(3, 5, 7)
)

In [25]:
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 10000,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "gbm_rand_grid",
    hyper_params = param_list,
    algorithm = "gbm",
    search_criteria = search_criteria,

    # Parameters for early stopping
    stopping_metric = "MSE",
    stopping_rounds = 15,
    score_tree_interval = 1
  
)



In [26]:
# Sort and show the grid search results
gbm_rand_grid <- h2o.getGrid(grid_id = "gbm_rand_grid", sort_by = "mse", decreasing = FALSE)
print(gbm_rand_grid)

H2O Grid Details

Grid ID: gbm_rand_grid 
Used hyper parameters: 
  -  col_sample_rate 
  -  max_depth 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate max_depth sample_rate             model_ids
1             0.9         7         0.9 gbm_rand_grid_model_5
2             0.7         7         0.7 gbm_rand_grid_model_1
3             0.9         7         0.7 gbm_rand_grid_model_6
4             0.8         7         0.7 gbm_rand_grid_model_4
5             0.7         5         0.8 gbm_rand_grid_model_0
6             0.8         3         0.9 gbm_rand_grid_model_7
7             0.9         3         0.9 gbm_rand_grid_model_2
8             0.8         3         0.8 gbm_rand_grid_model_3
9             0.7         3         0.7 gbm_rand_grid_model_8
                  mse
1  0.4227388012308513
2  0.4327748309201154
3  0.4369533108701783
4  0.4397321318633594
5   0.448140619501795
6  0.4647039373596

In [27]:
# Extract the best model from random grid search
best_model_id <- gbm_rand_grid@model_ids[[1]] # top of the list
best_gbm_from_rand_grid <- h2o.getModel(best_model_id)
summary(best_gbm_from_rand_grid)

Model Details:

H2ORegressionModel: gbm
Model Key:  gbm_rand_grid_model_5 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             142                      142               87919         7
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         7    7.00000         16         82    44.04930

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.1197865
RMSE:  0.3461018
MAE:  0.2597976
RMSLE:  0.05153244
Mean Residual Deviance :  0.1197865



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4227388
RMSE:  0.6501837
MAE:  0.4856414
RMSLE:  0.09723137
Mean Residual Deviance :  0.4227388


Cross-Validation Metrics Summary: 
                         mean           sd cv_1_valid cv_2_valid cv_3_valid
mae                 0.4854467  0.006546657 0.49238658  0.4940408 0.46821254
mse                0.

In [28]:
# Check the model performance on test dataset
h2o.performance(best_gbm_from_rand_grid, wine_test)

H2ORegressionMetrics: gbm

MSE:  0.404719
RMSE:  0.6361753
MAE:  0.473215
RMSLE:  0.09498904
Mean Residual Deviance :  0.404719


In [29]:
h2o.performance(best_gbm_from_rand_grid, wine_test)@metrics$MSE

<br>

## Comparison of Model Performance on Test Data

In [30]:
cat('GBM with Default Settings                        :', 
          h2o.performance(gbm_default, wine_test)@metrics$MSE, "\n")
cat('GBM with Manual Settings                         :', 
          h2o.performance(gbm_manual, wine_test)@metrics$MSE, "\n")
cat('GBM with Manual Settings & CV                    :', 
          h2o.performance(gbm_manual_cv, wine_test)@metrics$MSE, "\n")
cat('GBM with Manual Settings, CV & Early Stopping    :', 
          h2o.performance(gbm_manual_cv_es, wine_test)@metrics$MSE, "\n")
cat('GBM with CV, Early Stopping & Full Grid Search   :', 
          h2o.performance(best_gbm_from_full_grid, wine_test)@metrics$MSE, "\n")
cat('GBM with CV, Early Stopping & Random Grid Search :', 
          h2o.performance(best_gbm_from_rand_grid, wine_test)@metrics$MSE, "\n")

GBM with Default Settings                        : 0.4551121 
GBM with Manual Settings                         : 0.4432567 
GBM with Manual Settings & CV                    : 0.4432567 
GBM with Manual Settings, CV & Early Stopping    : 0.4287345 
GBM with CV, Early Stopping & Full Grid Search   : 0.4196124 
GBM with CV, Early Stopping & Random Grid Search : 0.404719 


<br>