# Machine Learning with H2O - Tutorial 3: Basic Regression Models

<hr>

**Objective**:

- This tutorial explains how to build regression models with four different H2O algorithms.

<hr>

**Wine Quality Dataset:**

- Source: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
- CSV (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)

<hr>
    
**Algorithms**:

1. GLM
2. DRF
3. GBM
4. DNN


<hr>

**Full Technical Reference:**

- http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html

<br>


In [1]:
# Start and connect to a local H2O cluster
import h2o
h2o.init(nthreads = -1)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_121"; Java(TM) SE Runtime Environment (build 1.8.0_121-b13); Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.5/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp6vv0ebqc
  JVM stdout: /tmp/tmp6vv0ebqc/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmp6vv0ebqc/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster version:,3.10.3.5
H2O cluster version age:,6 days
H2O cluster name:,H2O_from_python_joe_pi4bgn
H2O cluster total nodes:,1
H2O cluster free memory:,5.210 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


<br>

In [2]:
# Import wine quality data from a local CSV file
wine = h2o.import_file("winequality-white.csv")
wine.head(5)

Parse progress: |█████████████████████████████████████████████████████████| 100%


fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6




In [3]:
# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 'quality' (otherwise there is nothing to predict)
features

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8])

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag evaluation

In [5]:
wine_train.shape

(3920, 12)

In [6]:
wine_test.shape

(978, 12)

<br>

## Generalized Linear Model

In [7]:
# Build a Generalized Linear Model (GLM) with default settings

# Import the function for GLM
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Set up GLM for regression
glm_default = H2OGeneralizedLinearEstimator(family = 'gaussian', model_id = 'glm_default')

# Use .train() to build the model
glm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [8]:
# Check the model performance on training dataset
glm_default

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_default


ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.5467603340138831
RMSE: 0.7394324404662559
MAE: 0.5779878196869473
RMSLE: 0.10933916345150961
R^2: 0.2911164642372668
Mean Residual Deviance: 0.5467603340138831
Null degrees of freedom: 3919
Residual degrees of freedom: 3909
Null deviance: 3023.4875000000006
Residual deviance: 2143.300509334422
AIC: 8781.798802624067
Scoring History: 


0,1,2,3,4,5
,timestamp,duration,iteration,negative_log_likelihood,objective
,2017-02-24 14:29:03,0.000 sec,0,3023.4875000,0.7712978




In [9]:
# Check the model performance on test dataset
glm_default.model_performance(wine_test)


ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 0.6379083941821849
RMSE: 0.7986916765449512
MAE: 0.6084586866057362
RMSLE: 0.12068062132646235
R^2: 0.23682987530728572
Mean Residual Deviance: 0.6379083941821849
Null degrees of freedom: 977
Residual degrees of freedom: 967
Null deviance: 817.5084757653062
Residual deviance: 623.8744095101769
AIC: 2359.773515167577




<br>

## Distributed Random Forest

In [10]:
# Build a Distributed Random Forest (DRF) model with default settings

# Import the function for DRF
from h2o.estimators.random_forest import H2ORandomForestEstimator

# Set up DRF for regression
# Add a seed for reproducibility
drf_default = H2ORandomForestEstimator(model_id = 'drf_default', seed = 1234)

# Use .train() to build the model
drf_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [11]:
# Check the DRF model summary
drf_default

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  drf_default


ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.3786311991201336
RMSE: 0.615330154567557
MAE: 0.4423354990678668
RMSLE: 0.09221859689553939
Mean Residual Deviance: 0.3786311991201336
Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2017-02-24 14:29:03,0.010 sec,0.0,,,
,2017-02-24 14:29:03,0.279 sec,1.0,0.9087144,0.5780743,0.8257618
,2017-02-24 14:29:04,0.354 sec,2.0,0.8919799,0.5738643,0.7956282
,2017-02-24 14:29:04,0.413 sec,3.0,0.8502710,0.5492569,0.7229608
,2017-02-24 14:29:04,0.469 sec,4.0,0.8352497,0.5485249,0.6976421
---,---,---,---,---,---,---
,2017-02-24 14:29:05,1.848 sec,46.0,0.6177766,0.4440441,0.3816479
,2017-02-24 14:29:05,1.870 sec,47.0,0.6170476,0.4438940,0.3807478
,2017-02-24 14:29:05,1.890 sec,48.0,0.6163446,0.4433232,0.3798807



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
alcohol,20878.3164062,1.0,0.2070740
density,11681.6083984,0.5595091,0.1158598
free sulfur dioxide,10304.6474609,0.4935574,0.1022029
volatile acidity,10046.6591797,0.4812006,0.0996441
total sulfur dioxide,7576.9418945,0.3629096,0.0751491
chlorides,7170.6318359,0.3434488,0.0711193
pH,6974.2666016,0.3340435,0.0691717
residual sugar,6736.8276367,0.3226710,0.0668168
fixed acidity,6671.1586914,0.3195257,0.0661655




In [12]:
# Check the model performance on test dataset
drf_default.model_performance(wine_test)


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.39639765836475405
RMSE: 0.6296011899327654
MAE: 0.4470666172070494
RMSLE: 0.0956471646937074
Mean Residual Deviance: 0.39639765836475405




<br>

## Gradient Boosting Machines

In [13]:
# Build a Gradient Boosting Machines (GBM) model with default settings

# Import the function for GBM
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Set up GBM for regression
# Add a seed for reproducibility
gbm_default = H2OGradientBoostingEstimator(model_id = 'gbm_default', seed = 1234)

# Use .train() to build the model
gbm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [14]:
# Check the GBM model summary
gbm_default

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_default


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.3279901359237029
RMSE: 0.5727042307541502
MAE: 0.4468592957574494
RMSLE: 0.08454073238229556
Mean Residual Deviance: 0.3279901359237029
Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2017-02-24 14:29:06,0.004 sec,0.0,0.8782356,0.6676549,0.7712978
,2017-02-24 14:29:06,0.074 sec,1.0,0.8470435,0.6414797,0.7174826
,2017-02-24 14:29:06,0.092 sec,2.0,0.8205102,0.6196456,0.6732370
,2017-02-24 14:29:06,0.108 sec,3.0,0.7978752,0.6061423,0.6366049
,2017-02-24 14:29:06,0.132 sec,4.0,0.7779741,0.5961975,0.6052437
---,---,---,---,---,---,---
,2017-02-24 14:29:06,0.696 sec,46.0,0.5786491,0.4516237,0.3348347
,2017-02-24 14:29:06,0.705 sec,47.0,0.5775187,0.4506466,0.3335279
,2017-02-24 14:29:06,0.713 sec,48.0,0.5770047,0.4500542,0.3329344



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
alcohol,3443.4946289,1.0,0.3760723
volatile acidity,1333.1967773,0.3871639,0.1456016
free sulfur dioxide,1227.8638916,0.3565749,0.1340980
residual sugar,572.8696289,0.1663629,0.0625645
pH,529.3963623,0.1537381,0.0578166
citric acid,438.0522766,0.1272115,0.0478407
fixed acidity,384.2873535,0.1115981,0.0419689
chlorides,319.1284485,0.0926758,0.0348528
sulphates,317.8505859,0.0923047,0.0347132




In [15]:
# Check the model performance on test dataset
gbm_default.model_performance(wine_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.4982022441718445
RMSE: 0.7058344311322907
MAE: 0.545334849281441
RMSLE: 0.10596392876553129
Mean Residual Deviance: 0.4982022441718445




<br>

## H2O Deep Learning

In [16]:
# Build a Deep Learning (Deep Neural Networks, DNN) model with default settings

# Import the function for DNN
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

# Set up DNN for regression
dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default')

# (not run) Change 'reproducible' to True if you want to reproduce the results
# The model will be built using a single thread (could be very slow)
# dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default', reproducible = True)

# Use .train() to build the model
dnn_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [17]:
# Check the DNN model summary
dnn_default

Model Details
H2ODeepLearningEstimator :  Deep Learning
Model Key:  dnn_default


ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.43786400091608246
RMSE: 0.6617129293856079
MAE: 0.5095157254634646
RMSLE: 0.09753833427573282
Mean Residual Deviance: 0.43786400091608246
Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_deviance,training_mae
,2017-02-24 14:29:07,0.000 sec,,0.0,0,0.0,,,
,2017-02-24 14:29:08,1.857 sec,5781 obs/sec,1.0,1,3920.0,0.7517084,0.5650655,0.5805199
,2017-02-24 14:29:10,3.825 sec,15100 obs/sec,10.0,10,39200.0,0.6617129,0.4378640,0.5095157




In [18]:
# Check the model performance on test dataset
dnn_default.model_performance(wine_test)


ModelMetricsRegression: deeplearning
** Reported on test data. **

MSE: 0.5250970590349768
RMSE: 0.7246358113114317
MAE: 0.5599060268088915
RMSLE: 0.10771735822508255
Mean Residual Deviance: 0.5250970590349768




<br>

## Making Predictions

In [19]:
# Use GLM model to make predictions
yhat_test_glm = glm_default.predict(wine_test)
yhat_test_glm.head(5)

glm prediction progress: |████████████████████████████████████████████████| 100%


predict
5.51509
5.73903
5.49107
5.48028
5.91343




In [20]:
# Use DRF model to make predictions
yhat_test_drf = drf_default.predict(wine_test)
yhat_test_drf.head(5)

drf prediction progress: |████████████████████████████████████████████████| 100%


predict
5.70643
5.71689
5.86
5.7
5.99533




In [21]:
# Use GBM model to make predictions
yhat_test_gbm = gbm_default.predict(wine_test)
yhat_test_gbm.head(5)

gbm prediction progress: |████████████████████████████████████████████████| 100%


predict
5.52585
5.88708
5.63179
5.66957
5.9565




In [22]:
# Use DNN model to make predictions
yhat_test_dnn = dnn_default.predict(wine_test)
yhat_test_dnn.head(5)

deeplearning prediction progress: |███████████████████████████████████████| 100%


predict
5.5508
5.64736
5.93948
5.5074
6.09279




<br>