### H2O AutoML

AutoML is a function in H2O that automates the process of building a large number of models, with the goal of finding the "best" model without any prior knowledge or effort by the Data Scientist.  

The current version of AutoML (in H2O 3.16.*) trains and cross-validates a default Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, a fixed grid of GLMs, and then trains two Stacked Ensemble models at the end. One ensemble contains all the models (optimized for model performance), and the second ensemble contains just the best performing model from each algorithm class/family (optimized for production use).

In [1]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.2+9, mixed mode)
  Starting server from D:\Anaconda\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\ERIC~1.YUA\AppData\Local\Temp\tmp5greonik
  JVM stdout: C:\Users\ERIC~1.YUA\AppData\Local\Temp\tmp5greonik\h2o_eric_yuan_started_from_python.out
  JVM stderr: C:\Users\ERIC~1.YUA\AppData\Local\Temp\tmp5greonik\h2o_eric_yuan_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.1
H2O cluster version age:,30 days
H2O cluster name:,H2O_from_python_eric_yuan_d19m8r
H2O cluster total nodes:,1
H2O cluster free memory:,3.975 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


### Classification

In [8]:
df = h2o.import_file('loan.csv')
df['bad_loan'] = df['bad_loan'].asfactor()

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [9]:
y = 'bad_loan'
x = list(df.columns)
# remove the response
x.remove(y)
# remove the interest rate column because it's correlated with the outcome
x.remove('int_rate')

* run the model  
The max_models argument specifies the number of individual (or "base") models, and does not include the two ensemble models that are trained at the end.

In [10]:
aml = H2OAutoML(max_models = 10, seed = 1)
aml.train(x = x, y = y, training_frame = df)

AutoML progress: |
14:41:47.722: AutoML: XGBoost is not available; skipping it.

████████████████████████████████████████████████████████| 100%


* leaderboard

In [11]:
lb = aml.leaderboard
# set rows = lb.nrows to make sure we are viewing the whole leaderboard
lb.head(rows = lb.nrows)

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
StackedEnsemble_AllModels_AutoML_20200116_144147,0.685387,0.444791,0.325564,0.366704,0.37394,0.139831
StackedEnsemble_BestOfFamily_AutoML_20200116_144147,0.684397,0.445114,0.324195,0.366049,0.37409,0.139943
GBM_1_AutoML_20200116_144147,0.68269,0.444099,0.321073,0.368341,0.373826,0.139746
GBM_2_AutoML_20200116_144147,0.681843,0.444419,0.320672,0.368599,0.373961,0.139847
GBM_3_AutoML_20200116_144147,0.679967,0.445154,0.318167,0.369837,0.374304,0.140104
GBM_grid__1_AutoML_20200116_144147_model_1,0.676778,0.455592,0.316477,0.372625,0.378487,0.143253
GBM_5_AutoML_20200116_144147,0.674414,0.447373,0.313482,0.373524,0.375221,0.140791
GLM_1_AutoML_20200116_144147,0.674163,0.447619,0.314141,0.373453,0.374973,0.140605
GBM_4_AutoML_20200116_144147,0.674056,0.44777,0.311629,0.373953,0.375489,0.140992
DeepLearning_1_AutoML_20200116_144147,0.670627,0.44915,0.309202,0.375012,0.375728,0.141172




* Ensemble Exploration

In [12]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(se.metalearner()['name'])

Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.

In [25]:
metalearner.coef()

{'Intercept': -2.862149139561866,
 'GBM_1_AutoML_20200116_144147': 1.0377256312897167,
 'GBM_2_AutoML_20200116_144147': 0.8788526764811525,
 'GBM_3_AutoML_20200116_144147': 0.5592781916044135,
 'GBM_grid__1_AutoML_20200116_144147_model_1': 1.0173414987228273,
 'GBM_5_AutoML_20200116_144147': 0.38183683443236327,
 'GLM_1_AutoML_20200116_144147': 0.7096711102694062,
 'GBM_4_AutoML_20200116_144147': 0.3670508256343875,
 'DeepLearning_1_AutoML_20200116_144147': 1.0680026153829598,
 'XRT_1_AutoML_20200116_144147': 0.5070725960843376,
 'DRF_1_AutoML_20200116_144147': 0.3593000898694801}

### Regression

In [32]:
data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv"
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [34]:
df.describe()

Rows:9568
Cols:5




Unnamed: 0,TemperatureCelcius,ExhaustVacuumHg,AmbientPressureMillibar,RelativeHumidity,HourlyEnergyOutputMW
type,real,real,real,real,real
mins,1.81,25.36,992.89,25.56,420.26
mean,19.651231187290968,54.30580372073578,1013.2590781772575,73.30897784280936,454.36500940635455
maxs,37.11,81.56,1033.3,100.16,495.76
sigma,7.452473229611079,12.707892998326809,5.938783705811604,14.600268756728953,17.066994999803416
zeros,0,0,0,0,0
missing,0,0,0,0,0
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


In [35]:
y = "HourlyEnergyOutputMW"

In [36]:
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

In [37]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_lb_frame")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |
14:58:35.810: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████Failed polling AutoML progress log: HTTP 500 Server Error:
Server error java.lang.ArrayIndexOutOfBoundsException:
  Error: Index 72 out of bounds for length 72
  Request: None

██████| 100%


In [38]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_4_AutoML_20200116_145835,10.4855,3.23813,10.4855,2.24599,0.00713297
GBM_3_AutoML_20200116_145835,11.1887,3.34496,11.1887,2.35355,0.00736291
GBM_grid__1_AutoML_20200116_145835_model_1,11.3694,3.37186,11.3694,2.36821,0.00742649
GBM_2_AutoML_20200116_145835,11.4439,3.38288,11.4439,2.39635,0.0074478
GBM_1_AutoML_20200116_145835,11.8546,3.44305,11.8546,2.40537,0.00758284
DRF_1_AutoML_20200116_145835,12.1175,3.48102,12.1175,2.45802,0.0076833
XRT_1_AutoML_20200116_145835,12.1331,3.48327,12.1331,2.464,0.00768583
GBM_grid__1_AutoML_20200116_145835_model_7,12.7733,3.57398,12.7733,2.57113,0.00787061
GBM_5_AutoML_20200116_145835,13.1665,3.62857,13.1665,2.62643,0.0079734
GBM_grid__1_AutoML_20200116_145835_model_5,13.7594,3.70937,13.7594,2.70828,0.00815945




In [39]:
pred = aml.predict(test)
pred.head()

gbm prediction progress: |████████████████████████████████████████████████| 100%


predict
486.333
473.891
466.384
452.318
447.87
469.437
442.477
464.248
442.807
431.665




In [40]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 10.48551445955257
RMSE: 3.238134410359238
MAE: 2.2459942869807
RMSLE: 0.007132968640986882
Mean Residual Deviance: 10.48551445955257


