### Ensemble Model

* H2O initialization

In [2]:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.2+9, mixed mode)
  Starting server from D:\Anaconda\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\ERIC~1.YUA\AppData\Local\Temp\tmp5w_uyf5n
  JVM stdout: C:\Users\ERIC~1.YUA\AppData\Local\Temp\tmp5w_uyf5n\h2o_eric_yuan_started_from_python.out
  JVM stderr: C:\Users\ERIC~1.YUA\AppData\Local\Temp\tmp5w_uyf5n\h2o_eric_yuan_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.1
H2O cluster version age:,30 days
H2O cluster name:,H2O_from_python_eric_yuan_y582bk
H2O cluster total nodes:,1
H2O cluster free memory:,3.975 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [3]:
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

* training model

In [5]:
# There are a few ways to assemble a list of models to stack together:
# 1. Train individual models and put them in a list
# 2. Train a grid of models
# 3. Train several grids of models
# Note: All base models must have the same cross-validation folds and
# the cross-validated predicted values must be kept.

# Number of CV folds (to generate level-one data for stacking)
nfolds = 5

# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
                                      ntrees=10,
                                      max_depth=3,
                                      min_rows=2,
                                      learn_rate=0.2,
                                      nfolds=nfolds,
                                      fold_assignment="Modulo",
                                      keep_cross_validation_predictions=True,
                                      seed=1)
my_gbm.train(x=x, y=y, training_frame=train)

# Train and cross-validate a RF
my_rf = H2ORandomForestEstimator(ntrees=50,
                                 nfolds=nfolds,
                                 fold_assignment="Modulo",
                                 keep_cross_validation_predictions=True,
                                 seed=1)
my_rf.train(x=x, y=y, training_frame=train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%


* ensemble

In [7]:
# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id = "my_ensemble_binomial",
                                       base_models = [my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)

stackedensemble Model Build progress: |███████████████████████████████████| 100%


* model evaluation

In [8]:
# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)
# Compare to base learner performance on the test set
perf_gbm_test = my_gbm.model_performance(test)
perf_rf_test = my_rf.model_performance(test)
# AUC
baselearner_best_auc_test = max(perf_gbm_test.auc(), perf_rf_test.auc())
stack_auc_test = perf_stack_test.auc()
print("Best Base-learner Test AUC:  {0}".format(baselearner_best_auc_test))
print("Ensemble Test AUC:  {0}".format(stack_auc_test))

Best Base-learner Test AUC:  0.7697982150254795
Ensemble Test AUC:  0.7735371695404032


### Model Deployment

H2O allows you to convert the models you have built to either a Plain Old Java Object (POJO) or a Model ObJect, Optimized (MOJO).

In [22]:
import os

In [29]:
modelfile = ensemble.download_mojo(path = os.getcwd() + '\\models', get_genmodel_jar = True)
# print("Model saved to " + modelfile)

work with Java code: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

* save and load model  
We need to run the H2O cluster to save, load and run models

When saving an H2O binary model with h2o.save_model (Python), or in Flow, you will only be able to load and use that saved binary model with the same version of H2O that you used to train your model. H2O binary models are not compatible across H2O versions. If you update your H2O version, then you will need to retrain your model. For production, you can save your model as a POJO/MOJO. These artifacts are not tied to a particular version of H2O because they are just plain Java code and do not require an H2O cluster to be running.

In [30]:
# save the model
model_path = h2o.save_model(model = ensemble, force = True)
# print(model_path)
saved_model = h2o.load_model(model_path)

In [31]:
saved_model.predict(test)

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict,p0,p1
0,0.666194,0.333806
1,0.582765,0.417235
1,0.60526,0.39474
1,0.19191,0.80809
1,0.45349,0.54651
1,0.314965,0.685035
1,0.282787,0.717213
0,0.662572,0.337428
0,0.711472,0.288528
1,0.603612,0.396388




In [33]:
h2o.shutdown()

  """Entry point for launching an IPython kernel.


H2O session _sid_95c3 closed.
