## Setting up H2O AutoML

In [1]:
! apt-get install default-jre
!java -version

Reading package lists... Done
Building dependency tree       
Reading state information... Done
default-jre is already the newest version (2:1.11-68ubuntu1~18.04.1).
default-jre set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.
openjdk version "11.0.9.1" 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)


In [2]:
! pip install h2o

Collecting h2o
[?25l  Downloading https://files.pythonhosted.org/packages/3b/d1/9edb2359afa29049bb6e75f9a673bb68c897e7f3e6238ffb77ba9eddf04b/h2o-3.32.0.3.tar.gz (164.6MB)
[K     |████████████████████████████████| 164.6MB 80kB/s 
Collecting colorama>=0.3.8
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Created wheel for h2o: filename=h2o-3.32.0.3-py2.py3-none-any.whl size=164649662 sha256=68c6cf623e5c66454e425b2edf580e697656289bc600bc5c60bffdbbfab1f9ac
  Stored in directory: /root/.cache/pip/wheels/0a/fd/63/96d322a27867a81a2904172a75aed5241913d603a4b8c4b277
Successfully built h2o
Installing collected packages: colorama, h2o
Successfully installed colorama-0.4.4 h2o-3.32.0.3


In [3]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.9.1" 2020-11-04; OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04); OpenJDK 64-Bit Server VM (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpliscoi3q
  JVM stdout: /tmp/tmpliscoi3q/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpliscoi3q/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.3
H2O_cluster_version_age:,22 days
H2O_cluster_name:,H2O_from_python_unknownUser_m0hbxk
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.180 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


### Load Data

In [4]:
# Use local data file or download from GitHub
import os
docker_data_path = "ratings_new.csv"
if os.path.isfile(docker_data_path):
  data_path = docker_data_path
else:
  data_path = "ratings_new.csv"


# Load data into H2O
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Let's take a look at the data.

In [5]:
df.describe()

Rows:100836
Cols:3




Unnamed: 0,userId,movieId,rating
type,int,int,real
mins,1.0,1.0,0.5
mean,326.12756356856625,19435.29571779919,3.5015569836169598
maxs,610.0,193609.0,5.0
sigma,182.618491463499,35530.98719870016,1.0425292390606347
zeros,0,0,0
missing,0,0,0
0,1.0,1.0,4.0
1,1.0,3.0,4.0
2,1.0,6.0,4.0


Identify the response column and save the column name as `y`.

In [6]:
y = "rating"

Lastly, let's split the data into two frames, a `train` (80%) and a `test` frame (20%).  The `test` frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.

In [7]:
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

## Run AutoML 

Run AutoML, stopping after 60 seconds.  The `max_runtime_secs` argument provides a way to limit the AutoML run by time.  When using a time-limited stopping criterion, the number of models train will vary between runs.  If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another. 

The `test` frame is passed explicitly to the `leaderboard_frame` argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [8]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "ml100k_lb_frame")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [9]:
aml2 = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "ml100k_full_data")
aml2.train(y = y, training_frame = df)

AutoML progress: |████████████████████████████████████████████████████████| 100%


*Note: We specify a `project_name` here for clarity.*

## Leaderboard

In [10]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_AllModels_AutoML_20210115_191607,0.858878,0.926757,0.858878,0.726354,0.247382
StackedEnsemble_BestOfFamily_AutoML_20210115_191607,0.867075,0.931168,0.867075,0.723846,0.24995
XGBoost_grid__1_AutoML_20210115_191607_model_1,0.877164,0.93657,0.877164,0.731772,0.251504
GBM_4_AutoML_20210115_191607,1.03747,1.01856,1.03747,0.811097,0.271089
GBM_3_AutoML_20210115_191607,1.04534,1.02242,1.04534,0.814428,0.271873
GBM_5_AutoML_20210115_191607,1.05093,1.02515,1.05093,0.816439,0.272812
GBM_grid__1_AutoML_20210115_191607_model_1,1.05216,1.02575,1.05216,0.816484,0.272803
GBM_2_AutoML_20210115_191607,1.05791,1.02855,1.05791,0.818203,0.273351
GBM_1_AutoML_20210115_191607,1.07241,1.03557,1.07241,0.823277,0.274831
GLM_1_AutoML_20210115_191607,1.08146,1.03993,1.08146,0.827454,0.275764




Now we will view a snapshot of the top models.  Here we should see the two Stacked Ensembles at or near the top of the leaderboard.  Stacked Ensembles can almost always outperform a single model.

In [11]:
aml2.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_AllModels_AutoML_20210115_191757,0.850114,0.922016,0.850114,0.71688,0.247793
StackedEnsemble_BestOfFamily_AutoML_20210115_191757,0.863591,0.929296,0.863591,0.722773,0.2496
XGBoost_grid__1_AutoML_20210115_191757_model_1,0.872988,0.934338,0.872988,0.730253,0.251222
GBM_4_AutoML_20210115_191757,1.03387,1.0168,1.03387,0.808668,0.27126
GBM_grid__1_AutoML_20210115_191757_model_1,1.05042,1.0249,1.05042,0.815446,0.273121
GBM_1_AutoML_20210115_191757,1.05739,1.02829,1.05739,0.817521,0.273793
GBM_5_AutoML_20210115_191757,1.06057,1.02984,1.06057,0.819031,0.274274
GBM_3_AutoML_20210115_191757,1.06242,1.03074,1.06242,0.81946,0.274269
GBM_2_AutoML_20210115_191757,1.0743,1.03648,1.0743,0.823384,0.275606
GLM_1_AutoML_20210115_191757,1.08424,1.04127,1.08424,0.830256,0.276578




## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [12]:
pred = aml.predict(test)
pred.head()

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict
4.24333
3.9449
4.31107
4.0852
4.02601
4.38628
4.20321
4.4545
4.38827
4.52566




If needed, the standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [13]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 0.8588780678874272
RMSE: 0.9267567468799065
MAE: 0.7263537121139809
RMSLE: 0.24738164203141583
R^2: 0.20690125290850947
Mean Residual Deviance: 0.8588780678874272
Null degrees of freedom: 20192
Residual degrees of freedom: 20186
Null deviance: 21868.68751719775
Residual deviance: 17343.324824850817
AIC: 54249.32456439552


