https://github.com/h2oai/h2o-tutorials/blob/master/h2o-world-2017/automl/Python/automl_regression_powerplant_output.ipynb

# H2O AutoML Regression Demo

This is a Jupyter Notebook. When you execute code within the notebook, the results appear beneath the code. To execute a code chunk, place your cursor on the cell and press Shift+Enter.

## Start H2O

Import the `h2o` Python module and `H2OAutoML` class and initialize a local H2O cluster.

In [1]:
import h2o
from h2o.automl import H2OAutoML

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_362"; OpenJDK Runtime Environment (build 1.8.0_362-b08); OpenJDK 64-Bit Server VM (build 25.362-b08, mixed mode)
  Starting server from /home/stever7/.local/lib/python3.9/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpnccc5poq
  JVM stdout: /tmp/tmpnccc5poq/h2o_stever7_started_from_python.out
  JVM stderr: /tmp/tmpnccc5poq/h2o_stever7_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,Etc/GMT
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.3
H2O_cluster_version_age:,"1 year, 2 months and 29 days !!!"
H2O_cluster_name:,H2O_from_python_stever7_wdi091
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,26.63 Gb
H2O_cluster_total_cores:,64
H2O_cluster_allowed_cores:,64


## Load Data

For the AutoML regression demo, we use the Combined Cycle Power Plant dataset:

http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values. In this demo, you will use H2O's AutoML to outperform the state of the art results on this task:

https://www.sciencedirect.com/science/article/pii/S0142061514000908

In [3]:
# Use local data file or download from GitHub

import os

docker_data_path = "/home/h2o/data/automl/powerplant_output.csv"

if os.path.isfile(docker_data_path):
  data_path = docker_data_path
else:
  data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv"

# Load data into H2O
df = h2o.import_file(data_path)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


Let's take a look at the data.

In [4]:
df.describe()

Rows:9568
Cols:5




Unnamed: 0,TemperatureCelcius,ExhaustVacuumHg,AmbientPressureMillibar,RelativeHumidity,HourlyEnergyOutputMW
type,real,real,real,real,real
mins,1.81,25.36,992.89,25.56,420.26
mean,19.651231187290968,54.30580372073578,1013.2590781772575,73.30897784280934,454.36500940635443
maxs,37.11,81.56,1033.3,100.16,495.76
sigma,7.452473229611079,12.707892998326809,5.938783705811605,14.600268756728953,17.066994999803416
zeros,0,0,0,0,0
missing,0,0,0,0,0
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


Next, let's identify the response column and save the column name as `y`. In this dataset, we will use all columns except the response as predictors, so we can skip setting the `x` argument explicitly.

In [5]:
y = "HourlyEnergyOutputMW"

Lastly, let's split the data into two frames, a `train` (80%) and a `test` frame (20%). The `test` frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.

In [6]:
splits = df.split_frame(ratios=[0.8], seed=1)
train = splits[0]
test = splits[1]

## Run AutoML

Run AutoML, stopping after 60 seconds. The `max_runtime_secs` argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.

The `test` frame is passed explicitly to the `leaderboard_frame` argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [7]:
# aml = H2OAutoML(max_runtime_secs=60, seed=1, project_name="powerplant_lb_frame")
# AutoML was not able to build any model within a max runtime constraint of 60 seconds, 
# you may want to increase this value before retrying.
aml = H2OAutoML(max_runtime_secs=120, seed=1, project_name="powerplant_lb_frame")

In [8]:
aml.train(y=y, training_frame=train, leaderboard_frame=test)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%




Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_grid_1_AutoML_1_20230516_164330_model_4


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,67.0




ModelMetricsRegression: xgboost
** Reported on train data. **

MSE: 2.4041656821357664
RMSE: 1.5505372237182073
MAE: 1.1008531422110077
RMSLE: 0.0034319598503326277
Mean Residual Deviance: 2.4041656821357664

ModelMetricsRegression: xgboost
** Reported on cross-validation data. **

MSE: 10.346373781608678
RMSE: 3.216577961375828
MAE: 2.298453566151329
RMSLE: 0.0070591134540485745
Mean Residual Deviance: 10.346373781608678

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,2.298449,0.023193,2.31917,2.314929,2.260921,2.294006,2.303218
1,mean_residual_deviance,10.346351,0.859431,11.369689,9.503255,9.952131,11.166836,9.73984
2,mse,10.346351,0.859431,11.369689,9.503255,9.952131,11.166836,9.73984
3,r2,0.964415,0.002922,0.96093,0.967818,0.965555,0.961815,0.965954
4,residual_deviance,10.346351,0.859431,11.369689,9.503255,9.952131,11.166836,9.73984
5,rmse,3.214377,0.132903,3.371897,3.082735,3.1547,3.341681,3.120872
6,rmsle,0.007055,0.000277,0.007378,0.006785,0.006893,0.007331,0.006888



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2023-05-16 16:44:33,18.183 sec,0.0,454.182071,453.861607,206281.353311
1,,2023-05-16 16:44:33,18.229 sec,5.0,76.482944,76.296754,5849.64078
2,,2023-05-16 16:44:33,18.341 sec,10.0,13.289866,12.895358,176.62054
3,,2023-05-16 16:44:33,18.473 sec,15.0,3.445073,2.733949,11.868531
4,,2023-05-16 16:44:33,18.605 sec,20.0,2.344798,1.703017,5.498076
5,,2023-05-16 16:44:33,18.733 sec,25.0,2.185036,1.571625,4.774384
6,,2023-05-16 16:44:34,18.860 sec,30.0,2.057268,1.468438,4.232352
7,,2023-05-16 16:44:34,18.987 sec,35.0,1.984562,1.416055,3.938486
8,,2023-05-16 16:44:34,19.116 sec,40.0,1.874295,1.334366,3.512981
9,,2023-05-16 16:44:34,19.248 sec,45.0,1.76867,1.259601,3.128193



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,TemperatureCelcius,3142538.0,1.0,0.776129
1,ExhaustVacuumHg,819400.2,0.260745,0.202372
2,AmbientPressureMillibar,48602.12,0.015466,0.012004
3,RelativeHumidity,38446.41,0.012234,0.009495




*Note: If you see the following error, it means that you need to install the pandas module.*

```
H2OTypeError: Argument `python_obj` should be a None | list | tuple | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse.issparse, got H2OTwoDimTable 
```

For demonstration purposes, we will also execute a second AutoML run, this time providing the original, full dataset, `df` (without passing a `leaderboard_frame`). This is a more efficient use of our data since we can use 100% of the data for training, rather than 80% like we did above. This time our leaderboard will use cross-validated metrics.

*Note: Using an explicit `leaderboard_frame` for scoring may be useful in some cases, which is why the option is available.*

In [9]:
# aml2 = H2OAutoML(max_runtime_secs=60, seed=1, project_name="powerplant_full_data")
# AutoML was not able to build any model within a max runtime constraint of 60 seconds, 
# you may want to increase this value before retrying.
aml2 = H2OAutoML(max_runtime_secs=120, seed=1, project_name="powerplant_full_data")

In [10]:
aml2.train(y=y, training_frame=df)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_grid_1_AutoML_2_20230516_164532_model_4


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,67.0




ModelMetricsRegression: xgboost
** Reported on train data. **

MSE: 2.6135914765667665
RMSE: 1.6166605941157737
MAE: 1.1259990009575782
RMSLE: 0.0035750335097736036
Mean Residual Deviance: 2.6135914765667665

ModelMetricsRegression: xgboost
** Reported on cross-validation data. **

MSE: 9.610184685113556
RMSE: 3.100029787778426
MAE: 2.198337861128077
RMSLE: 0.00681187657010913
Mean Residual Deviance: 9.610184685113556

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,2.198338,0.044306,2.164867,2.167556,2.265034,2.22293,2.171302
1,mean_residual_deviance,9.610259,1.096051,8.887444,8.462913,10.765993,10.796785,9.138162
2,mse,9.610259,1.096051,8.887444,8.462913,10.765993,10.796785,9.138162
3,r2,0.966952,0.004146,0.969431,0.970862,0.96271,0.962203,0.969554
4,residual_deviance,9.610259,1.096051,8.887444,8.462913,10.765993,10.796785,9.138162
5,rmse,3.096047,0.175908,2.981182,2.909108,3.281157,3.285846,3.022939
6,rmsle,0.006803,0.000383,0.006574,0.006389,0.007212,0.00721,0.006632



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2023-05-16 16:46:42,27.422 sec,0.0,454.185754,453.865009,206284.698696
1,,2023-05-16 16:46:42,28.135 sec,5.0,76.480405,76.294833,5849.252351
2,,2023-05-16 16:46:43,28.178 sec,10.0,13.270206,12.898634,176.098375
3,,2023-05-16 16:46:43,28.244 sec,15.0,3.335044,2.664468,11.122516
4,,2023-05-16 16:46:43,28.326 sec,20.0,2.243106,1.606557,5.031524
5,,2023-05-16 16:46:43,28.449 sec,25.0,2.128762,1.503646,4.531627
6,,2023-05-16 16:46:43,28.489 sec,30.0,2.039606,1.43603,4.159994
7,,2023-05-16 16:46:43,28.547 sec,35.0,1.942949,1.365574,3.77505
8,,2023-05-16 16:46:43,28.603 sec,40.0,1.875012,1.316291,3.515669
9,,2023-05-16 16:46:43,28.639 sec,45.0,1.832215,1.282705,3.357011



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,TemperatureCelcius,3936985.0,1.0,0.775243
1,ExhaustVacuumHg,1032642.0,0.262293,0.203341
2,AmbientPressureMillibar,58499.7,0.014859,0.011519
3,RelativeHumidity,50258.18,0.012766,0.009896




*Note: We specify a `project_name` here for clarity.*

## Leaderboard

Next, we will view the AutoML Leaderboard. Since we specified a `leaderboard_frame` in the `H2OAutoML.train()` method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

After viewing the `"powerplant_lb_frame"` AutoML project leaderboard, we compare that to the leaderboard for the `"powerplant_full_data"` project. We can see that the results are better when the full dataset is used for training.

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [11]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_grid_1_AutoML_1_20230516_164330_model_4,10.2428,3.20043,10.2428,2.23104,0.00705131
XGBoost_grid_1_AutoML_1_20230516_164330_model_3,10.4336,3.23011,10.4336,2.2744,0.00711938
GBM_grid_1_AutoML_1_20230516_164330_model_5,10.5207,3.24357,10.5207,2.2389,0.00713964
GBM_4_AutoML_1_20230516_164330,10.5916,3.25447,10.5916,2.28383,0.00716625
XGBoost_2_AutoML_1_20230516_164330,10.8055,3.28717,10.8055,2.35187,0.00725191
GBM_3_AutoML_1_20230516_164330,11.09,3.33016,11.09,2.34868,0.00733045
GBM_2_AutoML_1_20230516_164330,11.2763,3.35801,11.2763,2.39109,0.00739036
GBM_5_AutoML_1_20230516_164330,11.465,3.386,11.465,2.4209,0.007449
XGBoost_1_AutoML_1_20230516_164330,11.5945,3.40507,11.5945,2.41618,0.0075004
XRT_1_AutoML_1_20230516_164330,11.9057,3.45046,11.9057,2.42398,0.00761006




Now we will view a snapshot of the top models. Here we should see the two Stacked Ensembles at or near the top of the leaderboard. Stacked Ensembles can almost always outperform a single model.

In [12]:
aml2.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_grid_1_AutoML_2_20230516_164532_model_4,9.61018,3.10003,9.61018,2.19834,0.00681188
XGBoost_grid_1_AutoML_2_20230516_164532_model_3,9.94907,3.15422,9.94907,2.25798,0.00692968
GBM_grid_1_AutoML_2_20230516_164532_model_5,10.062,3.17207,10.062,2.23362,0.00696448
GBM_4_AutoML_2_20230516_164532,10.2867,3.2073,10.2867,2.29428,0.00704443
GBM_5_AutoML_2_20230516_164532,10.6581,3.26467,10.6581,2.3595,0.00716636
XGBoost_grid_1_AutoML_2_20230516_164532_model_2,10.6597,3.26493,10.6597,2.32741,0.00718573
GBM_3_AutoML_2_20230516_164532,10.7117,3.27287,10.7117,2.36223,0.00718902
XGBoost_1_AutoML_2_20230516_164532,10.7313,3.27586,10.7313,2.35948,0.00720418
GBM_2_AutoML_2_20230516_164532,10.7831,3.28376,10.7831,2.38277,0.00721314
XGBoost_3_AutoML_2_20230516_164532,11.0484,3.32391,11.0484,2.41398,0.00730508




This dataset comes from the UCI Machine Learning Repository of machine learning datasets. 

http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

The data was used in a publication in the *International Journal of Electrical Power & Energy Systems* in 2014. 

https://www.sciencedirect.com/science/article/pii/S0142061514000908

In the paper, the authors achieved a mean absolute error (MAE) of 2.818 and a Root Mean-Squared Error (RMSE) of 3.787 on their best model. So, with H2O's AutoML, we've already beaten the state-of-the-art in just 60 seconds of compute time!

## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [13]:
pred = aml.predict(test)
pred.head()

xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%


predict
486.013
475.466
465.117
451.432
448.032
468.971
445.349
462.903
443.487
432.499




If needed, the standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [14]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegression: xgboost
** Reported on test data. **

MSE: 10.242775520821024
RMSE: 3.200433645745686
MAE: 2.2310432744744846
RMSLE: 0.007051308135367831
Mean Residual Deviance: 10.242775520821024


