# H2O AutoML Regression Demo

This is a [Jupyter](https://jupyter.org/) Notebook. When you execute code within the notebook, the results appear beneath the code. To execute a code chunk, place your cursor on the cell and press *Shift+Enter*. 

### Start H2O

Import the **h2o** Python module and `H2OAutoML` class and initialize a local H2O cluster.

In [1]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_382"; OpenJDK Runtime Environment (build 1.8.0_382-b05); OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
  Starting server from /u/home/c/ccp2287/.conda/envs/h2oai/lib/python3.12/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpv_5ns2dx
  JVM stdout: /tmp/tmpv_5ns2dx/h2o_ccp2287_started_from_python.out
  JVM stderr: /tmp/tmpv_5ns2dx/h2o_ccp2287_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/Los_Angeles
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.2
H2O_cluster_version_age:,20 days
H2O_cluster_name:,H2O_from_python_ccp2287_csx8ti
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,26.67 Gb
H2O_cluster_total_cores:,36
H2O_cluster_allowed_cores:,36


### Load Data

For the AutoML regression demo, we use the [Combined Cycle Power Plant](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset.  The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values.  In this demo, you will use H2O's AutoML to outperform the [state of the art results](https://www.sciencedirect.com/science/article/pii/S0142061514000908) on this task.

In [2]:
# Use local data file or download from GitHub
import os
data_path = "https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv"

# Load data into H2O
df = h2o.import_file(data_path)

# Create Test Data
splits = df.split_frame(ratios = [0.8], seed = 1)
test = splits[1]


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


Let's take a look at the data.

In [3]:
df.describe()

Unnamed: 0,TemperatureCelcius,ExhaustVacuumHg,AmbientPressureMillibar,RelativeHumidity,HourlyEnergyOutputMW
type,real,real,real,real,real
mins,1.81,25.36,992.89,25.56,420.26
mean,19.651231187290968,54.30580372073578,1013.2590781772575,73.30897784280934,454.36500940635443
maxs,37.11,81.56,1033.3,100.16,495.76
sigma,7.452473229611079,12.707892998326809,5.938783705811605,14.600268756728953,17.066994999803416
zeros,0,0,0,0,0
missing,0,0,0,0,0
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


Next, let's identify the response column and save the column name as `y`.  In this dataset, we will use all columns except the response as predictors, so we can skip setting the `x` argument explicitly.

In [4]:
y = "HourlyEnergyOutputMW"

Lastly, let's split the data into two frames, a `train` (80%) and a `test` frame (20%).  The `test` frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.

## Run AutoML 

Run AutoML, stopping after 60 seconds.  The `max_runtime_secs` argument provides a way to limit the AutoML run by time.  When using a time-limited stopping criterion, the number of models train will vary between runs.  
In this example, we will import 100% of the data for training and the leaderboard will use cross-validated metrics to asses the models.

In [5]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_frame")
aml.train(y = y, training_frame = df)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees
,26.0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,2.2743654,0.0484475,2.2643597,2.2478166,2.359796,2.2566676,2.2431872
mean_residual_deviance,10.113454,1.0494863,9.68008,8.815423,11.392504,10.969993,9.709271
mse,10.113454,1.0494863,9.68008,8.815423,11.392504,10.969993,9.709271
r2,0.9652283,0.0039612,0.9667051,0.9696488,0.9605398,0.9615966,0.9676514
residual_deviance,10.113454,1.0494863,9.68008,8.815423,11.392504,10.969993,9.709271
rmse,3.1767414,0.1649548,3.1112828,2.9690778,3.3752782,3.3120978,3.1159704
rmsle,0.0069755,0.00036,0.0068496,0.0065198,0.0074113,0.0072656,0.0068314

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2023-11-28 22:11:12,11.708 sec,0.0,454.1857535,453.8650095,206284.6986959
,2023-11-28 22:11:12,11.762 sec,5.0,76.480405,76.2948331,5849.2523512
,2023-11-28 22:11:12,11.801 sec,10.0,13.2702063,12.8986344,176.0983753
,2023-11-28 22:11:12,11.845 sec,15.0,3.3350436,2.6644685,11.122516
,2023-11-28 22:11:12,11.890 sec,20.0,2.2431058,1.6065574,5.0315239
,2023-11-28 22:11:12,11.927 sec,25.0,2.1287619,1.5036465,4.5316272
,2023-11-28 22:11:12,11.941 sec,26.0,2.0962997,1.4770928,4.3944726

variable,relative_importance,scaled_importance,percentage
TemperatureCelcius,3930630.75,1.0,0.7791057
ExhaustVacuumHg,1024434.5,0.2606285,0.2030572
AmbientPressureMillibar,49977.78125,0.012715,0.0099063
RelativeHumidity,40011.6601562,0.0101795,0.0079309


## Leaderboard

Next, we will view the AutoML Leaderboard.  Since we specified a `leaderboard_frame` in the `H2OAutoML.train()` method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

After viewing the `"powerplant_lb_frame"` AutoML project leaderboard, we compare that to the leaderboard for the `"powerplant_full_data"` project.  We can see that the results are better when the full dataset is used for training.  

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric.  In the case of regression, the default ranking metric is mean residual deviance.  In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [6]:
aml.leaderboard.head()

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
XGBoost_grid_1_AutoML_1_20231128_221027_model_4,3.18016,10.1134,2.27437,0.00698293,10.1134
GBM_4_AutoML_1_20231128_221027,3.2073,10.2867,2.29428,0.00704443,10.2867
XGBoost_grid_1_AutoML_1_20231128_221027_model_3,3.22011,10.3691,2.32292,0.0070722,10.3691
GBM_5_AutoML_1_20231128_221027,3.26467,10.6581,2.3595,0.00716636,10.6581
XGBoost_grid_1_AutoML_1_20231128_221027_model_2,3.26493,10.6597,2.32741,0.00718573,10.6597
GBM_3_AutoML_1_20231128_221027,3.27287,10.7117,2.36223,0.00718902,10.7117
XGBoost_1_AutoML_1_20231128_221027,3.27586,10.7313,2.35948,0.00720418,10.7313
GBM_2_AutoML_1_20231128_221027,3.28376,10.7831,2.38277,0.00721314,10.7831
XGBoost_3_AutoML_1_20231128_221027,3.32391,11.0484,2.41398,0.00730508,11.0484
XGBoost_2_AutoML_1_20231128_221027,3.32473,11.0538,2.39346,0.00730588,11.0538


This dataset comes from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) of machine learning datasets.  The data was used in a [publication](https://www.sciencedirect.com/science/article/pii/S0142061514000908) in the *International Journal of Electrical Power & Energy Systems* in 2014.  In the paper, the authors achieved a mean absolute error (MAE) of 2.818 and a Root Mean-Squared Error (RMSE) of 3.787 on their best model.  So, with H2O's AutoML, we've already beaten the state-of-the-art in just 60 seconds of compute time!

## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [7]:
pred = aml.predict(test)
pred.head()

xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%


predict
487.538
475.761
466.843
453.029
449.231
468.274
443.634
463.748
441.56
433.288


If needed, the standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [8]:
perf = aml.leader.model_performance(test)
perf