# H2O AutoML Regression Demo

This is a [Jupyter](https://jupyter.org/) Notebook. When you execute code within the notebook, the results appear beneath the code. To execute a code chunk, place your cursor on the cell and press *Shift+Enter*. 

### Start H2O

Import the **h2o** Python module and `H2OAutoML` class and initialize a local H2O cluster.

In [1]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.8" 2020-07-14; OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1); OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
  Starting server from /home/krishna/.local/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpn1363shb
  JVM stdout: /tmp/tmpn1363shb/h2o_krishna_started_from_python.out
  JVM stderr: /tmp/tmpn1363shb/h2o_krishna_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster timezone:,Asia/Kolkata
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.2
H2O cluster version age:,"1 year, 1 month and 19 days !!!"
H2O cluster name:,H2O_from_python_krishna_ohir23
H2O cluster total nodes:,1
H2O cluster free memory:,3.848 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


### Load Data

We will use the [Combined Cycle Power Plant](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset.  
Input variables:
temperature, ambient pressure, relative humidity and exhaust vacuum values.  
Output variable:
HourlyEnergyOutput in MW

In [2]:
filename = "datasets/powerplant_output.csv"
df = h2o.import_file(filename)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Let's take a look at the data.

In [3]:
df.describe()

Rows:9568
Cols:5




Unnamed: 0,TemperatureCelcius,ExhaustVacuumHg,AmbientPressureMillibar,RelativeHumidity,HourlyEnergyOutputMW
type,real,real,real,real,real
mins,1.81,25.36,992.89,25.56,420.26
mean,19.651231187290964,54.30580372073579,1013.2590781772575,73.30897784280938,454.36500940635455
maxs,37.11,81.56,1033.3,100.16,495.76
sigma,7.452473229611079,12.70789299832681,5.938783705811606,14.600268756728953,17.066994999803413
zeros,0,0,0,0,0
missing,0,0,0,0,0
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


In [4]:
y = "HourlyEnergyOutputMW"

#No need to specify x argument. 
#All columns other than y are treated as input variables or features

In [5]:
#Split the data into 70% for training and 30% for testing.
#Training data is used for model building and 
#Test data will be used to assess the quality metrics of the model
splits = df.split_frame(ratios = [0.7], seed = 1)
train = splits[0]
test = splits[1]

## Run AutoML 

`max_runtime_secs` argument = model building will not run beyond this time. 

The `test` frame = data on which quality metrics needs to be computed to arrive at  `leaderboard_frame`

In [6]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "powerplant_lb_frame")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


*Note: If you see the following error, it means that you need to install the pandas module.*
```
H2OTypeError: Argument `python_obj` should be a None | list | tuple | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse.issparse, got H2OTwoDimTable 
``` 

In [7]:
#
# Print the Leader board of good models 
#
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_1_AutoML_20200915_120529,10.8828,3.29891,10.8828,2.38014,0.00727403
StackedEnsemble_BestOfFamily_AutoML_20200915_120529,10.9836,3.31416,10.9836,2.40979,0.00730677
StackedEnsemble_AllModels_AutoML_20200915_120529,11.3797,3.37339,11.3797,2.46036,0.00744631
XGBoost_2_AutoML_20200915_120529,19.1693,4.37827,19.1693,3.43954,0.00960476
XGBoost_3_AutoML_20200915_120529,29500.3,171.757,29500.3,171.486,0.47233




So the Top model used XGBoost algorithm

This dataset comes from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) of machine learning datasets.  The data was used in a [publication](https://www.sciencedirect.com/science/article/pii/S0142061514000908) in the *International Journal of Electrical Power & Energy Systems* in 2014.  In the paper, mean absolute error (MAE) is 2.818 and a Root Mean-Squared Error (RMSE) is 3.787. H2o leader board model has beaten these metrics(compared with the leader at 1st row).

## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [8]:
pred = aml.predict(test)
print(pred.head(2))

xgboost prediction progress: |████████████████████████████████████████████| 100%


predict
485.749
473.517





In [9]:
# 
# Print the model performance of the leader 
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegression: xgboost
** Reported on test data. **

MSE: 10.88278476366283
RMSE: 3.2989066012336314
MAE: 2.3801365480312664
RMSLE: 0.007274030873983928
Mean Residual Deviance: 10.88278476366283




Acknowlegements: I prepared this notebook from https://github.com/navdeep-G/sdss-h2o-automl