# H2O AutoML Regression Demo

This is a [Jupyter](https://jupyter.org/) Notebook. When you execute code within the notebook, the results appear beneath the code. To execute a code chunk, place your cursor on the cell and press *Shift+Enter*. 

### Start H2O

Import the **h2o** Python module and `H2OAutoML` class and initialize a local H2O cluster.

In [1]:
import os
import h2o
import pandas as pd
import datetime as dt

In [2]:
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_152"; Java(TM) SE Runtime Environment (build 1.8.0_152-b16); Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
  Starting server from /Users/wilsonpok/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/3h/lk0_ptbj3fxd0vd0jwvg3cw00000gn/T/tmpgvfjz1so
  JVM stdout: /var/folders/3h/lk0_ptbj3fxd0vd0jwvg3cw00000gn/T/tmpgvfjz1so/h2o_wilsonpok_started_from_python.out
  JVM stderr: /var/folders/3h/lk0_ptbj3fxd0vd0jwvg3cw00000gn/T/tmpgvfjz1so/h2o_wilsonpok_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,Australia/Sydney
H2O data parsing timezone:,UTC
H2O cluster version:,3.20.0.7
H2O cluster version age:,19 days
H2O cluster name:,H2O_from_python_wilsonpok_ayf6ko
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


## Load Data

In [3]:
df = pd.read_csv(os.path.expanduser('~/python-scripts/automl/demand-forecasting/input/train.csv'))

Let's take a look at the data.

In [4]:
df.describe()

Unnamed: 0,store,item,sales
count,913000.0,913000.0,913000.0
mean,5.5,25.5,52.250287
std,2.872283,14.430878,28.801144
min,1.0,1.0,0.0
25%,3.0,13.0,30.0
50%,5.5,25.5,47.0
75%,8.0,38.0,70.0
max,10.0,50.0,231.0


In [5]:
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,1,1,13
1,2013-01-02,1,1,11
2,2013-01-03,1,1,14
3,2013-01-04,1,1,13
4,2013-01-05,1,1,10


## Preprocessing

In [6]:
df['date'] = pd.to_datetime(df['date'])
print(df['date'].dtype)

datetime64[ns]


In [7]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['weekday'] = df['date'].dt.weekday
df['quoter'] = df['year'] * 4 + divmod(df['month'], 3)[0] - 8051
df.head()

Unnamed: 0,date,store,item,sales,year,month,weekday,quoter
0,2013-01-01,1,1,13,2013,1,1,1
1,2013-01-02,1,1,11,2013,1,2,1
2,2013-01-03,1,1,14,2013,1,3,1
3,2013-01-04,1,1,13,2013,1,4,1
4,2013-01-05,1,1,10,2013,1,5,1


In [8]:
df['item_store_month_sales'] = df.groupby(['item', 'store', 'month'])['sales'].transform('mean')
df['store_item_weekday_sales'] = df.groupby(['store', 'item', 'weekday'])['sales'].transform('mean')
df['round_item_store_month_sales'] = round(df['item_store_month_sales'])
df['round_store_item_weekday_sales'] = round(df['store_item_weekday_sales'])
df.head()

Unnamed: 0,date,store,item,sales,year,month,weekday,quoter,item_store_month_sales,store_item_weekday_sales,round_item_store_month_sales,round_store_item_weekday_sales
0,2013-01-01,1,1,13,2013,1,1,1,13.709677,18.168582,14.0,18.0
1,2013-01-02,1,1,11,2013,1,2,1,13.709677,18.793103,14.0,19.0
2,2013-01-03,1,1,14,2013,1,3,1,13.709677,19.452107,14.0,19.0
3,2013-01-04,1,1,13,2013,1,4,1,13.709677,21.015326,14.0,21.0
4,2013-01-05,1,1,10,2013,1,5,1,13.709677,22.97318,14.0,23.0


In [9]:
df_select = df[['sales', \
                'month', \
                'quoter', \
                'item_store_month_sales', \
                'store_item_weekday_sales', \
                'round_item_store_month_sales', \
                'round_store_item_weekday_sales']]

## Train model

In [10]:
hf = h2o.H2OFrame(df_select)

  data = _handle_python_lists(python_obj.as_matrix().tolist(), -1)[1]


Parse progress: |█████████████████████████████████████████████████████████| 100%


In [11]:
hf.describe

sales,month,quoter,item_store_month_sales,store_item_weekday_sales,round_item_store_month_sales,round_store_item_weekday_sales
13,1,1,13.7097,18.1686,14,18
11,1,1,13.7097,18.7931,14,19
14,1,1,13.7097,19.4521,14,19
13,1,1,13.7097,21.0153,14,21
10,1,1,13.7097,22.9732,14,23
12,1,1,13.7097,23.7969,14,24
10,1,1,13.7097,15.5846,14,16
9,1,1,13.7097,18.1686,14,18
12,1,1,13.7097,18.7931,14,19
9,1,1,13.7097,19.4521,14,19


<bound method H2OFrame.describe of >

In [12]:
y = 'sales'

In [13]:
splits = hf.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

In [99]:
train.head()

sales,month,quoter,item_store_month_sales,store_item_weekday_sales,round_item_store_month_sales,round_store_item_weekday_sales
13,1,1,13.7097,18.1686,14,18
11,1,1,13.7097,18.7931,14,19
13,1,1,13.7097,21.0153,14,21
10,1,1,13.7097,22.9732,14,23
12,1,1,13.7097,23.7969,14,24
10,1,1,13.7097,15.5846,14,16
9,1,1,13.7097,18.1686,14,18
12,1,1,13.7097,18.7931,14,19
9,1,1,13.7097,21.0153,14,21
7,1,1,13.7097,22.9732,14,23




## Run AutoML

In [14]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = 'lb_frame')
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

In [15]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
DRF_0_AutoML_20180920_131306,62.4272,7.90109,62.4272,6.06351,0.17414




## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the "H2OAutoML" object directly, or on the leader model object.

In [16]:
pred = aml.predict(test)
pred.head()

drf prediction progress: |████████████████████████████████████████████████| 100%


predict
13.5924
13.5924
10.0629
10.7422
17.3786
10.7422
12.7469
14.8313
12.9752
13.2621




In [17]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 62.42715974167017
RMSE: 7.901085984956129
MAE: 6.063511321128079
RMSLE: 0.17414034032296777
Mean Residual Deviance: 62.42715974167017


