# H2O-3's AutoML vs Driverless AI (with Auto Feature Engineering)

Load the same data set in H2O-3's AutoML and Driverless AI. Compare and contrast if the Driverless AI's Feature Engineering is 'squeezing' out more performance. Evaluate the Train and Test MAE's the raw data set in AutoML and then the Feature Engineered one from Driverless back into AutoML.


In [1]:
#Import H2O libiraries
import h2o
from h2o.automl import H2OAutoML

In [2]:
#Initialize local H2O Cluster
h2o.init(min_mem_size='8G', name='cluster-buster')

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_152-release"; OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12); OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)
  Starting server from /Users/thomasott/opt/anaconda3/envs/py37/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/9c/tqhnrz3x207bf8pzjxjhkm100000gn/T/tmp99hg826j
  JVM stdout: /var/folders/9c/tqhnrz3x207bf8pzjxjhkm100000gn/T/tmp99hg826j/h2o_thomasott_started_from_python.out
  JVM stderr: /var/folders/9c/tqhnrz3x207bf8pzjxjhkm100000gn/T/tmp99hg826j/h2o_thomasott_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.2
H2O cluster version age:,1 month and 4 days
H2O cluster name:,cluster-buster
H2O cluster total nodes:,1
H2O cluster free memory:,7.667 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [3]:
#Load in Loan Loss Training Set
#train = h2o.import_file('./data/train_v2.csv')
my_data = h2o.import_file('./data/train_v2.csv')

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
train,test = my_data.split_frame(ratios=[.7])

In [5]:
# Set target and predictor variables
y = "loss"
x = train.col_names
x.remove(y)

#Drop the ID column
x.remove("id")

In [6]:
#Call H2O-3 AutoML. Set early stopping metric to MAE and CV Folds = 5
aml = H2OAutoML(max_models = 10, seed=1234, stopping_metric = "MAE", sort_metric = "MAE", nfolds = 5)
aml.train(x=x, y=y, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [7]:
#Example LeaderBoard (Note: my small CPU can only train a single model, run this on a larger machine!)
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_id,mae,mean_residual_deviance,rmse,mse,rmsle
XGBoost_3_AutoML_20200225_115914,1.46113,19.8157,4.45148,19.8157,0.728169
XGBoost_1_AutoML_20200225_115914,1.46444,19.94,4.46543,19.94,0.734998
XGBoost_2_AutoML_20200225_115914,1.47659,19.9842,4.47037,19.9842,
StackedEnsemble_AllModels_AutoML_20200225_115914,1.48171,19.7464,4.44369,19.7464,
StackedEnsemble_BestOfFamily_AutoML_20200225_115914,1.48382,19.7672,4.44603,19.7672,
GBM_2_AutoML_20200225_115914,1.48416,19.9109,4.46216,19.9109,
GLM_1_AutoML_20200225_115914,1.48791,19.7957,4.44924,19.7957,
GBM_5_AutoML_20200225_115914,1.48796,19.8627,4.45676,19.8627,
GBM_3_AutoML_20200225_115914,1.48815,19.9096,4.46202,19.9096,
GBM_1_AutoML_20200225_115914,1.48846,20.1486,4.48872,20.1486,




Select the AutoML leader and apply test data set to it.

In [8]:
best_model = aml.leader.model_performance(test).mae()

Get the 'best_model' test performance.

In [9]:
best_model

1.393881910643572

In [10]:
#Import Driverless AI transformed features from oringal training Set
my_data_munged = h2o.import_file('./data/train_v2.zip.1575494585.3482676.bin.munged_train.csv')

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [11]:
train,test = my_data_munged.split_frame(ratios=[.7])

In [12]:
#Check out the munged dataset
train

306_f39,380_f46,407_f493,783_ClusterDist50:f170:f517:f766.5,783_ClusterDist50:f170:f517:f766.6,783_ClusterDist50:f170:f517:f766.7,783_ClusterDist50:f170:f517:f766.9,783_ClusterDist50:f170:f517:f766.10,loss
0.7471,0.80819,124.0,3.04902,3.70761,2.21696,3.53115,2.47696,0
0.77405,0.8207,903.0,5.23919,3.15662,1.08105,5.94185,4.17258,0
0.78385,0.86382,130.94,4.65057,2.7428,1.28935,5.13865,3.25539,0
0.79085,0.82485,399.0,1.9061,4.80431,3.01405,2.78507,2.98162,0
0.7269,0.89431,836.75,3.71168,3.10235,2.00807,4.07024,2.66102,1
0.7995,0.88271,82.0,4.05847,3.72011,1.46525,4.88991,3.85273,0
0.78255,0.8171,655.99,3.82943,2.81248,2.98224,3.56421,1.55374,0
0.8034,0.93131,299.0,5.03231,1.61661,2.9867,4.81117,2.60366,16
0.82245,0.9,42.97,3.60227,3.23619,1.95323,4.02183,2.68071,0
0.76405,0.80422,785.0,4.57044,2.06931,3.03125,4.30213,2.27379,0




In [13]:
# Set target and predictor variables
y = "loss"
x = train.col_names
x.remove(y)

In [14]:
#Run another AutoML with munged data set
aml_munge = H2OAutoML(max_models = 10, seed=1234, stopping_metric = "MAE", sort_metric = "MAE", nfolds = 5)
aml_munge.train(x=x, y=y, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [15]:
#check out top models (note the munged data set makes training faster)
lb = aml_munge.leaderboard
lb.head(rows=lb.nrows)

model_id,mae,mean_residual_deviance,rmse,mse,rmsle
DeepLearning_grid__2_AutoML_20200225_140150_model_1,0.951571,19.0482,4.36442,19.0482,
DeepLearning_grid__1_AutoML_20200225_140150_model_1,0.963559,19.228,4.38497,19.228,
DeepLearning_1_AutoML_20200225_140150,1.21543,18.7741,4.33291,18.7741,
XGBoost_grid__1_AutoML_20200225_140150_model_1,1.38965,18.7391,4.32887,18.7391,0.716022
XGBoost_3_AutoML_20200225_140150,1.41063,18.6989,4.32422,18.6989,
XGBoost_grid__1_AutoML_20200225_140150_model_2,1.41079,18.6808,4.32213,18.6808,0.717532
XGBoost_1_AutoML_20200225_140150,1.41265,18.8048,4.33645,18.8048,0.728388
XGBoost_grid__1_AutoML_20200225_140150_model_3,1.4195,19.0051,4.35948,19.0051,0.726911
StackedEnsemble_BestOfFamily_AutoML_20200225_140150,1.42807,18.6226,4.31539,18.6226,0.720902
StackedEnsemble_AllModels_AutoML_20200225_140150,1.4284,18.6249,4.31566,18.6249,0.721562




In [16]:
best_model = aml_munge.leader.model_performance(test).mae()

In [17]:
best_model

6.927857999672322

In [18]:
h2o.cluster().shutdown()

H2O session _sid_b599 closed.
