# Application of AutoML

<u>**Contents**</u>
1. Introduction to AutoML
2. Initial setup
3. Import data to H20
4. Build Models & Display AutoML Leaderboard
5. Save Submission file

## 1. Introduction to AutoML
H2O AutoML is an automated machine learning meta-algorithm that is part of the [H2O software library](http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/intro.html#what-is-h2o). (It shold not be confused with [H2O DriverlessAI](https://www.h2o.ai/products/h2o-driverless-ai/), which is a commercial product and built from an entirely different code base.) H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

The H2O Python module is not intended as a replacement for other popular machine learning frameworks such as scikit-learn, pylearn2, and their ilk, but is intended to bring H2O to a wider audience of data and machine learning devotees who work exclusively with Python.

H2O from Python is a tool for rapidly turning over models, doing data munging, and building applications in a fast, scalable environment without any of the mental anguish about parallelism and distribution of work.

## 2. Initial setup

Every new python session begins by initializing a connection between the python client and the H2O cluster.
By default, this will attempt to discover an H2O at localhost:54321. Note that If it fails to find a running H2O instance at this address, it will seek out an *h2o jar* at several possible locations. If no jar is found, then an *H2OStartupError* will be raised.

After making a successful connection, we can obtain a high-level summary of the cluster status:

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import h2o
print(h2o.__version__)
from h2o.automl import H2OAutoML

h2o.init(max_mem_size='16G')

3.30.0.4
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpw2n55vbb
  JVM stdout: /tmp/tmpw2n55vbb/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpw2n55vbb/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.4
H2O_cluster_version_age:,2 months and 11 days
H2O_cluster_name:,H2O_from_python_unknownUser_osul4c
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,16 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


## 3. Import data to H20

Data in H2O is compressed and is held in the JVM heap (i.e. data is “in memory”), and not in the python process local memory. The H2OFrame is an iterable (supporting list comprehensions).

In [3]:
train = h2o.import_file("../input/melanoma-train-test-creator/train_meta_size_3.csv")
test = h2o.import_file("../input/melanoma-train-test-creator/test_meta_size_3.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


Now, let us print few samples from Train.

In [4]:
train.head()

sex,age_approx,anatom_site_general_challenge,w,h,pred,fold,target
1,45,0,6000,4000,0.018716,0,0
0,45,5,6000,4000,0.020294,0,0
0,50,1,1872,1053,0.029681,4,0
0,45,0,1872,1053,0.0233586,0,0
0,55,5,6000,4000,0.0240381,0,0
0,40,1,6000,4000,0.028978,2,0
1,25,1,5184,3456,0.0425243,2,0
0,35,4,2592,1936,0.0217059,0,0
1,30,4,6000,4000,0.0274129,0,0
0,50,1,6000,4000,0.0253603,0,0




Now, let us print few samples from Test.

In [5]:
test.head()

sex,age_approx,anatom_site_general_challenge,w,h,pred
1,70,4,6000,4000,0.0273591
1,40,1,6000,4000,0.0257988
0,55,4,6000,4000,0.0259827
0,50,4,6000,4000,0.024942
0,45,1,1920,1080,0.0325687
1,50,1,1872,1053,0.0297833
1,45,5,1872,1053,0.037029
1,50,1,1920,1080,0.0389982
0,45,4,1920,1080,0.0416879
1,65,1,6000,4000,0.0330207




## 4. Build Models & Display Leaderboard results

Now, let us set the predictor and features to train on AutoML. 
The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates 
- a Random Forest (DRF), 
- an Extremely-Randomized Forest (DRF/XRT), 
- a random grid of Generalized Linear Models (GLM),
- a random grid of XGBoost (XGBoost), 
- a random grid of Gradient Boosting Machines (GBM), 
- a random grid of Deep Neural Nets (DeepLearning), & 
- 2 Stacked Ensembles, one of all the models, and one of only the best models of each kind.

The "aml.train" method used begins an AutoML task, a background task that automatically builds a number of models with various algorithms and tracks their performance in a leaderboard. At any point in the process you may use H2O’s performance or prediction functions on the resulting models.

In [8]:
x = test.columns
y = 'target'

# For binary classification, response should be a factor
train[y] = train[y].asfactor()


aml = H2OAutoML(max_models=10000, seed=47, max_runtime_secs=120)
aml.train(x=x, y=y, training_frame=train, fold_column="fold")

AutoML progress: |
05:42:21.219: Fold column fold will be used for cross-validation. nfolds parameter will be ignored.

████████████████████████████████████████████████████████| 100%


In [9]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
XGBoost_grid__1_AutoML_20200813_054221_model_2,0.911987,0.060198,0.298916,0.349583,0.120039,0.0144093
GBM_grid__1_AutoML_20200813_054221_model_1,0.910542,0.061322,0.286562,0.317242,0.121066,0.014657
XGBoost_grid__1_AutoML_20200813_054221_model_1,0.908936,0.0617488,0.285684,0.355685,0.121306,0.0147152
StackedEnsemble_BestOfFamily_AutoML_20200813_054221,0.901772,0.0700141,0.292884,0.339626,0.123204,0.0151793
GBM_5_AutoML_20200813_054221,0.896781,0.0667642,0.271831,0.367002,0.122653,0.0150439
GBM_2_AutoML_20200813_054221,0.894854,0.0682731,0.25578,0.354676,0.123364,0.0152186
GBM_1_AutoML_20200813_054221,0.894349,0.0681951,0.27625,0.342328,0.12276,0.0150701
GBM_3_AutoML_20200813_054221,0.889017,0.0676689,0.269545,0.366842,0.122509,0.0150086
GBM_4_AutoML_20200813_054221,0.875091,0.0677697,0.266501,0.352063,0.122588,0.0150277
GLM_1_AutoML_20200813_054221,0.870531,0.0668567,0.284015,0.342258,0.122103,0.0149091




Let us now load the top model and its details.

In [10]:
# The leader model is stored here
aml.leader

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_grid__1_AutoML_20200813_054221_model_2


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,29.0




ModelMetricsBinomial: xgboost
** Reported on train data. **

MSE: 0.013191819431686588
RMSE: 0.11485564605924511
LogLoss: 0.053787350190107226
Mean Per-Class Error: 0.13406561823442575
AUC: 0.9388228740334933
AUCPR: 0.4019694101070116
Gini: 0.8776457480669866

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.18720608353614807: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,31810.0,301.0,0.0094,(301.0/32111.0)
1,1,346.0,235.0,0.5955,(346.0/581.0)
2,Total,32156.0,536.0,0.0198,(647.0/32692.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.187206,0.42077,162.0
1,max f2,0.083639,0.464608,235.0
2,max f0point5,0.37978,0.51041,95.0
3,max accuracy,0.484938,0.98492,69.0
4,max precision,0.869504,1.0,0.0
5,max recall,0.001103,1.0,394.0
6,max specificity,0.869504,1.0,0.0
7,max absolute_mcc,0.239419,0.416654,140.0
8,max min_per_class_accuracy,0.022986,0.858864,324.0
9,max mean_per_class_accuracy,0.016151,0.865934,338.0



Gains/Lift Table: Avg response rate:  1.78 %, avg score:  1.74 %


Unnamed: 0,Unnamed: 1,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
0,,1,0.010002,0.284215,31.145563,31.145563,0.553517,0.506331,0.553517,0.506331,0.311532,0.311532,3014.556259,3014.556259
1,,2,0.020005,0.158105,12.045245,21.595404,0.214067,0.208244,0.383792,0.357287,0.120482,0.432014,1104.52452,2059.54039
2,,3,0.030007,0.11382,8.259597,17.150135,0.146789,0.133881,0.304791,0.282818,0.082616,0.51463,725.959671,1615.013483
3,,4,0.04001,0.084697,6.366772,14.454294,0.11315,0.097686,0.256881,0.236535,0.063683,0.578313,536.677246,1345.429424
4,,5,0.050073,0.069412,4.446751,12.443004,0.079027,0.077,0.221136,0.204472,0.04475,0.623064,344.675096,1144.300424
5,,6,0.100024,0.037252,3.135599,7.794994,0.055726,0.049985,0.138532,0.127323,0.156627,0.77969,213.559935,679.499439
6,,7,0.150006,0.023294,1.446314,5.679631,0.025704,0.029216,0.100938,0.094634,0.072289,0.851979,44.631402,467.963066
7,,8,0.200018,0.014515,1.204525,4.560683,0.021407,0.018848,0.081052,0.075685,0.060241,0.91222,20.452452,356.068304
8,,9,0.300012,0.005264,0.481957,3.201246,0.008565,0.008851,0.056892,0.053409,0.048193,0.960413,-51.80428,220.124637
9,,10,0.400098,0.003332,0.189167,2.447766,0.003362,0.004106,0.043502,0.041076,0.018933,0.979346,-81.083327,144.77659




ModelMetricsBinomial: xgboost
** Reported on cross-validation data. **

MSE: 0.014409338874733614
RMSE: 0.12003890567117652
LogLoss: 0.06019796648484496
Mean Per-Class Error: 0.16491678419055333
AUC: 0.9119868789902668
AUCPR: 0.298916466670715
Gini: 0.8239737579805335

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.25084608793258667: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,31878.0,233.0,0.0073,(233.0/32111.0)
1,1,402.0,179.0,0.6919,(402.0/581.0)
2,Total,32280.0,412.0,0.0194,(635.0/32692.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.250846,0.360524,137.0
1,max f2,0.08754,0.413167,233.0
2,max f0point5,0.309096,0.422725,116.0
3,max accuracy,0.629635,0.983268,38.0
4,max precision,0.865087,0.8,2.0
5,max recall,0.000845,1.0,397.0
6,max specificity,0.894473,0.999969,0.0
7,max absolute_mcc,0.263918,0.357086,131.0
8,max min_per_class_accuracy,0.018471,0.828906,335.0
9,max mean_per_class_accuracy,0.016134,0.835083,341.0



Gains/Lift Table: Avg response rate:  1.78 %, avg score:  1.83 %


Unnamed: 0,Unnamed: 1,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
0,,1,0.010002,0.300992,27.015764,27.015764,0.480122,0.504082,0.480122,0.504082,0.270224,0.270224,2601.576424,2601.576424
1,,2,0.020005,0.169704,9.980346,18.498055,0.17737,0.22216,0.328746,0.363121,0.099828,0.370052,898.034602,1749.805513
2,,3,0.030007,0.118732,7.399222,14.798444,0.131498,0.141864,0.262997,0.289369,0.07401,0.444062,639.922205,1379.84441
3,,4,0.04001,0.090969,6.882997,12.819582,0.122324,0.103243,0.227829,0.242837,0.068847,0.512909,588.299726,1181.958239
4,,5,0.050012,0.073296,4.129798,11.081626,0.073394,0.081368,0.196942,0.210543,0.041308,0.554217,312.979835,1008.162558
5,,6,0.100024,0.037829,3.028519,7.055072,0.053823,0.052002,0.125382,0.131273,0.151463,0.70568,202.851879,605.507219
6,,7,0.150006,0.024603,1.584058,5.232145,0.028152,0.030486,0.092985,0.097691,0.079174,0.784854,58.405821,423.214461
7,,8,0.200049,0.01573,1.444546,4.284666,0.025672,0.019763,0.076147,0.078197,0.072289,0.857143,44.454591,328.466579
8,,9,0.300043,0.006331,0.791787,3.12061,0.014072,0.010042,0.055459,0.055483,0.079174,0.936317,-20.821318,212.06102
9,,10,0.400006,0.004175,0.223834,2.396693,0.003978,0.005016,0.042594,0.042871,0.022375,0.958692,-77.616569,139.669312




Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,accuracy,0.9803273,0.0027140486,0.98011017,0.98178756,0.9800061,0.97620875,0.983524
1,auc,0.91692907,0.009671419,0.908659,0.91693956,0.9235938,0.92912954,0.90632343
2,aucpr,0.31594193,0.045517918,0.27431262,0.27157378,0.32654408,0.3250794,0.3821997
3,err,0.019672677,0.0027140486,0.01988984,0.018212426,0.019993896,0.023791252,0.016475972
4,err_count,128.6,17.472836,130.0,119.0,131.0,155.0,108.0
5,f0point5,0.41412586,0.06238799,0.39915967,0.39772728,0.40674603,0.34902596,0.5179704
6,f1,0.38129663,0.057962973,0.36893204,0.32,0.38497654,0.35684648,0.47572815
7,f2,0.3561895,0.061585728,0.3429603,0.26768643,0.36541888,0.36502546,0.43985638
8,lift_top_group,27.276306,3.65676,24.757576,23.896551,28.241379,26.379963,33.10606
9,logloss,0.060200468,0.0024591843,0.063315995,0.062013086,0.058073375,0.05997271,0.057627168



See the whole table with table.as_data_frame()

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
0,,2020-08-13 05:43:38,24.902 sec,0.0,0.5,0.693147,0.5,0.017772,1.0,0.982228
1,,2020-08-13 05:43:38,25.075 sec,5.0,0.153841,0.145807,0.861057,0.350381,28.134251,0.01973
2,,2020-08-13 05:43:38,25.204 sec,10.0,0.118905,0.073205,0.917781,0.366021,29.768963,0.021045
3,,2020-08-13 05:43:39,25.387 sec,15.0,0.116292,0.059169,0.927338,0.378239,30.021305,0.018904
4,,2020-08-13 05:43:39,25.633 sec,20.0,0.115692,0.055896,0.931702,0.38719,29.941038,0.018965
5,,2020-08-13 05:43:39,25.986 sec,25.0,0.115269,0.054652,0.935713,0.393811,30.272112,0.018843
6,,2020-08-13 05:43:40,26.295 sec,29.0,0.114856,0.053787,0.938823,0.401969,31.145563,0.019791



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,pred,1664.447266,1.0,0.726601
1,w,225.146362,0.135268,0.098286
2,age_approx,192.724884,0.115789,0.084133
3,h,110.730476,0.066527,0.048339
4,anatom_site_general_challenge,77.80275,0.046744,0.033964
5,sex,19.877613,0.011942,0.008677




Let us now predict on the Test.

In [11]:
preds = aml.predict(test)

xgboost prediction progress: |████████████████████████████████████████████| 100%


We now display few predictions.

In [13]:
preds

predict,p0,p1
0,0.999197,0.000803167
0,0.999098,0.00090187
0,0.9992,0.000800428
0,0.999205,0.000795351
0,0.995831,0.00416911
0,0.997427,0.00257268
0,0.959241,0.040759
0,0.981264,0.018736
0,0.979788,0.020212
0,0.998572,0.00142819




## 5. Save Submission file
We now create our submission file to be uploaded on Kaggle. **This prediction generated a Leaderboard score of 0.9395**

In [15]:
sample_submission = pd.read_csv('../input/siim-isic-melanoma-classification/sample_submission.csv')

sample_submission['target'] = preds['p1'].as_data_frame().values
sample_submission.to_csv('submission h2o.csv', index=False)

## 5. References
* http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
* http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/intro.html#what-is-h2o