# H20 Machine Learning

### Installation
Before installing H2O itself, a few dependencies are required. Please install them using pip install. On some systems, super-user privileges may be required. If so, adding sudo before pip install will solve the problem.

In [3]:
!pip install requests
!pip install tabulate
!pip install scikit-learn
!pip install colorama
!pip install future

Collecting tabulate
  Downloading https://files.pythonhosted.org/packages/12/c2/11d6845db5edf1295bc08b2f488cf5937806586afe42936c3f34c097ebdc/tabulate-0.8.2.tar.gz (45kB)
Building wheels for collected packages: tabulate
  Running setup.py bdist_wheel for tabulate: started
  Running setup.py bdist_wheel for tabulate: finished with status 'done'
  Stored in directory: C:\Users\Sandeep\AppData\Local\pip\Cache\wheels\2a\85\33\2f6da85d5f10614cbe5a625eab3b3aebfdf43e7b857f25f829
Successfully built tabulate
Installing collected packages: tabulate
Successfully installed tabulate-0.8.2
Collecting future
  Downloading https://files.pythonhosted.org/packages/90/52/e20466b85000a181e1e144fd8305caf2cf475e2f9674e797b222f8105f5f/future-0.17.1.tar.gz (829kB)
Building wheels for collected packages: future
  Running setup.py bdist_wheel for future: started
  Running setup.py bdist_wheel for future: finished with status 'done'
  Stored in directory: C:\Users\Sandeep\AppData\Local\pip\Cache\wheels\0c\61\d2\d

In [4]:
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

Looking in links: http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html
Collecting h2o
  Downloading https://files.pythonhosted.org/packages/6e/e4/1b34202b4887f8187f72acaa178eb4ff87982a9583008c78e1929d8a5e23/h2o-3.22.0.2.tar.gz (120.6MB)
Building wheels for collected packages: h2o
  Running setup.py bdist_wheel for h2o: started
  Running setup.py bdist_wheel for h2o: finished with status 'done'
  Stored in directory: C:\Users\Sandeep\AppData\Local\pip\Cache\wheels\0d\17\52\9ea300738f719aca7b88a790ce94b8c928e7c6098e72627c7f
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.22.0.2




In [5]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
  Starting server from C:\Users\Sandeep\Anaconda2\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: c:\users\sandeep\appdata\local\temp\tmpgdwkd6
  JVM stdout: c:\users\sandeep\appdata\local\temp\tmpgdwkd6\h2o_Sandeep_started_from_python.out
  JVM stderr: c:\users\sandeep\appdata\local\temp\tmpgdwkd6\h2o_Sandeep_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,12 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,24 days
H2O cluster name:,H2O_from_python_Sandeep_bwwipb
H2O cluster total nodes:,1
H2O cluster free memory:,1.314 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


### Data Import
Let’s import a dataset and train a model on it very quickly.

In [6]:
airlines_train_data = h2o.import_file("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


H2O will automatically download the dataset and parse it. It will also try to guess the datatype of each column automatically. H2O does a great job at datatype recognition, however, each decision can be overridden manually by the user, if required. The imported dataset can also be given a name using destination_frame argument. 

In [7]:
h2o.ls()

Unnamed: 0,key
0,Key_Frame__https___s3_amazonaws_com_h2o_airlin...


A preview of the data imported can be displayed with by typing the variable pointing to the H2O Frame, in this case airlines_train_data.

In [8]:
airlines_train_data

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,IsArrDelayed,IsDepDelayed
1987,10,14,3,741,730,912,849,PS,1451,,91,79,,23,11,SAN,SFO,447,,,0,,0,,,,,,YES,YES
1987,10,15,4,729,730,903,849,PS,1451,,94,79,,14,-1,SAN,SFO,447,,,0,,0,,,,,,YES,NO
1987,10,17,6,741,730,918,849,PS,1451,,97,79,,29,11,SAN,SFO,447,,,0,,0,,,,,,YES,YES
1987,10,18,7,729,730,847,849,PS,1451,,78,79,,-2,-1,SAN,SFO,447,,,0,,0,,,,,,NO,NO
1987,10,19,1,749,730,922,849,PS,1451,,93,79,,33,19,SAN,SFO,447,,,0,,0,,,,,,YES,YES
1987,10,21,3,728,730,848,849,PS,1451,,80,79,,-1,-2,SAN,SFO,447,,,0,,0,,,,,,NO,NO
1987,10,22,4,728,730,852,849,PS,1451,,84,79,,3,-2,SAN,SFO,447,,,0,,0,,,,,,YES,NO
1987,10,23,5,731,730,902,849,PS,1451,,91,79,,13,1,SAN,SFO,447,,,0,,0,,,,,,YES,YES
1987,10,24,6,744,730,908,849,PS,1451,,84,79,,19,14,SAN,SFO,447,,,0,,0,,,,,,YES,YES
1987,10,25,7,729,730,851,849,PS,1451,,82,79,,2,-1,SAN,SFO,447,,,0,,0,,,,,,YES,NO




#### Model Training
On top of the data imported, a model can be built quickly. There are many algorithms available in H2O. For the purpose of this tutorial, a widely known Gradient Boosting Machines method will be used. Let’s train a model that is able to predict if the plane arrives late based on month, day of week, and distance the plane has to travel before reaching its destination. GBM resides in h2o.estimators.gbm package. The first step is to import H2OGradientBoostingEstimator to avoid the need for typing a fully qualified name in future.

In [9]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

First, it is required to construct a new GBM estimator instance by calling gbm_model = H2OGradientBoostingEstimator() constructor. By invoking gbm_model.train(...), H2O will run a gradient boosting algorithm on the data. There are many variables to play with, and each and every data scientist can explore on his/her own. Overriding the default hyperparameters would only make this tutorial more complicated. H2O only needs to know three things:

predictor columns,
response variable column,
training frame — a dataset to train the model on.

In [10]:
gbm_model = H2OGradientBoostingEstimator()
gbm_model.train(x = ["Month", "DayOfWeek", "Distance"], y = "IsArrDelayed", training_frame=airlines_train_data)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [11]:
gbm_model

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1544977393365_1


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.234966307798
RMSE: 0.484733233643
LogLoss: 0.660935092712
Mean Per-Class Error: 0.41540494848
AUC: 0.623775438846
pr_auc: 0.682065359816
Gini: 0.247550877693
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.43158830279: 


0,1,2,3,4
,NO,YES,Error,Rate
NO,604.0,18933.0,0.9691,(18933.0/19537.0)
YES,232.0,24209.0,0.0095,(232.0/24441.0)
Total,836.0,43142.0,0.4358,(19165.0/43978.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4315883,0.7164228,368.0
max f2,0.3552172,0.8622405,395.0
max f0point5,0.5134535,0.6336807,279.0
max accuracy,0.5119928,0.5946610,280.0
max precision,0.9725341,1.0,0.0
max recall,0.3474692,1.0,397.0
max specificity,0.9725341,1.0,0.0
max absolute_mcc,0.6058388,0.1759478,151.0
max min_per_class_accuracy,0.5398879,0.5824639,235.0


Gains/Lift Table: Avg response rate: 55.58 %, avg score: 55.57 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100732,0.8970957,1.7059334,1.7059334,0.9480813,0.9110773,0.9480813,0.9110773,0.0171842,0.0171842,70.5933384,70.5933384
,2,0.0221247,0.8642732,1.6465782,1.6736022,0.9150943,0.8751877,0.9301131,0.8915280,0.0198437,0.0370279,64.6578244,67.3602218
,3,0.0302651,0.8302098,1.5430211,1.6384797,0.8575419,0.8474965,0.9105935,0.8796848,0.0125609,0.0495888,54.3021057,63.8479712
,4,0.0401110,0.7531671,1.4087318,1.5820847,0.7829099,0.7973865,0.8792517,0.8594834,0.0138701,0.0634589,40.8731759,58.2084665
,5,0.0514803,0.6960789,1.3495152,1.5307221,0.75,0.7118751,0.8507067,0.8268844,0.0153431,0.0788020,34.9515159,53.0722141
,6,0.1036882,0.6493719,1.3009263,1.4150179,0.7229965,0.6674533,0.7864035,0.7466095,0.0679187,0.1467207,30.0926344,41.5017942
,7,0.1503479,0.6180875,1.2249985,1.3560464,0.6807992,0.6351934,0.7536298,0.7120321,0.0571581,0.2038787,22.4998491,35.6046388
,8,0.2010096,0.6001114,1.1710335,1.3094164,0.6508079,0.6088750,0.7277149,0.6860327,0.0593265,0.2632053,17.1033501,30.9416443
,9,0.3004002,0.5785416,1.0785418,1.2330291,0.5994052,0.5889311,0.6852623,0.6539056,0.1071969,0.3704022,7.8541818,23.3029116



Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2018-12-16 11:28:13,0.116 sec,0.0,0.4968816,0.6869170,0.5,0.0,1.0,0.4442448
,2018-12-16 11:28:14,1.315 sec,1.0,0.4954874,0.6841017,0.5838815,0.6515037,1.6161761,0.4442448
,2018-12-16 11:28:15,1.698 sec,2.0,0.4943602,0.6818033,0.5847311,0.6524637,1.6552965,0.4442448
,2018-12-16 11:28:15,1.888 sec,3.0,0.4934351,0.6798920,0.5848234,0.6525991,1.6863896,0.4442448
,2018-12-16 11:28:15,2.062 sec,4.0,0.4926797,0.6783060,0.5851173,0.6531053,1.6572993,0.4442448
---,---,---,---,---,---,---,---,---,---
,2018-12-16 11:28:17,4.250 sec,38.0,0.4857586,0.6630447,0.6185145,0.6775247,1.7038126,0.4367184
,2018-12-16 11:28:17,4.300 sec,39.0,0.4856860,0.6628943,0.6189695,0.6779198,1.7105962,0.4369003
,2018-12-16 11:28:17,4.337 sec,40.0,0.4855816,0.6626826,0.6196851,0.6785724,1.7073866,0.4366956



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Distance,1379.9940185,1.0,0.5013133
Month,970.7394409,0.7034374,0.3526425
DayOfWeek,402.0243530,0.2913233,0.1460442




Overall, this model is not expected to perform very well, given the huge error rate to be observed in the confusion matrix. By playing with different GBM hyperparameters and including different predictors into the model, much better results can be achieved. As a data scientist, the task of making the model perform better is easy for you, that’s certain. To get detailed information about model and its scoring history, invoke the print(gbm_model) command. The output contains table with importances of variables (relative, percentage, scaled) taken into account in the model. Also, a detailed scoring history is available, as well as basic measures like mean squared error (MSE). As you begin exploring H2O, the reference guide will guide you through all the H2O’s functionality. A shortened example of a detailed view on GBM model is to be found in the next figure. Some of the text was omitted.

In [12]:
print(gbm_model)

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1544977393365_1


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.234966307798
RMSE: 0.484733233643
LogLoss: 0.660935092712
Mean Per-Class Error: 0.41540494848
AUC: 0.623775438846
pr_auc: 0.682065359816
Gini: 0.247550877693
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.43158830279: 


0,1,2,3,4
,NO,YES,Error,Rate
NO,604.0,18933.0,0.9691,(18933.0/19537.0)
YES,232.0,24209.0,0.0095,(232.0/24441.0)
Total,836.0,43142.0,0.4358,(19165.0/43978.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4315883,0.7164228,368.0
max f2,0.3552172,0.8622405,395.0
max f0point5,0.5134535,0.6336807,279.0
max accuracy,0.5119928,0.5946610,280.0
max precision,0.9725341,1.0,0.0
max recall,0.3474692,1.0,397.0
max specificity,0.9725341,1.0,0.0
max absolute_mcc,0.6058388,0.1759478,151.0
max min_per_class_accuracy,0.5398879,0.5824639,235.0


Gains/Lift Table: Avg response rate: 55.58 %, avg score: 55.57 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100732,0.8970957,1.7059334,1.7059334,0.9480813,0.9110773,0.9480813,0.9110773,0.0171842,0.0171842,70.5933384,70.5933384
,2,0.0221247,0.8642732,1.6465782,1.6736022,0.9150943,0.8751877,0.9301131,0.8915280,0.0198437,0.0370279,64.6578244,67.3602218
,3,0.0302651,0.8302098,1.5430211,1.6384797,0.8575419,0.8474965,0.9105935,0.8796848,0.0125609,0.0495888,54.3021057,63.8479712
,4,0.0401110,0.7531671,1.4087318,1.5820847,0.7829099,0.7973865,0.8792517,0.8594834,0.0138701,0.0634589,40.8731759,58.2084665
,5,0.0514803,0.6960789,1.3495152,1.5307221,0.75,0.7118751,0.8507067,0.8268844,0.0153431,0.0788020,34.9515159,53.0722141
,6,0.1036882,0.6493719,1.3009263,1.4150179,0.7229965,0.6674533,0.7864035,0.7466095,0.0679187,0.1467207,30.0926344,41.5017942
,7,0.1503479,0.6180875,1.2249985,1.3560464,0.6807992,0.6351934,0.7536298,0.7120321,0.0571581,0.2038787,22.4998491,35.6046388
,8,0.2010096,0.6001114,1.1710335,1.3094164,0.6508079,0.6088750,0.7277149,0.6860327,0.0593265,0.2632053,17.1033501,30.9416443
,9,0.3004002,0.5785416,1.0785418,1.2330291,0.5994052,0.5889311,0.6852623,0.6539056,0.1071969,0.3704022,7.8541818,23.3029116



Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2018-12-16 11:28:13,0.116 sec,0.0,0.4968816,0.6869170,0.5,0.0,1.0,0.4442448
,2018-12-16 11:28:14,1.315 sec,1.0,0.4954874,0.6841017,0.5838815,0.6515037,1.6161761,0.4442448
,2018-12-16 11:28:15,1.698 sec,2.0,0.4943602,0.6818033,0.5847311,0.6524637,1.6552965,0.4442448
,2018-12-16 11:28:15,1.888 sec,3.0,0.4934351,0.6798920,0.5848234,0.6525991,1.6863896,0.4442448
,2018-12-16 11:28:15,2.062 sec,4.0,0.4926797,0.6783060,0.5851173,0.6531053,1.6572993,0.4442448
---,---,---,---,---,---,---,---,---,---
,2018-12-16 11:28:17,4.250 sec,38.0,0.4857586,0.6630447,0.6185145,0.6775247,1.7038126,0.4367184
,2018-12-16 11:28:17,4.300 sec,39.0,0.4856860,0.6628943,0.6189695,0.6779198,1.7105962,0.4369003
,2018-12-16 11:28:17,4.337 sec,40.0,0.4855816,0.6626826,0.6196851,0.6785724,1.7073866,0.4366956



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Distance,1379.9940185,1.0,0.5013133
Month,970.7394409,0.7034374,0.3526425
DayOfWeek,402.0243530,0.2913233,0.1460442





### Prediction
Once a model is created, predictions are simply done by calling predict(data) method on a model, where the data argument is the variable pointing to an H2OFrame with data to do the prediction on. To test the prediction is functional in a very simple way, let’s use the gbm_model and let it predict the original training dataset by issuing gbm_model.predict(airlines_train_data) command.

In [14]:
gbm_model.predict(airlines_train_data)

gbm prediction progress: |████████████████████████████████████████████████| 100%


predict,NO,YES
YES,0.141994,0.858006
YES,0.101574,0.898426
YES,0.203606,0.796394
YES,0.12399,0.87601
YES,0.138436,0.861564
YES,0.141994,0.858006
YES,0.101574,0.898426
YES,0.103421,0.896579
YES,0.203606,0.796394
YES,0.12399,0.87601




A result of the prediction is a H2OFrame. Pointer to it can be saved into a variable as well, e.g. prediction = gbm_model.predict(airlines_train_data). The table printed is only a preview of the first few predictions made. As the above example demonstrates, the prediction is not very accurate in case there was no delay. This is expected, as the model is very basic. Of course, the confusion matrix seen earlier in this tutorial gave out the information about such “bad” performance beforehand.