## Day in a life of a Data Scientist using H2O-3 (demo) 

1. Ease of Loading Data into Cluster

2. Introducting the new Isolation Forest algorithm / append dataset

3. Building Multiple Tuned Models with H2O-3 AutoML / Saving the model

4. Introducting Explainability for H2O-3 Algorithms

5. Introducing Generalized Additive Models & Parallel SVM

In [1]:
import h2o
import numpy as np

from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator #GAM
from h2o.estimators import H2OSupportVectorMachineEstimator #SVM
from h2o.estimators import H2OIsolationForestEstimator #ISOLATION FOREST

from h2o.automl import H2OAutoML #AUTOML

import matplotlib.pyplot as plt

## Start the H2O-3 Cluster (version 3.30)

In [2]:
# Start Cluster to load data.
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_181"; Java(TM) SE Runtime Environment (build 1.8.0_181-b13); Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
  Starting server from /Users/thomasott/opt/anaconda3/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/9c/tqhnrz3x207bf8pzjxjhkm100000gn/T/tmp3ceedpuj
  JVM stdout: /var/folders/9c/tqhnrz3x207bf8pzjxjhkm100000gn/T/tmp3ceedpuj/h2o_thomasott_started_from_python.out
  JVM stderr: /var/folders/9c/tqhnrz3x207bf8pzjxjhkm100000gn/T/tmp3ceedpuj/h2o_thomasott_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.1.2
H2O_cluster_version_age:,1 month and 2 days
H2O_cluster_name:,H2O_from_python_thomasott_w07kqb
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.556 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


## Step 1 - Load Data Quickly into the Cluster
Note: Data can live on the cluster or come from a bucket like S3, Hive, etc.

In [3]:
#Path datasets for training and testing
train_path = "https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-train.csv"
test_path ="https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-test.csv"

# import the train and test dataset
train = h2o.import_file(train_path, destination_frame='CreditCard_Cat-train.csv')
test = h2o.import_file(test_path, destination_frame='CreditCard_Cat-test.csv')

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


## Inspect the Dataset

In [4]:
train.head()

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT_PAYMENT_NEXT_MONTH
1,20000,female,university,married,24,-2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
2,120000,female,university,single,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,female,university,single,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,female,university,married,37,1,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,male,university,married,57,2,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
6,50000,male,graduate,single,37,3,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0
7,500000,male,graduate,single,29,4,0,0,0,0,0,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
8,100000,female,university,single,23,5,-1,-1,0,0,-1,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0
9,140000,female,highschool,married,28,6,0,2,0,0,0,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,0
10,20000,male,highschool,single,35,7,-2,-2,-2,-1,-1,0,0,0,0,13007,13912,0,0,0,13007,1122,0,0




## Step 2 - Introducing the new Isolation Forest Algorithm for detecting Anomalies
We'll use the Isolation Forest Algorithm to find anomalies and add them as a column to our dataset.

In [5]:
myX = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','PAY_0','PAY_2','PAY_3','PAY_4','PAY_5',
 'PAY_6','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2',
 'PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

isolation_model_train = H2OIsolationForestEstimator(model_id = "isolation_forest_train.hex", seed = 1234)
isolation_model_train.train(training_frame = train, x = myX)

isolation_model_test = H2OIsolationForestEstimator(model_id = "isolation_forest_test.hex", seed = 1234)
isolation_model_test.train(training_frame = test, x = myX)




isolationforest Model Build progress: |███████████████████████████████████| 100%
isolationforest Model Build progress: |███████████████████████████████████| 100%


## Check the mean_length for the anomalies


In [6]:
predictions_train = isolation_model_train.predict(train)
predictions_test = isolation_model_test.predict(test)
predictions_train.head()

isolationforest prediction progress: |████████████████████████████████████| 100%
isolationforest prediction progress: |████████████████████████████████████| 100%


predict,mean_length
0.0705882,6.76
0.0647059,6.78
0.0,7.0
0.0,7.0
0.141176,6.52
0.0352941,6.88
0.729412,4.52
0.0941176,6.68
0.0294118,6.9
0.158824,6.46




In [7]:
anomalies_train = train[predictions_train["mean_length"] < 5.5]

print("Number of Anomalies in Train Set: " + str(anomalies_train.nrow))

anomalies_test = test[predictions_test["mean_length"] < 5.5]

print("Number of Anomalies in Test Set: " + str(anomalies_test.nrow))

Number of Anomalies in Train Set: 344
Number of Anomalies in Test Set: 73


In [8]:
# Add the mean_length column to the dataset Training and Test datasets
isolation_model_train.predict(anomalies_train)["mean_length"].cbind(anomalies_train[myX])
isolation_model_test.predict(anomalies_test)["mean_length"].cbind(anomalies_test[myX])

isolationforest prediction progress: |████████████████████████████████████| 100%
isolationforest prediction progress: |████████████████████████████████████| 100%


mean_length,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
4.98,500000,male,graduate,single,27,0,0,0,0,0,0,252881,224633,486367,616836,250600,296217,30032,271636,30876,60044,50539,100562
5.4,20000,male,university,single,22,3,2,2,7,7,7,2400,2400,2400,2400,2400,2400,0,0,0,0,0,0
5.42,200000,male,graduate,single,30,2,4,4,4,4,4,173639,176605,179754,182763,184570,184091,7310,7459,7468,6275,3000,0
5.02,320000,male,university,married,39,0,0,-1,-1,0,0,310243,0,189018,131916,258800,96900,0,189018,132846,130400,193800,45313
5.3,490000,male,graduate,married,39,0,0,-1,-1,0,0,189460,16769,2310,48409,296319,181627,16405,2324,25053,280695,17398,160111
5.48,100000,male,highschool,married,42,7,6,5,4,3,2,33816,33024,32308,31399,30448,29933,0,60,0,0,0,118
5.38,600000,male,graduate,married,36,0,0,0,0,0,0,372396,416438,459749,455910,463611,466570,50000,50000,15000,20000,15000,7000
5.42,730000,male,university,married,37,0,0,0,0,-1,0,70309,61991,49082,26873,514114,499100,20000,14023,9035,528897,22005,15000
5.22,500000,male,graduate,married,46,-1,-1,-1,0,0,0,46178,56570,117102,159284,112078,136341,57498,120899,101500,30418,80668,50384
5.42,430000,male,graduate,married,39,-1,-1,0,0,-1,-1,43970,46127,336073,325463,38290,21800,50942,325470,20003,39068,21800,351282




In [9]:
#Convert the training and test mean_length column to binary 'Yes' or 'No'
global_surrogate_data_train = train[:, :]
global_surrogate_data_train["anomaly"] = (predictions_train["mean_length"] < 5.5).ifelse("Yes", "No")
global_surrogate_data_train["anomaly"].table()

anomaly,Count
No,23655
Yes,344




In [10]:
global_surrogate_data_test = test[:, :]
global_surrogate_data_test["anomaly"] = (predictions_test["mean_length"] < 5.5).ifelse("Yes", "No")
global_surrogate_data_test["anomaly"].table()

anomaly,Count
No,5927
Yes,73




In [11]:
#Check if all the Training data is updated on the Cluster
global_surrogate_data_train.head()

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT_PAYMENT_NEXT_MONTH,anomaly
1,20000,female,university,married,24,-2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,No
2,120000,female,university,single,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,No
3,90000,female,university,single,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0,No
4,50000,female,university,married,37,1,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0,No
5,50000,male,university,married,57,2,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0,No
6,50000,male,graduate,single,37,3,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0,No
7,500000,male,graduate,single,29,4,0,0,0,0,0,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0,Yes
8,100000,female,university,single,23,5,-1,-1,0,0,-1,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0,No
9,140000,female,highschool,married,28,6,0,2,0,0,0,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,0,No
10,20000,male,highschool,single,35,7,-2,-2,-2,-1,-1,0,0,0,0,13007,13912,0,0,0,13007,1122,0,0,No




In [12]:
#Check if all the Test data is updated on the Cluster
#global_surrogate_data_test.head()

In [13]:
# set predictors and response
predictors = global_surrogate_data_train.columns
predictors.remove('ID')
response = "DEFAULT_PAYMENT_NEXT_MONTH"

In [14]:
# convert target to factor
global_surrogate_data_train[response] = global_surrogate_data_train[response].asfactor()

In [15]:
# assign IDs for later use
h2o.assign(global_surrogate_data_test, "CreditCard_TEST")
h2o.assign(global_surrogate_data_train, "CreditCard_TRAIN")

ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT_PAYMENT_NEXT_MONTH,anomaly
1,20000,female,university,married,24,-2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,No
2,120000,female,university,single,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,No
3,90000,female,university,single,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0,No
4,50000,female,university,married,37,1,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0,No
5,50000,male,university,married,57,2,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0,No
6,50000,male,graduate,single,37,3,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0,No
7,500000,male,graduate,single,29,4,0,0,0,0,0,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0,Yes
8,100000,female,university,single,23,5,-1,-1,0,0,-1,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0,No
9,140000,female,highschool,married,28,6,0,2,0,0,0,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,0,No
10,20000,male,highschool,single,35,7,-2,-2,-2,-1,-1,0,0,0,0,13007,13912,0,0,0,13007,1122,0,0,No




## Step 3 - Building Multiple Tuned Models with H2O-3 AutoML / Saving the model


In [None]:
# build an AUTOML  Model (for the demo we keep the number of models built low)

aml = H2OAutoML(max_models=4, seed=1234)
aml.train(x = predictors, y = response, training_frame=global_surrogate_data_train)

AutoML progress: |███

In [None]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

In [None]:
aml.leader

In [None]:
preds = aml.predict(global_surrogate_data_test)

In [None]:
preds.head()

In [None]:
aml.leader.model_performance(test)

In [None]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(se.metalearner()['name'])

In [None]:
metalearner.coef_norm()

In [None]:
%matplotlib inline
metalearner.std_coef_plot()

## Saving the Model for External Deployment

In [None]:
h2o.save_model(aml.leader, path = "./credit_card_bin")
aml.leader.download_mojo(path = "./credit_card_bin")

## Step 4 - Introducting Explainability for H2O-3 Algorithms
H2O-3 now supports the generation of Shapley Values on H2O-3's Gradient Boosted Machine (GBM), Distributed Random Forest (DRF), and XGBoost. For this example we'll use Boston House Prices dataset and H2O-3's DRF algo.

In [None]:
import shap
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o import H2OFrame

In [None]:
X, y = shap.datasets.boston()
boston_housing = H2OFrame(X).cbind(H2OFrame(y, column_names=["medv"]))

boston_housing.head()

In [None]:
x = ['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT']

In [None]:
# load JS visualization code to notebook
shap.initjs()

In [None]:
# train a DRF model in H2O

model = H2ORandomForestEstimator(ntrees=100)
model.train(training_frame=boston_housing, y="medv")

In [None]:
model

In [None]:
# calculate SHAP values using function predict_contributions
contributions = model.predict_contributions(boston_housing)

In [None]:
# convert the H2O Frame to use with shap's visualization functions
contributions_matrix = contributions.as_data_frame().as_matrix()
#contributions_matrix

In [None]:
# shap values are calculated for all features
shap_values = contributions_matrix[:,0:13]
shap_values.shape

In [None]:
# expected values is the last returned column
expected_value = contributions_matrix[:,13].min()
expected_value

In [None]:
# visualize the first prediction's explanation
shap.force_plot(expected_value, shap_values[0,:], X.iloc[0,:])

In [None]:
# visualize the training set predictions
shap.force_plot(expected_value, shap_values, X)

In [None]:
# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("RM", shap_values, X)

In [None]:
# summarize the effects of all the features
shap.summary_plot(shap_values, X)

In [None]:
shap.summary_plot(shap_values, X, plot_type="bar")

## Step 5 - Introducing Generalized Additive Models & Parallel SVM

In [None]:
# create frame knots
knots1 = [-1.99905699, -0.98143075, 0.02599159, 1.00770987, 1.99942290]
frameKnots1 = h2o.H2OFrame(python_obj=knots1)
knots2 = [-1.999821861, -1.005257990, -0.006716042, 1.002197392, 1.999073589]
frameKnots2 = h2o.H2OFrame(python_obj=knots2)
knots3 = [-1.999675688, -0.979893796, 0.007573327,1.011437347, 1.999611676]
frameKnots3 = h2o.H2OFrame(python_obj=knots3)

In [None]:
# import the dataset
h2o_data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/multinomial_10_classes_10_cols_10000_Rows_train.csv")

In [None]:
h2o_data.head()

In [None]:
#convert the C1, C2, and C11 columns to factors
h2o_data["C1"] = h2o_data["C1"].asfactor()
h2o_data["C2"] = h2o_data["C2"].asfactor()
h2o_data["C11"] = h2o_data["C11"].asfactor()

In [None]:
# split into train and validation sets
train, test = h2o_data.split_frame(ratios = [.8])

# set the predictor and response columns
y = "C11"
x = ["C1","C2"]

# specify the knots array
numKnots = [5,5,5]

# build the GAM model
h2o_model = H2OGeneralizedAdditiveEstimator(family='multinomial',
                                            gam_columns=["C6","C7","C8"],
                                            scale=[1,1,1],
                                            num_knots=numKnots,
                                            knot_ids=[frameKnots1.key, frameKnots2.key, frameKnots3.key])
h2o_model.train(x=x, y=y, training_frame=train)



In [None]:
# get the model coefficients
h2oCoeffs = h2o_model.coef()

# generate predictions using the test data
pred = h2o_model.predict(test)

In [None]:
print (pred)

In [None]:
from h2o.estimators import H2OSupportVectorMachineEstimator


# Import the splice dataset into H2O:
splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")



In [None]:
# Build and train the model:
svm_model = H2OSupportVectorMachineEstimator(gamma=0.01,
                                             rank_ratio = 0.1,
                                             disable_training_metrics = False)
svm_model.train(y = "C1", training_frame = splice)



In [None]:
# Eval performance:
perf = svm_model.model_performance()

# Generate predictions (if necessary):
pred = svm_model.predict(splice)

In [None]:
print (perf)

In [None]:
print (pred)