Copyright © 2021, SAS Institute Inc., Cary, NC, USA.  All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# HMEQ Dataset : Build and Import Trained H2O.ai Models into SAS Model Manager

This notebook provides an example of how to build and train a Python model and then import the model into SAS Model Manager using the HMEQ data set. Lines of code that must be modified by the user, such as directory paths are noted with the comment "_Changes required by user._".

_**Note:** If you download only this notebook and not the rest of the repository, you must also download the hmeq.csv file from the data folder in the examples directory. These files are used when executing this notebook example._

Here are the steps shown in this notebook:

1. Import and review data and preprocess for model training.
2. Build, train, and access an H2O.ai generalized linear estimator model.
3. Serialize the model into pickle or MOJO files.
4. Write the metadata JSON files needed for importing into SAS Model Manager.
4. Write a score code Python file for model scoring.
5. Zip the model, JSON, and score code files into an archive file.
6. Import the ZIP archive file to SAS Model Manager via the Session object and relevant function call.

### Python Package Imports

In [1]:
# Standard Library
from pathlib import Path
import warnings

# Third Party
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Application Specific
import sasctl.pzmm as pzmm
from sasctl import Session

In [2]:
# Global Package Options
pd.options.mode.chained_assignment = None  # default="warn"
plt.rc("font", size=14)
# Ignore warnings from pandas about SWAT using a feature that will be depreciated soon
warnings.simplefilter(action="ignore", category=FutureWarning)

In [3]:
h2o.__version__

'3.38.0.4'

On SAS Viya, models created in H2O versions 3.24 and under are only compatible in the binary model format. For H2O versions 3.26+, models can be in the MOJO or binary model format. If using a binary model, the H2O version on the SAS Viya server must match the exact version of H2O used to create the model.

In [4]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,9 mins 08 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.4
H2O_cluster_version_age:,3 months and 26 days !!!
H2O_cluster_name:,H2O_from_python
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,15.93 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


### Import and Review Data Set

In [5]:
hmeq_data = h2o.import_file("data/hmeq.csv", sep= ",")
hmeq_data.shape

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


(5960, 13)

### Preprocess Data

In [6]:
hmeq_data["BAD"] = hmeq_data["BAD"].asfactor()

train, validation, test = hmeq_data.split_frame(ratios=[.6, .2], seed=42)

y = "BAD"
x = list(hmeq_data.columns)
x.remove(y)

### Create, Train, and Assess Model

In [7]:
glm = H2OGeneralizedLinearEstimator(family="binomial", model_id="glmfit", lambda_search=True)
glm.train(x=x, y=y, training_frame=train, validation_frame=validation)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,binomial,logit,"Elastic Net (alpha = 0.5, lambda = 9.244E-4 )","nlambda = 100, lambda.max = 0.2455, lambda.min = 9.244E-4, lambda.1se = -1.0",18,17,93,py_3_sid_b54b

Unnamed: 0,0,1,Error,Rate
0,2446.0,414.0,0.1448,(414.0/2860.0)
1,314.0,407.0,0.4355,(314.0/721.0)
Total,2760.0,821.0,0.2033,(728.0/3581.0)

metric,threshold,value,idx
max f1,0.2574175,0.5278859,208.0
max f2,0.1522974,0.6261121,279.0
max f0point5,0.3886078,0.5571256,145.0
max accuracy,0.5672639,0.8352416,91.0
max precision,0.9988094,1.0,0.0
max recall,0.0012376,1.0,399.0
max specificity,0.9988094,1.0,0.0
max absolute_mcc,0.2666152,0.4020977,204.0
max min_per_class_accuracy,0.1782785,0.7066434,258.0
max mean_per_class_accuracy,0.2143163,0.7195026,234.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100531,0.9466854,4.9667129,4.9667129,1.0,0.9805079,1.0,0.9805079,0.0499307,0.0499307,396.6712899,396.6712899,0.0499307
2,0.0201061,0.852347,4.8287487,4.8977308,0.9722222,0.8987982,0.9861111,0.939653,0.0485437,0.0984743,382.8748652,389.7730775,0.0981247
3,0.0301592,0.7479588,3.8629989,4.5528202,0.7777778,0.7984125,0.9166667,0.8925729,0.038835,0.1373093,286.2998921,355.2820157,0.1341624
4,0.0402122,0.6912384,3.1731777,4.2079095,0.6388889,0.7152522,0.8472222,0.8482427,0.0319001,0.1692094,217.3177685,320.7909539,0.1615171
5,0.0502653,0.6267417,3.5870704,4.0837417,0.7222222,0.6558977,0.8222222,0.8097737,0.036061,0.2052705,258.7070427,308.3741717,0.1940816
6,0.1002513,0.4302426,2.441736,3.2650257,0.4916201,0.5199788,0.6573816,0.6652798,0.1220527,0.3273232,144.173595,226.502575,0.2843162
7,0.1502374,0.3459143,2.053278,2.8618606,0.4134078,0.3838592,0.5762082,0.5716473,0.1026352,0.4299584,105.3277958,186.1860592,0.3502381
8,0.2002234,0.284427,1.720314,2.576872,0.3463687,0.3119048,0.5188285,0.5068022,0.0859917,0.5159501,72.0313965,157.6871964,0.3953207
9,0.3001955,0.2098447,1.331856,2.1622527,0.2681564,0.2431157,0.4353488,0.4189885,0.1331484,0.6490985,33.1855973,116.2252685,0.4368607
10,0.4001676,0.1695745,0.887904,1.8438878,0.1787709,0.1882518,0.3712491,0.3613446,0.0887656,0.7378641,-11.2096018,84.3887831,0.4228291

Unnamed: 0,0,1,Error,Rate
0,861.0,97.0,0.1013,(97.0/958.0)
1,107.0,131.0,0.4496,(107.0/238.0)
Total,968.0,228.0,0.1706,(204.0/1196.0)

metric,threshold,value,idx
max f1,0.3127132,0.5622318,155.0
max f2,0.1877475,0.6551476,230.0
max f0point5,0.4296159,0.6079404,107.0
max accuracy,0.4438566,0.8461538,105.0
max precision,0.9939862,1.0,0.0
max recall,0.0096573,1.0,398.0
max specificity,0.9939862,1.0,0.0
max absolute_mcc,0.3127132,0.4565355,155.0
max min_per_class_accuracy,0.1954052,0.7478992,224.0
max mean_per_class_accuracy,0.191452,0.7508026,227.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100334,0.942287,5.0252101,5.0252101,1.0,0.9698172,1.0,0.9698172,0.0504202,0.0504202,402.5210084,402.5210084,0.0504202
2,0.0200669,0.8859499,4.6064426,4.8158263,0.9166667,0.9133302,0.9583333,0.9415737,0.0462185,0.0966387,360.6442577,381.5826331,0.0955948
3,0.0301003,0.8333951,3.7689076,4.4668534,0.75,0.8662565,0.8888889,0.9164679,0.0378151,0.1344538,276.8907563,346.6853408,0.1302784
4,0.0401338,0.7848628,3.7689076,4.2923669,0.75,0.8154078,0.8541667,0.8912029,0.0378151,0.1722689,276.8907563,329.2366947,0.164962
5,0.0501672,0.7212019,2.512605,3.9364146,0.5,0.7506003,0.7833333,0.8630824,0.0252101,0.197479,151.2605042,293.6414566,0.1839091
6,0.1003344,0.4947107,3.2663866,3.6014006,0.65,0.598425,0.7166667,0.7307537,0.1638655,0.3613445,226.6386555,260.140056,0.3258539
7,0.1505017,0.3581892,2.010084,3.0709617,0.4,0.4130834,0.6111111,0.6248636,0.1008403,0.4621849,101.0084034,207.0961718,0.389116
8,0.2006689,0.3002103,2.010084,2.8057423,0.4,0.3274215,0.5583333,0.5505031,0.1008403,0.5630252,101.0084034,180.5742297,0.452378
9,0.3001672,0.2236898,1.1824024,2.2676435,0.2352941,0.2578008,0.4512535,0.4534792,0.1176471,0.6806723,18.2402373,126.7643548,0.4750355
10,0.4005017,0.1761419,1.0469188,1.9618252,0.2083333,0.1963836,0.3903967,0.3890711,0.105042,0.7857143,4.6918768,96.1825231,0.4809126

Unnamed: 0,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test,alpha,iterations,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_r2,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2023-05-02 11:09:59,0.000 sec,1,.25E0,1,1.0045100,0.9980509,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.003 sec,3,.22E0,2,0.9946355,0.9864127,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.006 sec,5,.2E0,2,0.9859035,0.9760418,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.008 sec,7,.19E0,2,0.9781697,0.9667699,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.009 sec,9,.17E0,3,0.9686784,0.9544753,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.011 sec,11,.15E0,3,0.9587676,0.9411532,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.013 sec,13,.14E0,3,0.9500953,0.9293709,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.015 sec,15,.13E0,3,0.9424603,0.9188844,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.017 sec,17,.12E0,5,0.9341431,0.9080921,0.5,,,,,,,,,,,,,,,
,2023-05-02 11:09:59,0.019 sec,19,.11E0,5,0.9236449,0.8955093,0.5,,,,,,,,,,,,,,,

variable,relative_importance,scaled_importance,percentage
DELINQ,0.7970149,1.0,0.1473006
JOB.Sales,0.7478347,0.9382946,0.1382114
JOB.Office,0.510924,0.641047,0.0944266
JOB.Self,0.45714,0.5735652,0.0844865
CLAGE,0.4484982,0.5627225,0.0828894
DEBTINC,0.4467601,0.5605417,0.0825681
DEROG,0.4263238,0.5349007,0.0787912
NINQ,0.3037353,0.3810912,0.056135
VALUE,0.2412667,0.3027129,0.0445898
MORTDUE,0.2384536,0.2991833,0.0440699


In [8]:
# Check the model performance and print its accuracy
glm_performance = glm.model_performance(test)
print(glm_performance.accuracy())

[[0.551348008992684, 0.8486897717666948]]


### Register Models in SAS Model Manager

In [9]:
model_prefix = "glmfit"
binary_folder = Path.cwd() / "data/hmeqModels/H2OBinaryGLM/" # Changes needed by user
mojo_folder = Path.cwd() / "data/hmeqModels/H2OMOJOGLM/" # Changes needed by user

In [10]:
# Save the model as a H2O binary model file
pzmm.PickleModel.pickle_trained_model(
    model_prefix=model_prefix, 
    trained_model=glm, 
    pickle_path=binary_folder, 
    is_h2o_model=True, 
    is_binary_model=True
)

In [11]:
# Save the model as a H2O MOJO model file
pzmm.PickleModel.pickle_trained_model(
    model_prefix=model_prefix,
    trained_model=glm,
    pickle_path=mojo_folder, 
    is_h2o_model=True
)

In [12]:
train_df = train.as_data_frame()
# Write input variable mapping to a json file
pzmm.JSONFiles.write_var_json(train_df[x], is_input=True, json_path=binary_folder)
pzmm.JSONFiles.write_var_json(train_df[x], is_input=True, json_path=mojo_folder)

# Set output variables and assign an event threshold, then write output variable mapping
output_var = pd.DataFrame(
    columns=["EM_CLASSIFICATION", "EM_PROBABILITY"], 
    data=[[train_df[y].astype("category").cat.categories.astype("str"), 0.5]]
)
pzmm.JSONFiles.write_var_json(output_var, is_input=False, json_path=binary_folder)
pzmm.JSONFiles.write_var_json(output_var, is_input=False, json_path=mojo_folder)

# Write model properties to a json file
pzmm.JSONFiles.write_model_properties_json(
    model_name=model_prefix,
    model_desc="Binary H2O model.",
    target_variable=y,
    target_values=["1", "0"],
    json_path=binary_folder,
    modeler="sasdemo"
)
pzmm.JSONFiles.write_model_properties_json(
    model_name=model_prefix,
    model_desc="MOJO H2O model.",
    target_variable=y,
    target_values=["1", "0"],
    json_path=mojo_folder,
    modeler="sasdemo"
)

# Write model metadata to a json file
pzmm.JSONFiles.write_file_metadata_json(model_prefix=model_prefix, json_path=binary_folder)
pzmm.JSONFiles.write_file_metadata_json(model_prefix=model_prefix, json_path=mojo_folder, is_h2o_model=True)

inputVar.json was successfully written and saved to ~\examples\data\hmeqModels\H2OBinaryGLM\inputVar.json
inputVar.json was successfully written and saved to ~\examples\data\hmeqModels\H2OMOJOGLM\inputVar.json
outputVar.json was successfully written and saved to ~\examples\data\hmeqModels\H2OBinaryGLM\outputVar.json
outputVar.json was successfully written and saved to ~\examples\data\hmeqModels\H2OMOJOGLM\outputVar.json
ModelProperties.json was successfully written and saved to ~\examples\data\hmeqModels\H2OBinaryGLM\ModelProperties.json
ModelProperties.json was successfully written and saved to ~\examples\data\hmeqModels\H2OMOJOGLM\ModelProperties.json
fileMetadata.json was successfully written and saved to ~\examples\data\hmeqModels\H2OBinaryGLM\fileMetadata.json
fileMetadata.json was successfully written and saved to ~\examples\data\hmeqModels\H2OMOJOGLM\fileMetadata.json


In [13]:
import getpass
username = getpass.getpass()
password = getpass.getpass()
host = "sas.demo.com"
sess = Session(host, username, password, protocol="http")

In [14]:
binary_model = pzmm.ImportModel.import_model(
    model_files=binary_folder, 
    model_prefix=model_prefix + "_binary", 
    project="H2OModels", 
    input_data=train_df[x], 
    predict_method=[glm.predict, [list]], 
    binary_h2o_model=True, 
    score_metrics=["EM_CLASSIFICATION", "EM_EVENTPROBABILITY"],
    missing_values=True,
    overwrite_model=True,
    model_file_name="glmfit.pickle"
)
pzmm.ScoreCode.score_code = ""

mojo_model = pzmm.ImportModel.import_model(
    model_files=mojo_folder, 
    model_prefix=model_prefix + "_mojo", 
    project="H2OModels", 
    input_data=train_df[x], 
    predict_method=[glm.predict, [list]], 
    mojo_model=True, 
    score_metrics=["EM_CLASSIFICATION", "EM_EVENTPROBABILITY"],
    missing_values=True,
    overwrite_model=True,
    model_file_name="glmfit.mojo"
)

Model score code was written successfully to ~\examples\data\hmeqModels\H2OBinaryGLM\score_glmfit_binary.py and uploaded to SAS Model Manager.
All model files were zipped to ~\examples\data\hmeqModels\H2OBinaryGLM.


  warn(f"No project with the name or UUID {project} was found.")


A new project named H2OModels was created.
Model was successfully imported into SAS Model Manager as glmfit_binary with the following UUID: 5d178ea5-6477-428c-8fb9-3eb0cdca6eff.
Model score code was written successfully to ~\examples\data\hmeqModels\H2OMOJOGLM\score_glmfit_mojo.py and uploaded to SAS Model Manager.
All model files were zipped to ~\examples\data\hmeqModels\H2OMOJOGLM.
Model was successfully imported into SAS Model Manager as glmfit_mojo with the following UUID: 7d64c416-ff07-4f1f-aa64-f295d93b831c.
