Copyright © 2020, SAS Institute Inc., Cary, NC, USA.  All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# Fleet Maintenance H2O: Build and Import Trained Models into SAS Model Manager

This notebook provides an example of how to build and train a Python model and then import the model into SAS Model Manager using the fleet maintenance data set. Lines of code that must be modified by the user, such as directory paths are noted with the comment "_Changes required by user._".

_**Note:** If you download only this notebook and not the rest of the repository, you must also download the fleet maintenance CSV from the data folder in the examples directory. These files are used when executing this notebook example._

Here are the steps shown in this notebook:

1. Import and review data and preprocess for model training.
2. Build, train, and access an H2O generalized linear model.
3. Save the model as a MOJO file.
4. Write the metadata JSON files needed for importing into SAS Model Manager.
4. Write a score code Python file for model scoring.
5. Zip the MOJO, JSON, and score code files into an archive file.
6. Import the ZIP archive file to SAS Model Manager via the Session object and relevant function call.

### Python Package Imports

In [1]:
# Dataframes for data manipulations
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
# Mathematical calculations and array handling
import numpy as np
# File handling and reading
import json
import gzip
import zipfile
import io

# H2O models
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Embedded plotting
import matplotlib.pyplot as plt 
plt.rc("font", size=14)

# Pathing support
from pathlib import Path

In [2]:
# sasctl interface for importing models
import sasctl.pzmm as pzmm
from sasctl import Session
from sasctl.services import model_repository as modelRepo

In [3]:
# Use 2 CPU cores & 4 GB of RAM locally
h2o.init(nthreads=2, max_mem_size=4)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,3 hours 17 mins
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.2
H2O_cluster_version_age:,3 months and 4 days
H2O_cluster_name:,H2O_from_python_sclind_5uqkz7
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.999 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,2


### Import Data Set

In [4]:
fleetData = h2o.import_file('data/fleet_maintenance.csv')
fleetData.head(rows=3)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Maintenance_flag,Speed_sensor,Vibration,Engine_Load,Coolant_Temp,Intake_Pressure,Engine_RPM,Speed_OBD,Intake_Air,Flow_Rate,Throttle_Pos,Voltage,Ambient,Accel,Engine_Oil_Temp,Speed_GPS,GPS_Longitude,GPS_Latitude,GPS_Bearing,GPS_Altitude,Turbo_Boost,Trip_Distance,Litres_Per_km,Accel_Ssor_Total,CO2,Trip_Time,CO_emission,HC_emission,PM_emission,NOx_emission,CO2_emission,Fuel_level,Oil_life,Vibration_alert,VibrationAlert_Total,Vibration_Recent,Turbo_alert,Emission_alert,Fog_control,Engine_control
0,35,249.189,21.5686,88,116,1115.5,35,10,18.33,80,14.1,7,27.8431,85,36.216,9.1417,48.9328,75.2,164,2.75572,310.262,2.3515,0.045858,62.1972,11539,0,0,0,0,0,0,0,1,123,12,1,1,1,1
0,142,243.237,20.3922,88,135,1782.5,142,16,35.41,80,14.1,8,34.5098,85,148.968,9.88538,48.4948,274.4,436,5.51143,161.025,1.24465,0.043655,32.9209,4242,0,0,0,0,0,0,0,1,123,12,1,1,1,1
0,128,244.015,43.5294,81,109,1588.0,128,9,27.08,80,14.2,8,14.902,79,132.408,9.64806,48.4796,257.1,508,1.74045,158.238,2.1241,0.073833,56.1824,4146,0,0,0,0,0,0,0,1,123,12,1,1,1,1




### Preprocess Data

In [5]:
fleetData['Maintenance_flag'] = fleetData['Maintenance_flag'].asfactor()

train, validation, test = fleetData.split_frame(ratios=[.6, .2], seed=42)

y = 'Maintenance_flag'
x = list(fleetData.columns)
x.remove(y)

### Create, Train, and Assess Model

In [6]:
# Generate the generalized linear estimator model called glmFit and train it on the train partition
glmFit = H2OGeneralizedLinearEstimator(family='binomial', model_id='glmFit', 
                                       lambda_search=True)
glmFit.train(x=x, y=y, training_frame=train, validation_frame=validation)



glm Model Build progress: |███████████████████████████████████████████████| 100%


In [7]:
# Check the model performance and print its accuracy
glmPerf = glmFit.model_performance(test)
print(glmPerf.accuracy())

[[0.2722683116304135, 0.8069427527405603]]


In [8]:
mojoPath = str(Path.cwd() / 'data/FleetMaintenanceModels/GLMH2OSimple/')
glmFit.save_mojo(mojoPath, force=True)

'C:\\Users\\sclind\\Documents\\Python Scripts\\GitLab\\python-sasctl\\examples\\data\\FleetMaintenanceModels\\GLMH2OSimple\\glmFit.zip'

In [9]:
# gzip the mojo file to transfer more easily in SAS Model Manager
with open(str(Path(mojoPath) / 'glmFit.zip'), 'rb') as fileIn, gzip.open(str(Path(mojoPath) / 'glmFit.mojo'), 'wb') as fileOut:
    fileOut.writelines(fileIn)

In [10]:
# Import the mojo file into H2O and run a prediction with the test dataset
importModel = h2o.import_mojo(str(Path(mojoPath) / 'glmFit.zip'))
predictions = importModel.predict(test)
print(predictions)

generic Model Build progress: |███████████████████████████████████████████| 100%
Model Details
H2OGenericEstimator :  Import MOJO Model
Model Key:  Generic_model_python_1614009514990_7


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,binomial,logit,"Elastic Net (alpha = 0.5, lambda = 0.003702 )","nlambda = 100, lambda.max = 0.2673, lambda.min = 0.003702, lambda....",25,24,94,py_4_sid_b8c1




ModelMetricsBinomialGLMGeneric: generic
** Reported on train data. **

MSE: 0.11970005630976009
RMSE: 0.3459769592180382
LogLoss: 0.3601253555062796
Null degrees of freedom: 5010
Residual degrees of freedom: 4986
Null deviance: 5327.303889273391
Residual deviance: 3609.1763128839343
AIC: 3659.1763128839343
AUC: 0.8740980211847207
AUCPR: 0.5702749410637339
Gini: 0.7481960423694414

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2572011221149892: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,3110.0,780.0,0.2005,(780.0/3890.0)
1,1,189.0,932.0,0.1686,(189.0/1121.0)
2,Total,3299.0,1712.0,0.1934,(969.0/5011.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.257201,0.65796,228.0
1,max f2,0.179235,0.791471,267.0
2,max f0point5,0.257201,0.584766,228.0
3,max accuracy,0.420882,0.811415,162.0
4,max precision,0.909619,1.0,0.0
5,max recall,0.020911,1.0,379.0
6,max specificity,0.909619,1.0,0.0
7,max absolute_mcc,0.185502,0.557549,264.0
8,max min_per_class_accuracy,0.272493,0.805531,222.0
9,max mean_per_class_accuracy,0.185502,0.829621,264.0



Gains/Lift Table: Avg response rate: 22.37 %, avg score: 22.37 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010178,0.789252,2.62948,2.62948,0.588235,0.822662,0.588235,0.822662,0.026762,0.026762,162.947998,162.947998,0.021363
1,2,0.020156,0.735417,3.307886,2.965324,0.74,0.760078,0.663366,0.791679,0.033006,0.059768,230.788582,196.532445,0.051028
2,3,0.030134,0.710016,3.039679,2.989945,0.68,0.723356,0.668874,0.769056,0.03033,0.090098,203.967886,198.994512,0.077245
3,4,0.040112,0.686704,3.039679,3.002317,0.68,0.699504,0.671642,0.751754,0.03033,0.120428,203.967886,200.231669,0.103462
4,5,0.05009,0.665486,2.32446,2.867286,0.52,0.676023,0.641434,0.736669,0.023194,0.143622,132.44603,186.728554,0.120486
5,6,0.10018,0.591526,2.29739,2.582338,0.513944,0.62594,0.577689,0.681304,0.115076,0.258698,129.739028,158.233791,0.204199
6,7,0.15007,0.524434,2.539026,2.567939,0.568,0.556582,0.574468,0.639841,0.126673,0.38537,153.902587,156.793896,0.303108
7,8,0.20016,0.461084,2.724812,2.607196,0.609562,0.493115,0.58325,0.603123,0.136485,0.521855,172.481173,160.719625,0.4144
8,9,0.30014,0.314922,2.185985,2.466886,0.489022,0.392024,0.551862,0.532803,0.218555,0.74041,118.598485,146.688581,0.567146
9,10,0.40012,0.199996,1.641719,2.260697,0.367265,0.250601,0.505736,0.462288,0.164139,0.90455,64.171924,126.069705,0.649794




ModelMetricsBinomialGLMGeneric: generic
** Reported on validation data. **

MSE: 0.12348432766604255
RMSE: 0.35140336888829415
LogLoss: 0.36895478084504657
Null degrees of freedom: 1653
Residual degrees of freedom: 1629
Null deviance: 1755.8834812454106
Residual deviance: 1220.5024150354138
AIC: 1270.5024150354138
AUC: 0.866145750951673
AUCPR: 0.5263752060895416
Gini: 0.732291501903346

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2188383236789634: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,979.0,306.0,0.2381,(306.0/1285.0)
1,1,42.0,327.0,0.1138,(42.0/369.0)
2,Total,1021.0,633.0,0.2104,(348.0/1654.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.218838,0.652695,237.0
1,max f2,0.176502,0.796237,258.0
2,max f0point5,0.26449,0.575966,215.0
3,max accuracy,0.386754,0.80653,164.0
4,max precision,0.709841,0.568182,23.0
5,max recall,0.040906,1.0,358.0
6,max specificity,0.872615,0.999222,0.0
7,max absolute_mcc,0.176502,0.558638,258.0
8,max min_per_class_accuracy,0.267237,0.798444,213.0
9,max mean_per_class_accuracy,0.176502,0.831668,258.0



Gains/Lift Table: Avg response rate: 22.31 %, avg score: 22.76 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010278,0.77513,1.845688,1.845688,0.411765,0.806822,0.411765,0.806822,0.01897,0.01897,84.568787,84.568787,0.011188
1,2,0.020556,0.729967,2.900367,2.373027,0.647059,0.746441,0.529412,0.776632,0.02981,0.04878,190.036665,137.302726,0.036329
2,3,0.03023,0.703275,2.241192,2.33084,0.5,0.714825,0.52,0.756853,0.02168,0.070461,124.119241,133.084011,0.051784
3,4,0.040508,0.68953,2.109358,2.274643,0.470588,0.69678,0.507463,0.741611,0.02168,0.092141,110.935756,127.464304,0.06646
4,5,0.050181,0.675537,2.241192,2.268195,0.5,0.681757,0.506024,0.730073,0.02168,0.113821,124.119241,126.819473,0.081915
5,6,0.100363,0.603289,2.538218,2.403206,0.566265,0.637073,0.536145,0.683573,0.127371,0.241192,153.821791,140.320632,0.18127
6,7,0.14994,0.528513,2.623835,2.476156,0.585366,0.564661,0.552419,0.644255,0.130081,0.371274,162.383502,147.615613,0.284892
7,8,0.200121,0.468444,2.430209,2.464635,0.542169,0.499429,0.549849,0.607939,0.121951,0.493225,143.020864,146.463456,0.377272
8,9,0.299879,0.325597,2.254775,2.394823,0.50303,0.403635,0.534274,0.539975,0.224932,0.718157,125.47754,139.482254,0.538391
9,10,0.400242,0.201532,1.86316,2.261505,0.415663,0.256644,0.504532,0.468928,0.186992,0.905149,86.315996,126.150533,0.649896




Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test,training_rmse,training_logloss,...,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_r2,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2021-02-22 14:26:14,0.000 sec,1,0.27,1,1.063122,1.061598,,,...,,,,,,,,,,
1,,2021-02-22 14:26:14,0.008 sec,3,0.24,5,1.048855,1.047204,,,...,,,,,,,,,,
2,,2021-02-22 14:26:14,0.014 sec,5,0.22,5,1.035905,1.034183,,,...,,,,,,,,,,
3,,2021-02-22 14:26:14,0.019 sec,7,0.2,5,1.024723,1.02293,,,...,,,,,,,,,,
4,,2021-02-22 14:26:14,0.027 sec,9,0.18,5,1.01507,1.013207,,,...,,,,,,,,,,
5,,2021-02-22 14:26:14,0.035 sec,11,0.17,6,1.0061,1.004124,,,...,,,,,,,,,,
6,,2021-02-22 14:26:15,0.042 sec,13,0.15,6,0.995571,0.993271,,,...,,,,,,,,,,
7,,2021-02-22 14:26:15,0.048 sec,15,0.14,7,0.985972,0.983486,,,...,,,,,,,,,,
8,,2021-02-22 14:26:15,0.053 sec,17,0.13,8,0.974664,0.971717,,,...,,,,,,,,,,
9,,2021-02-22 14:26:15,0.059 sec,19,0.12,8,0.964305,0.960871,,,...,,,,,,,,,,



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,GPS_Altitude,1.259765,1.0,0.161117
1,Trip_Distance,0.806042,0.639835,0.103088
2,GPS_Longitude,0.75786,0.601588,0.096926
3,Flow_Rate,0.626926,0.497653,0.08018
4,Ambient,0.602574,0.478322,0.077066
5,Trip_Time,0.550254,0.436791,0.070374
6,GPS_Latitude,0.508591,0.403719,0.065046
7,Engine_Oil_Temp,0.504403,0.400394,0.06451
8,Voltage,0.454727,0.360961,0.058157
9,Litres_Per_km,0.187535,0.148865,0.023985



See the whole table with table.as_data_frame()

generic prediction progress: |████████████████████████████████████████████| 100%


predict,p0,p1
0,0.962103,0.0378967
0,0.96441,0.0355896
0,0.992403,0.00759678
0,0.968775,0.0312247
1,0.704515,0.295485
0,0.880342,0.119658
0,0.871856,0.128144
0,0.9621,0.0378999
0,0.93915,0.0608499
0,0.937026,0.0629742





### Register Model in SAS Model Manager with pzmm

In [11]:
modelPrefix = 'glmFit'
zipFolder = Path.cwd() / 'data/FleetMaintenanceModels/GLMH2OSimple/'

traindf = train.as_data_frame()

In [12]:
# Write the input and output variable JSON files
J = pzmm.JSONFiles()
J.writeVarJSON(traindf[x], isInput=True, jPath=zipFolder)
outputVar = pd.DataFrame(columns=['EM_EVENTPROBABILITY',
                                  'EM_CLASSIFICATION'])
outputVar['EM_CLASSIFICATION'] = traindf[y].astype('category').cat.categories.astype('str')
outputVar['EM_EVENTPROBABILITY'] = 0.5
J.writeVarJSON(outputVar, isInput=False, jPath=zipFolder)

# Write the model properties JSON file
J.writeModelPropertiesJSON(modelName=modelPrefix,
                           modelDesc='',
                           targetVariable=y,
                           modelType='Classification',
                           modelPredictors=x,
                           targetEvent=1,
                           numTargetCategories=1,
                           eventProbVar='EM_EVENTPROBABILITY',
                           jPath=zipFolder,
                           modeler='sclind')

J.writeFileMetadataJSON(modelPrefix=modelPrefix,
                        jPath=zipFolder,
                        isH2OModel=True)

In [13]:
username = getpass.getpass()
password = getpass.getpass()
host = 'sas.server.com'
sess = Session(host, username, password, protocol='http')

In [14]:
I = pzmm.ImportModel()
with sess:
    I.pzmmImportModel(zipFolder, modelPrefix, 'H2O Project', traindf[x], traindf[y],
                        '{}.predict({})', isH2OModel=True)