# Model Lifecycle : Build, Import, and Score Test Decision Tree Classifier Models

This notebook provides an example of implementing the entire model lifecycle using the HMEQ data set. Lines of code that must be modified by the user, such as directory paths or the host server are noted with the comment "_Changes required by user._".

_**Note:** If you download only this notebook and not the rest of the repository, you must also download the hmeq.csv file and the HMEQPERF_1_Q1.csv file from the data folder in the examples directory. These files are used when executing this notebook example._

_**Note:** This example has the option of utilizing CAS Gateway to run score testing quickly. This option is available for SAS Viya 2025.01 or later and if the user replaces False with True in  the section noted by "_Change to True if your Viya version is compatible with CAS Gateway_"._

Here are the steps shown in this notebook:

1. Import, review, and preprocess HMEQ data for model training.
2. Build, train, and assess a Decision Tree Classifer Model.
3. Score the model and save the resulting JSON information.
4. Import the model and the associated JSON files into SAS Model Manager.
4. Score test the results either with or without CAS Gateway and display the results.

### Python Package Imports

In [1]:
# Standard Library
from pathlib import Path
import warnings
from requests import HTTPError
import sys

# Third Party
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Application Specific
import sasctl.pzmm as pzmm
from sasctl import Session
from sasctl._services.score_definitions import ScoreDefinitions as sd
from sasctl._services.score_execution import ScoreExecution as se

In [2]:
# Global Package Options
pd.options.mode.chained_assignment = None  # default="warn"
plt.rc("font", size=14)
# Ignore warnings from pandas about SWAT using a feature that will be depreciated soon
warnings.simplefilter(action="ignore", category=FutureWarning)

### Import and Review Data Set

In [3]:
hmeq_data = pd.read_csv("data/hmeq.csv", sep= ",") # Try "data/hmeq.csv" if this does not work
hmeq_data.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,0,26800,46236.0,62711.0,DebtCon,Office,17.0,0.0,0.0,175.075058,1.0,22.0,33.059934
1,0,26900,74982.0,126972.0,DebtCon,Office,0.0,0.0,0.0,315.818911,0.0,23.0,38.32599
2,0,26900,67144.0,92923.0,DebtCon,Other,16.0,0.0,0.0,89.112173,1.0,17.0,32.791478
3,0,26900,45763.0,73797.0,DebtCon,Other,23.0,,0.0,291.591681,1.0,29.0,39.370858
4,0,27000,144901.0,178093.0,DebtCon,ProfExe,7.0,0.0,0.0,331.113972,0.0,34.0,40.566552


### Preprocess Data

In [4]:
predictor_columns = ["LOAN", "MORTDUE", "VALUE", "YOJ", "DEROG", "DELINQ", "CLAGE", "NINQ", "CLNO", "DEBTINC"]

target_column = "BAD"
x = hmeq_data[predictor_columns]
y = hmeq_data[target_column]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [5]:
# For missing values, impute the data set's mean value
x_test.fillna(x_test.mean(), inplace=True)
x_train.fillna(x_train.mean(), inplace=True)

### Create, Train, and Assess Model

In [6]:
dtc = DecisionTreeClassifier(max_depth=7, min_samples_split=2, min_samples_leaf=2, max_leaf_nodes=500)
dtc = dtc.fit(x_train, y_train)

In [7]:
# Calculate the importance of a predictor 
def sort_feature_importance(model, data):
    features = {}
    for importance, name in sorted(zip(model.feature_importances_, data.columns), reverse=True):
        features[name] = str(np.round(importance*100, 2)) + "%"
    return features

In [8]:
# Displays the percentage weight of the predictors
importances = pd.DataFrame.from_dict(sort_feature_importance(dtc, x_train), orient="index").rename(columns={0: "DecisionTree"})
importances

Unnamed: 0,DecisionTree
DEBTINC,58.35%
DELINQ,18.57%
CLAGE,8.07%
DEROG,4.86%
VALUE,3.24%
YOJ,2.78%
MORTDUE,1.87%
CLNO,1.2%
NINQ,0.88%
LOAN,0.17%


In [9]:
# Displays model score metrics
y_dtc_predict = dtc.predict(x_test)
y_dtc_proba = dtc.predict_proba(x_test)
print(confusion_matrix(y_test, y_dtc_predict))
print(classification_report(y_test, y_dtc_predict))
print("Decision Tree Model Accuracy = " + str(np.round(dtc.score(x_test, y_test)*100,2)) + "%")

[[1427   14]
 [ 272   75]]
              precision    recall  f1-score   support

           0       0.84      0.99      0.91      1441
           1       0.84      0.22      0.34       347

    accuracy                           0.84      1788
   macro avg       0.84      0.60      0.63      1788
weighted avg       0.84      0.84      0.80      1788

Decision Tree Model Accuracy = 84.0%


### Register Model in SAS Model Manager with pzmm

In [10]:
# Output variables expected in SAS Model Manager. If a classification value is expected to be output, it should be the first metric.
score_metrics = ["EM_CLASSIFICATION", "EM_EVENTPROBABILITY"]

# Path to where the model should be stored
path = Path.cwd() / "data/hmeqModels/DecisionTreeClassifier"

# Serialize the models to a pickle format
pzmm.PickleModel.pickle_trained_model(
    model_prefix="DecisionTreeClassifier",
    trained_model=dtc,
    pickle_path=path,
)

Model DecisionTreeClassifier was successfully pickled and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/DecisionTreeClassifier.pickle.


In [11]:
def write_json_files(data, predict, target, path, prefix):    
    # Write input variable mapping to a json file
    pzmm.JSONFiles.write_var_json(input_data=data[predict], is_input=True, json_path=path)
    
    # Set output variables and assign an event threshold, then write output variable mapping
    output_var = pd.DataFrame(columns=score_metrics, data=[["A", 0.5]]) # data argument includes example expected types for outputs
    pzmm.JSONFiles.write_var_json(output_var, is_input=False, json_path=path)
    
    # Write model properties to a json file
    pzmm.JSONFiles.write_model_properties_json(
        model_name=prefix, 
        target_variable=target, # Target variable to make predictions about (BAD in this case)
        target_values=["1", "0"], # Possible values for the target variable (1 or 0 for binary classification of BAD)
        json_path=path, 
        model_desc=f"Description for the {prefix} model.",
        model_algorithm="",
        modeler="sasdemo",
    )
    
    # Write model metadata to a json file so that SAS Model Manager can properly identify all model files
    pzmm.JSONFiles.write_file_metadata_json(model_prefix=prefix, json_path=path)


write_json_files(hmeq_data, predictor_columns, target_column, path, "DecisionTreeClassifier")

inputVar.json was successfully written and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/inputVar.json
outputVar.json was successfully written and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/outputVar.json
ModelProperties.json was successfully written and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/ModelProperties.json
fileMetadata.json was successfully written and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/fileMetadata.json


In [21]:
import getpass
def write_model_stats(x_train, y_train, test_predict, test_proba, y_test, model, path, prefix):
    # Calculate train predictions
    train_predict = model.predict(x_train)
    train_proba = model.predict_proba(x_train)
    
    # Assign data to lists of actual and predicted values
    train_data = pd.concat([y_train.reset_index(drop=True), pd.Series(train_predict), pd.Series(data=train_proba[:,1])], axis=1)
    test_data = pd.concat([y_test.reset_index(drop=True), pd.Series(test_predict), pd.Series(data=test_proba[:,1])], axis=1)
    
    # Calculate the model statistics, ROC chart, and Lift chart; then write to json files
    pzmm.JSONFiles.calculate_model_statistics(
        target_value=1, 
        prob_value=0.5, 
                train_data=train_data, 
        test_data=test_data, 
        json_path=path
    )

    full_training_data = pd.concat([y_train.reset_index(drop=True), x_train.reset_index(drop=True)], axis=1)
        
username = getpass.getpass()
password = getpass.getpass()
host = "demo.sas.com" # Changes required by user
sess = Session(host, username, password, protocol="http") # For TLS-enabled servers, change protocol value to "https"
conn = sess.as_swat() # Connect to SWAT through the sasctl authenticated connection

test_predict = y_dtc_predict
test_proba = y_dtc_proba

write_model_stats(x_train, y_train, test_predict, test_proba, y_test, dtc, path, "DecisionTreeClassifier")

dmcas_fitstat.json was successfully written and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/dmcas_fitstat.json
dmcas_roc.json was successfully written and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/dmcas_roc.json
dmcas_lift.json was successfully written and saved to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/dmcas_lift.json


In [13]:
pzmm.ImportModel.import_model(
    model_files=path,
    model_prefix="DecisionTreeClassifier", # What is the model name?
    project="HMEQModels", # What is the project name?
    input_data=x, # What does example input data look like?
    predict_method=[dtc.predict_proba, [int, int]], # What is the predict method and what does it return?
    score_metrics=score_metrics, # What are the output variables?
    overwrite_model=True, # Overwrite the model if it already exists?
    target_values=["0", "1"], # What are the expected values of the target variable?
    target_index=1, # What is the index of the target value in target_values?
    model_file_name="DecisionTreeClassifier" + ".pickle", # How was the model file serialized?
    missing_values=True # Does the data include missing values?
)
# Reinitialize the score_code variable when writing more than one model's score code
pzmm.ScoreCode.score_code = ""

  warn(
  warn(f"No project with the name or UUID {project} was found.")


Model score code was written successfully to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier/score_DecisionTreeClassifier.py and uploaded to SAS Model Manager.
All model files were zipped to ~/python-sasctl/examples/data/hmeqModels/DecisionTreeClassifier.
A new project named HMEQModels was created.
Model was successfully imported into SAS Model Manager as DecisionTreeClassifier with the following UUID: 1481eec5-48a4-4f52-9baa-d44b3f62f9af.


### Implementing Score Testing

In [6]:
# Creating the score definition for this model using the model UUID generated two steps before
score_definition = sd.create_score_definition(
    score_def_name="example_score_def_name", # Name of the score_definition, which can be any string
    model='DecisionTreeClassifier',  # Can use model name, UUID, or dictionary representation of the model
    table_name="HMEQPERF_1_Q1", # Table name for input data
    use_cas_gateway=False, # Change to True if your Viya version is compatible with CAS Gateway. 
    table_file='data/HMEQPERF_1_Q1.csv' # add the file path of HMEQPERF_1_Q1 if HMEQPERF_1_Q1 does not yet exist on the server. If the user doesn't need the file path argument, they can comment out this line completely.
)


In [None]:
# Executing the score definition
score_execution = se.create_score_execution(
    score_definition.get("id"), # Score definition id created in the previous cell
)

# Prints score_execution_id
print(score_execution)

In [None]:
# Use this function call to wait until the score execution is finished, as get_score_execution_results will throw an error if it hasn't finished
se.poll_score_execution_state(score_execution)

In [None]:
# The following lines print the output table with scoring results. Ensure that the use_cas_gateway argument is the same as it is in the score definition call.
score_results = se.get_score_execution_results(score_execution, use_cas_gateway=False)
score_results

***

### Using the Score Testing Task

The above commands can be run in a single function call, found in the tasks module of sasctl.

In [None]:
from sasctl.tasks import score_model_with_cas

score_model_with_cas(
    score_def_name="score_definition_example",
    model='DecisionTreeClassifier',
    table_name='HMEQPERF_1_Q1', # If this call is made before running the code above, the table_file argument must be included if the file is not yet on the server
    use_cas_gateway=True # Change to True if your Viya version is compatible with CAS Gateway. 
)