Copyright © 2020, SAS Institute Inc., Cary, NC, USA.  All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# Build and Import a Trained Model into SAS Model Manager

This notebook provides an example of how to build and train a Python model and then import the model into SAS Model Manager. Lines of code that must be modified by the user, such as directory paths are noted with the comment "_Changes required by user._".

_**Note:** If you download only this notebook and not the rest of the repository, you must also download the hmeq.csv, hmeqPrediction.csv, and dmcas_fitstat.csv files from the [/samples/Python_Models/DTree_sklearn_PyPickleModel/Data](../samples/Python_Models/DTree_sklearn_PyPickleModel/Data) directory. These files are used when executing this notebook example._

Here are the steps:

1. Build and train a model.
2. Serialize the model into a pickle file and deploy the pickle file into SAS Model Manager.
3. Write JSON files that are associated with the trained model and write the model score code .py file. Also, write JSON files for one of the following data options:
   (a) Generate Fit Statistics from user-defined input.
   (b) Calculate Fit Statistics, ROC curve and Lift information from data.
4.  Import model into SAS Model Manager using an import_model call. This call generates the necessary score code and creates a ZIP archive file for the model and then sends it to SAS Model Manager.

### Step 1: Build and Train a Model

In [1]:
from pathlib import Path
import pandas as pd

import sklearn.tree as tree
from sklearn.model_selection import train_test_split

In [2]:
data_folder = Path.cwd() / '../../samples/Python_Models/DTree_sklearn_PyPickleModel/Data/' # Changes required by user.
zip_folder = Path.cwd() / '../../samples/Python_Models/DTree_sklearn_PyPickleModel/Model/' # Changes required by user.
model_prefix  = 'hmeqClassTree'

In [3]:
y_name = 'BAD'
cat_name = ['JOB', 'REASON']
int_name = ['CLAGE', 'CLNO', 'DEBTINC', 'DELINQ', 'DEROG', 'NINQ', 'YOJ']

input_data = pd.read_csv((Path(data_folder) / 'hmeq.csv'), sep=',',
                        usecols=[y_name]+cat_name+int_name)

In [4]:
use_column = [y_name]
use_column.extend(cat_name + int_name)
input_data = input_data[use_column].dropna()

x_train, x_test, y_train, y_test = train_test_split(input_data, input_data[y_name],
                                                test_size=0.2, random_state=42)

In [None]:
model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=5,
                                    min_samples_split=20,
                                    min_samples_leaf=10,
                                    random_state=42)
print(model)

In [6]:
x_train_dummies = pd.get_dummies(x_train[cat_name].astype('category'))
x_train = x_train_dummies.join(x_train[int_name])
y_train = y_train.astype('category')
trained_model = model.fit(x_train, y_train)

In [7]:
x_test_dummies = pd.get_dummies(x_test[cat_name].astype('category'))
x_test = x_test_dummies.join(x_test[int_name])
y_train = y_train.astype('category')

In [8]:
y_category = y_train.cat.categories
output_var = pd.DataFrame(columns=['EM_EVENTPROBABILITY', 'EM_CLASSIFICATION'])
output_var['EM_CLASSIFICATION'] = y_category.astype('str')
output_var['EM_EVENTPROBABILITY'] = 0.5

### Step 2: Serialize a Model Into a Pickle File

In [9]:
import sasctl.pzmm as pzmm

In [None]:
pzmm.PickleModel.pickle_trained_model(trained_model = trained_model, 
                                      model_prefix=model_prefix, 
                                      pickle_path=zip_folder)

### Step 3: Write JSON Model Files

In [None]:
JSONFiles = pzmm.JSONFiles()
JSONFiles.write_var_json(input_data[cat_name+int_name], is_input=True, json_path=zip_folder)

JSONFiles.write_var_json(output_var, is_input=False, json_path=zip_folder)

model_name = 'Home Equity Loan Classification Tree'
JSONFiles.write_model_properties_json(model_name=model_name,
                                   target_variable=y_name,
                                   target_values=[str(y) for y in y_category],
                                   json_path=zip_folder,
                                   model_desc=f"Description for {model_name} model",
                                   model_algorithm="",
                                   modeler='sasdemo')

JSONFiles.write_file_metadata_json(model_prefix, json_path=zip_folder)

In [None]:
# (a) Writes Fit Statistics to dmcas_fitstat.json file from user-defined input.
# This cell can be skipped if calculating statistics automatically from data.
fit_stat_tuples = [('GAMMA', 1.65412, 'TRAIN'),
                 ('NObs', 176, 'TEST'),
                 ('MCLL', .196882, 'VALIDATE')]
csv_path = data_folder / 'dmcas_fitstat.csv' # Changes required by user.
fitstat_df = pd.read_csv(csv_path)
JSONFiles = pzmm.JSONFiles()
JSONFiles.input_fit_statistics(fitstat_df=fitstat_df,
                               user_input=True,
                               tuple_list=fit_stat_tuples,
                               json_path=zip_folder)

In [14]:
# To calculate statistics from data, a connection to a SAS server is required.

from sasctl import Session
import getpass

username = getpass.getpass()
password = getpass.getpass()
host = "demo.sas.com"  # Changes required by the user
sess = Session(host, username, password, protocol="http") # For TLS-enabled servers, change protocol value to "https"
conn = sess.as_swat() # Connect to SWAT through the sasctl authenticated connection

In [None]:
# (b) Calculates Fit Statistics, ROC curve and Lift information from data to create the relevant JSON files.
# This cell can be skipped if statistics were defined by the user.
train_predict = trained_model.predict(x_train)
train_proba = trained_model.predict_proba(x_train)

test_predict = trained_model.predict(x_test)
test_proba = trained_model.predict_proba(x_test)

train_data = pd.concat([y_train.reset_index(drop=True), pd.Series(train_predict), pd.Series(data=train_proba[:,1])], axis=1)
test_data = pd.concat([y_test.reset_index(drop=True), pd.Series(test_predict), pd.Series(data=test_proba[:,1])], axis=1)

JSONFiles.calculate_model_statistics(target_value=1,
                                     prob_value=.5,
                                     train_data=train_data,
                                     test_data=test_data,
                                     json_path=zip_folder
                                     )


### Step 4: Import Model into SAS Model Manager

In [None]:
# import_model generates necessary score code, ZIPs the model, and imports it into SAS Model Manager
# If calling import_model multiple times, ensure that the score_code value is reset before each call.

pzmm.ImportModel.import_model(
        model_files=zip_folder, # Where are the model files?
        model_prefix=model_prefix, # What is the model name?
        project="HMEQModels", # What is the project name?
        input_data=x_train, # What does example input data look like?
        predict_method=[tree.DecisionTreeClassifier.predict_proba, [int, int]], # What is the predict method and what does it return?
        score_metrics=output_var.columns.to_list(), # What are the output variables?
        overwrite_model=True, # Overwrite the model if it already exists?
        target_values=["0", "1"], # What are the expected values of the target variable?
        target_index=1, # What is the index of the target value in target_values?
        model_file_name=model_prefix + ".pickle", # How was the model file serialized?
        missing_values=True # Does the data include missing values?
    )

pzmm.ScoreCode.score_code = ''