# The Data Science Pilot Action Set 

The dataSciencePilot action set consists of actions that implement a policy-based, configurable, and scalable approach to automating data science workflows. This action set can be used to automate an end-to-end workflow or to automate steps in the  workflow such as data preparation, feature preprocessing, feature engineering, feature selection, and hyperparameter tuning.  More information about this action set is available on [its documentation page.](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_toc.htm&docsetVersion=8.5&locale=en)
***

## Table of Contents
Today we will set up the notebook and go through each of the seven actions. 

1. [Setting Up the Notebook](#Setting-Up-the-Notebook)
1. [Explore Data](#Explore-Data)
1. [Explore Correlations](#Explore-Correlations)
1. [Analyze Missing Patterns](#Analyze-Missing-Patterns)
1. [Detect Interactions](#Detect-Interactions)
1. [Screen Variables](#Screen-Variables)
1. [Feature Machine](#Feature-Machine)
1. [Generate Shadow Features](#Generate-Shadow-Features)
1. [Select Features](#Select-Features)
1. [Data Science Automated Machine Learning Pipeline](#Data-Science-Automated-Machine-Learning-Pipeline)
1. [Conclusion](#Conclusion)
***

## Setting Up the Notebook

First, we must import the Scripting Wrapper for Analytics Transfer (SWAT) package and use the package to connect to out Cloud Analytics Service (CAS).

In [1]:
import swat
import numpy as np
import pandas as pd

In [2]:
conn = swat.CAS('localhost', 5570, authinfo='~/.authinfo', caslib="CASUSER")

Now we will load the dataSciencePilot action set and the decisionTree action set.

In [3]:
conn.builtins.loadactionset('dataSciencePilot')
conn.builtins.loadactionset('decisionTree')

NOTE: Added action set 'dataSciencePilot'.
NOTE: Added action set 'decisionTree'.


Next, we must  connect to our data source. We are using a data set for predicting home equity loan defaults. 

In [4]:
tbl = 'hmeq'
hmeq = conn.read_csv("./data/hmeq.csv", casout=dict(name=tbl, replace=True))

NOTE: Cloud Analytic Services made the uploaded file available as table HMEQ in caslib CASUSER(sasdemo).
NOTE: The table HMEQ has been created in caslib CASUSER(sasdemo) from binary data uploaded to Cloud Analytic Services.


In [5]:
hmeq.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1.0,1100.0,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1.0,1300.0,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1.0,1500.0,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1.0,1500.0,,,,,,,,,,,
4,0.0,1700.0,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,


Our target is “BAD” meaning that it was a bad loan. I am setting up a variable to hold our target information as well as our policy information. Each policy is applicable to specific actions and I will provide more information about each policy later in the notebook. 

In [6]:
# Target Name 
trt='BAD'
# Exploration Policy 
expo = {'cardinality': {'lowMediumCutoff':40}}
# Screen Policy 
scpo = {'missingPercentThreshold':35}
# Selection Policy 
sepo = {'criterion': 'SU', 'topk':4}
# Transformation Policy 
trpo = {'entropy': True, 'iqv': True, 'kurtosis': True, 'outlier': True}

***
## Explore Data

The [exploreData action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details22.htm&docsetVersion=8.5&locale=en) calculates various statistical measures for each column in your data set such as Minimum, Maximum, Mean, Median, Mode, Number Missing, Standard Deviation, and more. The exploreData action also creates a hierarchical variable grouping with two levels. The first level groups variables according to their data type (interval, nominal, data, time, or datetime). The second level uses the following statistical metrics to group the interval and nominal data:
- Missing rate (interval and nominal).
- Cardinality (nominal). 
- Entropy (nominal). 
- Index of Qualitative Variation(IQV; interval and nominal). 
- Skewness (interval).
- Kurtosis (interval).
- Outliers (interval).
- Coefficient of Variation (CV; interval).

This action returns a CAS table listing all the variables, the variable groupings, and the summary statistics. These groupings allow for a pipelined approach to data transformation and cleaning. 

In [7]:
conn.dataSciencePilot.exploreData(   
        table  = tbl,
        target = trt,     
        casOut = {'name': 'EXPLORE_DATA_OUT_PY', 'replace' : True},
        explorationPolicy = expo
    )
conn.fetch(table = {'name': 'EXPLORE_DATA_OUT_PY'})

Unnamed: 0,Variable,VarType,MissingRated,CardinalityRated,EntropyRated,IQVRated,CVRated,SkewnessRated,KurtosisRated,OutlierRated,...,MomentCVPer,RobustCVPer,MomentSkewness,RobustSkewness,MomentKurtosis,RobustKurtosis,LowerOutlierMomentPer,UpperOutlierMomentPer,LowerOutlierRobustPer,UpperOutlierRobustPer
0,BAD,binary-target,,,,,,,,,...,,,,,,,,,,
1,REASON,character-nominal,1.0,1.0,3.0,,,,,,...,,,,,,,,,,
2,JOB,character-nominal,1.0,1.0,3.0,3.0,,,,,...,,,,,,,,,,
3,LOAN,numeric-nominal,1.0,3.0,,,,,,,...,,,,,,,,,,
4,MORTDUE,interval,2.0,,,,3.0,1.0,2.0,3.0,...,60.272664,69.553515,1.814481,0.844221,6.481866,0.370274,0.0,2.958471,2.241823,1.727306
5,VALUE,interval,1.0,,,,3.0,1.0,3.0,3.0,...,56.384362,60.247883,3.053344,0.989755,24.362805,0.425793,0.0,2.47948,0.444596,2.599179
6,YOJ,interval,2.0,,,,3.0,1.0,1.0,2.0,...,84.88853,142.857143,0.98846,0.977944,0.372072,-0.006105,0.0,2.31405,0.0,0.055096
7,DEROG,numeric-nominal,2.0,1.0,2.0,1.0,,,,,...,,,,,,,,,,
8,DELINQ,numeric-nominal,2.0,1.0,2.0,1.0,,,,,...,,,,,,,,,,
9,CLAGE,interval,2.0,,,,3.0,1.0,2.0,2.0,...,47.734255,67.143526,1.343412,0.282945,7.599549,0.061058,0.0,1.150035,0.0,0.902335


*** 
## Explore Correlations

If a target is specified, the [exploreCorrelation action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details21.htm&docsetVersion=8.5&locale=en) performs a linear and nonlinear correlation analysis of the input variables and the target. If a target is not specified, the exploreCorrelation action performs a linear and nonlinear correlation analysis for all pairwise combinations of the input variables. The correlation statistics available depend on the data type of each input variable in the pair. 
- Nominal-nominal correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Information Value (IV; for binary target), Entropy, chi-square, G test (G2), and Cramer’s V. 
- Nominal-interval correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Entropy, and F-test. 
- Interval-interval correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Entropy, and Pearson correlation. 

This action returns a CAS table listing all the variable pairs and the correlation statistics. 

In [8]:
conn.dataSciencePilot.exploreCorrelation(
        table = tbl, 
        casOut = {'name':'CORR_PY', 'replace':True},
        target = trt
)
conn.fetch(table = {"name" : "CORR_PY"})

Unnamed: 0,FirstVariable,SecondVariable,Type,MI
0,CLAGE,BAD,_it_,0.030242
1,CLNO,BAD,_it_,0.015505
2,DEBTINC,BAD,_it_,0.063485
3,DELINQ,BAD,_it_,0.076942
4,DEROG,BAD,_it_,0.048241
5,LOAN,BAD,_it_,0.036787
6,MORTDUE,BAD,_it_,0.012855
7,NINQ,BAD,_it_,0.021363
8,VALUE,BAD,_it_,0.016458
9,YOJ,BAD,_it_,0.009881


***
## Analyze Missing Patterns

If the target is specified, the [analyzeMissingPatterns action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details04.htm&docsetVersion=8.5&locale=en) performs a missing pattern analysis of the input variables and the target. If a target is not specified, the analyzeMissingPatterns action performs a missing pattern analysis for all pairwise combinations of the input variables. This analysis provides the correlation strength between missing patterns across variable pairs and dependencies of missingness in one variable and the values of the other variable. This action returns a CAS table listing all the missing variable pairs and the statistics around missingness. 

In [9]:
conn.dataSciencePilot.analyzeMissingPatterns(
        table = tbl, 
        target = trt, 
        casOut = {'name':'MISS_PATTERN_PY', 'replace':True}
)
conn.fetch(table = {'name': 'MISS_PATTERN_PY'})

Unnamed: 0,FirstVariable,SecondVariable,Type,MI,NormMI,SU,EntropyPerChange
0,CLAGE,BAD,_mt_,0.000672,0.036636,0.001324,0.09315
1,CLNO,BAD,_mt_,0.000258,0.022695,0.000542,0.035732
2,DEBTINC,BAD,_mt_,0.184595,0.555613,0.25161,25.605476
3,DELINQ,BAD,_mt_,0.003061,0.078129,0.005183,0.424657
4,DEROG,BAD,_mt_,0.003954,0.08875,0.006342,0.548446
5,LOAN,BAD,_mt_,0.0,0.0,0.0,0.0
6,MORTDUE,BAD,_mt_,1.1e-05,0.004749,2e-05,0.001564
7,NINQ,BAD,_mt_,0.001243,0.049837,0.002177,0.172475
8,VALUE,BAD,_mt_,0.035911,0.263255,0.083951,4.981264
9,YOJ,BAD,_mt_,0.002535,0.07111,0.004426,0.3516


***
## Detect Interactions

The [detectInteractions action](https://go.documentation.sas.com/?docsetId=casactml&docsetVersion=8.5&docsetTarget=casactml_datasciencepilot_details05.htm&locale=en) will assess the interactions between pairs of predictor variables and the correlation of that interaction on the response variable. Specially, it will see if the product of the pair of predictor variables correlate with the response variable. Since checking the correlation between the product of every predictor pair and the response variable can be computationally intensive, this action relies on the XYZ algorithm to search for these interactions efficiently in a high-dimensional space.   

The detectInteractions Action requires that all predictor variables be in a binary format, but the response variable can be numeric, binary, or multi-class.  Additionally, the detectInteractions Action can handle data in a sparse format, such as when predictor variables are encoded using an one-hot-encoding scheme.  In the example below, we will specify that our inputs are sparse. The output tables shows the gamma value for each pair of variables. 

In [10]:
# Tranform data for binary format
conn.dataPreprocess.transform(
    table = hmeq, 
    copyVars = ["BAD"], 
    casOut = {"name": "hmeq_transform", "replace": True}, 
    requestPackages = [{"inputs":["JOB", "REASON"], 
                        "catTrans":{"method": "label", "arguments":{"overrides":{"binMissing": True}}}}, 
                      {"inputs":["MORTDUE", "DEBTINC", "LOAN"], 
                       "discretize": {"method": "quantile", "arguments":{"overrides":{"binMissing": True}}} }])
conn.fetch(table = {'name': 'hmeq_transform'})

Unnamed: 0,BAD,_TR2_DEBTINC,_TR2_LOAN,_TR2_MORTDUE,_TR1_JOB,_TR1_REASON
0,1.0,0.0,1.0,1.0,3.0,2.0
1,1.0,0.0,1.0,3.0,3.0,2.0
2,1.0,0.0,1.0,1.0,3.0,2.0
3,1.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,4.0,2.0,2.0
5,1.0,4.0,1.0,1.0,3.0,2.0
6,1.0,0.0,1.0,2.0,3.0,2.0
7,1.0,4.0,1.0,1.0,3.0,2.0
8,1.0,0.0,1.0,1.0,3.0,2.0
9,1.0,0.0,1.0,0.0,5.0,2.0


In [11]:
conn.dataSciencePilot.detectInteractions(
    table ='hmeq_transform', 
    target = trt, 
    event = '1', 
    sparse = True, 
    inputs = ["_TR1_JOB", "_TR1_REASON", "_TR2_MORTDUE", "_TR2_DEBTINC", "_TR2_LOAN"], 
    inputLevels = [7, 3, 6, 6, 6], 
    casOut = {'name': 'DETECT_INT_OUT_PY', 'replace': True})
conn.fetch(table={'name':'DETECT_INT_OUT_PY'})



Unnamed: 0,FirstVarID,FirstVarName,SecondVarID,SecondVarName,Gamma
0,7.0,_TR1_JOB_7,12.0,_TR2_MORTDUE_2,0.502352
1,10.0,_TR1_REASON_3,12.0,_TR2_MORTDUE_2,0.502352
2,22.0,_TR2_DEBTINC_6,12.0,_TR2_MORTDUE_2,0.502352
3,28.0,_TR2_LOAN_6,12.0,_TR2_MORTDUE_2,0.502352
4,5.0,_TR1_JOB_5,12.0,_TR2_MORTDUE_2,0.48106
5,6.0,_TR1_JOB_6,12.0,_TR2_MORTDUE_2,0.463729
6,7.0,_TR1_JOB_7,14.0,_TR2_MORTDUE_4,0.45234
7,10.0,_TR1_REASON_3,14.0,_TR2_MORTDUE_4,0.45234
8,22.0,_TR2_DEBTINC_6,14.0,_TR2_MORTDUE_4,0.45234
9,28.0,_TR2_LOAN_6,14.0,_TR2_MORTDUE_4,0.45234


***
## Screen Variables

The [screenVariables action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details25.htm&docsetVersion=8.5&locale=en) makes one of the following recommendations for each input variable:
-	Remove variable if there are significant data-quality issues. 
-	Transform and keep variable if there are some data-quality issues. 
-	Keep variable if there are no data quality issues. 

The screenVariables action considers the following features of the input variables to make its recommendation:
-	Missing rate exceeds  threshold in screenPolicy (default is 90). 
-	Constant value across input variable.  
-	Mutual Information (MI) about the target is below the threshold in the screenPolicy (default is 0.05)
-	Entropy across levels. 
-	Entropy reduction of target exceeds threshold in screenPolicy (default is 90); also referred to as leakage. 
-	Symmetric Uncertainty (SU) of two variables exceed threshold in screenPolicy (default is 1); also referred to as redundancy. 

This action returns a CAS table listing all the input variables, the recommended action, and the reason for the recommended action.  

In [12]:
conn.dataSciencePilot.screenVariables(
    table = tbl, 
    target = trt, 
    casOut = {'name': 'SCREEN_VARIABLES_OUT_PY', 'replace': True}, 
    screenPolicy = {}
)
conn.fetch(table = {'name': 'SCREEN_VARIABLES_OUT_PY'})

Unnamed: 0,Variable,Recommendation,Reason
0,REASON,keep,passed all screening tests
1,JOB,keep,passed all screening tests
2,LOAN,keep,passed all screening tests
3,MORTDUE,keep,passed all screening tests
4,VALUE,keep,passed all screening tests
5,YOJ,keep,passed all screening tests
6,DEROG,keep,passed all screening tests
7,DELINQ,keep,passed all screening tests
8,CLAGE,keep,passed all screening tests
9,NINQ,keep,passed all screening tests


*** 
## Feature Machine

The [featureMachine action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details23.htm&docsetVersion=8.5&locale=en) creates an automated and parallel generation of features. The featureMachine action first explores the data and groups the input variables into categories with the same statistical profile, like the exploreData action. Next the featureMachine action screens variables to identify noise variables to exclude from further analysis, like the screenVariables action.  Finally, the featureMachine action generates new features by using the available structured pipelines:
-	Missing indicator addition. 
-	Mode imputation and rare value grouping. 
-	Missing level and rare value grouping. 
-	Median imputation. 
-	Mode imputation and label encoding. 
-	Missing level and label encoding. 
-	Yeo-Johnson transformation and median imputation. 
-	Box-Cox transformation. 
-	Quantile binning with missing bins.
-	Regression tree binning.
-	Decision tree binning. 
-	MDLP binning. 
-	Target encoding. 
-	Date, time, and datetime transformations. 

Depending on the parameters specified in the transformationPolicy, the featureMachine action can generate several features for each input variable. This action returns four CAS tables: the first lists information around the transformation pipelines, the second lists information around the transformed features, the third is the input table scored with the transformed features, and the fourth is an analytical store for scoring any additional input tables. 

In [13]:
conn.dataSciencePilot.featureMachine(
    table = tbl, 
    target = trt, 
    copyVars = trt, 
    explorationPolicy = expo, 
    screenPolicy = scpo, 
    transformationPolicy = trpo, 
    transformationOut       = {"name" : "TRANSFORMATION_OUT", "replace" : True},
    featureOut              = {"name" : "FEATURE_OUT", "replace" : True},
    casOut                  = {"name" : "CAS_OUT", "replace" : True},
    saveState               = {"name" : "ASTORE_OUT", "replace" : True}  
)

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo),TRANSFORMATION_OUT,33,21,"CASTable('TRANSFORMATION_OUT', caslib='CASUSER..."
1,CASUSER(sasdemo),FEATURE_OUT,59,9,"CASTable('FEATURE_OUT', caslib='CASUSER(sasdem..."
2,CASUSER(sasdemo),CAS_OUT,5960,60,"CASTable('CAS_OUT', caslib='CASUSER(sasdemo)')"
3,CASUSER(sasdemo),ASTORE_OUT,1,2,"CASTable('ASTORE_OUT', caslib='CASUSER(sasdemo)')"


In [14]:
conn.fetch(table = {'name': 'TRANSFORMATION_OUT'})

Unnamed: 0,FTGPipelineId,Name,NVariables,IsInteraction,ImputeMethod,OutlierMethod,OutlierTreat,OutlierArgs,FunctionMethod,FunctionArgs,...,MapIntervalArgs,HashMethod,HashArgs,DateTimeMethod,DiscretizeMethod,DiscretizeArgs,CatTransMethod,CatTransArgs,InteractionMethod,InteractionSynthesizer
0,1.0,miss_ind,5.0,,,,,,,,...,,MissIndicator,2.0,,,,,,,
1,2.0,grp_rare1,2.0,,Mode,,,,,,...,,,,,,,Group Rare,5.0,,
2,3.0,hc_tar_frq_rat,1.0,,,,,,,,...,10.0,,,,,,,,,
3,4.0,hc_lbl_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
4,5.0,hc_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
5,6.0,hc_cnt_log,1.0,,,,,,Log,e,...,0.0,,,,,,,,,
6,7.0,lchehi_lab,1.0,,,,,,,,...,,,,,,,Label (Sparse One-Hot),0.0,,
7,8.0,lcnhenhi_grp_rare,1.0,,,,,,,,...,,,,,,,Group Rare,5.0,,
8,9.0,lcnhenhi_dtree5,1.0,,,,,,,,...,,,,,,,DTree,5.0,,
9,10.0,lcnhenhi_dtree10,1.0,,,,,,,,...,,,,,,,DTree,10.0,,


In [15]:
conn.fetch(table = {'name': 'FEATURE_OUT'})

Unnamed: 0,FeatureId,Name,IsNominal,FTGPipelineId,NInputs,InputVar1,InputVar2,InputVar3,Label
0,1.0,cpy_int_med_imp_CLAGE,0.0,32.0,1.0,CLAGE,,,CLAGE: Low missing rate - median imputation
1,2.0,miss_ind_CLAGE,1.0,1.0,1.0,CLAGE,,,CLAGE: Significant missing - missing indicator
2,3.0,nhoks_nloks_dtree_10_CLAGE,1.0,31.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."
3,4.0,nhoks_nloks_dtree_5_CLAGE,1.0,30.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."
4,5.0,nhoks_nloks_log_CLAGE,0.0,26.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."
5,6.0,nhoks_nloks_pow_n0_5_CLAGE,0.0,25.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."
6,7.0,nhoks_nloks_pow_n1_CLAGE,0.0,24.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."
7,8.0,nhoks_nloks_pow_n2_CLAGE,0.0,23.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."
8,9.0,nhoks_nloks_pow_p0_5_CLAGE,0.0,27.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."
9,10.0,nhoks_nloks_pow_p1_CLAGE,0.0,28.0,1.0,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) ..."


In [16]:
conn.fetch(table = {'name': 'CAS_OUT'})

Unnamed: 0,BAD,cpy_int_med_imp_CLAGE,miss_ind_CLAGE,nhoks_nloks_dtree_10_CLAGE,nhoks_nloks_dtree_5_CLAGE,nhoks_nloks_log_CLAGE,nhoks_nloks_pow_n0_5_CLAGE,nhoks_nloks_pow_n1_CLAGE,nhoks_nloks_pow_n2_CLAGE,nhoks_nloks_pow_p0_5_CLAGE,...,hc_lbl_cnt_LOAN,hc_tar_frq_rat_LOAN,cpy_nom_miss_lev_lab_NINQ,lcnhenhi_dtree10_NINQ,lcnhenhi_dtree5_NINQ,lcnhenhi_grp_rare_NINQ,miss_ind_NINQ,cpy_nom_miss_lev_lab_JOB,lchehi_lab_JOB,cpy_nom_miss_lev_lab_REASON
0,1.0,94.366667,1.0,3.0,2.0,4.557729,0.1024,0.010486,0.00011,9.765586,...,528.0,0.5,2.0,2.0,2.0,2.0,1.0,3.0,3.0,2.0
1,1.0,121.833333,1.0,4.0,2.0,4.810828,0.090228,0.008141,6.6e-05,11.08302,...,461.0,0.5,1.0,1.0,1.0,1.0,1.0,3.0,3.0,2.0
2,1.0,149.466667,1.0,4.0,2.0,5.013742,0.081523,0.006646,4.4e-05,12.266486,...,385.0,0.5,2.0,2.0,2.0,2.0,1.0,3.0,3.0,2.0
3,1.0,173.466667,0.0,0.0,0.0,5.161734,0.075708,0.005732,3.3e-05,13.208583,...,385.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,93.333333,1.0,2.0,2.0,4.546835,0.10296,0.010601,0.000112,9.712535,...,359.0,0.5,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0
5,1.0,101.466002,1.0,3.0,2.0,4.629531,0.098789,0.009759,9.5e-05,10.122549,...,359.0,0.5,2.0,2.0,2.0,2.0,1.0,3.0,3.0,2.0
6,1.0,77.1,1.0,2.0,2.0,4.35799,0.113155,0.012804,0.000164,8.83742,...,401.0,0.5,2.0,2.0,2.0,2.0,1.0,3.0,3.0,2.0
7,1.0,88.76603,1.0,2.0,2.0,4.497207,0.105547,0.01114,0.000124,9.474494,...,401.0,0.5,1.0,1.0,1.0,1.0,1.0,3.0,3.0,2.0
8,1.0,216.933333,1.0,7.0,4.0,5.384189,0.067739,0.004589,2.1e-05,14.762565,...,259.0,0.5,2.0,2.0,2.0,2.0,1.0,3.0,3.0,2.0
9,1.0,115.8,1.0,3.0,2.0,4.760463,0.092529,0.008562,7.3e-05,10.807405,...,259.0,0.5,1.0,1.0,1.0,1.0,1.0,5.0,5.0,2.0


*** 
## Generate Shadow Features

The [generateShadowFeatures Action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_examples39.htm&docsetVersion=8.5&locale=en) performs a scalable random permutation of input features to create shadow features. The shadow features are randomly selected from a matching distribution of each input feature. These shadow features can be used for all-relevant feature selection which removes the inputs whose variable importance is lower than the shadow feature’s variable importance. The shadow features can also be used in a post-fit analysis using Permutation Feature Importance (PFI). By replacing each input with its shadow feature one-by-one and measuring the change on model performance, one can determine that features importance based on relative size of the model’s performance change.  

In the example below, I will use the outputs of the feature machine for all-relevant feature selection. This involves getting the variable metadata from my feature machine table, generating my shadow features, finding the variable importance for my features and shadow features using a random forest, and comparing each variable's performance to its shadow features. In the end, I will only keep variables with a higher importance than its shadow feature for the next phase. 

In [17]:
# Getting variable names and metadata from feature machine output
fm = conn.CASTable('FEATURE_OUT').to_frame()
inputs = fm['Name'].to_list()
nom = fm.loc[fm['IsNominal'] == 1]
nom = nom['Name'].to_list()

# Generating Shadow Features
conn.dataSciencePilot.generateShadowFeatures(
    table = 'CAS_OUT', 
    nProbes = 2, 
    inputs = inputs, 
    nominals = nom,
    casout={"name" : "SHADOW_FEATURES_OUT", "replace" : True},
    copyVars = trt
)
conn.fetch(table = {"name" : "SHADOW_FEATURES_OUT"})

Unnamed: 0,BAD,_fpi_cpy_int_med_imp_CLAGE_1,_fpi_cpy_int_med_imp_CLAGE_2,_fpi_cpy_int_med_imp_DEBTINC_1,_fpi_cpy_int_med_imp_DEBTINC_2,_fpi_cpy_int_med_imp_MORTDUE_1,_fpi_cpy_int_med_imp_MORTDUE_2,_fpi_cpy_int_med_imp_VALUE_1,_fpi_cpy_int_med_imp_VALUE_2,_fpi_cpy_int_med_imp_YOJ_1,...,_fpn_miss_ind_YOJ_1,_fpn_miss_ind_YOJ_2,_fpn_nhoks_nloks_dtree_10_CLAGE_1,_fpn_nhoks_nloks_dtree_10_CLAGE_2,_fpn_nhoks_nloks_dtree_10_YOJ_1,_fpn_nhoks_nloks_dtree_10_YOJ_2,_fpn_nhoks_nloks_dtree_5_CLAGE_1,_fpn_nhoks_nloks_dtree_5_CLAGE_2,_fpn_nhoks_nloks_dtree_5_YOJ_1,_fpn_nhoks_nloks_dtree_5_YOJ_2
0,1.0,212.857966,306.301194,41.920768,37.885292,126964.545638,143702.041714,115978.674484,67369.485639,16.501445,...,1.0,1.0,5.0,5.0,8.0,4.0,5.0,3.0,2.0,4.0
1,1.0,107.271223,173.466669,47.838093,43.263531,98964.859966,71690.484398,140962.807997,43080.986769,0.028263,...,0.0,1.0,4.0,3.0,6.0,2.0,1.0,4.0,2.0,4.0
2,1.0,184.889813,212.829761,38.964341,36.458693,31115.897464,65019.225046,90000.081104,83919.782979,23.570016,...,1.0,1.0,4.0,6.0,4.0,8.0,2.0,4.0,4.0,4.0
3,1.0,622.587866,107.172996,34.818262,36.463339,62360.100423,94601.260163,49543.401705,31888.847989,26.744998,...,1.0,1.0,7.0,9.0,9.0,9.0,4.0,2.0,4.0,1.0
4,0.0,121.889601,218.133473,28.422306,41.635977,48107.013889,75397.282159,31532.854191,65026.000575,5.013121,...,1.0,1.0,8.0,5.0,9.0,3.0,2.0,1.0,4.0,4.0
5,1.0,181.232278,112.785876,34.769221,36.994494,65020.344536,47127.398889,288193.525528,115523.037663,0.097692,...,1.0,1.0,6.0,7.0,9.0,10.0,5.0,4.0,0.0,2.0
6,1.0,208.091265,81.477013,31.280852,28.122599,20635.084053,54337.933251,46838.33504,35974.002994,1.063992,...,1.0,1.0,0.0,9.0,8.0,8.0,4.0,2.0,2.0,5.0
7,1.0,202.794672,261.1654,34.818263,26.903285,57406.281905,41256.638428,195909.103089,86605.88914,5.01519,...,1.0,1.0,1.0,8.0,7.0,9.0,5.0,2.0,4.0,4.0
8,1.0,108.27154,130.776027,27.694422,31.058132,140511.373088,52062.739878,94710.729253,39620.951869,3.005884,...,1.0,1.0,9.0,9.0,7.0,9.0,4.0,3.0,1.0,5.0
9,1.0,114.463463,367.218162,34.81827,38.265665,159481.832637,88873.73265,26510.127331,182172.822104,3.020234,...,1.0,1.0,4.0,8.0,4.0,10.0,5.0,2.0,2.0,1.0


In [18]:
# Getting Feature Importance for Orginal Features
feats = conn.decisionTree.forestTrain(
    table = 'CAS_OUT', 
    inputs = inputs, 
    target = trt, 
    varImp = True)
real_features = feats.DTreeVarImpInfo

# Getting Feature Importance for Shadow Features
inp = conn.CASTable('SHADOW_FEATURES_OUT').axes[1].to_list()
shadow_feats = conn.decisionTree.forestTrain(
    table = 'SHADOW_FEATURES_OUT', 
    inputs = inp, 
    target = trt, 
    varImp = True)
sf = shadow_feats.DTreeVarImpInfo

# Building dataframe for easy comparison 
feat_comp = pd.DataFrame(columns=['Variable', 'Real_Imp', 'SF_Imp1', 'SF_Imp2'])
# Filling Variable Column of Data Frame from Feature
feat_comp['Variable'] = real_features['Variable']
# Filling Importance Column of Data Frame from Feature
feat_comp['Real_Imp'] = real_features['Importance']
# Finding each Feature's Shadow Feature
for index, row in sf.iterrows():
    temp_name = row['Variable']
    temp_num = int(temp_name[-1:])
    temp_name = temp_name[5:-2]
    temp_imp = row['Importance']
    for ind, ro in feat_comp.iterrows():
        if temp_name == ro['Variable']:
            if temp_num == 1:
                # Filling First Shadow Feature's Importance
                feat_comp.at[ind, 'SF_Imp1'] = temp_imp
            else:
                # Filling First Shadow Feature's Importance
                feat_comp.at[ind, 'SF_Imp2'] = temp_imp
feat_comp.head()

Unnamed: 0,Variable,Real_Imp,SF_Imp1,SF_Imp2
0,hk_dtree_disct10_DEBTINC,50.62582,0.356029,0.42216
1,hk_dtree_disct5_DEBTINC,44.679171,0.0912159,0.153254
2,miss_ind_DEBTINC,29.850201,0.0304762,
3,cpy_int_med_imp_DEBTINC,24.158801,0.595817,0.525752
4,grp_rare1_DELINQ,17.844217,0.0468703,0.0970889


In [19]:
# Determining which features have an importance smaller than their shadow feature's importance
to_drop = list()
for ind, ro in feat_comp.iterrows():
    if ro['Real_Imp'] <= ro['SF_Imp1'] or ro['Real_Imp'] <= ro['SF_Imp2']:
        to_drop.append(ro['Variable'])
to_drop

['ho_winsor_VALUE',
 'ho_winsor_MORTDUE',
 'nhoks_nloks_pow_n1_YOJ',
 'nhoks_nloks_dtree_10_YOJ',
 'hc_cnt_LOAN',
 'nhoks_nloks_pow_n2_YOJ',
 'nhoks_nloks_pow_p0_5_YOJ',
 'nhoks_nloks_pow_p2_YOJ',
 'hc_cnt_log_LOAN',
 'nhoks_nloks_pow_p1_YOJ',
 'ho_quan_disct10_MORTDUE',
 'ho_dtree_disct10_MORTDUE',
 'miss_ind_CLAGE',
 'ho_dtree_disct5_MORTDUE',
 'miss_ind_NINQ']

In [20]:
# Dropping Columns from CAS_OUT
CAS_OUT=conn.CASTable('CAS_OUT')
CAS_OUT = CAS_OUT.drop(to_drop, axis=1)

*** 
## Select Features

The [selectFeatures action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details26.htm&docsetVersion=8.5&locale=en) performs a filter-based selection by the criterion selected in the selectionPolicy (default is the best ten input variables according to the Mutual Information statistic). The criterion available for selection include Chi-Square, Cramer’s V, F-test, G2, Information Value, Mutual Information, Normalized Mutual Information statistic, Pearson correlation, and the Symmetric Uncertainty statistic. This action returns a CAS table listing the variables, their rank according to the selected criterion, and the value of the selected criterion. 

In [21]:
conn.dataSciencePilot.screenVariables(
    table='CAS_OUT', 
    target=trt, 
    screenPolicy=scpo, 
    casout={"name" : "SCREEN_VARIABLES_OUT", "replace" : True}
)
conn.fetch(table = {"name" : "SCREEN_VARIABLES_OUT"})

Unnamed: 0,Variable,Recommendation,Reason
0,cpy_int_med_imp_CLAGE,keep,passed all screening tests
1,miss_ind_CLAGE,keep,passed all screening tests
2,nhoks_nloks_dtree_10_CLAGE,keep,passed all screening tests
3,nhoks_nloks_dtree_5_CLAGE,keep,passed all screening tests
4,nhoks_nloks_log_CLAGE,keep,passed all screening tests
5,nhoks_nloks_pow_n0_5_CLAGE,keep,passed all screening tests
6,nhoks_nloks_pow_n1_CLAGE,keep,passed all screening tests
7,nhoks_nloks_pow_n2_CLAGE,keep,passed all screening tests
8,nhoks_nloks_pow_p0_5_CLAGE,keep,passed all screening tests
9,nhoks_nloks_pow_p1_CLAGE,keep,passed all screening tests


***
## Data Science Automated Machine Learning Pipeline

The [dsAutoMl action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details20.htm&docsetVersion=8.5&locale=en) creates a policy-based, scalable, end-to-end automated machine learning pipeline for both regression and classification problems. The only input required from the user is the input data set and the target variable, but optional parameters include the policy parameters for data exploration, variable screening, feature selection, and feature transformation.  Overriding the default policy parameters allow a data scientist to configure their pipeline in their data science workflow. In addition, a data scientist may also select additional models to consider. By default, only a decision tree model is included in the pipeline, but neural networks, random forest models, and gradient boosting models are also available. 

The dsAutoMl action first explores the data and groups the input variables into categories with the same statistical profile, like the exploreData action. Next the dsAutoMl action screens variables to identify noise variables to exclude from further analysis, like the screenVariables action.  Then, the dsAutoMl action generates several new features for the input variables, like the featureMachine action. After there are various new cleaned features, the dsAutoMl action will select features based on selected criterion, like the selectFeatures action. 

From here, various pipelines are created using subsets of the selected features, chosen for each pipeline using a feature-representation algorithm. Then the chosen models are added to each pipeline and the hyperparameters for the selected models are optimized, like the modelComposer action of the Autotune action set. These hyperparameters are optimized for the selected objective parameter when cross-validated. By default, classification problems are optimized to have the smallest Misclassification Error Rate (MCE) and regression problems are optimized to have the smallest Average Square Error (ASR).  Data scientists can then select their champion and challenger models from the pipelines. 

This action returns several CAS tables: the first lists information around the transformation pipelines, the second lists information around the transformed features, the third lists pipeline performance according to the objective parameter and the last tables are analytical stores for creating the feature set and scoring  with our model when new data is available.

In [22]:
conn.dataSciencePilot.dsAutoMl(
    table = tbl,
    target = trt, 
    explorationPolicy = expo, 
    screenPolicy = scpo, 
    selectionPolicy = sepo,
    transformationPolicy = trpo,
     modelTypes              = ["decisionTree", "gradboost"],
        objective               = "ASE",
        sampleSize              = 10,
        topKPipelines           = 10,
        kFolds                  = 5,
        transformationOut       = {"name" : "TRANSFORMATION_OUT_PY", "replace" : True},
        featureOut              = {"name" : "FEATURE_OUT_PY", "replace" : True},
        pipelineOut             = {"name" : "PIPELINE_OUT_PY", "replace" : True},
        saveState               = {"modelNamePrefix" : "ASTORE_OUT_PY", "replace" : True, "topK":1} 
)

NOTE: Added action set 'autotune'.
NOTE: Added action set 'decisionTree'.
NOTE: Early stopping is activated; 'NTREE' will not be tuned.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'decisionTree'.
NOTE: Early stopping is activated; 'NTREE' will not be tuned.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'decisionTree'.
NOTE: Early stopping is activated; 'NTREE' will not be tuned.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'decisionTree'.
NOTE: Early stopping is activated; 'NTREE' will not be tuned.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action

Unnamed: 0,Descr,Value
0,Number of Tree Nodes,599.0
1,Max Number of Branches,2.0
2,Number of Levels,15.0
3,Number of Leaves,300.0
4,Number of Bins,100.0
5,Minimum Size of Leaves,5.0
6,Maximum Size of Leaves,442.0
7,Number of Variables,4.0
8,Confidence Level for Pruning,0.25
9,Number of Observations Used,5960.0

Unnamed: 0,Descr,Value
0,Number of Observations Read,5960.0
1,Number of Observations Used,5960.0
2,Misclassification Error (%),11.476510067

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,1,0,P_BAD1
1,0,1,P_BAD0

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,,0,I_BAD

Unnamed: 0,Variable,Event,CutOff,TP,FP,FN,TN,Sensitivity,Specificity,KS,...,F_HALF,FPR,ACC,FDR,F1,C,Gini,Gamma,Tau,MISCEVENT
0,P_BAD0,0,0.00,4771.0,1189.0,0.0,0.0,1.000000,0.000000,0.0,...,0.833770,1.000000,0.800503,0.199497,0.889200,0.922537,0.845073,0.854121,0.269958,0.199497
1,P_BAD0,0,0.01,4771.0,1015.0,0.0,174.0,1.000000,0.146341,0.0,...,0.854558,0.853659,0.829698,0.175423,0.903855,0.922537,0.845073,0.854121,0.269958,0.170302
2,P_BAD0,0,0.02,4771.0,1015.0,0.0,174.0,1.000000,0.146341,0.0,...,0.854558,0.853659,0.829698,0.175423,0.903855,0.922537,0.845073,0.854121,0.269958,0.170302
3,P_BAD0,0,0.03,4771.0,1015.0,0.0,174.0,1.000000,0.146341,0.0,...,0.854558,0.853659,0.829698,0.175423,0.903855,0.922537,0.845073,0.854121,0.269958,0.170302
4,P_BAD0,0,0.04,4770.0,989.0,1.0,200.0,0.999790,0.168209,0.0,...,0.857698,0.831791,0.833893,0.171731,0.905983,0.922537,0.845073,0.854121,0.269958,0.166107
5,P_BAD0,0,0.05,4770.0,989.0,1.0,200.0,0.999790,0.168209,0.0,...,0.857698,0.831791,0.833893,0.171731,0.905983,0.922537,0.845073,0.854121,0.269958,0.166107
6,P_BAD0,0,0.06,4770.0,989.0,1.0,200.0,0.999790,0.168209,0.0,...,0.857698,0.831791,0.833893,0.171731,0.905983,0.922537,0.845073,0.854121,0.269958,0.166107
7,P_BAD0,0,0.07,4769.0,974.0,2.0,215.0,0.999581,0.180824,0.0,...,0.859496,0.819176,0.836242,0.169598,0.907171,0.922537,0.845073,0.854121,0.269958,0.163758
8,P_BAD0,0,0.08,4767.0,949.0,4.0,240.0,0.999162,0.201850,0.0,...,0.862493,0.798150,0.840101,0.166025,0.909126,0.922537,0.845073,0.854121,0.269958,0.159899
9,P_BAD0,0,0.09,4767.0,949.0,4.0,240.0,0.999162,0.201850,0.0,...,0.862493,0.798150,0.840101,0.166025,0.909126,0.922537,0.845073,0.854121,0.269958,0.159899

Unnamed: 0,NOBS,ASE,DIV,RASE,MCE,MCLL
0,5960.0,0.081785,5960.0,0.285982,0.114765,0.262608

Unnamed: 0,Parameter,Value
0,Model Type,Decision Tree
1,Tuner Objective Function,Misclassification
2,Search Method,GRID
3,Number of Grid Points,6
4,Maximum Tuning Time in Seconds,36000
5,Validation Type,Cross-Validation
6,Num Folds in Cross-Validation,5
7,Log Level,0
8,Seed,726654185
9,Number of Parallel Evaluations,4

Unnamed: 0,Evaluation,MAXLEVEL,NBINS,CRIT,MeanConseqError,EvaluationTime
0,0,11,20,gainRatio,0.140101,0.526701
1,4,15,100,gain,0.1151,1.270005
2,2,15,100,gainRatio,0.119799,1.488537
3,3,10,100,gainRatio,0.122987,1.228688
4,1,10,100,gain,0.129321,0.680179
5,5,5,100,gain,0.138948,0.668237
6,6,5,100,gainRatio,0.149161,0.405213

Unnamed: 0,Iteration,Evaluations,Best_obj,Time_sec
0,0,1,0.140101,0.526701
1,1,7,0.1151,2.161789

Unnamed: 0,Evaluation,Iteration,MAXLEVEL,NBINS,CRIT,MeanConseqError,EvaluationTime
0,0,0,11,20,gainRatio,0.140101,0.526701
1,1,1,10,100,gain,0.129321,0.680179
2,2,1,15,100,gainRatio,0.119799,1.488537
3,3,1,10,100,gainRatio,0.122987,1.228688
4,4,1,15,100,gain,0.1151,1.270005
5,5,1,5,100,gain,0.138948,0.668237
6,6,1,5,100,gainRatio,0.149161,0.405213

Unnamed: 0,Parameter,Name,Value
0,Evaluation,Evaluation,4
1,Maximum Tree Levels,MAXLEVEL,15
2,Maximum Bins,NBINS,100
3,Criterion,CRIT,gain
4,Misclassification,Objective,0.1151004199

Unnamed: 0,Parameter,Value
0,Initial Configuration Objective Value,0.140101
1,Best Configuration Objective Value,0.1151
2,Worst Configuration Objective Value,0.149161
3,Initial Configuration Evaluation Time in Seconds,0.526701
4,Best Configuration Evaluation Time in Seconds,1.126155
5,Number of Improved Configurations,3.0
6,Number of Evaluated Configurations,7.0
7,Total Tuning Time in Seconds,2.308218
8,Parallel Tuning Speedup,2.527624

Unnamed: 0,Task,Time_sec,Time_percent
0,Model Training,3.962936,67.924703
1,Model Scoring,1.38282,23.701521
2,Total Objective Evaluations,5.348993,91.68171
3,Tuner,0.485315,8.31829
4,Total CPU Time,5.834308,100.0

Unnamed: 0,Hyperparameter,RelImportance
0,MAXLEVEL,1.0
1,CRIT,0.066046
2,NBINS,0.0

Unnamed: 0,Descr,Value
0,Number of Trees,150.0
1,Distribution,2.0
2,Learning Rate,0.1
3,Subsampling Rate,0.6
4,Number of Selected Variables (M),4.0
5,Number of Bins,77.0
6,Number of Variables,4.0
7,Max Number of Tree Nodes,119.0
8,Min Number of Tree Nodes,57.0
9,Max Number of Branches,2.0

Unnamed: 0,Progress,Metric
0,1.0,0.199497
1,2.0,0.199497
2,3.0,0.199497
3,4.0,0.199497
4,5.0,0.176174
5,6.0,0.153356
6,7.0,0.149832
7,8.0,0.138758
8,9.0,0.133557
9,10.0,0.130872

Unnamed: 0,Descr,Value
0,Number of Observations Read,5960.0
1,Number of Observations Used,5960.0
2,Misclassification Error (%),10.72147651

Unnamed: 0,TreeID,Trees,NLeaves,MCR,LogLoss,ASE,RASE,MAXAE
0,0.0,1.0,47.0,0.199497,0.458278,0.145427,0.381349,0.819707
1,1.0,2.0,99.0,0.199497,0.429789,0.134712,0.367032,0.836517
2,2.0,3.0,152.0,0.199497,0.408185,0.126404,0.355533,0.851858
3,3.0,4.0,211.0,0.199497,0.390526,0.119608,0.345844,0.863870
4,4.0,5.0,263.0,0.176174,0.376211,0.114106,0.337796,0.875788
5,5.0,6.0,322.0,0.153356,0.364140,0.109631,0.331106,0.887132
6,6.0,7.0,377.0,0.149832,0.353821,0.105886,0.325401,0.895821
7,7.0,8.0,431.0,0.138758,0.345072,0.102787,0.320604,0.904793
8,8.0,9.0,485.0,0.133557,0.337656,0.100209,0.316557,0.912727
9,9.0,10.0,541.0,0.130872,0.331382,0.098148,0.313286,0.919855

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,1,0,P_BAD1
1,0,1,P_BAD0

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,,0,I_BAD

Unnamed: 0,Variable,Event,CutOff,TP,FP,FN,TN,Sensitivity,Specificity,KS,...,F_HALF,FPR,ACC,FDR,F1,C,Gini,Gamma,Tau,MISCEVENT
0,P_BAD0,0,0.00,4771.0,1189.0,0.0,0.0,1.000000,0.000000,0.0,...,0.833770,1.000000,0.800503,0.199497,0.889200,0.926975,0.853951,0.862263,0.272794,0.199497
1,P_BAD0,0,0.01,4771.0,1178.0,0.0,11.0,1.000000,0.009251,0.0,...,0.835054,0.990749,0.802349,0.198016,0.890112,0.926975,0.853951,0.862263,0.272794,0.197651
2,P_BAD0,0,0.02,4771.0,1170.0,0.0,19.0,1.000000,0.015980,0.0,...,0.835991,0.984020,0.803691,0.196937,0.890777,0.926975,0.853951,0.862263,0.272794,0.196309
3,P_BAD0,0,0.03,4770.0,1123.0,1.0,66.0,0.999790,0.055509,0.0,...,0.841478,0.944491,0.811409,0.190565,0.894599,0.926975,0.853951,0.862263,0.272794,0.188591
4,P_BAD0,0,0.04,4770.0,1098.0,1.0,91.0,0.999790,0.076535,0.0,...,0.844457,0.923465,0.815604,0.187117,0.896701,0.926975,0.853951,0.862263,0.272794,0.184396
5,P_BAD0,0,0.05,4770.0,1072.0,1.0,117.0,0.999790,0.098402,0.0,...,0.847578,0.901598,0.819966,0.183499,0.898898,0.926975,0.853951,0.862263,0.272794,0.180034
6,P_BAD0,0,0.06,4769.0,1043.0,2.0,146.0,0.999581,0.122792,0.0,...,0.851030,0.877208,0.824664,0.179456,0.901257,0.926975,0.853951,0.862263,0.272794,0.175336
7,P_BAD0,0,0.07,4768.0,1018.0,3.0,171.0,0.999371,0.143818,0.0,...,0.854021,0.856182,0.828691,0.175942,0.903287,0.926975,0.853951,0.862263,0.272794,0.171309
8,P_BAD0,0,0.08,4768.0,994.0,3.0,195.0,0.999371,0.164003,0.0,...,0.856968,0.835997,0.832718,0.172510,0.905345,0.926975,0.853951,0.862263,0.272794,0.167282
9,P_BAD0,0,0.09,4766.0,978.0,5.0,211.0,0.998952,0.177460,0.0,...,0.858832,0.822540,0.835067,0.170265,0.906515,0.926975,0.853951,0.862263,0.272794,0.164933

Unnamed: 0,NOBS,ASE,DIV,RASE,MCE,MCLL
0,5960.0,0.078295,5960.0,0.279812,0.107215,0.258225

Unnamed: 0,Parameter,Value
0,Model Type,Gradient Boosting Tree
1,Tuner Objective Function,Misclassification
2,Search Method,GRID
3,Number of Grid Points,16
4,Maximum Tuning Time in Seconds,36000
5,Validation Type,Cross-Validation
6,Num Folds in Cross-Validation,5
7,Log Level,0
8,Seed,726654418
9,Number of Parallel Evaluations,4

Unnamed: 0,Evaluation,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,NBINS,MAXLEVEL,MeanConseqError,EvaluationTime
0,0,4,0.1,0.5,0.0,1.0,50,5,0.186242,0.924383
1,6,4,0.1,0.6,0.0,0.0,77,7,0.121141,11.662343
2,2,4,0.1,0.8,0.0,0.0,77,7,0.122148,10.705619
3,5,4,0.1,0.8,0.5,0.0,77,7,0.122987,9.525259
4,10,4,0.1,0.6,0.5,0.0,77,7,0.134009,5.635216
5,16,4,0.1,0.6,0.0,0.0,77,5,0.136934,2.912493
6,13,4,0.1,0.8,0.0,0.0,77,5,0.145796,3.245351
7,9,4,0.1,0.8,0.5,0.0,77,5,0.174442,3.087513
8,14,4,0.05,0.8,0.5,0.0,77,5,0.199216,0.955785
9,7,4,0.05,0.8,0.0,0.0,77,5,0.199362,1.859044

Unnamed: 0,Iteration,Evaluations,Best_obj,Time_sec
0,0,1,0.186242,0.924383
1,1,17,0.121141,17.755302

Unnamed: 0,Evaluation,Iteration,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,NBINS,MAXLEVEL,MeanConseqError,EvaluationTime
0,0,0,4,0.1,0.5,0.0,1.0,50,5,0.186242,0.924383
1,1,1,4,0.05,0.6,0.5,0.0,77,5,0.19943,1.23553
2,2,1,4,0.1,0.8,0.0,0.0,77,7,0.122148,10.705619
3,3,1,4,0.05,0.6,0.0,0.0,77,7,0.199664,2.768309
4,4,1,4,0.05,0.6,0.0,0.0,77,5,0.199664,2.254907
5,5,1,4,0.1,0.8,0.5,0.0,77,7,0.122987,9.525259
6,6,1,4,0.1,0.6,0.0,0.0,77,7,0.121141,11.662343
7,7,1,4,0.05,0.8,0.0,0.0,77,5,0.199362,1.859044
8,8,1,4,0.05,0.6,0.5,0.0,77,7,0.199497,3.138668
9,9,1,4,0.1,0.8,0.5,0.0,77,5,0.174442,3.087513

Unnamed: 0,Parameter,Name,Value
0,Evaluation,Evaluation,6.0
1,Number of Variables to Try,M,4.0
2,Learning Rate,LEARNINGRATE,0.1
3,Sampling Rate,SUBSAMPLERATE,0.6
4,Lasso,LASSO,0.0
5,Ridge,RIDGE,0.0
6,Number of Bins,NBINS,77.0
7,Maximum Tree Levels,MAXLEVEL,7.0
8,Misclassification,Objective,0.1211409396

Unnamed: 0,Parameter,Value
0,Initial Configuration Objective Value,0.186242
1,Best Configuration Objective Value,0.121141
2,Worst Configuration Objective Value,0.199664
3,Initial Configuration Evaluation Time in Seconds,0.924383
4,Best Configuration Evaluation Time in Seconds,11.662332
5,Number of Improved Configurations,2.0
6,Number of Evaluated Configurations,17.0
7,Total Tuning Time in Seconds,19.482536
8,Parallel Tuning Speedup,3.315336

Unnamed: 0,Task,Time_sec,Time_percent
0,Model Training,59.722464,92.462293
1,Model Scoring,4.34667,6.729513
2,Total Objective Evaluations,64.076689,99.203503
3,Tuner,0.514467,0.796497
4,Total CPU Time,64.591156,100.0

Unnamed: 0,Hyperparameter,RelImportance
0,LEARNINGRATE,1.0
1,MAXLEVEL,0.157487
2,LASSO,0.045519
3,SUBSAMPLERATE,0.00881
4,M,0.0
5,RIDGE,0.0
6,NBINS,0.0

Unnamed: 0,Descr,Value
0,Number of Trees,150.0
1,Distribution,2.0
2,Learning Rate,0.1
3,Subsampling Rate,0.6
4,Number of Selected Variables (M),4.0
5,Number of Bins,77.0
6,Number of Variables,4.0
7,Max Number of Tree Nodes,107.0
8,Min Number of Tree Nodes,43.0
9,Max Number of Branches,2.0

Unnamed: 0,Progress,Metric
0,1.0,0.199497
1,2.0,0.199497
2,3.0,0.199497
3,4.0,0.197483
4,5.0,0.165436
5,6.0,0.152181
6,7.0,0.139933
7,8.0,0.136242
8,9.0,0.131879
9,10.0,0.130201

Unnamed: 0,Descr,Value
0,Number of Observations Read,5960.0
1,Number of Observations Used,5960.0
2,Misclassification Error (%),10.687919463

Unnamed: 0,TreeID,Trees,NLeaves,MCR,LogLoss,ASE,RASE,MAXAE
0,0.0,1.0,44.0,0.199497,0.461408,0.146438,0.382672,0.819707
1,1.0,2.0,90.0,0.199497,0.433832,0.135944,0.368706,0.834834
2,2.0,3.0,131.0,0.199497,0.412887,0.127711,0.357367,0.850252
3,3.0,4.0,175.0,0.197483,0.396104,0.121037,0.347903,0.863593
4,4.0,5.0,218.0,0.165436,0.382247,0.115562,0.339944,0.875763
5,5.0,6.0,269.0,0.152181,0.370704,0.111132,0.333364,0.887209
6,6.0,7.0,322.0,0.139933,0.360500,0.107283,0.327541,0.896381
7,7.0,8.0,367.0,0.136242,0.351936,0.104032,0.322540,0.905584
8,8.0,9.0,413.0,0.131879,0.344754,0.101382,0.318406,0.913174
9,9.0,10.0,461.0,0.130201,0.338286,0.099128,0.314845,0.920931

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,1,0,P_BAD1
1,0,1,P_BAD0

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,,0,I_BAD

Unnamed: 0,Variable,Event,CutOff,TP,FP,FN,TN,Sensitivity,Specificity,KS,...,F_HALF,FPR,ACC,FDR,F1,C,Gini,Gamma,Tau,MISCEVENT
0,P_BAD0,0,0.00,4771.0,1189.0,0.0,0.0,1.000000,0.000000,0.0,...,0.833770,1.000000,0.800503,0.199497,0.889200,0.901954,0.803909,0.816656,0.256808,0.199497
1,P_BAD0,0,0.01,4771.0,1104.0,0.0,85.0,1.000000,0.071489,0.0,...,0.843798,0.928511,0.814765,0.187915,0.896299,0.901954,0.803909,0.816656,0.256808,0.185235
2,P_BAD0,0,0.02,4771.0,1070.0,0.0,119.0,1.000000,0.100084,0.0,...,0.847876,0.899916,0.820470,0.183188,0.899171,0.901954,0.803909,0.816656,0.256808,0.179530
3,P_BAD0,0,0.03,4771.0,1042.0,0.0,147.0,1.000000,0.123633,0.0,...,0.851265,0.876367,0.825168,0.179253,0.901550,0.901954,0.803909,0.816656,0.256808,0.174832
4,P_BAD0,0,0.04,4771.0,1012.0,0.0,177.0,1.000000,0.148865,0.0,...,0.854926,0.851135,0.830201,0.174996,0.904112,0.901954,0.803909,0.816656,0.256808,0.169799
5,P_BAD0,0,0.05,4771.0,971.0,0.0,218.0,1.000000,0.183347,0.0,...,0.859981,0.816653,0.837081,0.169105,0.907638,0.901954,0.803909,0.816656,0.256808,0.162919
6,P_BAD0,0,0.06,4771.0,956.0,0.0,233.0,1.000000,0.195963,0.0,...,0.861845,0.804037,0.839597,0.166929,0.908935,0.901954,0.803909,0.816656,0.256808,0.160403
7,P_BAD0,0,0.07,4771.0,949.0,0.0,240.0,1.000000,0.201850,0.0,...,0.862717,0.798150,0.840772,0.165909,0.909542,0.901954,0.803909,0.816656,0.256808,0.159228
8,P_BAD0,0,0.08,4771.0,941.0,0.0,248.0,1.000000,0.208579,0.0,...,0.863717,0.791421,0.842114,0.164741,0.910236,0.901954,0.803909,0.816656,0.256808,0.157886
9,P_BAD0,0,0.09,4771.0,934.0,0.0,255.0,1.000000,0.214466,0.0,...,0.864594,0.785534,0.843289,0.163716,0.910844,0.901954,0.803909,0.816656,0.256808,0.156711

Unnamed: 0,NOBS,ASE,DIV,RASE,MCE,MCLL
0,5960.0,0.081221,5960.0,0.284993,0.106879,0.276219

Unnamed: 0,Parameter,Value
0,Model Type,Gradient Boosting Tree
1,Tuner Objective Function,Misclassification
2,Search Method,GRID
3,Number of Grid Points,16
4,Maximum Tuning Time in Seconds,36000
5,Validation Type,Cross-Validation
6,Num Folds in Cross-Validation,5
7,Log Level,0
8,Seed,726656387
9,Number of Parallel Evaluations,4

Unnamed: 0,Evaluation,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,NBINS,MAXLEVEL,MeanConseqError,EvaluationTime
0,0,4,0.1,0.5,0.0,1.0,50,5,0.199497,0.928658
1,2,4,0.1,0.6,0.0,0.0,77,7,0.121962,11.134599
2,3,4,0.1,0.8,0.5,0.0,77,7,0.126618,9.764959
3,7,4,0.1,0.6,0.5,0.0,77,5,0.12804,6.044404
4,9,4,0.1,0.8,0.0,0.0,77,5,0.128141,6.599532
5,11,4,0.1,0.8,0.0,0.0,77,7,0.128396,9.20615
6,5,4,0.1,0.6,0.0,0.0,77,5,0.12955,6.472404
7,8,4,0.1,0.6,0.5,0.0,77,7,0.130851,9.779642
8,15,4,0.1,0.8,0.5,0.0,77,5,0.147987,4.466785
9,6,4,0.05,0.8,0.0,0.0,77,5,0.199362,1.984114

Unnamed: 0,Iteration,Evaluations,Best_obj,Time_sec
0,0,1,0.199497,0.928658
1,1,17,0.121962,21.858145

Unnamed: 0,Evaluation,Iteration,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,NBINS,MAXLEVEL,MeanConseqError,EvaluationTime
0,0,0,4,0.1,0.5,0.0,1.0,50,5,0.199497,0.928658
1,1,1,4,0.05,0.8,0.0,0.0,77,7,0.199463,3.10433
2,2,1,4,0.1,0.6,0.0,0.0,77,7,0.121962,11.134599
3,3,1,4,0.1,0.8,0.5,0.0,77,7,0.126618,9.764959
4,4,1,4,0.05,0.6,0.5,0.0,77,7,0.19953,3.689206
5,5,1,4,0.1,0.6,0.0,0.0,77,5,0.12955,6.472404
6,6,1,4,0.05,0.8,0.0,0.0,77,5,0.199362,1.984114
7,7,1,4,0.1,0.6,0.5,0.0,77,5,0.12804,6.044404
8,8,1,4,0.1,0.6,0.5,0.0,77,7,0.130851,9.779642
9,9,1,4,0.1,0.8,0.0,0.0,77,5,0.128141,6.599532

Unnamed: 0,Parameter,Name,Value
0,Evaluation,Evaluation,2.0
1,Number of Variables to Try,M,4.0
2,Learning Rate,LEARNINGRATE,0.1
3,Sampling Rate,SUBSAMPLERATE,0.6
4,Lasso,LASSO,0.0
5,Ridge,RIDGE,0.0
6,Number of Bins,NBINS,77.0
7,Maximum Tree Levels,MAXLEVEL,7.0
8,Misclassification,Objective,0.121961723

Unnamed: 0,Parameter,Value
0,Initial Configuration Objective Value,0.199497
1,Best Configuration Objective Value,0.121962
2,Worst Configuration Objective Value,0.199609
3,Initial Configuration Evaluation Time in Seconds,0.928658
4,Best Configuration Evaluation Time in Seconds,10.997567
5,Number of Improved Configurations,5.0
6,Number of Evaluated Configurations,17.0
7,Total Tuning Time in Seconds,24.530787
8,Parallel Tuning Speedup,3.358467

Unnamed: 0,Task,Time_sec,Time_percent
0,Model Training,77.405939,93.955399
1,Model Scoring,4.508735,5.472707
2,Total Objective Evaluations,81.92206,99.437071
3,Tuner,0.463774,0.562929
4,Total CPU Time,82.385834,100.0

Unnamed: 0,CAS_Library,Name,Rows,Columns
0,CASUSER(SASDEMO),ASTORE_OUT_PY_gradBoost_1,1,2

Unnamed: 0,Hyperparameter,RelImportance
0,LEARNINGRATE,1.0
1,MAXLEVEL,0.012582
2,SUBSAMPLERATE,0.00055
3,LASSO,0.000193
4,M,0.0
5,RIDGE,0.0
6,NBINS,0.0

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo),PIPELINE_OUT_PY,10,15,"CASTable('PIPELINE_OUT_PY', caslib='CASUSER(sa..."
1,CASUSER(sasdemo),TRANSFORMATION_OUT_PY,17,21,"CASTable('TRANSFORMATION_OUT_PY', caslib='CASU..."
2,CASUSER(sasdemo),FEATURE_OUT_PY,23,15,"CASTable('FEATURE_OUT_PY', caslib='CASUSER(sas..."
3,CASUSER(sasdemo),ASTORE_OUT_PY_fm_,1,2,"CASTable('ASTORE_OUT_PY_fm_', caslib='CASUSER(..."
4,CASUSER(sasdemo),ASTORE_OUT_PY_gradBoost_1,1,2,"CASTable('ASTORE_OUT_PY_gradBoost_1', caslib='..."


In [23]:
conn.fetch(table = {"name" : "TRANSFORMATION_OUT_PY"})

Unnamed: 0,FTGPipelineId,Name,NVariables,IsInteraction,ImputeMethod,OutlierMethod,OutlierTreat,OutlierArgs,FunctionMethod,FunctionArgs,...,MapIntervalArgs,HashMethod,HashArgs,DateTimeMethod,DiscretizeMethod,DiscretizeArgs,CatTransMethod,CatTransArgs,InteractionMethod,InteractionSynthesizer
0,1.0,miss_ind,3.0,,,,,,,,...,,MissIndicator,2.0,,,,Label (Sparse One-Hot),,,
1,2.0,hc_tar_frq_rat,1.0,,,,,,,,...,10.0,,,,,,,,,
2,3.0,hc_lbl_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
3,4.0,hc_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
4,5.0,hc_cnt_log,1.0,,,,,,Log,e,...,0.0,,,,,,,,,
5,6.0,lcnhenhi_grp_rare,2.0,,,,,,,,...,,,,,,,Group Rare,5.0,,
6,7.0,lcnhenhi_dtree5,2.0,,,,,,,,...,,,,,,,DTree,5.0,,
7,8.0,lcnhenhi_dtree10,2.0,,,,,,,,...,,,,,,,DTree,10.0,,
8,9.0,hk_yj_n2,1.0,,Median,,,,Yeo-Johnson,-2,...,,,,,,,,,,
9,10.0,hk_yj_n1,1.0,,Median,,,,Yeo-Johnson,-1,...,,,,,,,,,,


In [24]:
conn.fetch(table = {"name" : "FEATURE_OUT_PY"})

Unnamed: 0,FeatureId,Name,IsNominal,FTGPipelineId,NInputs,InputVar1,InputVar2,InputVar3,Label,RankCrit,BestTransRank,GlobalIntervalRank,GlobalNominalRank,GlobalRank,IsGenerated
0,1.0,cpy_int_med_imp_DEBTINC,0.0,16.0,1.0,DEBTINC,,,DEBTINC: Low missing rate - median imputation,0.086483,1.0,1.0,,4.0,1.0
1,2.0,hk_dtree_disct10_DEBTINC,1.0,15.0,1.0,DEBTINC,,,DEBTINC: High kurtosis - ten bin decision tree...,0.102374,3.0,,3.0,3.0,0.0
2,3.0,hk_dtree_disct5_DEBTINC,1.0,14.0,1.0,DEBTINC,,,DEBTINC: High kurtosis - five bin decision tre...,0.12996,2.0,,2.0,2.0,1.0
3,4.0,hk_yj_0_DEBTINC,0.0,11.0,1.0,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=0)...,0.080955,3.0,3.0,,6.0,0.0
4,5.0,hk_yj_n1_DEBTINC,0.0,10.0,1.0,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=-1...,0.060571,4.0,4.0,,9.0,0.0
5,6.0,hk_yj_n2_DEBTINC,0.0,9.0,1.0,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=-2...,0.007162,6.0,10.0,,17.0,0.0
6,7.0,hk_yj_p1_DEBTINC,0.0,12.0,1.0,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=1)...,0.086483,1.0,1.0,,4.0,1.0
7,8.0,hk_yj_p2_DEBTINC,0.0,13.0,1.0,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=2)...,0.044039,5.0,5.0,,12.0,0.0
8,9.0,miss_ind_DEBTINC,1.0,1.0,1.0,DEBTINC,,,DEBTINC: Significant missing - missing indicator,0.25161,1.0,,1.0,1.0,1.0
9,10.0,cpy_nom_miss_lev_lab_DELINQ,1.0,17.0,1.0,DELINQ,,,DELINQ: Low missing rate - missing level,0.06843,1.0,,4.0,7.0,1.0


In [25]:
conn.fetch(table = {"name" : "PIPELINE_OUT_PY"})

Unnamed: 0,PipelineId,ModelType,MLType,Objective,ObjectiveType,Target,NFeatures,Feat1Id,Feat1IsNom,Feat2Id,Feat2IsNom,Feat3Id,Feat3IsNom,Feat4Id,Feat4IsNom
0,2.0,binary classification,gradBoost,0.114747,MCE,BAD,4.0,10.0,1.0,15.0,1.0,9.0,1.0,23.0,0.0
1,9.0,binary classification,dtree,0.1151,MCE,BAD,4.0,13.0,1.0,18.0,1.0,3.0,1.0,23.0,0.0
2,10.0,binary classification,gradBoost,0.121141,MCE,BAD,4.0,13.0,1.0,18.0,1.0,3.0,1.0,23.0,0.0
3,1.0,binary classification,dtree,0.121455,MCE,BAD,4.0,10.0,1.0,15.0,1.0,9.0,1.0,23.0,0.0
4,3.0,binary classification,dtree,0.126139,MCE,BAD,3.0,13.0,1.0,15.0,1.0,3.0,1.0,,
5,4.0,binary classification,gradBoost,0.127818,MCE,BAD,3.0,13.0,1.0,15.0,1.0,3.0,1.0,,
6,8.0,binary classification,gradBoost,0.132595,MCE,BAD,3.0,13.0,1.0,15.0,1.0,9.0,1.0,,
7,7.0,binary classification,dtree,0.133389,MCE,BAD,3.0,13.0,1.0,15.0,1.0,9.0,1.0,,
8,5.0,binary classification,dtree,0.180872,MCE,BAD,1.0,10.0,1.0,,,,,,
9,6.0,binary classification,gradBoost,0.185434,MCE,BAD,1.0,10.0,1.0,,,,,,


***
## Conclusion

The dataSciencePilot action set consists of actions that implement a policy-based, configurable, and scalable approach to automating data science workflows. This action set can be used to automate and end-to-end workflow or to automate steps in the  workflow such as data preparation, feature preprocessing, feature engineering, feature selection, and hyperparameter tuning.  In this notebook, we demonstrated how to use each step of the dataSciencePilot Action set using a Python interface. 

In [26]:
conn.close()