# The Data Science Pilot Action Set 

The dataSciencePilot action set consists of actions that implement a policy-based, configurable, and scalable approach to automating data science workflows. This action set can be used to automate and end-to-end workflow or to automate steps in the  workflow such as data preparation, feature preprocessing, feature engineering, feature selection, and hyperparameter tuning.  More information about this action set is available on [its documentation page.](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_table.htm&docsetVersion=8.4&locale=en)
***

## Table of Contents
Today we will set up the notebook and go through each of the seven actions. 

1. [Setting Up the Notebook](#Setting-Up-the-Notebook)
1. [Explore Data](#Explore-Data)
1. [Explore Correlations](#Explore-Correlations)
1. [Analyze Missing Patterns](#Analyze-Missing-Patterns)
1. [Screen Variables](#Screen-Variables)
1. [Feature Machine](#Feature-Machine)
1. [Select Features](#Select-Features)
1. [Data Science Automated Machine Learning Pipeline](#Data-Science-Automated-Machine-Learning-Pipeline)
1. [Conclusion](#Conclusion)
***

## Setting Up the Notebook

First, we must import the Scripting Wrapper for Analytics Transfer (SWAT) package and use the package to connect to out Cloud Analytics Service (CAS).

In [1]:
import swat
import numpy as np

In [2]:
conn = swat.CAS('localhost', 5570, authinfo='~/.authinfo', caslib="CASUSER")

Now we will load the dataSciencePilot action set.

In [3]:
conn.builtins.loadactionset('dataSciencePilot')

NOTE: Added action set 'dataSciencePilot'.


Next, we must  connect to our data source. We are using a data set for predicting home equity loan defaults. 

In [4]:
tbl = 'hmeq'
hmeq = conn.read_csv("./data/hmeq.csv", casout=dict(name=tbl, replace=True))

NOTE: Cloud Analytic Services made the uploaded file available as table HMEQ in caslib CASUSER(sasdemo05).
NOTE: The table HMEQ has been created in caslib CASUSER(sasdemo05) from binary data uploaded to Cloud Analytic Services.


In [5]:
hmeq.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1.0,1100.0,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1.0,1300.0,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1.0,1500.0,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1.0,1500.0,,,,,,,,,,,
4,0.0,1700.0,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,


Our target is “BAD” meaning that it was a bad loan. I am setting up a variable to hold our target information as well as our policy information. Each policy is applicable to specific actions and I will provide more information about each policy later in the notebook. 

In [6]:
# Target Name 
trt='BAD'
# Exploration Policy 
expo = {'cardinality': {'lowMediumCutoff':40}}
# Screen Policy 
scpo = {'missingPercentThreshold':35}
# Selection Policy 
sepo = {'criterion': 'SU', 'topk':4}
# Transformation Policy 
trpo = {'entropy': True, 'iqv': True, 'kurtosis': True, 'outlier': True}

***
## Explore Data

The [exploreData action]( https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details07.htm&docsetVersion=8.4&locale=en) calculates various statistical measures for each column in your data set such as Minimum, Maximum, Mean, Median, Mode, Number Missing, Standard Deviation, and more. The exploreData action also creates a hierarchical variable grouping with two levels. The first level groups variables according to their data type (interval, nominal, data, time, or datetime). The second level uses the following statistical metrics to group the interval and nominal data:
- Missing rate (interval and nominal).
- Cardinality (nominal). 
- Entropy (nominal). 
- Index of Qualitative Variation(IQV; interval and nominal). 
- Skewness (interval).
- Kurtosis (interval).
- Outliers (interval).
- Coefficient of Variation (CV; interval).

This action returns a CAS table listing all the variables, the variable groupings, and the summary statistics. These groupings allow for a pipelined approach to data transformation and cleaning. 


In [7]:
conn.dataSciencePilot.exploreData(   
        table  = tbl,
        target = trt,     
        casOut = {'name': 'EXPLORE_DATA_OUT_PY', 'replace' : True},
        explorationPolicy = expo
    )
conn.fetch(table = {'name': 'EXPLORE_DATA_OUT_PY'})

Unnamed: 0,Variable,VarType,MissingRated,CardinalityRated,EntropyRated,IQVRated,CVRated,SkewnessRated,KurtosisRated,OutlierRated,...,MomentCVPer,RobustCVPer,MomentSkewness,RobustSkewness,MomentKurtosis,RobustKurtosis,LowerOutlierMomentPer,UpperOutlierMomentPer,LowerOutlierRobustPer,UpperOutlierRobustPer
0,BAD,binary-target,,,,,,,,,...,,,,,,,,,,
1,REASON,character-nominal,1.0,1.0,3.0,,,,,,...,,,,,,,,,,
2,JOB,character-nominal,1.0,1.0,3.0,3.0,,,,,...,,,,,,,,,,
3,LOAN,numeric-nominal,1.0,3.0,,,,,,,...,,,,,,,,,,
4,MORTDUE,interval,2.0,,,,3.0,1.0,2.0,3.0,...,60.272664,69.553515,1.814481,0.844221,6.481866,0.370274,0.0,2.958471,2.241823,1.727306
5,VALUE,interval,1.0,,,,3.0,1.0,3.0,3.0,...,56.384362,60.247883,3.053344,0.989755,24.362805,0.425793,0.0,2.47948,0.444596,2.599179
6,YOJ,interval,2.0,,,,3.0,1.0,1.0,2.0,...,84.88853,142.857143,0.98846,0.977944,0.372072,-0.006105,0.0,2.31405,0.0,0.055096
7,DEROG,numeric-nominal,2.0,1.0,2.0,1.0,,,,,...,,,,,,,,,,
8,DELINQ,numeric-nominal,2.0,1.0,2.0,1.0,,,,,...,,,,,,,,,,
9,CLAGE,interval,2.0,,,,3.0,1.0,2.0,2.0,...,47.734255,67.143526,1.343412,0.282945,7.599549,0.061058,0.0,1.150035,0.0,0.902335


*** 
## Explore Correlations

If a target is specified, the [exploreCorrelation action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details06.htm&docsetVersion=8.4&locale=en) performs a linear and nonlinear correlation analysis of the input variables and the target. If a target is not specified, the exploreCorrelation action performs a linear and nonlinear correlation analysis for all pairwise combinations of the input variables. The correlation statistics available depend on the data type of each input variable in the pair. 
- Nominal-nominal correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Information Value (IV; for binary target), Entropy, chi-square, G test (G2), and Cramer’s V. 
- Nominal-interval correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Entropy, and F-test. 
- Interval-interval correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Entropy, and Pearson correlation. 

This action returns a CAS table listing all the variable pairs and the correlation statistics. 

In [8]:
conn.dataSciencePilot.exploreCorrelation(
        table = tbl, 
        casOut = {'name':'CORR_PY', 'replace':True},
        target = trt
)
conn.fetch(table = {"name" : "CORR_PY"})

Unnamed: 0,FirstVariable,SecondVariable,Type,MI
0,CLAGE,BAD,_it_,0.031648
1,CLNO,BAD,_it_,0.017042
2,DEBTINC,BAD,_it_,0.070887
3,DELINQ,BAD,_it_,0.077422
4,DEROG,BAD,_it_,0.050566
5,LOAN,BAD,_it_,0.043922
6,MORTDUE,BAD,_it_,0.014353
7,NINQ,BAD,_it_,0.021779
8,VALUE,BAD,_it_,0.0203
9,YOJ,BAD,_it_,0.015134


***
## Analyze Missing Patterns

If the target is specified, the [analyzeMissingPatterns action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details04.htm&docsetVersion=8.4&locale=en) performs a missing pattern analysis of the input variables and the target. If a target is not specified, the analyzeMissingPatterns action performs a missing pattern analysis for all pairwise combinations of the input variables. This analysis provides the correlation strength between missing patterns across variable pairs and dependencies of missingness in one variable and the values of the other variable. This action returns a CAS table listing all the missing variable pairs and the statistics around missingness. 

In [9]:
conn.dataSciencePilot.analyzeMissingPatterns(
        table = tbl, 
        target = trt, 
        casOut = {'name':'MISS_PATTERN_PY', 'replace':True}
)
conn.fetch(table = {'name': 'MISS_PATTERN_PY'})

Unnamed: 0,FirstVariable,SecondVariable,Type,MI,NormMI,SU,EntropyPerChange
0,CLAGE,BAD,_mt_,0.000672,0.036636,0.001324,0.09315
1,CLNO,BAD,_mt_,0.000258,0.022695,0.000542,0.035732
2,DEBTINC,BAD,_mt_,0.184595,0.555613,0.25161,25.605476
3,DELINQ,BAD,_mt_,0.003061,0.078129,0.005183,0.424657
4,DEROG,BAD,_mt_,0.003954,0.08875,0.006342,0.548446
5,LOAN,BAD,_mt_,0.0,0.0,0.0,0.0
6,MORTDUE,BAD,_mt_,1.1e-05,0.004749,2e-05,0.001564
7,NINQ,BAD,_mt_,0.001243,0.049837,0.002177,0.172475
8,VALUE,BAD,_mt_,0.035911,0.263255,0.083951,4.981264
9,YOJ,BAD,_mt_,0.002535,0.07111,0.004426,0.3516


***
## Screen Variables

The [screenVariables action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details09.htm&docsetVersion=8.4&locale=en) makes one of the following recommendations for each input variable:
-	Remove variable if there are significant data-quality issues. 
-	Transform and keep variable if there are some data-quality issues. 
-	Keep variable if there are no data quality issues. 

The screenVariables action considers the following features of the input variables to make its recommendation:
-	Missing rate exceeds  threshold in screenPolicy (default is 90). 
-	Constant value across input variable.  
-	Mutual Information (MI) about the target is below the threshold in the screenPolicy (default is 0.05)
-	Entropy across levels. 
-	Entropy reduction of target exceeds threshold in screenPolicy (default is 90); also referred to as leakage. 
-	Symmetric Uncertainty (SU) of two variables exceed threshold in screenPolicy (default is 1); also referred to as redundancy. 

This action returns a CAS table listing all the input variables, the recommended action, and the reason for the recommended action.  


In [10]:
conn.dataSciencePilot.screenVariables(
    table = tbl, 
    target = trt, 
    casOut = {'name': 'SCREEN_VARIABLES_OUT_PY', 'replace': True}, 
    screenPolicy = {}
)
conn.fetch(table = {'name': 'SCREEN_VARIABLES_OUT_PY'})

Unnamed: 0,Variable,Recommendation,Reason
0,REASON,keep,passed all screening tests
1,JOB,keep,passed all screening tests
2,LOAN,keep,passed all screening tests
3,MORTDUE,keep,passed all screening tests
4,VALUE,keep,passed all screening tests
5,YOJ,keep,passed all screening tests
6,DEROG,keep,passed all screening tests
7,DELINQ,keep,passed all screening tests
8,CLAGE,keep,passed all screening tests
9,NINQ,keep,passed all screening tests


*** 
## Feature Machine

The [featureMachine action]( https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details08.htm&docsetVersion=8.4&locale=en) creates an automated and parallel generation of features. The featureMachine action first explores the data and groups the input variables into categories with the same statistical profile, like the exploreData action. Next the featureMachine action screens variables to identify noise variables to exclude from further analysis, like the screenVariables action.  Finally, the featureMachine action generates new features by using the available structured pipelines:
-	Missing indicator addition. 
-	Mode imputation and rare value grouping. 
-	Missing level and rare value grouping. 
-	Median imputation. 
-	Mode imputation and label encoding. 
-	Missing level and label encoding. 
-	Yeo-Johnson transformation and median imputation. 
-	Box-Cox transformation. 
-	Quantile binning with missing bins.
-	Regression tree binning.
-	Decision tree binning. 
-	MDLP binning. 
-	Target encoding. 
-	Date, time, and datetime transformations. 

Depending on the parameters specified in the transformationPolicy, the featureMachine action can generate several features for each input variable. This action returns four CAS tables: the first lists information around the transformation pipelines, the second lists information around the transformed features, the third is the input table scored with the transformed features, and the fourth is an analytical store for scoring any additional input tables. 

In [11]:
conn.dataSciencePilot.featureMachine(
    table = tbl, 
    target = trt, 
    copyVars = trt, 
    explorationPolicy = expo, 
    screenPolicy = scpo, 
    transformationPolicy = trpo, 
    transformationOut       = {"name" : "TRANSFORMATION_OUT", "replace" : True},
    featureOut              = {"name" : "FEATURE_OUT", "replace" : True},
    casOut                  = {"name" : "CAS_OUT", "replace" : True},
    saveState               = {"name" : "ASTORE_OUT", "replace" : True}  
)

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo05),TRANSFORMATION_OUT,31,21,"CASTable('TRANSFORMATION_OUT', caslib='CASUSER..."
1,CASUSER(sasdemo05),FEATURE_OUT,52,8,"CASTable('FEATURE_OUT', caslib='CASUSER(sasdem..."
2,CASUSER(sasdemo05),CAS_OUT,5960,53,"CASTable('CAS_OUT', caslib='CASUSER(sasdemo05)')"
3,CASUSER(sasdemo05),ASTORE_OUT,1,2,"CASTable('ASTORE_OUT', caslib='CASUSER(sasdemo..."


In [12]:
conn.fetch(table = {'name': 'TRANSFORMATION_OUT'})

Unnamed: 0,FTGPipelineId,Name,NVariables,IsInteraction,ImputeMethod,OutlierMethod,OutlierTreat,OutlierArgs,FunctionMethod,FunctionArgs,...,MapIntervalArgs,HashMethod,HashArgs,DateTimeMethod,DiscretizeMethod,DiscretizeArgs,CatTransMethod,CatTransArgs,InteractionMethod,InteractionSynthesizer
0,1.0,hc_tar_frq_rat,1.0,,,,,,,,...,0.0,,,,,,,,,
1,2.0,hc_lbl_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
2,3.0,hc_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
3,4.0,hc_cnt_log,1.0,,,,,,Log,e,...,0.0,,,,,,,,,
4,5.0,lchehi_lab,1.0,,,,,,,,...,,,,,,,Label (Sparse One-Hot),0.0,,
5,6.0,lcnhenhi_grp_rare,1.0,,,,,,,,...,,,,,,,Group Rare,5.0,,
6,7.0,lcnhenhi_dtree5,1.0,,,,,,,,...,,,,,,,DTree,5.0,,
7,8.0,lcnhenhi_dtree10,1.0,,,,,,,,...,,,,,,,DTree,10.0,,
8,9.0,ho_winsor,2.0,,Median,Modified IQR,Winsor,0.0,,,...,,,,,,,,,,
9,10.0,ho_quan_disct5,2.0,,,Modified IQR,Trim,0.0,,,...,,,,,Equal-Freq (Quantile),5.0,,,,


In [13]:
conn.fetch(table = {'name': 'FEATURE_OUT'})

Unnamed: 0,FeatureId,Name,IsNominal,FTGPipelineId,NInputs,InputVar1,InputVar2,InputVar3
0,1.0,cpy_int_med_imp_CLAGE,0.0,30.0,1.0,CLAGE,,
1,2.0,nhoks_nloks_dtree_10_CLAGE,1.0,29.0,1.0,CLAGE,,
2,3.0,nhoks_nloks_dtree_5_CLAGE,1.0,28.0,1.0,CLAGE,,
3,4.0,nhoks_nloks_log_CLAGE,0.0,24.0,1.0,CLAGE,,
4,5.0,nhoks_nloks_pow_n0_5_CLAGE,0.0,23.0,1.0,CLAGE,,
5,6.0,nhoks_nloks_pow_n1_CLAGE,0.0,22.0,1.0,CLAGE,,
6,7.0,nhoks_nloks_pow_n2_CLAGE,0.0,21.0,1.0,CLAGE,,
7,8.0,nhoks_nloks_pow_p0_5_CLAGE,0.0,25.0,1.0,CLAGE,,
8,9.0,nhoks_nloks_pow_p1_CLAGE,0.0,26.0,1.0,CLAGE,,
9,10.0,nhoks_nloks_pow_p2_CLAGE,0.0,27.0,1.0,CLAGE,,


In [14]:
conn.fetch(table = {'name': 'CAS_OUT'})

Unnamed: 0,BAD,cpy_int_med_imp_CLAGE,nhoks_nloks_dtree_10_CLAGE,nhoks_nloks_dtree_5_CLAGE,nhoks_nloks_log_CLAGE,nhoks_nloks_pow_n0_5_CLAGE,nhoks_nloks_pow_n1_CLAGE,nhoks_nloks_pow_n2_CLAGE,nhoks_nloks_pow_p0_5_CLAGE,nhoks_nloks_pow_p1_CLAGE,...,hc_cnt_log_LOAN,hc_lbl_cnt_LOAN,hc_tar_frq_rat_LOAN,cpy_nom_miss_lev_lab_NINQ,lcnhenhi_dtree10_NINQ,lcnhenhi_dtree5_NINQ,lcnhenhi_grp_rare_NINQ,cpy_nom_miss_lev_lab_JOB,lchehi_lab_JOB,cpy_nom_miss_lev_lab_REASON
0,1.0,94.366667,3.0,2.0,4.557729,0.1024,0.010486,0.00011,9.765586,95.366667,...,0.0,528.0,0.0,2.0,2.0,2.0,2.0,3.0,3.0,2.0
1,1.0,121.833333,4.0,2.0,4.810828,0.090228,0.008141,6.6e-05,11.08302,122.833333,...,0.0,461.0,0.0,1.0,1.0,1.0,1.0,3.0,3.0,2.0
2,1.0,149.466667,4.0,2.0,5.013742,0.081523,0.006646,4.4e-05,12.266486,150.466667,...,0.693147,385.0,0.0,2.0,2.0,2.0,2.0,3.0,3.0,2.0
3,1.0,173.466667,0.0,0.0,5.161734,0.075708,0.005732,3.3e-05,13.208583,174.466667,...,0.693147,385.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,93.333333,2.0,2.0,4.546835,0.10296,0.010601,0.000112,9.712535,94.333333,...,0.693147,359.0,0.5,1.0,1.0,1.0,1.0,2.0,2.0,2.0
5,1.0,101.466002,3.0,2.0,4.629531,0.098789,0.009759,9.5e-05,10.122549,102.466002,...,0.693147,359.0,0.5,2.0,2.0,2.0,2.0,3.0,3.0,2.0
6,1.0,77.1,2.0,2.0,4.35799,0.113155,0.012804,0.000164,8.83742,78.1,...,0.693147,401.0,0.0,2.0,2.0,2.0,2.0,3.0,3.0,2.0
7,1.0,88.76603,2.0,2.0,4.497207,0.105547,0.01114,0.000124,9.474494,89.76603,...,0.693147,401.0,0.0,1.0,1.0,1.0,1.0,3.0,3.0,2.0
8,1.0,216.933333,7.0,4.0,5.384189,0.067739,0.004589,2.1e-05,14.762565,217.933333,...,1.791759,259.0,0.166667,2.0,2.0,2.0,2.0,3.0,3.0,2.0
9,1.0,115.8,3.0,2.0,4.760463,0.092529,0.008562,7.3e-05,10.807405,116.8,...,1.791759,259.0,0.166667,1.0,1.0,1.0,1.0,5.0,5.0,2.0


*** 
## Select Features

The [selectFeatures action]( https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details10.htm&docsetVersion=8.4&locale=en) performs a filter-based selection by the criterion selected in the selectionPolicy (default is the best ten input variables according to the Mutual Information statistic). The criterion available for selection include Chi-Square, Cramer’s V, F-test, G2, Information Value, Mutual Information, Normalized Mutual Information statistic, Pearson correlation, and the Symmetric Uncertainty statistic. This action returns a CAS table listing the variables, their rank according to the selected criterion, and the value of the selected criterion. 

In [15]:
conn.dataSciencePilot.screenVariables(
    table='CAS_OUT', 
    target=trt, 
    screenPolicy=scpo, 
    casout={"name" : "SCREEN_VARIABLES_OUT", "replace" : True}
)
conn.fetch(table = {"name" : "SCREEN_VARIABLES_OUT"})

Unnamed: 0,Variable,Recommendation,Reason
0,cpy_int_med_imp_CLAGE,keep,passed all screening tests
1,nhoks_nloks_dtree_10_CLAGE,keep,passed all screening tests
2,nhoks_nloks_dtree_5_CLAGE,keep,passed all screening tests
3,nhoks_nloks_log_CLAGE,keep,passed all screening tests
4,nhoks_nloks_pow_n0_5_CLAGE,keep,passed all screening tests
5,nhoks_nloks_pow_n1_CLAGE,keep,passed all screening tests
6,nhoks_nloks_pow_n2_CLAGE,keep,passed all screening tests
7,nhoks_nloks_pow_p0_5_CLAGE,keep,passed all screening tests
8,nhoks_nloks_pow_p1_CLAGE,keep,passed all screening tests
9,nhoks_nloks_pow_p2_CLAGE,keep,passed all screening tests


***
## Data Science Automated Machine Learning Pipeline

The [dsAutoMl action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details05.htm&docsetVersion=8.4&locale=en) creates a policy-based, scalable, end-to-end automated machine learning pipeline for both regression and classification problems. The only input required from the user is the input data set and the target variable, but optional parameters include the policy parameters for data exploration, variable screening, feature selection, and feature transformation.  Overriding the default policy parameters allow a data scientist to configure their pipeline in their data science workflow. In addition, a data scientist may also select additional models to consider. By default, only a decision tree model is included in the pipeline, but neural networks, random forest models, and gradient boosting models are also available. 

The dsAutoMl action first explores the data and groups the input variables into categories with the same statistical profile, like the exploreData action. Next the dsAutoMl action screens variables to identify noise variables to exclude from further analysis, like the screenVariables action.  Then, the dsAutoMl action generates several new features for the input variables, like the featureMachine action. After there are various new cleaned features, the dsAutoMl action will select features based on selected criterion, like the selectFeatures action. 

From here, various pipelines are created using subsets of the selected features, chosen for each pipeline using a feature-representation algorithm. Then the chosen models are added to each pipeline and the hyperparameters for the selected models are optimized, like the modelComposer action of the Autotune action set. These hyperparameters are optimized for the selected objective parameter when cross-validated. By default, classification problems are optimized to have the smallest Misclassification Error Rate (MCE) and regression problems are optimized to have the smallest Average Square Error (ASR).  Data scientists can then select their champion and challenger models from the pipelines. 

This action returns four CAS tables: the first lists information around the transformation pipelines, the second lists information around the transformed features, the third lists pipeline performance according to the objective parameter  and the fourth is an analytical store for scoring any additional input tables.

In [16]:
conn.dataSciencePilot.dsAutoMl(
    table = tbl,
    target = trt, 
    explorationPolicy = expo, 
    screenPolicy = scpo, 
    selectionPolicy = sepo,
    transformationPolicy = trpo,
     modelTypes              = ["decisionTree"],
        objective               = "ASE",
        sampleSize              = 10,
        topKPipelines           = 10,
        kFolds                  = 5,
        transformationOut       = {"name" : "TRANSFORMATION_OUT_PY", "replace" : True},
        featureOut              = {"name" : "FEATURE_OUT_PY", "replace" : True},
        pipelineOut             = {"name" : "PIPELINE_OUT_PY", "replace" : True},
        saveState               = {"name" : "ASTORE_OUT_PY", "replace" : True} 
)

NOTE: Added action set 'autotune'.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'autotune'.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.
NOTE: Added action set 'autotune'.
NOTE: Added action set 'autotune'.
NOTE: Added action set 'autotune'.
NOTE: The number of bins will not be tuned since all inputs are nominal.


Unnamed: 0,Descr,Value
0,Number of Tree Nodes,207.0
1,Max Number of Branches,2.0
2,Number of Levels,13.0
3,Number of Leaves,104.0
4,Number of Bins,20.0
5,Minimum Size of Leaves,5.0
6,Maximum Size of Leaves,1030.0
7,Number of Variables,3.0
8,Confidence Level for Pruning,0.25
9,Number of Observations Used,5960.0

Unnamed: 0,Descr,Value
0,Number of Observations Read,5960.0
1,Number of Observations Used,5960.0
2,Misclassification Error (%),13.087248322

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,1,0,P_BAD1
1,0,1,P_BAD0

Unnamed: 0,LEVNAME,LEVINDEX,VARNAME
0,,0,I_BAD

Unnamed: 0,Variable,Event,CutOff,TP,FP,FN,TN,Sensitivity,Specificity,KS,...,F_HALF,FPR,ACC,FDR,F1,C,Gini,Gamma,Tau,MISCEVENT
0,P_BAD0,0,0.00,4771.0,1189.0,0.0,0.0,1.000000,0.000000,0.0,...,0.833770,1.000000,0.800503,0.199497,0.889200,0.889178,0.778356,0.816383,0.248645,0.199497
1,P_BAD0,0,0.01,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
2,P_BAD0,0,0.02,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
3,P_BAD0,0,0.03,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
4,P_BAD0,0,0.04,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
5,P_BAD0,0,0.05,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
6,P_BAD0,0,0.06,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
7,P_BAD0,0,0.07,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
8,P_BAD0,0,0.08,4771.0,1122.0,0.0,67.0,1.000000,0.056350,0.0,...,0.841654,0.943650,0.811745,0.190395,0.894786,0.889178,0.778356,0.816383,0.248645,0.188255
9,P_BAD0,0,0.09,4766.0,1065.0,5.0,124.0,0.998952,0.104289,0.0,...,0.848194,0.895711,0.820470,0.182644,0.899076,0.889178,0.778356,0.816383,0.248645,0.179530

Unnamed: 0,NOBS,ASE,DIV,RASE,MCE,MCLL
0,5960.0,0.091684,5960.0,0.302793,0.130872,0.302066

Unnamed: 0,Parameter,Value
0,Model Type,Decision Tree
1,Tuner Objective Function,Misclassification
2,Search Method,GA
3,Population Size,10
4,Maximum Iterations,5
5,Maximum Tuning Time in Seconds,36000
6,Validation Type,Cross-Validation
7,Num Folds in Cross-Validation,5
8,Log Level,0
9,Seed,360427786

Unnamed: 0,Evaluation,MAXLEVEL,CRIT,MeanConseqError,EvaluationTime
0,0,11,GAINRATIO,0.13775,5.634602
1,31,13,CHISQUARE,0.129698,0.718316
2,24,14,GAINRATIO,0.12972,1.231974
3,20,13,GAINRATIO,0.130034,0.70828
4,29,20,GAINRATIO,0.130034,1.25499
5,25,17,GINI,0.130436,1.230738
6,38,15,CHISQUARE,0.130728,2.293269
7,32,15,GAINRATIO,0.131711,1.330985
8,19,6,GINI,0.131879,0.820333
9,16,12,GINI,0.132238,1.191354

Unnamed: 0,Iteration,Evaluations,Best_obj,Time_sec
0,0,1,0.13775,5.634602
1,1,14,0.133413,12.845557
2,2,21,0.130034,14.492667
3,3,31,0.129698,17.108658
4,4,37,0.129698,18.675609
5,5,41,0.129698,20.969113

Unnamed: 0,Evaluation,Iteration,MAXLEVEL,CRIT,MeanConseqError,EvaluationTime
0,0,0,11,GAINRATIO,0.13775,5.634602
1,1,1,20,GAIN,0.140098,0.666434
2,2,1,8,CHISQUARE,0.135904,6.602241
3,3,1,2,GINI,0.14969,6.404023
4,4,1,6,CHAID,0.184792,6.43012
5,5,1,16,GAINRATIO,0.137918,0.951816
6,6,1,4,GINI,0.139093,0.401025
7,7,1,12,GAINRATIO,0.13775,0.743847
8,8,1,14,CHISQUARE,0.139428,1.130453
9,9,1,18,CHAID,0.149329,1.144902

Unnamed: 0,Parameter,Name,Value
0,Evaluation,Evaluation,31
1,Maximum Tree Levels,MAXLEVEL,13
2,Criterion,CRIT,CHISQUARE
3,Misclassification,Objective,0.1296979866

Unnamed: 0,Parameter,Value
0,Initial Configuration Objective Value,0.13775
1,Best Configuration Objective Value,0.129698
2,Worst Configuration Objective Value,0.19944
3,Initial Configuration Evaluation Time in Seconds,5.634602
4,Best Configuration Evaluation Time in Seconds,0.718308
5,Number of Improved Configurations,8.0
6,Number of Evaluated Configurations,41.0
7,Total Tuning Time in Seconds,21.085831
8,Parallel Tuning Speedup,2.206446

Unnamed: 0,Task,Time_sec,Time_percent
0,Model Training,31.911317,68.589983
1,Model Scoring,8.633135,18.556006
2,Total Objective Evaluations,40.566772,87.193962
3,Tuner,5.957977,12.806038
4,Total CPU Time,46.524749,100.0

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo05),PIPELINE_OUT_PY,10,15,"CASTable('PIPELINE_OUT_PY', caslib='CASUSER(sa..."
1,CASUSER(sasdemo05),TRANSFORMATION_OUT_PY,16,21,"CASTable('TRANSFORMATION_OUT_PY', caslib='CASU..."
2,CASUSER(sasdemo05),FEATURE_OUT_PY,20,8,"CASTable('FEATURE_OUT_PY', caslib='CASUSER(sas..."
3,CASUSER(sasdemo05),ASTORE_OUT_PY,1,2,"CASTable('ASTORE_OUT_PY', caslib='CASUSER(sasd..."


In [17]:
conn.fetch(table = {"name" : "TRANSFORMATION_OUT_PY"})

Unnamed: 0,FTGPipelineId,Name,NVariables,IsInteraction,ImputeMethod,OutlierMethod,OutlierTreat,OutlierArgs,FunctionMethod,FunctionArgs,...,MapIntervalArgs,HashMethod,HashArgs,DateTimeMethod,DiscretizeMethod,DiscretizeArgs,CatTransMethod,CatTransArgs,InteractionMethod,InteractionSynthesizer
0,1.0,hc_tar_frq_rat,1.0,,,,,,,,...,0.0,,,,,,,,,
1,2.0,hc_lbl_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
2,3.0,hc_cnt,1.0,,,,,,,,...,0.0,,,,,,,,,
3,4.0,hc_cnt_log,1.0,,,,,,Log,e,...,0.0,,,,,,,,,
4,5.0,lcnhenhi_grp_rare,2.0,,,,,,,,...,,,,,,,Group Rare,5.0,,
5,6.0,lcnhenhi_dtree5,2.0,,,,,,,,...,,,,,,,DTree,5.0,,
6,7.0,lcnhenhi_dtree10,2.0,,,,,,,,...,,,,,,,DTree,10.0,,
7,8.0,hk_yj_n2,1.0,,Median,,,,Yeo-Johnson,-2,...,,,,,,,,,,
8,9.0,hk_yj_n1,1.0,,Median,,,,Yeo-Johnson,-1,...,,,,,,,,,,
9,10.0,hk_yj_0,1.0,,Median,,,,Yeo-Johnson,0,...,,,,,,,,,,


In [18]:
conn.fetch(table = {"name" : "FEATURE_OUT_PY"})

Unnamed: 0,FeatureId,Name,IsNominal,FTGPipelineId,NInputs,InputVar1,InputVar2,InputVar3
0,1.0,cpy_int_med_imp_DEBTINC,0.0,15.0,1.0,DEBTINC,,
1,2.0,hk_dtree_disct10_DEBTINC,1.0,14.0,1.0,DEBTINC,,
2,3.0,hk_dtree_disct5_DEBTINC,1.0,13.0,1.0,DEBTINC,,
3,4.0,hk_yj_0_DEBTINC,0.0,10.0,1.0,DEBTINC,,
4,5.0,hk_yj_n1_DEBTINC,0.0,9.0,1.0,DEBTINC,,
5,6.0,hk_yj_n2_DEBTINC,0.0,8.0,1.0,DEBTINC,,
6,7.0,hk_yj_p1_DEBTINC,0.0,11.0,1.0,DEBTINC,,
7,8.0,hk_yj_p2_DEBTINC,0.0,12.0,1.0,DEBTINC,,
8,9.0,cpy_nom_miss_lev_lab_DELINQ,1.0,16.0,1.0,DELINQ,,
9,10.0,lcnhenhi_dtree10_DELINQ,1.0,7.0,1.0,DELINQ,,


In [19]:
conn.fetch(table = {"name" : "PIPELINE_OUT_PY"})

Unnamed: 0,PipelineId,ModelType,MLType,Objective,ObjectiveType,Target,NFeatures,Feat1Id,Feat1IsNom,Feat2Id,Feat2IsNom,Feat3Id,Feat3IsNom,Feat4Id,Feat4IsNom
0,8.0,binary classification,dtree,0.112303,MCE,BAD,4.0,12.0,1.0,16.0,1.0,1.0,0.0,17.0,0.0
1,4.0,binary classification,dtree,0.112919,MCE,BAD,4.0,12.0,1.0,14.0,1.0,3.0,1.0,20.0,0.0
2,3.0,binary classification,dtree,0.126005,MCE,BAD,3.0,10.0,1.0,16.0,1.0,2.0,1.0,,
3,10.0,binary classification,dtree,0.129698,MCE,BAD,3.0,12.0,1.0,14.0,1.0,2.0,1.0,,
4,5.0,binary classification,dtree,0.131754,MCE,BAD,3.0,12.0,1.0,16.0,1.0,3.0,1.0,,
5,9.0,binary classification,dtree,0.134083,MCE,BAD,3.0,12.0,1.0,13.0,1.0,5.0,0.0,,
6,1.0,binary classification,dtree,0.147626,MCE,BAD,4.0,12.0,1.0,16.0,1.0,6.0,0.0,20.0,0.0
7,2.0,binary classification,dtree,0.168793,MCE,BAD,2.0,11.0,1.0,15.0,1.0,,,,
8,7.0,binary classification,dtree,0.169798,MCE,BAD,2.0,11.0,1.0,16.0,1.0,,,,
9,6.0,binary classification,dtree,0.173096,MCE,BAD,2.0,12.0,1.0,16.0,1.0,,,,


Currently, dsAutoMl does not output an analytic store file for the best performing model pipeline, but we can create one in just a few easy steps. First, we will examine the pipeline file output from dsAutoMl. This file will list the best performing models and the features each model was built on.

In [20]:
# Get the best performing model pipeline
pipeline=conn.fetch(table="PIPELINE_OUT_PY")['Fetch']
best_pipeline = pipeline.iloc[0]
NFeatures = int(best_pipeline.NFeatures)
# Get the information on all the features
all_features = conn.fetch(table = "FEATURE_OUT_PY")['Fetch']
# Save the features we want
features = []
nominals = []
for i in range(1, NFeatures+1):
    # Get column name
    col_name = 'Feat{}Id'.format(i)
    # Get feature ID
    feat_id = best_pipeline[col_name]
    select_feat = all_features[all_features['FeatureId'] == feat_id]
    # Get feature name
    feat_name = select_feat['Name'].values[0]
    features.append(feat_name)
    # Check if feature is nominal 
    if select_feat['IsNominal'].values[0] == 1:
        nominals.append(feat_name)
# I would like to give credit for this code block to Biruk Gebremariam

Next, we will run the feature generation analytic store file on our input data.

In [21]:
conn.loadactionset(actionset='astore')

NOTE: Added action set 'astore'.


In [22]:
conn.astore.score(table=tbl, copyvars=trt, rstore='ASTORE_OUT_PY', casout=dict(name='feat_table', replace=True))

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo05),feat_table,5960,21,"CASTable('feat_table', caslib='CASUSER(sasdemo..."

Unnamed: 0,Task,Seconds,Percent
0,Loading the Store,0.000152,0.010257
1,Creating the State,0.005963,0.402747
2,Scoring,0.008686,0.586657
3,Total,0.014806,1.0


Now, we will need to add our target variable back into our data with the new features.

Finally, we create our best performing model. We will only use the features from the best performing models as inputs. 

In [23]:
conn.loadactionset(actionset='autotune')
conn.loadactionset(actionset='decisionTree')

NOTE: Added action set 'autotune'.
NOTE: Added action set 'decisionTree'.


In [24]:
conn.autotune.tunedecisiontree(trainoptions=dict(table='feat_table',
                                                 target=trt,
                                                 inputs=features,
                                                 nominals=nominals,
                                                 casOut = 'py_dtree'))

NOTE: Autotune is started for 'Decision Tree' model.
NOTE: Autotune option SEARCHMETHOD='GA'.
NOTE: Autotune option MAXTIME=36000 (sec.).
NOTE: Autotune option SEED=360356071.
NOTE: Autotune objective is 'Mean Square Error'.
NOTE: Autotune number of parallel evaluations is set to 4, each using 0 worker nodes.
         Iteration       Evals     Best Objective  Elapsed Time
                 0           1             0.1257          7.88
                 1          17             0.1053         14.76
                 2          30             0.1053         16.46
                 3          41             0.1051         17.71
                 4          50             0.1051         18.70
                 5          56             0.1048         19.19
NOTE: Data was partitioned during tuning, to tune based on validation score; the final model is trained and scored on all data.
NOTE: Autotune time is 19.38 seconds.


Unnamed: 0,Descr,Value
0,Number of Tree Nodes,39.0
1,Max Number of Branches,2.0
2,Number of Levels,6.0
3,Number of Leaves,20.0
4,Number of Bins,200.0
5,Minimum Size of Leaves,5.0
6,Maximum Size of Leaves,1827.0
7,Number of Variables,4.0
8,Alpha for Cost-Complexity Pruning,0.0
9,Number of Observations Used,5960.0

Unnamed: 0,Descr,Value
0,Number of Observations Read,5960.0
1,Number of Observations Used,5960.0
2,Mean Squared Error,0.0973217559

Unnamed: 0,Parameter,Value
0,Model Type,Decision Tree
1,Tuner Objective Function,Mean Square Error
2,Search Method,GA
3,Population Size,10
4,Maximum Iterations,5
5,Maximum Tuning Time in Seconds,36000
6,Validation Type,Single Partition
7,Validation Partition Fraction,0.30
8,Log Level,2
9,Seed,360356071

Unnamed: 0,Evaluation,MAXLEVEL,NBINS,CRIT,MeanSqErr,EvaluationTime
0,0,11,20,VARIANCE,0.125745,7.877451
1,52,6,200,VARIANCE,0.104843,0.204783
2,34,8,197,VARIANCE,0.105112,0.30392
3,47,8,197,FTEST,0.105112,0.266141
4,1,8,200,VARIANCE,0.105291,0.118484
5,28,8,200,FTEST,0.105291,0.383924
6,21,8,162,VARIANCE,0.105335,0.248293
7,37,8,194,VARIANCE,0.105529,0.226766
8,46,8,195,VARIANCE,0.105589,0.319373
9,51,8,198,VARIANCE,0.105831,0.293959

Unnamed: 0,Iteration,Evaluations,Best_obj,Time_sec
0,0,1,0.125745,7.877451
1,1,17,0.105291,22.632541
2,2,30,0.105291,24.334493
3,3,41,0.105112,25.583986
4,4,50,0.105112,26.575767
5,5,56,0.104843,27.072136

Unnamed: 0,Evaluation,Iteration,MAXLEVEL,NBINS,CRIT,MeanSqErr,EvaluationTime
0,0,0,11,20,VARIANCE,0.125745,7.877451
1,1,1,8,200,VARIANCE,0.105291,0.118484
2,2,1,10,160,VARIANCE,0.110225,6.865246
3,3,1,2,180,FTEST,0.148717,6.571633
4,4,1,20,80,VARIANCE,0.120451,6.876473
5,5,1,4,40,CHAID,0.150733,0.139903
6,6,1,12,120,FTEST,0.114429,0.62987
7,7,1,18,60,FTEST,0.123758,0.666872
8,8,1,6,140,CHAID,0.147757,0.291074
9,9,1,16,100,VARIANCE,0.123096,0.636404

Unnamed: 0,Parameter,Name,Value
0,Evaluation,Evaluation,52
1,Maximum Tree Levels,MAXLEVEL,6
2,Maximum Bins,NBINS,200
3,Criterion,CRIT,VARIANCE
4,Mean Square Error,Objective,0.1048429391

Unnamed: 0,Parameter,Value
0,Initial Configuration Objective Value,0.125745
1,Best Configuration Objective Value,0.104843
2,Worst Configuration Objective Value,0.150733
3,Initial Configuration Evaluation Time in Seconds,7.877451
4,Best Configuration Evaluation Time in Seconds,0.204662
5,Number of Improved Configurations,3.0
6,Number of Evaluated Configurations,56.0
7,Total Tuning Time in Seconds,19.377036
8,Parallel Tuning Speedup,1.44887

Unnamed: 0,Task,Time_sec,Time_percent
0,Model Training,18.745287,66.769054
1,Model Scoring,1.52518,5.432555
2,Total Objective Evaluations,20.276328,72.222486
3,Tuner,7.798485,27.777514
4,Total CPU Time,28.074813,100.0

Unnamed: 0,CAS_Library,Name,Rows,Columns
0,CASUSER(sasdemo05),py_dtree,39,27


To score our new data, we must first generate the features and then use our model to make our predictions. Luckily, we have an analytic store file for each. Running the block of code below on our new data will allow us to move from data to prediction in just a few lines of code. 

In [25]:
hmeq_week1 = conn.read_sas("./data/hmeq_week1.sas7bdat", casout=dict(name='hmeq_week1', replace=True))
conn.astore.score(table='hmeq_week1', copyvars=trt, rstore='ASTORE_OUT_PY', casout=dict(name='hmeq_week1_feats', replace=True))
conn.decisionTree.dtreeScore(modelTable={"name":"py_dtree"},table={"name":"hmeq_week1_feats"}, casOut={"name":"scored_week1"})

NOTE: Cloud Analytic Services made the uploaded file available as table HMEQ_WEEK1 in caslib CASUSER(sasdemo05).
NOTE: The table HMEQ_WEEK1 has been created in caslib CASUSER(sasdemo05) from binary data uploaded to Cloud Analytic Services.


Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(sasdemo05),scored_week1,1000,11,"CASTable('scored_week1', caslib='CASUSER(sasde..."

Unnamed: 0,Descr,Value
0,Number of Observations Read,1000.0
1,Number of Observations Used,1000.0
2,Mean Squared Error,0.0860899127


***
## Conclusion

The dataSciencePilot action set consists of actions that implement a policy-based, configurable, and scalable approach to automating data science workflows. This action set can be used to automate and end-to-end workflow or to automate steps in the  workflow such as data preparation, feature preprocessing, feature engineering, feature selection, and hyperparameter tuning.  In this notebook, we demonstrated how to use each step of the dataSciencePilot Action set using a Python interface. 

In [26]:
conn.close()