# Data Preparation using SAS Viya, CAS, and R

The focus for today's demonstration is using SAS Viya, CAS, and R to enable efficient data preparation, using multiple approaches:

- Use the *SWAT package* to make API calls to the Viya server and leverage the compute power of the CAS engine
- Can use standard R methods to reference and manipulate data, even when executing calculations on CAS
- Or can execute *CAS Actions* via SWAT calls
    - Today's example: the **dataSciencePilot action set**, which consists of actions that implement a policy-based, configurable, and scalable approach to automating data science workflows. This action set can be used to automate an end-to-end workflow or to automate steps in the  workflow such as data preparation, feature preprocessing, feature engineering, feature selection, and hyperparameter tuning.  More information about this action set is available on [its documentation page.](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_toc.htm&docsetVersion=8.5&locale=en)
***

## Table of Contents
Today we will set up the notebook, manage data, and apply eight actions available through the **dataSciencePilot** action set designed for data preparation.

1. [Setting Up the Notebook](#Setting-Up-the-Notebook)
1. [Managing Data](#Manage-Data)
1. [Prepare Data using dataSciencePilot action set](#Explore-Data)
    1. [Explore Correlations](#Explore-Correlations)
    1. [Analyze Missing Patterns](#Analyze-Missing-Patterns)
    1. [Detect Interactions](#Detect-Interactions)
    1. [Screen Variables](#Screen-Variables)
    1. [Feature Machine](#Feature-Machine)
    1. [Generate Shadow Features](#Generate-Shadow-Features)
    1. [Select Features](#Select-Features)
1. [Conclusion](#Conclusion)
***

## Setting Up the Notebook

First, we must load the Scripting Wrapper for Analytics Transfer (SWAT) package and use the package to connect to out Cloud Analytics Service (CAS).

In [1]:
library(swat)
library(stringr)

SWAT 1.4.1



In [2]:
server = CAS('localhost', 5570, authinfo='~/.authinfo', caslib="OpenDemo")

NOTE: Connecting to CAS and generating CAS action functions for loaded

      action sets...

NOTE: To generate the functions with signatures (for tab completion), set 

      options(cas.gen.function.sig=TRUE).



Now we will load action sets, which are analogous to packages in R (or libraries in Python).

In [3]:
loadActionSet(server, 'dataSciencePilot')
loadActionSet(server, 'table')

NOTE: Added action set 'dataSciencePilot'.

NOTE: Information for action set 'dataSciencePilot':

NOTE:    dataSciencePilot

NOTE:       exploreData - Exploration, automatic variable analysis and grouping using comprehensive statistical profiling of the variables.

NOTE:       screenVariables - Screens noise variables and variables that need special transformations to be useful in the downstream analytics.

NOTE:       analyzeMissingPatterns - Missing pattern analysis

NOTE:       exploreCorrelation - Explore linear and nonlinear correlation among the variables.

NOTE:       detectInteractions - Variable interaction detection and ranking

NOTE:       generateShadowFeatures - Generate shadow features.

NOTE:       featureMachine - Automated feature transformation and generation engine

NOTE:       selectFeatures - Feature selection

NOTE:       dsAutoMl - Automated machine learning pipeline exploration, execution and ranking.

NOTE: Added action set 'table'.

NOTE: Information for actio

## Manage Data

We can access and manage data that already exists on the Viya server. Here we will connect to a file previously saved to disk on the Viya server and "lift" that table in-memory in CAS. Throughout this example, we will use this data set for predicting home equity loan defaults. 

In [4]:
### If table is already loaded in-memory, first drop it before re-loading from disk:
hmeq_dropTbl <- cas.table.dropTable(server, caslib="OpenDemo", name="hmeq_from_R_Jupyter")


ERROR: The action stopped due to errors.



In [5]:
### Load table from disk:
hmeq_loadTbl <- cas.table.loadTable(server, caslib="Public", path="HMEQ.sashdat",
                                   casout=list(caslib="OpenDemo", name="hmeq_from_R_Jupyter"))

NOTE: Cloud Analytic Services made the file HMEQ.sashdat available as table HMEQ_FROM_R_JUPYTER in caslib OpenDemo.



We can now create a reference (called a 'CASTable') that points to that in-memory CAS table. This allows us to perform CAS Actions as well as standard interactions available on local R data.frames -- but all are executed on the server side in the CAS engine. 

In [6]:
hmeqCAStable <- defCasTable(server, tablename="hmeq_from_R_Jupyter", caslib="OpenDemo")

In [7]:
print("--- First six rows of the *in-memory* HMEQ table ---")
head(hmeqCAStable)

[1] "--- First six rows of the *in-memory* HMEQ table ---"


BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.36667,1.0,9.0,
1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.83333,0.0,14.0,
1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.46667,1.0,10.0,
1,1500,,,,,,,,,,,
0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.33333,0.0,14.0,
1,1700,30548.0,40320.0,HomeImp,Other,9.0,0.0,0.0,101.466,1.0,8.0,37.11361


Our target is “BAD” meaning that it was a bad loan. I am setting up a variable to hold our target information as well as our policy information. Each policy is applicable to specific actions and I will provide more information about each policy later in the notebook. 

In [8]:
# Target Name 
trt <- "BAD"
# Exploration Policy 
# expo <- {'cardinality': {'lowMediumCutoff':40}}
expo <- list(cardinality = list(lowMediumCutoff=40))
# Screen Policy 
scpo <- list(missingPercentThreshold = 35)
# Selection Policy 
sepo <- list(criterion = "SU", topk = 4)
# Transformation Policy 
trpo <- list(entropy = TRUE, iqv = TRUE, kurtosis = TRUE, outlier = TRUE)

***
## Explore Data

The [exploreData action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details22.htm&docsetVersion=8.5&locale=en) calculates various statistical measures for each column in your data set such as Minimum, Maximum, Mean, Median, Mode, Number Missing, Standard Deviation, and more. The exploreData action also creates a hierarchical variable grouping with two levels. The first level groups variables according to their data type (interval, nominal, data, time, or datetime). The second level uses the following statistical metrics to group the interval and nominal data:
- Missing rate (interval and nominal).
- Cardinality (nominal). 
- Entropy (nominal). 
- Index of Qualitative Variation(IQV; interval and nominal). 
- Skewness (interval).
- Kurtosis (interval).
- Outliers (interval).
- Coefficient of Variation (CV; interval).

This action returns a CAS table listing all the variables, the variable groupings, and the summary statistics. These groupings allow for a pipelined approach to data transformation and cleaning. 

In [9]:
cas.dataSciencePilot.exploreData(
        server,
        table  = list(name ="hmeq_from_R_Jupyter", caslib="OpenDemo"),
        target = trt,     
        casOut = list(name = "EXPLORE_DATA_OUT_R", replace = TRUE),
        explorationPolicy = expo
    )

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),EXPLORE_DATA_OUT_R,13,42


In [10]:
cas.table.fetch(server, table = list(name = "EXPLORE_DATA_OUT_R"))

_Index_,Variable,VarType,MissingRated,CardinalityRated,EntropyRated,IQVRated,CVRated,SkewnessRated,KurtosisRated,⋯,MomentCVPer,RobustCVPer,MomentSkewness,RobustSkewness,MomentKurtosis,RobustKurtosis,LowerOutlierMomentPer,UpperOutlierMomentPer,LowerOutlierRobustPer,UpperOutlierRobustPer
<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,BAD,binary-target,,,,,,,,⋯,,,,,,,,,,
2,REASON,character-nominal,1.0,1.0,3.0,,,,,⋯,,,,,,,,,,
3,JOB,character-nominal,1.0,1.0,3.0,3.0,,,,⋯,,,,,,,,,,
4,LOAN,numeric-nominal,1.0,3.0,,,,,,⋯,,,,,,,,,,
5,MORTDUE,interval,2.0,,,,3.0,1.0,2.0,⋯,60.27266,69.55352,1.8144807,0.8442205,6.4818663,0.37027422,0.0,2.9584712,2.2418229,1.72730614
6,VALUE,interval,1.0,,,,3.0,1.0,3.0,⋯,56.38436,60.24788,3.0533443,0.9897549,24.3628049,0.42579341,0.0,2.4794802,0.4445964,2.59917921
7,YOJ,interval,2.0,,,,3.0,1.0,1.0,⋯,84.88853,142.85714,0.9884601,0.9779435,0.3720725,-0.00610491,0.0,2.3140496,0.0,0.05509642
8,DEROG,numeric-nominal,2.0,1.0,2.0,1.0,,,,⋯,,,,,,,,,,
9,DELINQ,numeric-nominal,2.0,1.0,2.0,1.0,,,,⋯,,,,,,,,,,
10,CLAGE,interval,2.0,,,,3.0,1.0,2.0,⋯,47.73425,67.14353,1.343412,0.2829449,7.5995493,0.06105818,0.0,1.1500354,0.0,0.90233546


*** 
## Explore Correlations

If a target is specified, the [exploreCorrelation action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details21.htm&docsetVersion=8.5&locale=en) performs a linear and nonlinear correlation analysis of the input variables and the target. If a target is not specified, the exploreCorrelation action performs a linear and nonlinear correlation analysis for all pairwise combinations of the input variables. The correlation statistics available depend on the data type of each input variable in the pair. 
- Nominal-nominal correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Information Value (IV; for binary target), Entropy, chi-square, G test (G2), and Cramer’s V. 
- Nominal-interval correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Entropy, and F-test. 
- Interval-interval correlation pairs have the following statistics available: Mutual Information (MI), Symmetric Uncertainty (SU), Entropy, and Pearson correlation. 

This action returns a CAS table listing all the variable pairs and the correlation statistics. 

In [11]:
cas.dataSciencePilot.exploreCorrelation(
        server,
        table = list(name ="hmeq_from_R_Jupyter", caslib="OpenDemo"),
        casOut = list(name="CORR_from_R", replace=TRUE),
        target = trt
)
cas.table.fetch(server, table = list(name="CORR_from_R"))

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),CORR_from_R,12,4


_Index_,FirstVariable,SecondVariable,Type,MI
<int>,<chr>,<chr>,<chr>,<dbl>
1,CLAGE,BAD,_it_,0.030242323
2,CLNO,BAD,_it_,0.015505245
3,DEBTINC,BAD,_it_,0.063484853
4,DELINQ,BAD,_it_,0.076942337
5,DEROG,BAD,_it_,0.048241053
6,LOAN,BAD,_it_,0.036786854
7,MORTDUE,BAD,_it_,0.012854994
8,NINQ,BAD,_it_,0.021362923
9,VALUE,BAD,_it_,0.01645833
10,YOJ,BAD,_it_,0.00988147


***
## Analyze Missing Patterns

If the target is specified, the [analyzeMissingPatterns action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details04.htm&docsetVersion=8.5&locale=en) performs a missing pattern analysis of the input variables and the target. If a target is not specified, the analyzeMissingPatterns action performs a missing pattern analysis for all pairwise combinations of the input variables. This analysis provides the correlation strength between missing patterns across variable pairs and dependencies of missingness in one variable and the values of the other variable. This action returns a CAS table listing all the missing variable pairs and the statistics around missingness. 

In [12]:
cas.dataSciencePilot.analyzeMissingPatterns(
    server,
    table = list(name ="hmeq_from_R_Jupyter", caslib="OpenDemo"),
    target = trt, 
    casOut = list(name="MISS_PATTERN_from_R", replace=TRUE)
)


casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),MISS_PATTERN_from_R,12,7


In [13]:
cas.table.fetch(server, table = list(name="MISS_PATTERN_from_R"))

_Index_,FirstVariable,SecondVariable,Type,MI,NormMI,SU,EntropyPerChange
<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,CLAGE,BAD,_mt_,0.0006715406,0.03663577,0.001324012,0.093150467
2,CLNO,BAD,_mt_,0.0002576016,0.022695164,0.000542062,0.035732328
3,DEBTINC,BAD,_mt_,0.1845951,0.555613202,0.2516096,25.605476106
4,DELINQ,BAD,_mt_,0.003061436,0.078129153,0.005182954,0.42465662
5,DEROG,BAD,_mt_,0.003953859,0.088749839,0.006342418,0.548446081
6,LOAN,BAD,_mt_,0.0,0.0,0.0,0.0
7,MORTDUE,BAD,_mt_,1.127866e-05,0.004749429,1.966636e-05,0.001564481
8,NINQ,BAD,_mt_,0.001243406,0.049836964,0.002176776,0.172474876
9,VALUE,BAD,_mt_,0.03591094,0.263255279,0.08395083,4.981263843
10,YOJ,BAD,_mt_,0.002534755,0.071110352,0.004426398,0.351599951


***
## Detect Interactions

The [detectInteractions action](https://go.documentation.sas.com/?docsetId=casactml&docsetVersion=8.5&docsetTarget=casactml_datasciencepilot_details05.htm&locale=en) will assess the interactions between pairs of predictor variables and the correlation of that interaction on the response variable. Specially, it will see if the product of the pair of predictor variables correlate with the response variable. Since checking the correlation between the product of every predictor pair and the response variable can be computationally intensive, this action relies on the XYZ algorithm to search for these interactions efficiently in a high-dimensional space.   

The detectInteractions Action requires that all predictor variables be in a binary format, but the response variable can be numeric, binary, or multi-class.  Additionally, the detectInteractions Action can handle data in a sparse format, such as when predictor variables are encoded using an one-hot-encoding scheme.  In the example below, we will specify that our inputs are sparse. The output tables shows the gamma value for each pair of variables. 

In [14]:
# Tranform data for binary format
cas.dataPreprocess.transform(
    server,
    table = list(name ="hmeq_from_R_Jupyter", caslib="OpenDemo"),
    copyVars = "BAD", 
    casOut = list(name="hmeq_transform", replace=TRUE), 
    requestPackages = list(
        list(inputs=list("JOB", "REASON"),
            catTrans=list(
                method="label",
                arguments=list(
                    overrides=list(binMissing=TRUE)
                )
            )
        ),
        list(inputs=list("MORTDUE", "DEBTINC", "LOAN"),
             discretize=list(method="quantile", 
                             arguments=list(overrides=list(
                                 binMissing=TRUE)
                             )
             )
        )
    )
)

ActualName,NTransVars,DisctMethod,CatTransMethod
<chr>,<dbl>,<chr>,<chr>
_TR1,2,,Label (Sparse One-Hot)
_TR2,3,Equal-Freq (Quantile),

Variable,Transformation,ResultVar,N,NMiss,NBins
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
DEBTINC,_TR2,_TR2_DEBTINC,4693,1267,6.0
LOAN,_TR2,_TR2_LOAN,5960,0,6.0
MORTDUE,_TR2,_TR2_MORTDUE,5442,518,6.0
JOB,_TR1,_TR1_JOB,5681,279,
REASON,_TR1,_TR1_REASON,5708,252,

Variable,N,NMiss,NLevels
<chr>,<dbl>,<dbl>,<dbl>
JOB,5681,279,6
REASON,5708,252,2

Variable,Transformation,BinId,BinLowerBnd,BinUpperBnd,BinWidth,NInBin,Mean,Std,Min,Max
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
DEBTINC,_TR2,0,,,,1267,,,,
DEBTINC,_TR2,1,0.5244992,27.61633,27.09183,938,22.34201,5.042519,0.5244992,27.61084
DEBTINC,_TR2,2,27.61633,32.8557,5.239368,939,30.30564,1.488466,27.61633,32.85379
DEBTINC,_TR2,3,32.8557,36.58787,3.732168,938,34.82965,1.061672,32.8557,36.57539
DEBTINC,_TR2,4,36.58787,39.85215,3.264277,939,38.22702,0.9338563,36.58787,39.84839
DEBTINC,_TR2,5,39.85215,203.31215,163.46,939,43.1842,9.388867,39.85215,203.31215
LOAN,_TR2,0,,,,0,,,,
LOAN,_TR2,1,1100.0,10000.0,8900.0,1130,7112.12389,1964.838,1100.0,9900.0
LOAN,_TR2,2,10000.0,14400.0,4400.0,1253,12049.0822,1278.119,10000.0,14300.0
LOAN,_TR2,3,14400.0,18800.0,4400.0,1187,16412.38416,1245.88,14400.0,18700.0

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),hmeq_transform,5960,6


In [15]:
cas.table.fetch(server, table = list(name = "hmeq_transform"))

_Index_,BAD,_TR2_DEBTINC,_TR2_LOAN,_TR2_MORTDUE,_TR1_JOB,_TR1_REASON
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,0,1,1,3,2
2,1,0,1,3,3,2
3,1,0,1,1,3,2
4,1,0,1,0,0,0
5,0,0,1,4,2,2
6,1,4,1,1,3,2
7,1,0,1,2,3,2
8,1,4,1,1,3,2
9,1,0,1,1,3,2
10,1,0,1,0,5,2


In [16]:
cas.dataSciencePilot.detectInteractions(
    server,
    table ='hmeq_transform', 
    target = trt, 
    event = '1', 
    sparse = TRUE, 
    inputs = list("_TR1_JOB", "_TR1_REASON", "_TR2_MORTDUE", "_TR2_DEBTINC", "_TR2_LOAN"), 
    inputLevels = list(7, 3, 6, 6, 6), 
    casOut = list(name = "DETECT_INT_OUT_from_R", replace=TRUE)
)




Description,Value
<chr>,<dbl>
Number of Observations Read,5960.0
Number of Observations Used,4039.0
Number of Inputs,5.0
Background Average Interaction Strength,0.3214374
Subsample Size,3.0

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),DETECT_INT_OUT_from_R,135,5


***
## Screen Variables

The [screenVariables action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details25.htm&docsetVersion=8.5&locale=en) makes one of the following recommendations for each input variable:
-	Remove variable if there are significant data-quality issues. 
-	Transform and keep variable if there are some data-quality issues. 
-	Keep variable if there are no data quality issues. 

The screenVariables action considers the following features of the input variables to make its recommendation:
-	Missing rate exceeds  threshold in screenPolicy (default is 90). 
-	Constant value across input variable.  
-	Mutual Information (MI) about the target is below the threshold in the screenPolicy (default is 0.05)
-	Entropy across levels. 
-	Entropy reduction of target exceeds threshold in screenPolicy (default is 90); also referred to as leakage. 
-	Symmetric Uncertainty (SU) of two variables exceed threshold in screenPolicy (default is 1); also referred to as redundancy. 

This action returns a CAS table listing all the input variables, the recommended action, and the reason for the recommended action.  

In [17]:
cas.dataSciencePilot.screenVariables(
    server,
    table = list(name ="hmeq_from_R_Jupyter", caslib="OpenDemo"),
    target = trt, 
    casOut = list(name="SCREEN_VARIABLES_FROM_R", replace=TRUE)
)

cas.table.fetch(server, table = list(name="SCREEN_VARIABLES_FROM_R"))

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),SCREEN_VARIABLES_FROM_R,12,3


_Index_,Variable,Recommendation,Reason
<int>,<chr>,<chr>,<chr>
1,REASON,keep,passed all screening tests
2,JOB,keep,passed all screening tests
3,LOAN,keep,passed all screening tests
4,MORTDUE,keep,passed all screening tests
5,VALUE,keep,passed all screening tests
6,YOJ,keep,passed all screening tests
7,DEROG,keep,passed all screening tests
8,DELINQ,keep,passed all screening tests
9,CLAGE,keep,passed all screening tests
10,NINQ,keep,passed all screening tests


*** 
## Feature Machine

The [featureMachine action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details23.htm&docsetVersion=8.5&locale=en) creates an automated and parallel generation of features. The featureMachine action first explores the data and groups the input variables into categories with the same statistical profile, like the exploreData action. Next the featureMachine action screens variables to identify noise variables to exclude from further analysis, like the screenVariables action.  Finally, the featureMachine action generates new features by using the available structured pipelines:
-	Missing indicator addition. 
-	Mode imputation and rare value grouping. 
-	Missing level and rare value grouping. 
-	Median imputation. 
-	Mode imputation and label encoding. 
-	Missing level and label encoding. 
-	Yeo-Johnson transformation and median imputation. 
-	Box-Cox transformation. 
-	Quantile binning with missing bins.
-	Regression tree binning.
-	Decision tree binning. 
-	MDLP binning. 
-	Target encoding. 
-	Date, time, and datetime transformations. 

Depending on the parameters specified in the transformationPolicy, the featureMachine action can generate several features for each input variable. This action returns four CAS tables: the first lists information around the transformation pipelines, the second lists information around the transformed features, the third is the input table scored with the transformed features, and the fourth is an analytical store for scoring any additional input tables. 

In [18]:
cas.dataSciencePilot.featureMachine(
    server,
    table = list(name ="hmeq_from_R_Jupyter", caslib="OpenDemo"),
    target = trt, 
    copyVars = trt, 
    explorationPolicy = expo, 
    screenPolicy = scpo, 
    transformationPolicy = trpo, 
    transformationOut       = list(name="TRANSFORMATION_OUT", replace=TRUE),
    featureOut              = list(name="FEATURE_OUT", replace=TRUE),
    casOut                  = list(name="CAS_OUT", replace=TRUE),
    saveState               = list(name="ASTORE_OUT", replace=TRUE)
)

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),TRANSFORMATION_OUT,33,21
CASUSER(sasdemo),FEATURE_OUT,59,9
CASUSER(sasdemo),CAS_OUT,5960,60
CASUSER(sasdemo),ASTORE_OUT,1,2


In [19]:
cas.table.fetch(server, table = list(name="TRANSFORMATION_OUT"))

_Index_,FTGPipelineId,Name,NVariables,IsInteraction,ImputeMethod,OutlierMethod,OutlierTreat,OutlierArgs,FunctionMethod,⋯,MapIntervalArgs,HashMethod,HashArgs,DateTimeMethod,DiscretizeMethod,DiscretizeArgs,CatTransMethod,CatTransArgs,InteractionMethod,InteractionSynthesizer
<int>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,⋯,<dbl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>
1,1,miss_ind,5,,,,,,,⋯,,MissIndicator,2.0,,,,,,,
2,2,grp_rare1,2,,Mode,,,,,⋯,,,,,,,Group Rare,5.0,,
3,3,hc_tar_frq_rat,1,,,,,,,⋯,10.0,,,,,,,,,
4,4,hc_lbl_cnt,1,,,,,,,⋯,0.0,,,,,,,,,
5,5,hc_cnt,1,,,,,,,⋯,0.0,,,,,,,,,
6,6,hc_cnt_log,1,,,,,,Log,⋯,0.0,,,,,,,,,
7,7,lchehi_lab,1,,,,,,,⋯,,,,,,,Label (Sparse One-Hot),0.0,,
8,8,lcnhenhi_grp_rare,1,,,,,,,⋯,,,,,,,Group Rare,5.0,,
9,9,lcnhenhi_dtree5,1,,,,,,,⋯,,,,,,,DTree,5.0,,
10,10,lcnhenhi_dtree10,1,,,,,,,⋯,,,,,,,DTree,10.0,,


In [20]:
cas.table.fetch(server, table = list(name="FEATURE_OUT"))

_Index_,FeatureId,Name,IsNominal,FTGPipelineId,NInputs,InputVar1,InputVar2,InputVar3,Label
<int>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>
1,1,cpy_int_med_imp_CLAGE,0,32,1,CLAGE,,,CLAGE: Low missing rate - median imputation
2,2,miss_ind_CLAGE,1,1,1,CLAGE,,,CLAGE: Significant missing - missing indicator
3,3,nhoks_nloks_dtree_10_CLAGE,1,31,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - ten bin decision tree binning"
4,4,nhoks_nloks_dtree_5_CLAGE,1,30,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - five bin decision tree binning"
5,5,nhoks_nloks_log_CLAGE,0,26,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - log + impute(median)"
6,6,nhoks_nloks_pow_n0_5_CLAGE,0,25,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - power(-0.5) + impute(median)"
7,7,nhoks_nloks_pow_n1_CLAGE,0,24,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - power(-1) + impute(median)"
8,8,nhoks_nloks_pow_n2_CLAGE,0,23,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - power(-2) + impute(median)"
9,9,nhoks_nloks_pow_p0_5_CLAGE,0,27,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - power(0.5) + impute(median)"
10,10,nhoks_nloks_pow_p1_CLAGE,0,28,1,CLAGE,,,"CLAGE: Not high (outlier, kurtosis, skewness) - power(1) + impute(median)"


In [21]:
cas.table.fetch(server, table = list(name="CAS_OUT"))

_Index_,BAD,cpy_int_med_imp_CLAGE,miss_ind_CLAGE,nhoks_nloks_dtree_10_CLAGE,nhoks_nloks_dtree_5_CLAGE,nhoks_nloks_log_CLAGE,nhoks_nloks_pow_n0_5_CLAGE,nhoks_nloks_pow_n1_CLAGE,nhoks_nloks_pow_n2_CLAGE,⋯,hc_lbl_cnt_LOAN,hc_tar_frq_rat_LOAN,cpy_nom_miss_lev_lab_NINQ,lcnhenhi_dtree10_NINQ,lcnhenhi_dtree5_NINQ,lcnhenhi_grp_rare_NINQ,miss_ind_NINQ,cpy_nom_miss_lev_lab_JOB,lchehi_lab_JOB,cpy_nom_miss_lev_lab_REASON
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,94.36667,1,3,2,4.557729,0.10240041,0.010485844,0.0001099529,⋯,528,0.5,2,2,2,2,1,3,3,2
2,1,121.83333,1,4,2,4.810828,0.09022811,0.008141113,6.627771e-05,⋯,461,0.5,1,1,1,1,1,3,3,2
3,1,149.46667,1,4,2,5.013742,0.08152294,0.00664599,4.416919e-05,⋯,385,0.5,2,2,2,2,1,3,3,2
4,1,173.46667,0,0,0,5.161734,0.07570835,0.005731754,3.2853e-05,⋯,385,0.5,0,0,0,0,0,0,0,0
5,0,93.33333,1,2,2,4.546835,0.10295973,0.010600707,0.000112375,⋯,359,0.5,1,1,1,1,1,2,2,2
6,1,101.466,1,3,2,4.629531,0.09878934,0.009759335,9.524461e-05,⋯,359,0.5,2,2,2,2,1,3,3,2
7,1,77.1,1,2,2,4.35799,0.11315519,0.012804097,0.0001639449,⋯,401,0.5,2,2,2,2,1,3,3,2
8,1,88.76603,1,2,2,4.497207,0.10554654,0.011140072,0.0001241012,⋯,401,0.5,1,1,1,1,1,3,3,2
9,1,216.93333,1,7,4,5.384189,0.0677389,0.004588559,2.105488e-05,⋯,259,0.5,2,2,2,2,1,3,3,2
10,1,115.8,1,3,2,4.760463,0.09252915,0.008561644,7.330175e-05,⋯,259,0.5,1,1,1,1,1,5,5,2


*** 
## Generate Shadow Features

The [generateShadowFeatures Action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_examples39.htm&docsetVersion=8.5&locale=en) performs a scalable random permutation of input features to create shadow features. The shadow features are randomly selected from a matching distribution of each input feature. These shadow features can be used for all-relevant feature selection which removes the inputs whose variable importance is lower than the shadow feature’s variable importance. The shadow features can also be used in a post-fit analysis using Permutation Feature Importance (PFI). By replacing each input with its shadow feature one-by-one and measuring the change on model performance, one can determine that features importance based on relative size of the model’s performance change.  

In the example below, I will use the outputs of the feature machine for all-relevant feature selection. This involves getting the variable metadata from my feature machine table, generating my shadow features, finding the variable importance for my features and shadow features using a random forest, and comparing each variable's performance to its shadow features. In the end, I will only keep variables with a higher importance than its shadow feature for the next phase. 

In [22]:
# Getting variable names and metadata from feature machine output
fm <- to.casDataFrame(defCasTable(server, "FEATURE_OUT"))
inputs <- fm$Name
nom <- fm$Name[fm$IsNominal==1]

# Generating Shadow Features
cas.dataSciencePilot.generateShadowFeatures(
    server,
    table = 'CAS_OUT', 
    nProbes = 2, 
    inputs = inputs, 
    nominals = nom,
    casout=list(name = "SHADOW_FEATURES_OUT", replace=TRUE),
    copyVars = trt
)


casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),SHADOW_FEATURES_OUT,5960,119


In [23]:
cas.table.fetch(server, table = "SHADOW_FEATURES_OUT")

_Index_,BAD,_fpi_cpy_int_med_imp_CLAGE_1,_fpi_cpy_int_med_imp_CLAGE_2,_fpi_cpy_int_med_imp_DEBTINC_1,_fpi_cpy_int_med_imp_DEBTINC_2,_fpi_cpy_int_med_imp_MORTDUE_1,_fpi_cpy_int_med_imp_MORTDUE_2,_fpi_cpy_int_med_imp_VALUE_1,_fpi_cpy_int_med_imp_VALUE_2,⋯,_fpn_miss_ind_YOJ_1,_fpn_miss_ind_YOJ_2,_fpn_nhoks_nloks_dtree_10_CLAGE_1,_fpn_nhoks_nloks_dtree_10_CLAGE_2,_fpn_nhoks_nloks_dtree_10_YOJ_1,_fpn_nhoks_nloks_dtree_10_YOJ_2,_fpn_nhoks_nloks_dtree_5_CLAGE_1,_fpn_nhoks_nloks_dtree_5_CLAGE_2,_fpn_nhoks_nloks_dtree_5_YOJ_1,_fpn_nhoks_nloks_dtree_5_YOJ_2
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,367.24187,215.61179,22.05703,34.81827,76600.853,54077.154,68489.29,54514.17,⋯,1,0,2,6,8,8,1,5,4,4
2,1,102.51186,143.73687,30.70182,42.4977,22112.279,33337.008,62824.75,131523.25,⋯,1,1,5,10,2,3,2,5,0,2
3,1,206.62355,150.15827,40.00731,34.81826,75961.838,16355.44,96608.76,50000.39,⋯,1,1,10,10,1,9,3,2,4,0
4,1,295.08331,171.7267,40.95501,34.81826,131443.202,43809.377,96268.25,67192.53,⋯,1,1,1,2,2,4,0,2,5,4
5,0,174.44322,97.40284,39.93486,31.55124,62212.928,43778.751,84083.8,49013.17,⋯,1,1,1,3,10,1,2,5,4,3
6,1,130.82698,261.91422,39.42332,34.81826,56805.911,137001.396,171706.44,184217.5,⋯,1,1,4,10,5,7,2,2,2,4
7,1,223.86455,267.67097,40.41403,43.24281,51989.981,125690.917,45654.14,131367.07,⋯,1,1,4,9,9,8,0,5,2,4
8,1,174.55157,273.31307,19.24355,43.78095,59903.271,56715.712,35400.55,108632.44,⋯,1,1,9,9,3,10,1,0,4,5
9,1,203.03879,108.60735,30.44789,38.72168,50371.576,41130.316,48047.23,61456.22,⋯,1,1,4,9,0,7,2,5,2,1
10,1,125.77261,196.91074,38.27022,25.61715,46769.137,53615.495,48129.13,54306.85,⋯,1,1,4,3,8,8,2,2,5,0


In [24]:
# First, need to load the decisionTree Action Set:
loadActionSet(server, 'decisionTree')

# Getting Feature Importance for Original Features
feats <- cas.decisionTree.forestTrain(
    server,
    table = 'CAS_OUT', 
    inputs = inputs, 
    target = trt, 
    varImp = TRUE)

real_features <- feats$DTreeVarImpInfo


# Getting Feature Importance for Shadow Features
inp <- colnames(defCasTable(server, 'SHADOW_FEATURES_OUT'))
shadow_feats <- cas.decisionTree.forestTrain(
    server,
    table = 'SHADOW_FEATURES_OUT', 
    inputs = inp, 
    target = trt, 
    varImp = TRUE)

sf <- shadow_feats$DTreeVarImpInfo


# Building dataframe for easy comparison
feat_comp <- data.frame(Variable=real_features$Variable, 
                        Real_Imp=real_features$Importance,
                        SF_Imp1=0,
                        SF_Imp2=0) 

NOTE: Added action set 'decisionTree'.

NOTE: Information for action set 'decisionTree':

NOTE:    decisionTree

NOTE:       dtreeTrain - Trains a decision tree

NOTE:       dtreeScore - Scores a table using a decision tree model

NOTE:       dtreeSplit - Splits decision tree nodes

NOTE:       dtreePrune - Prune a decision tree

NOTE:       dtreeMerge - Merges decision tree nodes

NOTE:       dtreeCode - Generates DATA step scoring code from a decision tree model

NOTE:       forestTrain - Trains a forest. This action requires a SAS Visual Data Mining and Machine Learning license

NOTE:       forestScore - Scores a table using a forest model

NOTE:       forestCode - Generates DATA step scoring code from a forest model

NOTE:       gbtreeTrain - Trains a gradient boosting tree. This action requires a SAS Visual Data Mining and Machine Learning license

NOTE:       gbtreeScore - Scores a table using a gradient boosting tree model

NOTE:       gbtreeCode - Generates DATA step scoring co

In [25]:
### Finding each Feature's Shadow Feature
  ### loop through shadow features data object:
    for (cRow in 1:nrow(sf) ) {
        ### remove the first five characters ("_fpn_") and final two characters ("_N")
        temp_name <- str_sub(sf$Variable[cRow], 6, -3)
        ### grab _N final numeric indicator
        temp_num <- str_sub(sf$Variable[cRow], -1, -1)
        ### then loop through each row of feat_comp (feature comparison object) and fill in importance values:
        for (fcRow in 1:nrow(feat_comp) ) {
            if (temp_name == feat_comp$Variable[fcRow]) {
                if (temp_num == 1) {
                    ### fill in first shadow feature's importance value:
                    feat_comp$SF_Imp1[fcRow] <- sf$Importance[cRow]
                }
                else {
                    ### fill in second shadow feature's importance value:
                    feat_comp$SF_Imp2[fcRow] <- sf$Importance[cRow]
                }
            }
        }
    }

In [26]:
# Determining which features have an importance smaller than their shadow feature's importance
feat_comp$to_drop <- ifelse(feat_comp$Real_Imp <= feat_comp$SF_Imp1, 1, 
                            ifelse(feat_comp$Real_Imp <= feat_comp$SF_Imp2, 1, 0))

cols_to_drop <- as.character(feat_comp$Variable[feat_comp$to_drop==1])
"Dropping the following columns: "
print(paste(cols_to_drop))

 [1] "hc_cnt_log_LOAN"          "nhoks_nloks_pow_p0_5_YOJ"
 [3] "nhoks_nloks_pow_n0_5_YOJ" "nhoks_nloks_pow_p2_YOJ"  
 [5] "ho_quan_disct10_MORTDUE"  "hc_cnt_LOAN"             
 [7] "nhoks_nloks_pow_n1_YOJ"   "ho_winsor_VALUE"         
 [9] "ho_winsor_MORTDUE"        "hc_lbl_cnt_LOAN"         
[11] "nhoks_nloks_pow_n2_YOJ"   "nhoks_nloks_pow_p1_YOJ"  
[13] "nhoks_nloks_dtree_5_YOJ" 


In [27]:
### drop the columns using the alterTable CAS Action:
cas.table.alterTable(server, name = "CAS_OUT", drop = cols_to_drop)

*** 
## Select Features

The [selectFeatures action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details26.htm&docsetVersion=8.5&locale=en) performs a filter-based selection by the criterion selected in the selectionPolicy (default is the best ten input variables according to the Mutual Information statistic). The criterion available for selection include Chi-Square, Cramer’s V, F-test, G2, Information Value, Mutual Information, Normalized Mutual Information statistic, Pearson correlation, and the Symmetric Uncertainty statistic. This action returns a CAS table listing the variables, their rank according to the selected criterion, and the value of the selected criterion. 

In [28]:
cas.dataSciencePilot.screenVariables(
    server,
    table='CAS_OUT', 
    target=trt, 
    screenPolicy=scpo, 
    casout=list(name="SCREEN_VARIABLES_OUT", replace=TRUE)
)

cas.table.fetch(server, table = "SCREEN_VARIABLES_OUT")

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),SCREEN_VARIABLES_OUT,46,3


_Index_,Variable,Recommendation,Reason
<int>,<chr>,<chr>,<chr>
1,cpy_int_med_imp_CLAGE,keep,passed all screening tests
2,miss_ind_CLAGE,keep,passed all screening tests
3,nhoks_nloks_dtree_10_CLAGE,keep,passed all screening tests
4,nhoks_nloks_dtree_5_CLAGE,keep,passed all screening tests
5,nhoks_nloks_log_CLAGE,keep,passed all screening tests
6,nhoks_nloks_pow_n0_5_CLAGE,keep,passed all screening tests
7,nhoks_nloks_pow_n1_CLAGE,keep,passed all screening tests
8,nhoks_nloks_pow_n2_CLAGE,keep,passed all screening tests
9,nhoks_nloks_pow_p0_5_CLAGE,keep,passed all screening tests
10,nhoks_nloks_pow_p1_CLAGE,keep,passed all screening tests


***
## Data Science Automated Machine Learning Pipeline

The [dsAutoMl action](https://go.documentation.sas.com/?docsetId=casactml&docsetTarget=casactml_datasciencepilot_details20.htm&docsetVersion=8.5&locale=en) creates a policy-based, scalable, end-to-end automated machine learning pipeline for both regression and classification problems. The only input required from the user is the input data set and the target variable, but optional parameters include the policy parameters for data exploration, variable screening, feature selection, and feature transformation.  Overriding the default policy parameters allow a data scientist to configure their pipeline in their data science workflow. In addition, a data scientist may also select additional models to consider. By default, only a decision tree model is included in the pipeline, but neural networks, random forest models, and gradient boosting models are also available. 

The dsAutoMl action first explores the data and groups the input variables into categories with the same statistical profile, like the exploreData action. Next the dsAutoMl action screens variables to identify noise variables to exclude from further analysis, like the screenVariables action.  Then, the dsAutoMl action generates several new features for the input variables, like the featureMachine action. After there are various new cleaned features, the dsAutoMl action will select features based on selected criterion, like the selectFeatures action. 

From here, various pipelines are created using subsets of the selected features, chosen for each pipeline using a feature-representation algorithm. Then the chosen models are added to each pipeline and the hyperparameters for the selected models are optimized, like the modelComposer action of the Autotune action set. These hyperparameters are optimized for the selected objective parameter when cross-validated. By default, classification problems are optimized to have the smallest Misclassification Error Rate (MCE) and regression problems are optimized to have the smallest Average Square Error (ASR).  Data scientists can then select their champion and challenger models from the pipelines. 

This action returns several CAS tables: the first lists information around the transformation pipelines, the second lists information around the transformed features, the third lists pipeline performance according to the objective parameter and the last tables are analytical stores for creating the feature set and scoring  with our model when new data is available.

In [29]:
cas.dataSciencePilot.dsAutoMl(
    server,
    table = list(name ="hmeq_from_R_Jupyter", caslib="OpenDemo"),
    target = trt, 
    explorationPolicy = expo, 
    screenPolicy = scpo, 
    selectionPolicy = sepo,
    transformationPolicy = trpo,
     modelTypes              = c("decisionTree", "gradboost"),
        objective               = "ASE",
        sampleSize              = 10,
        topKPipelines           = 10,
        kFolds                  = 5,
        transformationOut       = list(name="TRANSFORMATION_OUT_R", replace=TRUE),
        featureOut              = list(name="FEATURE_OUT_R", replace=TRUE),
        pipelineOut             = list(name="PIPELINE_OUT_R", replace=TRUE),
        saveState               = list(modelNamePrefix="ASTORE_OUT_R", replace=TRUE)
)

NOTE: Added action set 'autotune'.


NOTE: Added action set 'decisionTree'.




NOTE: Early stopping is activated; 'NTREE' will not be tuned.

NOTE: Added action set 'autotune'.


NOTE: The number of bins will not be tuned since all inputs are nominal.

NOTE: Added action set 'decisionTree'.




NOTE: Early stopping is activated; 'NTREE' will not be tuned.

NOTE: The number of bins will not be tuned since all inputs are nominal.

NOTE: Added action set 'autotune'.


NOTE: Added action set 'decisionTree'.




NOTE: Early stopping is activated; 'NTREE' will not be tuned.

NOTE: Added action set 'autotune'.


NOTE: The number of bins will not be tuned since all inputs are nominal.

NOTE: Added action set 'decisionTree'.




NOTE: Early stopping is activated; 'NTREE' will not be tuned.

NOTE: The number of bins will not be tuned since all inputs are nominal.

NOTE: Added action set 'autotune'.


NOTE: The number of bins will not be tuned since all inputs are nominal.

NOTE: Added action se

Descr,Value
<chr>,<dbl>
Number of Tree Nodes,15.0
Max Number of Branches,2.0
Number of Levels,6.0
Number of Leaves,8.0
Number of Bins,50.0
Minimum Size of Leaves,38.0
Maximum Size of Leaves,4179.0
Number of Variables,1.0
Confidence Level for Pruning,0.25
Number of Observations Used,5960.0

Descr,Value
<chr>,<chr>
Number of Observations Read,5960.0
Number of Observations Used,5960.0
Misclassification Error (%),18.238255034

LEVNAME,LEVINDEX,VARNAME
<chr>,<int>,<chr>
1,0,P_BAD1
0,1,P_BAD0

LEVNAME,LEVINDEX,VARNAME
<chr>,<int>,<chr>
,0,I_BAD

Variable,Event,CutOff,TP,FP,FN,TN,Sensitivity,Specificity,KS,⋯,F_HALF,FPR,ACC,FDR,F1,C,Gini,Gamma,Tau,MISCEVENT
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
P_BAD0,0,0.00,4771,1189,0,0,1.0000000,0.00000000,0,⋯,0.8337702,1.0000000,0.8005034,0.1994966,0.8891995,0.6665542,0.3331083,0.5522005,0.1064111,0.1994966
P_BAD0,0,0.01,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.02,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.03,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.04,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.05,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.06,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.07,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.08,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396
P_BAD0,0,0.09,4771,1138,0,51,1.0000000,0.04289319,0,⋯,0.8397578,0.9571068,0.8090604,0.1925876,0.8934457,0.6665542,0.3331083,0.5522005,0.1064111,0.1909396

NOBS,ASE,DIV,RASE,MCE,MCLL
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5960,0.13934,5960,0.3732827,0.1823826,0.4461202

Parameter,Value
<chr>,<chr>
Model Type,Decision Tree
Tuner Objective Function,Misclassification
Search Method,GRID
Number of Grid Points,6
Maximum Tuning Time in Seconds,36000
Validation Type,Cross-Validation
Num Folds in Cross-Validation,5
Log Level,0
Seed,796709594
Number of Parallel Evaluations,4

Evaluation,MAXLEVEL,CRIT,MeanConseqError,EvaluationTime
<int>,<int>,<chr>,<dbl>,<dbl>
0,11,gainRatio,0.1827172,0.780735
4,10,gain,0.1788898,1.660795
6,10,gainRatio,0.1788898,0.866489
3,15,gain,0.1814066,1.263057
5,5,gain,0.1815743,0.898322
2,5,gainRatio,0.1836618,0.855706
1,15,gainRatio,0.1848362,0.633738

Iteration,Evaluations,Best_obj,Time_sec
<int>,<int>,<dbl>,<dbl>
0,1,0.1827172,0.780735
1,7,0.1788898,2.50424

Evaluation,Iteration,MAXLEVEL,CRIT,MeanConseqError,EvaluationTime
<int>,<int>,<int>,<chr>,<dbl>,<dbl>
0,0,11,gainRatio,0.1827172,0.780735
1,1,15,gainRatio,0.1848362,0.633738
2,1,5,gainRatio,0.1836618,0.855706
3,1,15,gain,0.1814066,1.263057
4,1,10,gain,0.1788898,1.660795
5,1,5,gain,0.1815743,0.898322
6,1,10,gainRatio,0.1788898,0.866489

Parameter,Name,Value
<chr>,<chr>,<chr>
Evaluation,Evaluation,4
Maximum Tree Levels,MAXLEVEL,10
Criterion,CRIT,gain
Misclassification,Objective,0.1788897717

Parameter,Value
<chr>,<dbl>
Initial Configuration Objective Value,0.1827172
Best Configuration Objective Value,0.1788898
Worst Configuration Objective Value,0.1848362
Initial Configuration Evaluation Time in Seconds,0.780735
Best Configuration Evaluation Time in Seconds,1.0288429
Number of Improved Configurations,2.0
Number of Evaluated Configurations,7.0
Total Tuning Time in Seconds,2.6810091
Parallel Tuning Speedup,2.3942669

Task,Time_sec,Time_percent
<chr>,<dbl>,<dbl>
Model Training,3.657799,56.98348
Model Scoring,2.0837827,32.46247
Total Objective Evaluations,5.7476764,89.5409
Tuner,0.6713748,10.4591
Total CPU Time,6.4190512,100.0

Hyperparameter,RelImportance
<chr>,<dbl>
MAXLEVEL,1.0
CRIT,0.4852744

Descr,Value
<chr>,<dbl>
Number of Trees,150.0
Distribution,2.0
Learning Rate,0.1
Subsampling Rate,0.6
Number of Selected Variables (M),1.0
Number of Bins,50.0
Number of Variables,1.0
Max Number of Tree Nodes,7.0
Min Number of Tree Nodes,7.0
Max Number of Branches,2.0

Progress,Metric
<dbl>,<dbl>
1,0.1994966

Descr,Value
<chr>,<chr>
Number of Observations Read,5960.0
Number of Observations Used,5960.0
Misclassification Error (%),19.94966443

TreeID,Trees,NLeaves,MCR,LogLoss,ASE,RASE,MAXAE
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,1,4,0.1994966,0.4890773,0.1561412,0.395147,0.8088122

LEVNAME,LEVINDEX,VARNAME
<chr>,<int>,<chr>
1,0,P_BAD1
0,1,P_BAD0

LEVNAME,LEVINDEX,VARNAME
<chr>,<int>,<chr>
,0,I_BAD

Variable,Event,CutOff,TP,FP,FN,TN,Sensitivity,Specificity,KS,⋯,F_HALF,FPR,ACC,FDR,F1,C,Gini,Gamma,Tau,MISCEVENT
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
P_BAD0,0,0.00,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.01,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.02,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.03,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.04,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.05,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.06,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.07,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.08,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966
P_BAD0,0,0.09,4771,1189,0,0,1,0,0,⋯,0.8337702,1,0.8005034,0.1994966,0.8891995,0.6614801,0.3229601,0.6516301,0.1031693,0.1994966

NOBS,ASE,DIV,RASE,MCE,MCLL
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5960,0.1561412,5960,0.395147,0.1994966,0.4890773

Parameter,Value
<chr>,<chr>
Model Type,Gradient Boosting Tree
Tuner Objective Function,Misclassification
Search Method,GRID
Number of Grid Points,16
Maximum Tuning Time in Seconds,36000
Validation Type,Cross-Validation
Num Folds in Cross-Validation,5
Log Level,0
Seed,796697333
Number of Parallel Evaluations,4

Evaluation,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,MAXLEVEL,MeanConseqError,EvaluationTime
<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>
0,1,0.1,0.5,0.0,1,5,0.1994966,1.482672
4,1,0.1,0.6,0.0,0,7,0.1993287,3.180527
3,1,0.1,0.6,0.5,0,5,0.1993622,2.722858
1,1,0.05,0.6,0.5,0,7,0.1994631,2.648849
2,1,0.05,0.8,0.5,0,5,0.1994966,3.312076
5,1,0.1,0.8,0.5,0,5,0.1994966,2.133869
6,1,0.1,0.8,0.0,0,7,0.1994966,2.099159
8,1,0.1,0.6,0.5,0,7,0.1994966,1.851147
11,1,0.1,0.8,0.5,0,7,0.1994966,2.354108
12,1,0.05,0.8,0.0,0,5,0.1994966,2.317221

Iteration,Evaluations,Best_obj,Time_sec
<int>,<int>,<dbl>,<dbl>
0,1,0.1994966,1.482672
1,17,0.1993287,10.47585

Evaluation,Iteration,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,MAXLEVEL,MeanConseqError,EvaluationTime
<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>
0,0,1,0.1,0.5,0.0,1,5,0.1994966,1.482672
1,1,1,0.05,0.6,0.5,0,7,0.1994631,2.648849
2,1,1,0.05,0.8,0.5,0,5,0.1994966,3.312076
3,1,1,0.1,0.6,0.5,0,5,0.1993622,2.722858
4,1,1,0.1,0.6,0.0,0,7,0.1993287,3.180527
5,1,1,0.1,0.8,0.5,0,5,0.1994966,2.133869
6,1,1,0.1,0.8,0.0,0,7,0.1994966,2.099159
7,1,1,0.1,0.8,0.0,0,5,0.19953,1.99389
8,1,1,0.1,0.6,0.5,0,7,0.1994966,1.851147
9,1,1,0.05,0.6,0.0,0,7,0.19953,2.305925

Parameter,Name,Value
<chr>,<chr>,<chr>
Evaluation,Evaluation,4.0
Number of Variables to Try,M,1.0
Learning Rate,LEARNINGRATE,0.1
Sampling Rate,SUBSAMPLERATE,0.6
Lasso,LASSO,0.0
Ridge,RIDGE,0.0
Maximum Tree Levels,MAXLEVEL,7.0
Misclassification,Objective,0.1993286897

Parameter,Value
<chr>,<dbl>
Initial Configuration Objective Value,0.1994966
Best Configuration Objective Value,0.1993287
Worst Configuration Objective Value,0.199631
Initial Configuration Evaluation Time in Seconds,1.482672
Best Configuration Evaluation Time in Seconds,2.7723918
Number of Improved Configurations,3.0
Number of Evaluated Configurations,17.0
Total Tuning Time in Seconds,10.6326342
Parallel Tuning Speedup,3.4214151

Task,Time_sec,Time_percent
<chr>,<dbl>,<dbl>
Model Training,29.3729393,80.742235
Model Scoring,6.291065,17.293286
Total Objective Evaluations,35.6815948,98.083875
Tuner,0.6970606,1.916125
Total CPU Time,36.3786554,100.0

Hyperparameter,RelImportance
<chr>,<dbl>
LEARNINGRATE,1.0
LASSO,0.31008052
SUBSAMPLERATE,0.2314095
MAXLEVEL,0.08785053
M,0.0
RIDGE,0.0

CAS_Library,Name,Rows,Columns
<chr>,<chr>,<int>,<int>
CASUSER(SASDEMO),ASTORE_OUT_R_dtree_3,15,50

Descr,Value
<chr>,<dbl>
Number of Trees,150.0
Distribution,2.0
Learning Rate,0.1
Subsampling Rate,0.8
Number of Selected Variables (M),4.0
Number of Bins,77.0
Number of Variables,4.0
Max Number of Tree Nodes,103.0
Min Number of Tree Nodes,47.0
Max Number of Branches,2.0

Progress,Metric
<dbl>,<dbl>
1,0.1994966
2,0.1994966
3,0.1994966
4,0.1988255
5,0.1731544
6,0.1563758
7,0.139094
8,0.1333893
9,0.1312081
10,0.1307047

Descr,Value
<chr>,<chr>
Number of Observations Read,5960.0
Number of Observations Used,5960.0
Misclassification Error (%),11.627516779

TreeID,Trees,NLeaves,MCR,LogLoss,ASE,RASE,MAXAE
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,1,42,0.1994966,0.4601767,0.14605236,0.382168,0.8197072
1,2,83,0.1994966,0.4329927,0.13575179,0.3684451,0.8343021
2,3,131,0.1994966,0.411031,0.12709593,0.3565052,0.8483222
3,4,181,0.1988255,0.3942918,0.12042337,0.3470207,0.8601935
4,5,229,0.1731544,0.3808748,0.11510621,0.3392731,0.8728619
5,6,278,0.1563758,0.3695443,0.11069534,0.3327091,0.8826373
6,7,324,0.139094,0.3597853,0.10699658,0.3271033,0.893069
7,8,372,0.1333893,0.3514781,0.10389526,0.3223279,0.9023394
8,9,417,0.1312081,0.3441044,0.10125893,0.3182121,0.9100018
9,10,466,0.1307047,0.3381804,0.09916642,0.314907,0.9152618

LEVNAME,LEVINDEX,VARNAME
<chr>,<int>,<chr>
1,0,P_BAD1
0,1,P_BAD0

LEVNAME,LEVINDEX,VARNAME
<chr>,<int>,<chr>
,0,I_BAD

Variable,Event,CutOff,TP,FP,FN,TN,Sensitivity,Specificity,KS,⋯,F_HALF,FPR,ACC,FDR,F1,C,Gini,Gamma,Tau,MISCEVENT
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
P_BAD0,0,0.00,4771,1189,0,0,1.0000000,0.00000000,0,⋯,0.8337702,1.0000000,0.8005034,0.1994966,0.8891995,0.8915153,0.7830305,0.8022346,0.2501384,0.1994966
P_BAD0,0,0.01,4771,1189,0,0,1.0000000,0.00000000,0,⋯,0.8337702,1.0000000,0.8005034,0.1994966,0.8891995,0.8915153,0.7830305,0.8022346,0.2501384,0.1994966
P_BAD0,0,0.02,4771,1175,0,14,1.0000000,0.01177460,0,⋯,0.8354054,0.9882254,0.8028523,0.1976118,0.8903611,0.8915153,0.7830305,0.8022346,0.2501384,0.1971477
P_BAD0,0,0.03,4771,1148,0,41,1.0000000,0.03448276,0,⋯,0.8385770,0.9655172,0.8073826,0.1939517,0.8926099,0.8915153,0.7830305,0.8022346,0.2501384,0.1926174
P_BAD0,0,0.04,4771,1114,0,75,1.0000000,0.06307822,0,⋯,0.8426053,0.9369218,0.8130872,0.1892948,0.8954580,0.8915153,0.7830305,0.8022346,0.2501384,0.1869128
P_BAD0,0,0.05,4771,1089,0,100,1.0000000,0.08410429,0,⋯,0.8455921,0.9158957,0.8172819,0.1858362,0.8975637,0.8915153,0.7830305,0.8022346,0.2501384,0.1827181
P_BAD0,0,0.06,4771,1069,0,120,1.0000000,0.10092515,0,⋯,0.8479969,0.8990749,0.8206376,0.1830479,0.8992555,0.8915153,0.7830305,0.8022346,0.2501384,0.1793624
P_BAD0,0,0.07,4771,1054,0,135,1.0000000,0.11354079,0,⋯,0.8498094,0.8864592,0.8231544,0.1809442,0.9005285,0.8915153,0.7830305,0.8022346,0.2501384,0.1768456
P_BAD0,0,0.08,4771,1045,0,144,1.0000000,0.12111018,0,⋯,0.8509007,0.8788898,0.8246644,0.1796768,0.9012940,0.8915153,0.7830305,0.8022346,0.2501384,0.1753356
P_BAD0,0,0.09,4771,1027,0,162,1.0000000,0.13624895,0,⋯,0.8530916,0.8637511,0.8276846,0.1771300,0.9028290,0.8915153,0.7830305,0.8022346,0.2501384,0.1723154

NOBS,ASE,DIV,RASE,MCE,MCLL
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5960,0.08653991,5960,0.2941767,0.1162752,0.2937926

Parameter,Value
<chr>,<chr>
Model Type,Gradient Boosting Tree
Tuner Objective Function,Misclassification
Search Method,GRID
Number of Grid Points,16
Maximum Tuning Time in Seconds,36000
Validation Type,Cross-Validation
Num Folds in Cross-Validation,5
Log Level,0
Seed,796703972
Number of Parallel Evaluations,4

Evaluation,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,NBINS,MAXLEVEL,MeanConseqError,EvaluationTime
<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>
0,4,0.1,0.5,0.0,1,50,5,0.1994966,1.733289
7,4,0.1,0.8,0.0,0,77,7,0.1195037,23.40294
10,4,0.1,0.6,0.0,0,77,7,0.1238279,30.075051
13,4,0.1,0.6,0.5,0,77,7,0.126197,15.179399
8,4,0.1,0.8,0.5,0,77,7,0.1270112,19.644942
2,4,0.1,0.6,0.5,0,77,5,0.1313331,10.599564
3,4,0.1,0.8,0.5,0,77,5,0.1323621,11.85415
14,4,0.1,0.8,0.0,0,77,5,0.1327181,11.864749
9,4,0.1,0.6,0.0,0,77,5,0.135806,14.500616
12,4,0.05,0.6,0.5,0,77,5,0.1993289,2.844246

Iteration,Evaluations,Best_obj,Time_sec
<int>,<int>,<dbl>,<dbl>
0,1,0.1994966,1.733289
1,17,0.1195037,46.78629

Evaluation,Iteration,M,LEARNINGRATE,SUBSAMPLERATE,LASSO,RIDGE,NBINS,MAXLEVEL,MeanConseqError,EvaluationTime
<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>
0,0,4,0.1,0.5,0.0,1,50,5,0.1994966,1.733289
1,1,4,0.05,0.6,0.5,0,77,7,0.1994631,4.547067
2,1,4,0.1,0.6,0.5,0,77,5,0.1313331,10.599564
3,1,4,0.1,0.8,0.5,0,77,5,0.1323621,11.85415
4,1,4,0.05,0.8,0.0,0,77,5,0.1994966,5.601267
5,1,4,0.05,0.8,0.0,0,77,7,0.199631,4.342541
6,1,4,0.05,0.6,0.0,0,77,7,0.199631,4.621323
7,1,4,0.1,0.8,0.0,0,77,7,0.1195037,23.40294
8,1,4,0.1,0.8,0.5,0,77,7,0.1270112,19.644942
9,1,4,0.1,0.6,0.0,0,77,5,0.135806,14.500616

Parameter,Name,Value
<chr>,<chr>,<chr>
Evaluation,Evaluation,7.0
Number of Variables to Try,M,4.0
Learning Rate,LEARNINGRATE,0.1
Sampling Rate,SUBSAMPLERATE,0.8
Lasso,LASSO,0.0
Ridge,RIDGE,0.0
Number of Bins,NBINS,77.0
Maximum Tree Levels,MAXLEVEL,7.0
Misclassification,Objective,0.11950366

Parameter,Value
<chr>,<dbl>
Initial Configuration Objective Value,0.1994966
Best Configuration Objective Value,0.1195037
Worst Configuration Objective Value,0.199631
Initial Configuration Evaluation Time in Seconds,1.733289
Best Configuration Evaluation Time in Seconds,23.4029281
Number of Improved Configurations,4.0
Number of Evaluated Configurations,17.0
Total Tuning Time in Seconds,49.601222
Parallel Tuning Speedup,3.4308716

Task,Time_sec,Time_percent
<chr>,<dbl>,<dbl>
Model Training,159.235996,93.571676
Model Scoring,9.182404,5.395846
Total Objective Evaluations,168.435163,98.977372
Tuner,1.740262,1.022628
Total CPU Time,170.175424,100.0

CAS_Library,Name,Rows,Columns
<chr>,<chr>,<int>,<int>
CASUSER(SASDEMO),ASTORE_OUT_R_gradBoost_2,1,2

Hyperparameter,RelImportance
<chr>,<dbl>
LEARNINGRATE,1.0
MAXLEVEL,0.016590887
SUBSAMPLERATE,0.005235429
LASSO,0.002886346
M,0.0
RIDGE,0.0
NBINS,0.0

casLib,Name,Rows,Columns
<chr>,<chr>,<dbl>,<dbl>
CASUSER(sasdemo),PIPELINE_OUT_R,10,15
CASUSER(sasdemo),TRANSFORMATION_OUT_R,17,21
CASUSER(sasdemo),FEATURE_OUT_R,23,15
CASUSER(sasdemo),ASTORE_OUT_R_fm_,1,2
CASUSER(sasdemo),ASTORE_OUT_R_dtree_1,527,52
CASUSER(sasdemo),ASTORE_OUT_R_gradBoost_1,1,2
CASUSER(sasdemo),ASTORE_OUT_R_gradBoost_2,1,2
CASUSER(sasdemo),ASTORE_OUT_R_dtree_2,621,38
CASUSER(sasdemo),ASTORE_OUT_R_dtree_3,15,50


In [30]:
cas.table.fetch(server, table = list(name = "TRANSFORMATION_OUT_R"))

_Index_,FTGPipelineId,Name,NVariables,IsInteraction,ImputeMethod,OutlierMethod,OutlierTreat,OutlierArgs,FunctionMethod,⋯,MapIntervalArgs,HashMethod,HashArgs,DateTimeMethod,DiscretizeMethod,DiscretizeArgs,CatTransMethod,CatTransArgs,InteractionMethod,InteractionSynthesizer
<int>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,⋯,<dbl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>
1,1,miss_ind,3,,,,,,,⋯,,MissIndicator,2.0,,,,Label (Sparse One-Hot),,,
2,2,hc_tar_frq_rat,1,,,,,,,⋯,10.0,,,,,,,,,
3,3,hc_lbl_cnt,1,,,,,,,⋯,0.0,,,,,,,,,
4,4,hc_cnt,1,,,,,,,⋯,0.0,,,,,,,,,
5,5,hc_cnt_log,1,,,,,,Log,⋯,0.0,,,,,,,,,
6,6,lcnhenhi_grp_rare,2,,,,,,,⋯,,,,,,,Group Rare,5.0,,
7,7,lcnhenhi_dtree5,2,,,,,,,⋯,,,,,,,DTree,5.0,,
8,8,lcnhenhi_dtree10,2,,,,,,,⋯,,,,,,,DTree,10.0,,
9,9,hk_yj_n2,1,,Median,,,,Yeo-Johnson,⋯,,,,,,,,,,
10,10,hk_yj_n1,1,,Median,,,,Yeo-Johnson,⋯,,,,,,,,,,


In [31]:
cas.table.fetch(server, table = list(name = "FEATURE_OUT_R"))

_Index_,FeatureId,Name,IsNominal,FTGPipelineId,NInputs,InputVar1,InputVar2,InputVar3,Label,RankCrit,BestTransRank,GlobalIntervalRank,GlobalNominalRank,GlobalRank,IsGenerated
<int>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,cpy_int_med_imp_DEBTINC,0,16,1,DEBTINC,,,DEBTINC: Low missing rate - median imputation,0.086482795,1,1.0,,4,1
2,2,hk_dtree_disct10_DEBTINC,1,15,1,DEBTINC,,,DEBTINC: High kurtosis - ten bin decision tree binning,0.10237428,3,,3.0,3,0
3,3,hk_dtree_disct5_DEBTINC,1,14,1,DEBTINC,,,DEBTINC: High kurtosis - five bin decision tree binning,0.129960492,2,,2.0,2,1
4,4,hk_yj_0_DEBTINC,0,11,1,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=0) + impute(median),0.080955282,3,3.0,,6,0
5,5,hk_yj_n1_DEBTINC,0,10,1,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=-1) + impute(median),0.060570668,4,4.0,,9,0
6,6,hk_yj_n2_DEBTINC,0,9,1,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=-2) + impute(median),0.007162127,6,10.0,,17,0
7,7,hk_yj_p1_DEBTINC,0,12,1,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=1) + impute(median),0.086482795,1,1.0,,4,1
8,8,hk_yj_p2_DEBTINC,0,13,1,DEBTINC,,,DEBTINC: High kurtosis - Yeo-Johnson(lambda=2) + impute(median),0.044039344,5,5.0,,12,0
9,9,miss_ind_DEBTINC,1,1,1,DEBTINC,,,DEBTINC: Significant missing - missing indicator,0.251609647,1,,1.0,1,1
10,10,cpy_nom_miss_lev_lab_DELINQ,1,17,1,DELINQ,,,DELINQ: Low missing rate - missing level,0.06842977,1,,4.0,7,1


In [32]:
cas.table.fetch(server, table = list(name = "PIPELINE_OUT_R"))

_Index_,PipelineId,ModelType,MLType,Objective,ObjectiveType,Target,NFeatures,Feat1Id,Feat1IsNom,Feat2Id,Feat2IsNom,Feat3Id,Feat3IsNom,Feat4Id,Feat4IsNom
<int>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,binary classification,dtree,0.1194631,MCE,BAD,4,10,1,15.0,1.0,9.0,1.0,23.0,0.0
2,6,binary classification,gradBoost,0.1202823,MCE,BAD,4,13,1,18.0,1.0,3.0,1.0,23.0,0.0
3,2,binary classification,gradBoost,0.1241641,MCE,BAD,4,10,1,15.0,1.0,9.0,1.0,23.0,0.0
4,5,binary classification,dtree,0.1275376,MCE,BAD,4,13,1,18.0,1.0,3.0,1.0,23.0,0.0
5,7,binary classification,dtree,0.1803706,MCE,BAD,1,10,1,,,,,,
6,9,binary classification,dtree,0.1825513,MCE,BAD,1,13,1,,,,,,
7,8,binary classification,gradBoost,0.1852033,MCE,BAD,1,10,1,,,,,,
8,3,binary classification,dtree,0.1862319,MCE,BAD,1,13,1,,,,,,
9,4,binary classification,gradBoost,0.1993287,MCE,BAD,1,13,1,,,,,,
10,10,binary classification,gradBoost,0.1993287,MCE,BAD,1,13,1,,,,,,


***
## Conclusion

The dataSciencePilot action set consists of actions that implement a policy-based, configurable, and scalable approach to automating data science workflows. This action set can be used to automate and end-to-end workflow or to automate steps in the  workflow such as data preparation, feature preprocessing, feature engineering, feature selection, and hyperparameter tuning.  In this notebook, we demonstrated how to use each step of the dataSciencePilot Action set from an R programmer's interface. 

In [33]:
cas.close(server)