## CAS Connection

### Connect to the Cas Server

In [1]:
import swat
s = swat.CAS(host, port)
s.session.setLocale(locale="en_US") 
s.sessionProp.setSessOpt(timeout=864000)

# Document Classification
## Part 3: Build Lean Models using Feature Importance
In this notebook, you will quantify feature importance to determine which features are most important to the models trained in Part 2. Three methods are presented: a standard method, split-based Gini feature importance which is built into the decisionTree action set, and two new methods (patent pending) which consider the network structure of the tree based models. These methods are called Betweenness Centrality Feature Importance and Leaf Based Feature Importance.

# Load Data

The Cora data set is publicly available via [this hyperlink](https://linqs.soe.ucsc.edu/data).

In [2]:
import document_classification_scripts as scripts
import importlib
importlib.reload(scripts)
from document_classification_scripts import AttributeDict, nClasses, nWords, targetColumn, baseFeatureList
demo = scripts.Demo(s)

NOTE: Added action set 'sampling'.
NOTE: Added action set 'pca'.
NOTE: Added action set 'fedsql'.
NOTE: Added action set 'deepLearn'.
NOTE: Added action set 'network'.
NOTE: Added action set 'transpose'.
NOTE: Added action set 'table'.
NOTE: Added action set 'builtins'.
NOTE: Added action set 'neuralNet'.
NOTE: Added action set 'autotune'.
NOTE: Added action set 'session'.
NOTE: Added action set 'decisionTree'.
NOTE: Added action set 'aStore'.
NOTE: Added action set 'aggregation'.


In [3]:
demo.loadRawData()

NOTE: Cloud Analytic Services made the uploaded file available as table CONTENT in caslib CASUSER(brrees).
NOTE: The table CONTENT has been created in caslib CASUSER(brrees) from binary data uploaded to Cloud Analytic Services.
NOTE: Cloud Analytic Services made the uploaded file available as table CITES in caslib CASUSER(brrees).
NOTE: The table CITES has been created in caslib CASUSER(brrees) from binary data uploaded to Cloud Analytic Services.


# Data Preprocessing
### Creates a custom format definition for target labels

In [4]:
demo.defineTargetVariableFormat()

NOTE: Format library MYFMTLIB added. Format search update using parameter APPEND completed.


### Partitions data into training and test

In [5]:
demo.loadOrPartitionData()

NOTE: Cloud Analytic Services added the caslib 'cora'.
NOTE: Cloud Analytic Services made the file contentPartitioned.sashdat available as table CONTENTPARTITIONED in caslib CASUSER(brrees).
NOTE: Cloud Analytic Services made the file contentTrain.sashdat available as table CONTENTTRAIN in caslib CASUSER(brrees).
NOTE: Cloud Analytic Services made the file contentTest.sashdat available as table CONTENTTEST in caslib CASUSER(brrees).


### Performs Principal Component Analysis (PCA)

In [6]:
nPca = 40
demo.performPca(nPca)
pcaFeatureList = [f"pca{i}" for i in range(1,nPca)]



### Joins citations and training data targets

In [7]:
demo.joinTrainingTargets()

NOTE: Table CITESTRAIN was created in caslib CASUSER(brrees) with 3562 rows returned.
NOTE: Table CITESCOMBINED was created in caslib CASUSER(brrees) with 5429 rows returned.


## Generate Network Features

In [8]:
%%capture
networkParam=AttributeDict({
    "useCentrality":True,
    "useNodeSimilarity":True,
    "useCommunity":True,
    "useCore":True
})

tableContentNetwork, networkFeatureList = demo.addNetworkFeatures(
    "contentTrain", "citesTrain", networkParam)
tableContentPartitionedNetwork, networkFeatureList = demo.addNetworkFeatures(
    "contentPartitioned", "citesCombined", networkParam)

tableContentNetworkPca, networkFeatureList = demo.addNetworkFeatures(
    "contentTrainPca", "citesTrain", networkParam)
tableContentPartitionedNetworkPca, networkFeatureList = demo.addNetworkFeatures(
    "contentPartitionedPca", "citesCombined", networkParam)

In [9]:
s.datastep.runCode(
    code = f"data contentTestNetwork; set {tableContentPartitionedNetwork}(where=(partition=0)); run;"
)
print(f"contentTestNetwork: (rows, cols) = {s.CASTable('contentTestNetwork').shape}")

s.datastep.runCode(
    code = f"data contentTestPcaNetwork; set {tableContentPartitionedNetworkPca}(where=(partition=0)); run;"
)
print(f"contentTestPcaNetwork: (rows, cols) = {s.CASTable('contentTestPcaNetwork').shape}")

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed',)).History will not be written to the database.
contentTestNetwork: (rows, cols) = (542, 1485)
contentTestPcaNetwork: (rows, cols) = (542, 92)


# Load the Autotuned Network+PCA Model (trained in Part 2)

Here we load the best hyperparameter configuration found by autotune, for the feature set using PCA and Network features. We again train a forest model using these hyperparameters in order to examine the feature importance values computed by the forestTrain action in the decisionTree action set. 

In [10]:
networkPcaModelAuto = "networkPcaModelAuto"

In [11]:
bestConfig = demo.loadOrTuneForestModel(networkPcaModelAuto,
                           "contentTrainPcaNetwork",
                           pcaFeatureList + networkFeatureList
                          )
print(bestConfig)

NOTE: Cloud Analytic Services made the file networkPcaModelAutoAStore.sashdat available as table NETWORKPCAMODELAUTOASTORE in caslib CASUSER(brrees).
NOTE: Cloud Analytic Services made the file networkPcaModelAuto.sashdat available as table NETWORKPCAMODELAUTO in caslib CASUSER(brrees).
Best Configuration

                                     Parameter       Value
Name                                                      
Evaluation                          Evaluation           9
NTREE                          Number of Trees          63
M                   Number of Variables to Try          66
BOOTSTRAP                            Bootstrap  0.72222222
MAXLEVEL                   Maximum Tree Levels          21
NBINS                           Number of Bins          47
Objective   Misclassification Error Percentage        9.06


## Train PCA Forest Model

In [12]:
%%time
resultsTrainNetworkPcaModelAuto = demo.trainForestModel(networkPcaModelAuto,
                 "contentTrainPcaNetwork",
                 pcaFeatureList + networkFeatureList,
                 forestParam=bestConfig)

NOTE: 6755422 bytes were written to the table "networkPcaModelAutoAStore" in the caslib "CASUSER(brrees)".
CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms
Wall time: 12.5 s


# View Gini (Split Based) Feature Importances

In [13]:
topNCutoff=12

In [14]:
resultsTrainNetworkPcaModelAuto['DTreeVarImpInfo'].head(topNCutoff)

Unnamed: 0,Variable,Importance,Std
0,core_out_Genetic_Algorithms,202.501823,83.936062
1,core_out_Probabilistic_Methods,191.616565,94.22554
2,core_out_Theory,157.864192,74.408969
3,core_out_Case_Based,144.608804,70.726945
4,core_out_Reinforcement_Learning,121.608562,54.491919
5,core_out_Neural_Networks,102.232383,43.44342
6,core_out_Rule_Learning,97.366988,45.322815
7,pca3,16.903084,3.368072
8,deg_in_Genetic_Algorithms,10.87443,19.279559
9,pca1,10.382247,2.518907


# New Methods for Feature Importance Calculation

The following function calls include prototype python implementations of two new alternative methods to the commonly used methods for calculating feature importance. Either of the three methods (split based feature importance above, or the two methods below) can be used to determine a smaller feature set to use for the next iteration of model building.

## Calculate and View Betweenness Feature Importances (patent pending)

In [15]:
demo.calculateBetweennessImportance(networkPcaModelAuto, casOut="betweennessImportances")

NOTE: The number of nodes in the input graph is 8097.
NOTE: The number of links in the input graph is 8034.
NOTE: Processing centrality metrics.
NOTE: Processing centrality metrics used 0.04 (cpu: 0.64) seconds.


In [16]:
s.CASTable("betweennessImportances").nlargest(topNCutoff,"betweenImportance")

Unnamed: 0,Variable,betweenImportance
0,core_out_Neural_Networks,0.016174
1,core_out_Theory,0.013138
2,core_out_Reinforcement_Learning,0.011181
3,core_out_Probabilistic_Methods,0.010852
4,pca3,0.00836
5,core_out_Case_Based,0.00834
6,core_out_Rule_Learning,0.008216
7,deg_in_Genetic_Algorithms,0.00613
8,pca2,0.005769
9,pca1,0.005418


## Calculate and View Leaf Based Feature Importances (patent pending)

In [17]:
classes = [
    "Rule_Learning",
    "Theory",
    "Genetic_Algorithms",
    "Reinforcement_Learning",
    "Case_Based",
    "Neural_Networks",
    "Probabilistic_Methods"
]

In [18]:
leafBasedImportances = demo.leafBasedImportances(networkPcaModelAuto,
                 "contentTrainPcaNetwork",
                 pcaFeatureList + networkFeatureList,
                 classes
                )

tree 1 of 63
tree 2 of 63
tree 3 of 63
tree 4 of 63
tree 5 of 63
tree 6 of 63
tree 7 of 63
tree 8 of 63
tree 9 of 63
tree 10 of 63
tree 11 of 63
tree 12 of 63
tree 13 of 63
tree 14 of 63
tree 15 of 63
tree 16 of 63
tree 17 of 63
tree 18 of 63
tree 19 of 63
tree 20 of 63
tree 21 of 63
tree 22 of 63
tree 23 of 63
tree 24 of 63
tree 25 of 63
tree 26 of 63
tree 27 of 63
tree 28 of 63
tree 29 of 63
tree 30 of 63
tree 31 of 63
tree 32 of 63
tree 33 of 63
tree 34 of 63
tree 35 of 63
tree 36 of 63
tree 37 of 63
tree 38 of 63
tree 39 of 63
tree 40 of 63
tree 41 of 63
tree 42 of 63
tree 43 of 63
tree 44 of 63
tree 45 of 63
tree 46 of 63
tree 47 of 63
tree 48 of 63
tree 49 of 63
tree 50 of 63
tree 51 of 63
tree 52 of 63
tree 53 of 63
tree 54 of 63
tree 55 of 63
tree 56 of 63
tree 57 of 63
tree 58 of 63
tree 59 of 63
tree 60 of 63
tree 61 of 63
tree 62 of 63
tree 63 of 63


In [19]:
rankedFeaturesLeafBased = scripts.printImportances(leafBasedImportances, topNCutoff)

              core_out_Rule_Learning:  28290.745613
         core_out_Genetic_Algorithms:  27315.230665
                 core_out_Case_Based:  21857.950878
     core_out_Reinforcement_Learning:  11610.907497
      core_out_Probabilistic_Methods:  8518.756256
                     core_out_Theory:  4120.936694
           deg_in_Genetic_Algorithms:  3317.974407
            core_out_Neural_Networks:  2512.329228
                deg_in_Rule_Learning:   720.522850
                                pca3:   502.303616
                                pca2:   414.799033
                   deg_in_Case_Based:   208.073518


In [20]:
rankedFeaturesSplitBased = resultsTrainNetworkPcaModelAuto['DTreeVarImpInfo']['Variable'].tolist()[0:topNCutoff]
rankedFeaturesBetweenness = s.CASTable("betweennessImportances").nlargest(topNCutoff,"betweenImportance")["Variable"].tolist()

Note that split based Gini feature importance and Betweenness feature importance produce the same set of top 12 features.

Leaf Based feature importance, on the other hand includes two features (generated from network in-degree) and excludes two features (from PCA) in its top 12:

In [21]:
a=set(rankedFeaturesLeafBased) - set(rankedFeaturesBetweenness)
b=set(rankedFeaturesBetweenness) - set(rankedFeaturesLeafBased)
print(f"""In Leaf Based Top {topNCutoff}, but not Betweenness Top {topNCutoff}:
   {a}""")
print(f"""In Betweenness Top {topNCutoff}, but not Leaf Based Top {topNCutoff}:
   {b}""")

a=set(rankedFeaturesLeafBased) - set(rankedFeaturesSplitBased)
b=set(rankedFeaturesSplitBased) - set(rankedFeaturesLeafBased)
print(f"""
In Leaf Based Top {topNCutoff}, but not Split Based Top {topNCutoff}:
   {a}""")
print(f"""In Split Based Top {topNCutoff}, but not Leaf Based Top {topNCutoff}:
   {b}""")

a=set(rankedFeaturesSplitBased) - set(rankedFeaturesBetweenness)
b=set(rankedFeaturesBetweenness) - set(rankedFeaturesSplitBased)
print(f"""
In Split Based Top {topNCutoff}, but not Betweenness Top {topNCutoff}:
   {a}""")
print(f"""In Betweenness Top {topNCutoff}, but not Split Based Top {topNCutoff}:
   {b}""")

In Leaf Based Top 12, but not Betweenness Top 12:
   {'deg_in_Case_Based', 'deg_in_Rule_Learning'}
In Betweenness Top 12, but not Leaf Based Top 12:
   {'pca10', 'pca1'}

In Leaf Based Top 12, but not Split Based Top 12:
   {'deg_in_Case_Based', 'deg_in_Rule_Learning'}
In Split Based Top 12, but not Leaf Based Top 12:
   {'pca10', 'pca1'}

In Split Based Top 12, but not Betweenness Top 12:
   set()
In Betweenness Top 12, but not Split Based Top 12:
   set()


# Build models using only top N features

The best-performing models from Part 1 and Part 2 use a total of 85 features -- 40 PCA features and 45 Network features.

Can we achieve similar model performance by using only the 12 most important features?

# First, try the top 12 features by Split Based Feature Importance

In [22]:
topNFeatureList = rankedFeaturesSplitBased
topNFeatureList

['core_out_Genetic_Algorithms',
 'core_out_Probabilistic_Methods',
 'core_out_Theory',
 'core_out_Case_Based',
 'core_out_Reinforcement_Learning',
 'core_out_Neural_Networks',
 'core_out_Rule_Learning',
 'pca3',
 'deg_in_Genetic_Algorithms',
 'pca1',
 'pca2',
 'pca10']

## Train Neural Net Model Using Top N Split Based Features

In [23]:
deepLearnParam = AttributeDict({
    "randomSeed": 1337,
    "dropout": 0.5,
    "activation": "RECTIFIER",
    "outputActivation": "SOFTMAX",
    "denseLayers": [50, 50],
    "nOutputs": nClasses,
    "nEpochs": 100,
    "algoMethod": "ADAM",
    "useLocking": False
})

In [24]:
topNNnModel = "topNNnModelSplit"
demo.defineNnModel(topNNnModel, deepLearnParam)

In [25]:
%%time
demo.trainNnModel(topNNnModel,"contentTrainPcaNetwork", topNFeatureList, deepLearnParam)

CPU times: user 46.9 ms, sys: 0 ns, total: 46.9 ms
Wall time: 1.15 s


Unnamed: 0,Descr,Value
0,Model Name,topnnnmodelsplit
1,Model Type,Deep Neural Network
2,Number of Layers,4
3,Number of Input Layers,1
4,Number of Output Layers,1
5,Number of Fully Connected Layers,2
6,Number of Weight Parameters,3450
7,Number of Bias Parameters,107
8,Total Number of Model Parameters,3557
9,Approximate Memory Cost for Training (MB),1

Unnamed: 0,Epoch,LearningRate,Loss,FitError
0,0.0,0.001,1.795905,0.674977
1,1.0,0.001,1.539395,0.554017
2,2.0,0.001,1.346250,0.465836
3,3.0,0.001,1.245496,0.408587
4,4.0,0.001,1.143274,0.385042
5,5.0,0.001,1.044882,0.341182
6,6.0,0.001,1.002767,0.338412
7,7.0,0.001,0.941305,0.314404
8,8.0,0.001,0.895646,0.303786
9,9.0,0.001,0.865053,0.276085

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(brrees),topNNnModelSplitWeights,3557,3,"CASTable('topNNnModelSplitWeights', caslib='CA..."


In [26]:
demo.scoreNnModel(topNNnModel,"contentTestPcaNetwork")


                         Descr         Value
0  Number of Observations Read           542
1  Number of Observations Used           542
2  Misclassification Error (%)      11.99262
3                   Loss Error      0.487157
Accuracy = 0.8800738


0.8800738

### Bootstrap Runs

In [27]:
%%time
accuracies = demo.bootstrapNnModel(topNNnModel,"contentTrainPcaNetwork",
                                   "contentTestPcaNetwork",
                                   topNFeatureList,
                                   deepLearnParam,
                                   25
                                  );

NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5678 for sampling.
Accuracy = 0.8790984000000001
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5679 for sampling.
Accuracy = 0.8770492
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5680 for sampling.
Accuracy = 0.8872951
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5681 for sampling.
Accuracy = 0.8709015999999999
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5682 for sampling.
Accuracy = 0.8852459
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5683 for sampling.
Accuracy = 0.8811475
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5684 for sampling.
Accuracy = 0.8811475
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5685 for sampling.
Accuracy = 0.8852459
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5686 for sampling.
Accuracy = 0.8770492
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5687 for sampling.
Acc

## Train Forest Model Using Top N Split Based Features

In [28]:
topNForestModel = "topNForestModelSplit"

In [29]:
%%time
demo.trainForestModel(
    topNForestModel, "contentTrainPcaNetwork", topNFeatureList)

NOTE: 1372882 bytes were written to the table "topNForestModelSplitAStore" in the caslib "CASUSER(brrees)".
CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms
Wall time: 301 ms


Unnamed: 0,Descr,Value
0,Number of Trees,50.0
1,Number of Selected Variables (M),4.0
2,Random Number Seed,12345.0
3,Bootstrap Percentage (%),63.212056
4,Number of Bins,50.0
5,Number of Variables,12.0
6,Confidence Level for Pruning,0.25
7,Max Number of Tree Nodes,43.0
8,Min Number of Tree Nodes,25.0
9,Max Number of Branches,2.0

Unnamed: 0,Variable,Importance,Std
0,core_out_Genetic_Algorithms,129.630012,84.690007
1,core_out_Case_Based,108.441181,61.202702
2,core_out_Probabilistic_Methods,101.093844,81.521286
3,core_out_Neural_Networks,100.689179,89.778221
4,core_out_Rule_Learning,70.951737,38.078643
5,core_out_Reinforcement_Learning,56.38613,41.892267
6,core_out_Theory,56.09093,55.52476
7,deg_in_Genetic_Algorithms,15.455629,62.070077
8,pca3,6.644762,11.032325
9,pca1,4.594636,7.703717

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(brrees),topNForestModelSplit,1724,35,"CASTable('topNForestModelSplit', caslib='CASUS..."


In [30]:
resultsScoreTopNForestModel=demo.scoreForestModel(topNForestModel,"contentTestPcaNetwork")

Accuracy = 0.7915129151291513


### Bootstrap Runs

In [31]:
%%time
accuracies = demo.bootstrapForestModel(topNForestModel,"contentTrainPcaNetwork",
                                       "contentTestPcaNetwork",
                                       topNFeatureList,
                                       n=25
                                      );

NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5678 for sampling.
NOTE: 1372882 bytes were written to the table "topNForestModelSplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7930327868852459
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5679 for sampling.
NOTE: 1375986 bytes were written to the table "topNForestModelSplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7889344262295082
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5680 for sampling.
NOTE: 1377594 bytes were written to the table "topNForestModelSplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7889344262295082
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5681 for sampling.
NOTE: 1372866 bytes were written to the table "topNForestModelSplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7889344262295082
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5682 for sampling.
NOTE: 1374442 bytes were written to the table "topNFo

## Autotune Forest Model Using Top N Split Based Features

In [32]:
topNForestModelAuto = f"topNForestModelAuto{topNCutoff}Split"

In [33]:
%%time
bestConfigTopN = demo.loadOrTuneForestModel(topNForestModelAuto,
                           "contentTrainPcaNetwork",
                           topNFeatureList
                          )
print(bestConfigTopN)

NOTE: Cloud Analytic Services made the file topNForestModelAuto12SplitAStore.sashdat available as table TOPNFORESTMODELAUTO12SPLITASTORE in caslib CASUSER(brrees).
NOTE: Cloud Analytic Services made the file topNForestModelAuto12Split.sashdat available as table TOPNFORESTMODELAUTO12SPLIT in caslib CASUSER(brrees).
Best Configuration

                                     Parameter       Value
Name                                                      
Evaluation                          Evaluation          80
NTREE                          Number of Trees          97
M                   Number of Variables to Try           4
BOOTSTRAP                            Bootstrap  0.20613305
MAXLEVEL                   Maximum Tree Levels          15
NBINS                           Number of Bins          44
Objective   Misclassification Error Percentage        9.22
CPU times: user 31.2 ms, sys: 0 ns, total: 31.2 ms
Wall time: 106 ms


In [34]:
resultsScoreTopNForestModelAuto=demo.scoreForestModel(topNForestModelAuto,"contentTestPcaNetwork")

Accuracy = 0.8634686346863468


### Bootstrap Runs

In [35]:
%%time
accuracies = demo.bootstrapForestModel(topNForestModelAuto,"contentTrainPcaNetwork",
                                       "contentTestPcaNetwork",
                                       topNFeatureList,
                                       bestConfigTopN,
                                       25
                                      );

NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5678 for sampling.
NOTE: 4961890 bytes were written to the table "topNForestModelAuto12SplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.8627049180327869
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5679 for sampling.
NOTE: 4958370 bytes were written to the table "topNForestModelAuto12SplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.8668032786885246
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5680 for sampling.
NOTE: 4956810 bytes were written to the table "topNForestModelAuto12SplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.8647540983606558
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5681 for sampling.
NOTE: 4946458 bytes were written to the table "topNForestModelAuto12SplitAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.8586065573770492
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5682 for sampling.
NOTE: 4933626 bytes were writ

# Now, try the top 12 features by Leaf Based Feature Importance

In [36]:
topNFeatureList = rankedFeaturesLeafBased
topNFeatureList

['core_out_Rule_Learning',
 'core_out_Genetic_Algorithms',
 'core_out_Case_Based',
 'core_out_Reinforcement_Learning',
 'core_out_Probabilistic_Methods',
 'core_out_Theory',
 'deg_in_Genetic_Algorithms',
 'core_out_Neural_Networks',
 'deg_in_Rule_Learning',
 'pca3',
 'pca2',
 'deg_in_Case_Based']

## Train Neural Net Model Using Top N Leaf Based Features

In [37]:
deepLearnParam = AttributeDict({
    "randomSeed": 1337,
    "dropout": 0.5,
    "activation": "RECTIFIER",
    "outputActivation": "SOFTMAX",
    "denseLayers": [50, 50],
    "nOutputs": nClasses,
    "nEpochs": 100,
    "algoMethod": "ADAM",
    "useLocking": False
})

In [38]:
topNNnModel = "topNNnModelLeaf"
demo.defineNnModel(topNNnModel, deepLearnParam)

In [39]:
%%time
demo.trainNnModel(topNNnModel,"contentTrainPcaNetwork", topNFeatureList, deepLearnParam)

CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 1.17 s


Unnamed: 0,Descr,Value
0,Model Name,topnnnmodelleaf
1,Model Type,Deep Neural Network
2,Number of Layers,4
3,Number of Input Layers,1
4,Number of Output Layers,1
5,Number of Fully Connected Layers,2
6,Number of Weight Parameters,3450
7,Number of Bias Parameters,107
8,Total Number of Model Parameters,3557
9,Approximate Memory Cost for Training (MB),1

Unnamed: 0,Epoch,LearningRate,Loss,FitError
0,0.0,0.001,1.799952,0.699908
1,1.0,0.001,1.529331,0.567405
2,2.0,0.001,1.347158,0.467221
3,3.0,0.001,1.210284,0.404894
4,4.0,0.001,1.098087,0.351801
5,5.0,0.001,0.981098,0.311634
6,6.0,0.001,0.917063,0.267775
7,7.0,0.001,0.827239,0.242382
8,8.0,0.001,0.762579,0.228532
9,9.0,0.001,0.712670,0.213296

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(brrees),topNNnModelLeafWeights,3557,3,"CASTable('topNNnModelLeafWeights', caslib='CAS..."


In [40]:
demo.scoreNnModel(topNNnModel,"contentTestPcaNetwork")


                         Descr         Value
0  Number of Observations Read           542
1  Number of Observations Used           542
2  Misclassification Error (%)      12.73063
3                   Loss Error      0.512408
Accuracy = 0.8726936999999999


0.8726936999999999

### Bootstrap Runs

In [41]:
%%time
accuracies = demo.bootstrapNnModel(topNNnModel,"contentTrainPcaNetwork",
                                   "contentTestPcaNetwork",
                                   topNFeatureList,
                                   deepLearnParam,
                                   25
                                  );

NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5678 for sampling.
Accuracy = 0.8709015999999999
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5679 for sampling.
Accuracy = 0.8709015999999999
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5680 for sampling.
Accuracy = 0.8729508
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5681 for sampling.
Accuracy = 0.8647541
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5682 for sampling.
Accuracy = 0.8811475
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5683 for sampling.
Accuracy = 0.8790984000000001
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5684 for sampling.
Accuracy = 0.8729508
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5685 for sampling.
Accuracy = 0.8790984000000001
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5686 for sampling.
Accuracy = 0.8668032999999999
NOTE: Simple Random Sampling is in effect.
NOTE: Using 

## Train Forest Model Using Top N Leaf Based Features

In [42]:
topNForestModel = "topNForestModelLeaf"

In [43]:
%%time
demo.trainForestModel(
    topNForestModel, "contentTrainPcaNetwork", topNFeatureList)

NOTE: 1323010 bytes were written to the table "topNForestModelLeafAStore" in the caslib "CASUSER(brrees)".
CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms
Wall time: 303 ms


Unnamed: 0,Descr,Value
0,Number of Trees,50.0
1,Number of Selected Variables (M),4.0
2,Random Number Seed,12345.0
3,Bootstrap Percentage (%),63.212056
4,Number of Bins,50.0
5,Number of Variables,12.0
6,Confidence Level for Pruning,0.25
7,Max Number of Tree Nodes,41.0
8,Min Number of Tree Nodes,23.0
9,Max Number of Branches,2.0

Unnamed: 0,Variable,Importance,Std
0,core_out_Genetic_Algorithms,145.566501,83.829538
1,core_out_Neural_Networks,113.372868,87.590736
2,core_out_Probabilistic_Methods,108.853599,80.21883
3,core_out_Case_Based,88.623363,59.056901
4,core_out_Reinforcement_Learning,61.260201,43.96631
5,core_out_Rule_Learning,61.160922,34.722224
6,core_out_Theory,58.169529,51.766018
7,deg_in_Genetic_Algorithms,9.120373,43.726751
8,deg_in_Case_Based,8.536772,36.944579
9,pca2,4.991211,6.497107

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(brrees),topNForestModelLeaf,1660,35,"CASTable('topNForestModelLeaf', caslib='CASUSE..."


In [44]:
resultsScoreTopNForestModel=demo.scoreForestModel(topNForestModel,"contentTestPcaNetwork")

Accuracy = 0.7933579335793358


### Bootstrap Runs

In [45]:
%%time
accuracies = demo.bootstrapForestModel(topNForestModel,"contentTrainPcaNetwork",
                                       "contentTestPcaNetwork",
                                       topNFeatureList,
                                       n=25
                                      );

NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5678 for sampling.
NOTE: 1323010 bytes were written to the table "topNForestModelLeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7950819672131147
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5679 for sampling.
NOTE: 1322994 bytes were written to the table "topNForestModelLeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7889344262295082
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5680 for sampling.
NOTE: 1313522 bytes were written to the table "topNForestModelLeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7889344262295082
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5681 for sampling.
NOTE: 1311930 bytes were written to the table "topNForestModelLeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.7889344262295082
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5682 for sampling.
NOTE: 1308666 bytes were written to the table "topNForest

## Autotune Forest Model Using Top N Leaf Based Features

In [46]:
topNForestModelAuto = f"topNForestModelAuto{topNCutoff}Leaf"

In [47]:
%%time
bestConfigTopN = demo.loadOrTuneForestModel(topNForestModelAuto,
                           "contentTrainPcaNetwork",
                           topNFeatureList
                          )
print(bestConfigTopN)

NOTE: Cloud Analytic Services made the file topNForestModelAuto12LeafAStore.sashdat available as table TOPNFORESTMODELAUTO12LEAFASTORE in caslib CASUSER(brrees).
NOTE: Cloud Analytic Services made the file topNForestModelAuto12Leaf.sashdat available as table TOPNFORESTMODELAUTO12LEAF in caslib CASUSER(brrees).
Best Configuration

                                     Parameter       Value
Name                                                      
Evaluation                          Evaluation          72
NTREE                          Number of Trees          98
M                   Number of Variables to Try           4
BOOTSTRAP                            Bootstrap  0.20442693
MAXLEVEL                   Maximum Tree Levels          21
NBINS                           Number of Bins          50
Objective   Misclassification Error Percentage        9.83
CPU times: user 15.6 ms, sys: 31.2 ms, total: 46.9 ms
Wall time: 101 ms


In [48]:
resultsScoreTopNForestModelAuto=demo.scoreForestModel(topNForestModelAuto,"contentTestPcaNetwork")

Accuracy = 0.8708487084870848


### Bootstrap Runs

In [49]:
%%time
accuracies = demo.bootstrapForestModel(topNForestModelAuto,"contentTrainPcaNetwork",
                                       "contentTestPcaNetwork",
                                       topNFeatureList,
                                       bestConfigTopN,
                                       25
                                      );

NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5678 for sampling.
NOTE: 4221130 bytes were written to the table "topNForestModelAuto12LeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.860655737704918
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5679 for sampling.
NOTE: 4207834 bytes were written to the table "topNForestModelAuto12LeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.8504098360655737
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5680 for sampling.
NOTE: 4209730 bytes were written to the table "topNForestModelAuto12LeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.8565573770491803
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5681 for sampling.
NOTE: 4196546 bytes were written to the table "topNForestModelAuto12LeafAStore" in the caslib "CASUSER(brrees)".
Accuracy = 0.8524590163934426
NOTE: Simple Random Sampling is in effect.
NOTE: Using SEED=5682 for sampling.
NOTE: 4210210 bytes were written t

# Session Cleanup

In [50]:
s.terminate();