### Regression Example for python pipefiiter

This is an example to show how to solve regression problem by decision tree using python pipefitter. It involves grid search to do hyperparameter tuning

In [1]:
import swat
import numpy as np
import pandas as pd


#### Generate Data

Idea from Chapter 10 Boosting and Additive Trees, 'the elements of statistical learning', Trevor Hastie. Robert Tibshirani. Jerome Friedman

In [2]:
mu, sigma = 0, 1 # mean and standard deviation

np.random.normal(mu, sigma, 10)

allnumpys = list()
for i in range(50):
    st = np.random.normal(mu, sigma, 1000)
    allnumpys.append(st)

data = pd.DataFrame(allnumpys)

data = data.transpose()

data.columns=['a'+str(i) for i in range(50)]

data['label']=1
def f(x):
    sumn=0
    for i in range(10):
        sumn = sumn + x['a'+str(i)]*x['a'+str(i)]+2*np.random.normal(0, 1, 1)
    return sumn


data['label']=data.apply(f, axis=1) 
data.head()

Unnamed: 0,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...,a41,a42,a43,a44,a45,a46,a47,a48,a49,label
0,0.662716,0.30807,0.338617,-0.351051,-1.481292,0.557367,0.966436,0.200269,-0.336749,1.288656,...,0.703076,-0.846608,1.609467,-0.038384,-1.29093,0.273557,0.280244,1.042739,1.067851,10.780648
1,-0.271896,0.169343,1.071508,-0.938864,1.030303,0.874336,0.319124,1.141288,1.254647,0.033753,...,0.270631,-0.406405,-1.592481,0.062499,1.977884,-0.830468,1.145762,0.691041,-0.327328,-1.121349
2,0.621875,1.087925,0.315371,1.258042,-1.213612,-0.890436,-3.056506,1.280023,-0.864962,0.496708,...,-0.340298,-0.483659,0.037831,-1.418239,-0.801472,0.79563,-0.308464,-0.213733,-1.141872,15.343394
3,0.189052,0.858708,-1.147145,-0.420938,0.829774,-1.375461,-0.407992,-1.110958,0.474658,0.285635,...,-0.148302,0.309954,-1.864935,-0.024355,-0.303272,0.774517,1.991609,0.100281,-0.300906,-1.601921
4,-1.801008,-1.152999,0.770565,1.70674,0.442727,0.455442,-0.28685,0.232046,-0.481169,0.128884,...,1.21434,0.031385,-1.023916,-1.498513,-0.14788,1.82093,0.056998,0.313696,1.12206,2.076204


## SAS Viya version

### Create Connections and Load Data

In [3]:
casconn = swat.CAS('sasserver.demo.sas.com', 5570, nworkers=1) 
casdata = casconn.upload_frame(data)

NOTE: Cloud Analytic Services made the uploaded file available as table TMPZ90EA2AZ in caslib CASUSER(lidong).
NOTE: The table TMPZ90EA2AZ has been created in caslib CASUSER(lidong) from binary data uploaded to Cloud Analytic Services.


Return first 5 rows of casdata

In [4]:
casdata.head()

Unnamed: 0,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...,a41,a42,a43,a44,a45,a46,a47,a48,a49,label
0,0.662716,0.30807,0.338617,-0.351051,-1.481292,0.557367,0.966436,0.200269,-0.336749,1.288656,...,0.703076,-0.846608,1.609467,-0.038384,-1.29093,0.273557,0.280244,1.042739,1.067851,10.780648
1,-0.271896,0.169343,1.071508,-0.938864,1.030303,0.874336,0.319124,1.141288,1.254647,0.033753,...,0.270631,-0.406405,-1.592481,0.062499,1.977884,-0.830468,1.145762,0.691041,-0.327328,-1.121349
2,0.621875,1.087925,0.315371,1.258042,-1.213612,-0.890436,-3.056506,1.280023,-0.864962,0.496708,...,-0.340298,-0.483659,0.037831,-1.418239,-0.801472,0.79563,-0.308464,-0.213733,-1.141872,15.343394
3,0.189052,0.858708,-1.147145,-0.420938,0.829774,-1.375461,-0.407992,-1.110958,0.474658,0.285635,...,-0.148302,0.309954,-1.864935,-0.024355,-0.303272,0.774517,1.991609,0.100281,-0.300906,-1.601921
4,-1.801008,-1.152999,0.770565,1.70674,0.442727,0.455442,-0.28685,0.232046,-0.481169,0.128884,...,1.21434,0.031385,-1.023916,-1.498513,-0.14788,1.82093,0.056998,0.313696,1.12206,2.076204


show the information of the table such as create time, rows etc.

In [5]:
casdata.tableinfo()

Unnamed: 0,Name,Rows,Columns,Encoding,CreateTimeFormatted,ModTimeFormatted,JavaCharSet,CreateTime,ModTime,Global,Repeated,View,SourceName,SourceCaslib,Compressed,Creator,Modifier
0,TMPZ90EA2AZ,1000,51,utf-8,09May2017:15:20:29,09May2017:15:20:29,UTF8,1809962000.0,1809962000.0,0,0,0,,,0,lidong,


### Estimator

Import regression models: decision tree, random forest and gradient boosting tree

In [6]:
from pipefitter.estimator import DecisionTree, DecisionForest, GBTree

Create a DecisionTree object. This object is the high-level object that has no knowledge of CAS or SAS.

In [7]:
params = dict(target='label', 
              inputs=['a'+str(i) for i in range(50)])

In [8]:
dtree = DecisionTree(max_depth=6, **params)
dtree

DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

#### Decision Tree Fit and Score of CAS Table

Using the `DecisionTree` instance, we'll first run the `fit` method on the data set.  This will return a model object.

In [9]:
model = dtree.fit(casdata)
model

DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

The ``score`` method can then be called on the resulting model object

In [10]:
score = model.score(casdata)
score

Target                                        label
Level                                      INTERVAL
Var                                   _DT_PredMean_
NBins                                           100
NObsUsed                                       1000
TargetCount                                    1000
TargetMiss                                        0
PredCount                                      1000
PredMiss                                          0
AverageAbsoluteError                        5.40905
AverageSquaredError                         45.1413
AverageSquaredLogarithmicError             0.609637
RootAverageAbsoluteError                    2.32574
RootAverageSquaredError                     6.71873
RootAverageSquaredLogarithmicError         0.780792
dtype: object

### HyperParameter Tuning

The hyper-parameter tuning classes allow you to test multiple sets of parameters across
a set of estimators.

In [11]:
from pipefitter.model_selection import HyperParameterTuning

define parameter space

In [12]:
param_grid = dict(
    max_depth=[6, 10],
    leaf_size=[3, 5],
)

In [13]:
hpt = HyperParameterTuning(
          estimator=DecisionTree(target='label', 
              inputs=['a'+str(i) for i in range(50)]),
          param_grid=param_grid, 
          cv=3)

In [14]:
hpt.gridsearch(casdata)

Unnamed: 0,MeanScore,ScoreStd,Parameters,FoldScores,MeanClockTime
0,64.033971,4.462896,"{'max_depth': 6, 'leaf_size': 5}","[67.66779226021174, 66.68907746010373, 57.7492...",0.000574
1,66.068731,2.973981,"{'max_depth': 6, 'leaf_size': 3}","[69.36152128839937, 66.66353057701731, 62.1523...",0.000578
2,73.500004,3.667657,"{'max_depth': 10, 'leaf_size': 5}","[74.63014496848004, 77.35032653322236, 68.5751...",0.000586
3,77.890286,3.530998,"{'max_depth': 10, 'leaf_size': 3}","[77.81242056005406, 82.29285231553257, 73.6456...",0.000602


## SAS 9 Version

### Open sas session and load sas data

In [15]:
import saspy
sas = saspy.SASsession(cfgname='tdi')

SAS Connection established. Subprocess id is 1706



In [16]:
sasdata = sas.dataframe2sasdata(data)

In [17]:
from pipefitter.estimator import DecisionTree, DecisionForest, GBTree

In [18]:
params = dict(target='label', 
              inputs=['a'+str(i) for i in range(50)])

In [19]:
dtree = DecisionTree(max_depth=6, **params)
dtree

DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

#### Decision Tree Fit and Score of SAS Table

In [21]:
model = dtree.fit(sasdata)
model

DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

In [22]:
score = model.score(sasdata)
score

Target                               label
Level                             INTERVAL
Var                                P_label
NBins                                  100
NObsUsed                              1000
TargetCount                           1000
TargetMiss                               0
PredCount                             1000
PredMiss                                 0
SumSquaredError                    40496.7
Divsor                                1000
AverageSquaredError                40.4967
RootAverageSquaredError             6.3637
MeanPredictionValue                9.67758
StandardDeviationPrediction        4.19754
CoefficientVariationPrediction    0.433739
dtype: object

### HyperParameter Tuning

In [23]:
from pipefitter.model_selection import HyperParameterTuning

In [24]:
param_grid = dict(
    max_depth=[6, 10],
    leaf_size=[3, 5],
)

In [25]:
hpt = HyperParameterTuning(
          estimator=DecisionTree(target='label', 
              inputs=['a'+str(i) for i in range(50)]),
          param_grid=param_grid, 
          cv=3)

In [26]:
hpt.gridsearch(sasdata)

Unnamed: 0,MeanScore,ScoreStd,Parameters,FoldScores,MeanClockTime
0,67.518192,2.914376,"{'max_depth': 6, 'leaf_size': 3}","[64.3333611383824, 71.37622528899391, 66.84498...",0.014081
1,67.518192,2.914376,"{'max_depth': 6, 'leaf_size': 5}","[64.3333611383824, 71.37622528899391, 66.84498...",0.008604
2,81.668458,1.267523,"{'max_depth': 10, 'leaf_size': 3}","[80.6143748368767, 83.4511284043157, 80.939870...",0.010123
3,81.668458,1.267523,"{'max_depth': 10, 'leaf_size': 5}","[80.6143748368767, 83.4511284043157, 80.939870...",0.008844
