### Regression Example for python pipefiiter

This is an example to show how to solve regression problem by decision tree using python pipefitter. It involves grid search to do hyperparameter tuning

In [1]:
import swat
import numpy as np
import pandas as pd


#### Generate Data

Idea from Chapter 10 Boosting and Additive Trees, 'the elements of statistical learning', Trevor Hastie. Robert Tibshirani. Jerome Friedman

In [2]:
mu, sigma = 0, 1 # mean and standard deviation

np.random.normal(mu, sigma, 10)

allnumpys = list()
for i in range(50):
    st = np.random.normal(mu, sigma, 1000)
    allnumpys.append(st)

data = pd.DataFrame(allnumpys)

data = data.transpose()

data.columns=['a'+str(i) for i in range(50)]

data['label']=1
def f(x):
    sumn=0
    for i in range(10):
        sumn = sumn + x['a'+str(i)]*x['a'+str(i)]+2*np.random.normal(0, 1, 1)
    return sumn


data['label']=data.apply(f, axis=1) 
data.head()

Unnamed: 0,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...,a41,a42,a43,a44,a45,a46,a47,a48,a49,label
0,-0.163868,0.55166,0.755243,0.453293,0.312177,-1.242281,-0.953866,-1.08095,0.768824,0.340052,...,0.264166,0.699456,-1.593044,-0.374007,-0.176346,1.086926,0.261823,0.420917,0.619162,2.611734
1,0.641373,0.089934,1.289161,-0.114076,0.76064,-0.423258,-0.267495,-0.676723,-0.428593,-1.400205,...,-0.450404,1.554618,0.02331,0.476761,0.273385,-1.785905,-2.270099,0.002988,-1.709381,2.754268
2,0.892209,-1.513641,1.506877,-0.56347,-1.632783,-0.129873,0.183244,1.474889,0.613299,0.268912,...,-0.356278,-0.553862,0.542841,-0.031553,0.197168,-0.419485,1.046704,0.458453,-0.299734,18.649774
3,-0.357151,-1.724089,-0.868914,0.124689,-0.808981,-0.546866,0.473728,0.497063,2.027501,-0.10064,...,0.819672,1.212975,-0.017733,-1.17564,-0.929942,0.915668,-0.493842,1.727418,-0.245242,9.659525
4,0.898753,0.523708,1.438855,-1.173597,0.512464,-0.783214,-0.427894,1.578074,0.356996,-1.200643,...,-0.710959,0.286052,0.085302,-0.916496,-0.144651,-1.58935,0.819149,1.039231,0.066407,10.332826


## SAS Viya version

### Create Connections and Load Data

In [3]:
casconn = swat.CAS('sasserver.demo.sas.com', 5570, nworkers=1) 
casdata = casconn.upload_frame(data)

NOTE: Cloud Analytic Services made the uploaded file available as table TMPFY5Y_6HZ in caslib CASUSER(lidong).
NOTE: The table TMPFY5Y_6HZ has been created in caslib CASUSER(lidong) from binary data uploaded to Cloud Analytic Services.


Return first 5 rows of casdata

In [4]:
casdata.head()

Unnamed: 0,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...,a41,a42,a43,a44,a45,a46,a47,a48,a49,label
0,-0.163868,0.55166,0.755243,0.453293,0.312177,-1.242281,-0.953866,-1.08095,0.768824,0.340052,...,0.264166,0.699456,-1.593044,-0.374007,-0.176346,1.086926,0.261823,0.420917,0.619162,2.611734
1,0.641373,0.089934,1.289161,-0.114076,0.76064,-0.423258,-0.267495,-0.676723,-0.428593,-1.400205,...,-0.450404,1.554618,0.02331,0.476761,0.273385,-1.785905,-2.270099,0.002988,-1.709381,2.754268
2,0.892209,-1.513641,1.506877,-0.56347,-1.632783,-0.129873,0.183244,1.474889,0.613299,0.268912,...,-0.356278,-0.553862,0.542841,-0.031553,0.197168,-0.419485,1.046704,0.458453,-0.299734,18.649774
3,-0.357151,-1.724089,-0.868914,0.124689,-0.808981,-0.546866,0.473728,0.497063,2.027501,-0.10064,...,0.819672,1.212975,-0.017733,-1.17564,-0.929942,0.915668,-0.493842,1.727418,-0.245242,9.659525
4,0.898753,0.523708,1.438855,-1.173597,0.512464,-0.783214,-0.427894,1.578074,0.356996,-1.200643,...,-0.710959,0.286052,0.085302,-0.916496,-0.144651,-1.58935,0.819149,1.039231,0.066407,10.332826


show the information of the table such as create time, rows etc.

In [5]:
casdata.tableinfo()

Unnamed: 0,Name,Rows,Columns,Encoding,CreateTimeFormatted,ModTimeFormatted,JavaCharSet,CreateTime,ModTime,Global,Repeated,View,SourceName,SourceCaslib,Compressed,Creator,Modifier
0,TMPFY5Y_6HZ,1000,51,utf-8,09May2017:15:26:28,09May2017:15:26:28,UTF8,1809963000.0,1809963000.0,0,0,0,,,0,lidong,


### Estimator

Import regression models: decision tree, random forest and gradient boosting tree

In [6]:
from pipefitter.estimator import DecisionTree, DecisionForest, GBTree

Create a DecisionTree object. This object is the high-level object that has no knowledge of CAS or SAS.

In [7]:
params = dict(target='label', 
              inputs=['a'+str(i) for i in range(50)])

In [8]:
dtree = DecisionTree(max_depth=6, **params)
dtree

DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

#### Decision Tree Fit and Score of CAS Table

Using the `DecisionTree` instance, we'll first run the `fit` method on the data set.  This will return a model object.

In [9]:
model = dtree.fit(casdata)
model

DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

The ``score`` method can then be called on the resulting model object

In [10]:
score = model.score(casdata)
score

Target                                        label
Level                                      INTERVAL
Var                                   _DT_PredMean_
NBins                                           100
NObsUsed                                       1000
TargetCount                                    1000
TargetMiss                                        0
PredCount                                      1000
PredMiss                                          0
AverageAbsoluteError                        5.76552
AverageSquaredError                         52.6174
AverageSquaredLogarithmicError             0.718349
RootAverageAbsoluteError                    2.40115
RootAverageSquaredError                     7.25378
RootAverageSquaredLogarithmicError         0.847555
dtype: object

### HyperParameter Tuning

The hyper-parameter tuning classes allow you to test multiple sets of parameters across
a set of estimators.

In [11]:
from pipefitter.model_selection import HyperParameterTuning

define parameter space

In [12]:
param_grid = dict(
    max_depth=[6, 10],
    leaf_size=[3, 5],
)

In [13]:
hpt = HyperParameterTuning(
          estimator=DecisionTree(target='label', 
              inputs=['a'+str(i) for i in range(50)]),
          param_grid=param_grid, 
          cv=3)

In [14]:
hpt.gridsearch(casdata)

Unnamed: 0,MeanScore,ScoreStd,Parameters,FoldScores,MeanClockTime
0,69.806799,3.704812,"{'leaf_size': 5, 'max_depth': 6}","[74.94611662324502, 66.49346708505026, 67.8597...",0.000346
1,70.835907,4.448506,"{'leaf_size': 3, 'max_depth': 6}","[76.45918787273892, 70.38556283013965, 65.5875...",0.000383
2,73.822943,3.661347,"{'leaf_size': 5, 'max_depth': 10}","[75.37561483948795, 68.72377631003239, 77.2592...",0.000495
3,80.267741,5.550547,"{'leaf_size': 3, 'max_depth': 10}","[88.06205685853658, 76.2462445689743, 76.32933...",0.000493


## SAS 9 Version

### Open sas session and load sas data

In [15]:
import saspy
sas = saspy.SASsession(cfgname='tdi')

SAS Connection established. Subprocess id is 5555



In [16]:
sasdata = sas.dataframe2sasdata(data)

In [17]:
params = dict(target='label', 
              inputs=['a'+str(i) for i in range(50)])

In [18]:
dtree = DecisionTree(max_depth=6, **params)
dtree

DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

#### Decision Tree Fit and Score of SAS Table

In [19]:
model = dtree.fit(sasdata)
model

DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

In [20]:
score = model.score(sasdata)
score

Target                               label
Level                             INTERVAL
Var                                P_label
NBins                                  100
NObsUsed                              1000
TargetCount                           1000
TargetMiss                               0
PredCount                             1000
PredMiss                                 0
SumSquaredError                    47997.6
Divsor                                1000
AverageSquaredError                47.9976
RootAverageSquaredError            6.92803
MeanPredictionValue                10.1566
StandardDeviationPrediction        3.95756
CoefficientVariationPrediction    0.389654
dtype: object

### HyperParameter Tuning

In [21]:
param_grid = dict(
    max_depth=[6, 10],
    leaf_size=[3, 5],
)

In [22]:
hpt = HyperParameterTuning(
          estimator=DecisionTree(target='label', 
              inputs=['a'+str(i) for i in range(50)]),
          param_grid=param_grid, 
          cv=3)

In [23]:
hpt.gridsearch(sasdata)

Unnamed: 0,MeanScore,ScoreStd,Parameters,FoldScores,MeanClockTime
0,68.404924,4.419822,"{'leaf_size': 3, 'max_depth': 6}","[72.4244376641314, 70.54062996103379, 62.24970...",0.016511
1,68.404924,4.419822,"{'leaf_size': 5, 'max_depth': 6}","[72.4244376641314, 70.54062996103379, 62.24970...",0.009058
2,78.77297,9.506403,"{'leaf_size': 3, 'max_depth': 10}","[77.52525571760691, 90.9894958658754, 67.80415...",0.008474
3,78.77297,9.506403,"{'leaf_size': 5, 'max_depth': 10}","[77.52525571760691, 90.9894958658754, 67.80415...",0.019093
