### Regression Example for python pipefiiter

This is an example to show how to solve regression problem by decision tree using python pipefitter. It involves grid search to do hyperparameter tuning

In [1]:
import swat
import numpy as np
import pandas as pd

#### Generate Data

Idea from Chapter 10 Boosting and Additive Trees, 'the elements of statistical learning', Trevor Hastie. Robert Tibshirani. Jerome Friedman

In [2]:
mu, sigma = 0, 1 # mean and standard deviation

np.random.normal(mu, sigma, 10)

allnumpys = list()
for i in range(50):
    st = np.random.normal(mu, sigma, 1000)
    allnumpys.append(st)

data = pd.DataFrame(allnumpys)

data = data.transpose()

data['label']=1

def f(x):
    sumn=0
    for i in range(10):
        sumn = sumn + x[i]*x[i]+2*np.random.normal(0, 1, 1)
    return sumn

data['label']=data.apply(f, axis=1) 
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,label
0,1.026961,-1.460769,-0.599474,0.631775,-0.924325,-0.929645,-0.450401,0.541806,1.097178,-0.179979,...,0.336655,1.020102,0.829641,0.671955,-0.111027,0.689708,-0.807005,1.762734,-0.83125,12.396742
1,0.849867,-0.675181,0.190271,-0.217543,1.091228,-0.322843,-0.148746,-0.666014,-0.089191,-0.679919,...,-0.015966,-0.257987,0.137701,1.553995,-0.281952,-1.217915,0.505365,0.710981,-0.785675,-3.81457
2,0.654742,0.730339,1.072677,0.524776,-0.908372,0.149268,1.105778,-2.04872,1.486944,2.03078,...,0.642224,-0.826536,-0.628561,-2.48972,-0.726688,0.220639,-0.273145,1.522935,0.203334,13.053647
3,-0.264717,0.716824,-0.054386,0.508544,1.760732,0.579096,-0.104291,-0.84591,-0.123437,-0.670075,...,-0.179121,0.169972,-0.832509,0.839313,1.815583,-1.087448,-0.946224,0.387775,0.740499,13.246403
4,-1.278897,-0.279804,1.543342,-0.327045,2.276871,0.6416,-1.158023,-0.457537,-1.973937,0.478836,...,0.374458,0.143975,-0.279526,-0.056836,-3.221961,1.188831,1.112162,1.361132,-0.065893,9.743054


## SAS Viya version

### Create Connections and Load Data

In [3]:
casconn = swat.CAS('sasserver.demo.sas.com', 5570, nworkers=1) 
casdata = casconn.upload_frame(data)

NOTE: Cloud Analytic Services made the uploaded file available as table TMPHM87RBQD in caslib CASUSERHDFS(lidong).
NOTE: The table TMPHM87RBQD has been created in caslib CASUSERHDFS(lidong) from binary data uploaded to Cloud Analytic Services.


Return first 5 rows of casdata

In [4]:
casdata.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,label
0,1.026961,-1.460769,-0.599474,0.631775,-0.924325,-0.929645,-0.450401,0.541806,1.097178,-0.179979,...,0.336655,1.020102,0.829641,0.671955,-0.111027,0.689708,-0.807005,1.762734,-0.83125,12.396742
1,0.849867,-0.675181,0.190271,-0.217543,1.091228,-0.322843,-0.148746,-0.666014,-0.089191,-0.679919,...,-0.015966,-0.257987,0.137701,1.553995,-0.281952,-1.217915,0.505365,0.710981,-0.785675,-3.81457
2,0.654742,0.730339,1.072677,0.524776,-0.908372,0.149268,1.105778,-2.04872,1.486944,2.03078,...,0.642224,-0.826536,-0.628561,-2.48972,-0.726688,0.220639,-0.273145,1.522935,0.203334,13.053647
3,-0.264717,0.716824,-0.054386,0.508544,1.760732,0.579096,-0.104291,-0.84591,-0.123437,-0.670075,...,-0.179121,0.169972,-0.832509,0.839313,1.815583,-1.087448,-0.946224,0.387775,0.740499,13.246403
4,-1.278897,-0.279804,1.543342,-0.327045,2.276871,0.6416,-1.158023,-0.457537,-1.973937,0.478836,...,0.374458,0.143975,-0.279526,-0.056836,-3.221961,1.188831,1.112162,1.361132,-0.065893,9.743054


show the information of the table such as create time, rows etc.

In [5]:
casdata.tableinfo()

Unnamed: 0,Name,Rows,Columns,Encoding,CreateTimeFormatted,ModTimeFormatted,JavaCharSet,CreateTime,ModTime,Global,Repeated,View,SourceName,SourceCaslib,Compressed,Creator,Modifier
0,TMPHM87RBQD,1000,51,utf-8,04May2017:19:38:22,04May2017:19:38:22,UTF8,1809546000.0,1809546000.0,0,0,0,,,0,lidong,


### Estimator

Import regression models: decision tree, random forest and gradient boosting tree

In [6]:
from pipefitter.estimator import DecisionTree, DecisionForest, GBTree

Create a DecisionTree object. This object is the high-level object that has no knowledge of CAS or SAS.

In [7]:
params = dict(target='label', 
              inputs=[str(i) for i in range(50)])

In [8]:
dtree = DecisionTree(max_depth=6, **params)
dtree

DecisionTree(alpha=0.0, cf_level=0.25, criterion=None, inputs=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

#### Decision Tree Fit and Score of CAS Table

Using the `DecisionTree` instance, we'll first run the `fit` method on the data set.  This will return a model object.

In [9]:
model = dtree.fit(casdata)
model

DecisionTreeModel(alpha=0.0, cf_level=0.25, criterion=None, inputs=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49'], leaf_size=5, max_branches=2, max_depth=6, n_bins=20, nominals=[], prune=False, target='label', var_importance=False)

The ``score`` method can then be called on the resulting model object

In [10]:
score = model.score(casdata)
score

Target                                        label
Level                                      INTERVAL
Var                                   _DT_PredMean_
NBins                                           100
NObsUsed                                       1000
TargetCount                                    1000
TargetMiss                                        0
PredCount                                      1000
PredMiss                                          0
AverageAbsoluteError                        5.45961
AverageSquaredError                         47.7999
AverageSquaredLogarithmicError             0.655048
RootAverageAbsoluteError                    2.33658
RootAverageSquaredError                     6.91375
RootAverageSquaredLogarithmicError          0.80935
dtype: object

### HyperParameter Tuning

The hyper-parameter tuning classes allow you to test multiple sets of parameters across
a set of estimators.

In [11]:
from pipefitter.model_selection import HyperParameterTuning

define parameter space

In [12]:
param_grid = dict(
    max_depth=[6, 10],
    leaf_size=[3, 5],
)

In [13]:
hpt = HyperParameterTuning(
          estimator=DecisionTree(target='label', 
              inputs=[str(i) for i in range(50)]),
          param_grid=param_grid, 
          cv=3)

In [14]:
hpt.gridsearch(casdata)

Unnamed: 0,MeanScore,ScoreStd,Parameters,FoldScores,MeanClockTime
0,62.113374,2.5879,"{'max_depth': 6, 'leaf_size': 5}","[65.75939818905599, 60.24532944862384, 60.2942...",0.000566
1,63.896503,4.274422,"{'max_depth': 6, 'leaf_size': 3}","[69.83758939242621, 59.9135774358081, 61.98532...",0.000577
2,67.476578,4.470002,"{'max_depth': 10, 'leaf_size': 5}","[73.77481176868052, 64.00831623139457, 64.6048...",0.000593
3,74.464213,7.394592,"{'max_depth': 10, 'leaf_size': 3}","[84.58060851524692, 67.0219802969355, 71.95020...",0.00072
