 ## Heart Classification
 |Info | Details |
| ---  | --- |
|  DATA | heart_disease.csv |
|  DESCRIPTION | The data set contains measurements on 304 patients, consisting of factors that potentially indicate the presence or absence of heart disease. |
|PURPOSE | In this example, we will show different binary classification modeling techniques to predict the heart disease. |
| SOURCE | Adapted from "Heart Disease prediction Random forest Classifier https://www.kaggle.com/code/mruanova/heart-disease-prediction-random-forest-classifier by Mau Rua



#### Loading the heart_disease data.

In [2]:
title 'Predicting heart disease using different modeling techniques';

filename heart url 'https://raw.githubusercontent.com/rachelnisbet/JupyterCodingCanvas/main/heart_disease.csv';

proc import
    datafile=heart
    out=heart_disease dbms=csv replace;
run;


103  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
103! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml1.htm[0m
104  
105  title 'Predicting heart disease using different modeling techniques';
106  
107  filename heart url 'https://raw.githubusercontent.com/rachelnisbet/JupyterCodingCanvas/main/heart_disease.csv';
108  

[38;5;21mNOTE: The infile HEART is:
      Filename=https://raw.githubusercontent.com/rachelnisbet/JupyterCodingCanvas/main/heart_disease.csv,
      Local Host Name=sas-compute-server-eda058c2-17fb-4563-89a9-c1b7a0842c33-222,
      Local Host IP addr=10.244.1.78,
      Service Hostname Name=cdn-185-199-108-133.github.com,
      Service IP addr=185.199.108.133,
      Service Name=N/A,Service Portno=443,
      Lrecl=32767,Recfm=Variable[0m

[38;5;21mNOTE: 304 records were read from the infile HEART.
      The minimum record length was 33.
      The maximum reco

In [3]:
proc format;
    value heart_disease
        0="No heart disease"
        1="With heart disease"
    ;
    value sex_format
        0="Female"
        1="Male"
    ;
run;
data heart_disease;
    format target heart_disease. sex sex_format.;
    set heart_disease;
run;


173  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
173! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml2.htm[0m
174  
175  proc format;
176      value heart_disease
177          0="No heart disease"
178          1="With heart disease"
179      ;
[38;5;21mNOTE: Format HEART_DISEASE has been output.[0m
180      value sex_format
181          0="Female"
182          1="Male"
183      ;
[38;5;21mNOTE: Format SEX_FORMAT has been output.[0m
184  run;

[38;5;21mNOTE: PROCEDURE FORMAT used (Total process time):
      real time           0.00 seconds
      cpu time            0.02 seconds
      [0m

185  data heart_disease;
186      format target heart_disease. sex sex_format.;
187      set heart_disease;
188  run;

[38;5;21mNOTE: There were 304 observations read from the data set WORK.HEART_DISEASE.[0m
[38;5;21mNOTE: The data set WORK.HEART_DISEASE has 304 observations and 14 v

#### Print a few rows to show the original data.

In [4]:
title2 'Portion of heart_disease data';
proc print data=heart_disease (obs=5); run;

Obs,target,sex,age,trestbps,chol,thalch,oldpeak,ca,cp,exang,slope,thal,restecg,fbs
1,No heart disease,Male,63,145,233,150,2.3,0,3,0,0,1,0,1
2,With heart disease,Male,67,160,286,108,1.5,3,0,1,1,2,0,0
3,With heart disease,Male,67,120,229,129,2.6,2,0,1,1,0,0,0
4,No heart disease,Male,37,130,250,187,3.5,0,2,0,0,2,1,0
5,No heart disease,Female,41,130,204,172,1.4,0,1,0,2,2,0,0


### Column Descriptions
| Variable  | Description |
| ---  | --- |
| age | age in years|
| sex| (1 = male; 0 = female)|
| cp| chest pain type|
| trestbps| resting blood pressure (in mm Hg on admission to the hospital)|
| chol| serum cholesterol in mg/dl|
| fbs| (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)|
| restecg | resting electrocardiographic results|
| thalch | maximum heart rate achieved|
| exang| exercise induced angina (1 = yes; 0 = no)|
| oldpeak| ST depression induced by exercise relative to rest|
| slope| the slope of the peak exercise ST segment|
| ca| number of major vessels (0-3) colored by fluoroscopy|
| thal| 3 = normal; 6 = fixed defect; 7 = reversable defect|
| target| refers to the presence of heart disease in the patient (1=yes, 0=no)|

#### Visualization of heart disease percentages through a pie chart
 - Use the TEMPLATE procedure to customize the pie chart.
 - Use the SGRENDER procedure to display the pie chart with the customized
   template
 - Show the overall heart disease percentage and the heart disease percentage
   within sex

In [5]:
proc template;
    define statgraph simplepie;
        begingraph;
            entrytitle "Heart Disease Percentage";
            layout region;
                piechart category=target / name="p"
                     datalabelcontent=(percent)
                     datalabellocation=inside
                     dataskin=sheen;
                discretelegend "p" / title="Target" halign=left valign=bottom;
            endlayout;
        endgraph;
    end;
run;
title2 'Percentage of heart disease in the data';
proc sgrender data=heart_disease
              template=simplepie;
run;

proc template;
    define statgraph simplepie;
        begingraph;
            entrytitle "Heart Disease Percentage";
            layout region;
                piechart category=target / group=sex name="p"
                     datalabelcontent=(percent)
                     datalabellocation=inside
                     dataskin=sheen;
                discretelegend "p" / title="Target" halign=left valign=bottom;
            endlayout;
        endgraph;
    end;
run;
title2 'Percentage of heart disease by gender in the data';
proc sgrender data=heart_disease
              template=simplepie;
run;




#### Correlation Analysis
 To better understand how different factors contribute to heart disease
 and how the factors correlate with each other, we will use two different
 visualization tools: a correlation heatmap and a pairwise scatter plot.
 
  First, we use the CORR procedure to create the correlation matrix.


In [6]:
title2 'Output from the CORR procedure';
ods output PearsonCorr=corr;
proc corr data = heart_disease;
    var target--fbs;
run;

0,1
14 Variables:,target sex age trestbps chol thalch oldpeak ca cp exang slope thal restecg fbs

Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics
Variable,N,Mean,Std Dev,Sum,Minimum,Maximum
target,304,0.45724,0.49899,139.0,0.0,1.0
sex,304,0.68092,0.46689,207.0,0.0,1.0
age,304,54.35197,9.15026,16523.0,28.0,77.0
trestbps,304,131.68421,17.57095,40032.0,94.0,200.0
chol,304,246.31579,52.10828,74880.0,126.0,564.0
thalch,304,149.72368,22.92726,45516.0,71.0,202.0
oldpeak,304,1.03618,1.16069,315.0,0.0,6.2
ca,299,0.67224,0.93744,201.0,0.0,3.0
cp,304,0.96053,1.03012,292.0,0.0,3.0
exang,304,0.32566,0.46939,99.0,0.0,1.0

Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations,Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations
Unnamed: 0_level_1,target,sex,age,trestbps,chol,thalch,oldpeak,ca,cp,exang,slope,thal,restecg,fbs
target,1.00000  304,0.27414 <.0001 304,0.22847 <.0001 304,0.15090 0.0084 304,0.09102 0.1132 304,-0.41962 <.0001 304,0.42607 <.0001 304,0.46044 <.0001 299,-0.43348 <.0001 304,0.43305 <.0001 304,-0.34116 <.0001 304,-0.51025 <.0001 304,-0.13455 0.0189 304,0.02648 0.6456 304
sex,0.27414 <.0001 304,1.00000  304,-0.10264 0.0740 304,-0.06462 0.2613 304,-0.20313 0.0004 304,-0.04495 0.4348 304,0.09994 0.0819 304,0.09318 0.1078 299,-0.04686 0.4156 304,0.14440 0.0117 304,-0.03525 0.5404 304,-0.36331 <.0001 304,-0.05733 0.3191 304,0.04687 0.4155 304
age,0.22847 <.0001 304,-0.10264 0.0740 304,1.00000  304,0.28192 <.0001 304,0.22533 <.0001 304,-0.40151 <.0001 304,0.20924 0.0002 304,0.36260 <.0001 299,-0.06435 0.2634 304,0.09694 0.0916 304,-0.16855 0.0032 304,-0.10569 0.0657 304,-0.10214 0.0754 304,0.12083 0.0352 304
trestbps,0.15090 0.0084 304,-0.06462 0.2613 304,0.28192 <.0001 304,1.00000  304,0.12977 0.0236 304,-0.04566 0.4276 304,0.18920 0.0009 304,0.09877 0.0882 299,0.04161 0.4698 304,0.06493 0.2591 304,-0.11750 0.0406 304,-0.12301 0.0320 304,-0.11919 0.0378 304,0.17542 0.0021 304
chol,0.09102 0.1132 304,-0.20313 0.0004 304,0.22533 <.0001 304,0.12977 0.0236 304,1.00000  304,-0.01457 0.8003 304,0.05262 0.3606 304,0.11900 0.0397 299,-0.07558 0.1888 304,0.06581 0.2526 304,-0.00303 0.9580 304,-0.01165 0.8397 304,-0.14210 0.0131 304,0.01278 0.8243 304
thalch,-0.41962 <.0001 304,-0.04495 0.4348 304,-0.40151 <.0001 304,-0.04566 0.4276 304,-0.01457 0.8003 304,1.00000  304,-0.34584 <.0001 304,-0.26425 <.0001 299,0.29033 <.0001 304,-0.37985 <.0001 304,0.38844 <.0001 304,0.25811 <.0001 304,0.03419 0.5526 304,-0.00994 0.8629 304
oldpeak,0.42607 <.0001 304,0.09994 0.0819 304,0.20924 0.0002 304,0.18920 0.0009 304,0.05262 0.3606 304,-0.34584 <.0001 304,1.00000  304,0.29583 <.0001 299,-0.14344 0.0123 304,0.28966 <.0001 304,-0.57874 <.0001 304,-0.32197 <.0001 304,-0.05002 0.3848 304,0.00697 0.9037 304
ca,0.46044 <.0001 299,0.09318 0.1078 299,0.36260 <.0001 299,0.09877 0.0882 299,0.11900 0.0397 299,-0.26425 <.0001 299,0.29583 <.0001 299,1.00000  299,-0.22253 0.0001 299,0.14557 0.0117 299,-0.11012 0.0572 299,-0.24131 <.0001 299,-0.10789 0.0624 299,0.14548 0.0118 299
cp,-0.43348 <.0001 304,-0.04686 0.4156 304,-0.06435 0.2634 304,0.04161 0.4698 304,-0.07558 0.1888 304,0.29033 <.0001 304,-0.14344 0.0123 304,-0.22253 0.0001 299,1.00000  304,-0.38968 <.0001 304,0.11343 0.0482 304,0.25312 <.0001 304,0.04432 0.4414 304,0.09706 0.0912 304
exang,0.43305 <.0001 304,0.14440 0.0117 304,0.09694 0.0916 304,0.06493 0.2591 304,0.06581 0.2526 304,-0.37985 <.0001 304,0.28966 <.0001 304,0.14557 0.0117 299,-0.38968 <.0001 304,1.00000  304,-0.25937 <.0001 304,-0.32223 <.0001 304,-0.06389 0.2668 304,0.02659 0.6442 304


#### Sort and transpose the output from the CORR procedure for plotting a heatmap.

In [7]:
proc sort data=Corr;
    by variable;
run;
proc transpose data=Corr out=Corr_trans(rename=(COL1=Corr)) name=Correlation;
    var target--fbs;
    by variable;
run;
proc sort data=Corr_trans;
    by variable correlation;
run;

251  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
251! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml6.htm[0m
252  
253  proc sort data=Corr;
254      by variable;
255  run;

[38;5;21mNOTE: There were 14 observations read from the data set WORK.CORR.[0m
[38;5;21mNOTE: The data set WORK.CORR has 14 observations and 43 variables.[0m
[38;5;21mNOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      [0m

256  proc transpose data=Corr out=Corr_trans(rename=(COL1=Corr)) name=Correlation;
257      var target--fbs;
258      by variable;
259  run;

[38;5;21mNOTE: There were 14 observations read from the data set WORK.CORR.[0m
[38;5;21mNOTE: The data set WORK.CORR_TRANS has 196 observations and 3 variables.[0m
[38;5;21mNOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.00

#### Use the SGPLOT procedure to produce the heatmap.
 - A few variables have strong positive or negative correlation with the "target"
   (heart disease). For example, "ca", "exang", and "oldpeak" show strong positive
   correlation with "target", while "thal" and "thalch" show strong negative
   correlation to "target".
 - Some variables also show strong correlation with each other. For example,
   "slope" has strong negative correlation with "oldpeak".

In [8]:
title2 'Heatmap of the correlation matrix';
proc sgplot data=Corr_trans noautolegend;
    heatmap x=variable y=Correlation / colorresponse=Corr discretex discretey x2axis;
    text x=Variable y=Correlation text=Corr  / textattrs=(size=5pt) x2axis;
    label correlation='Pearson Correlation';
    yaxis reverse display=(nolabel);
    x2axis display=(nolabel);
    gradlegend;
run;



#### Scatter Plot
The next tool we can use to visualize the relationships is the pairwise scatter plot. We can color the plotted points by "target" to visualize whether different distributions exist between "with heart disease" and "no heart disease" on each scatter plot.

For demonstration purposes, we pick only the variables that show strong correlation with "target". From the pairwise scatter plots, it is clear that the two groups ("With heart disease" vs "No heart disease") have different distribution patterns on these scatter plots.

In [9]:
title2 'Pairwise scatter plots for interval variables';
proc sgscatter data=heart_disease;
     matrix ca exang oldpeak thal thalch /group=target diagonal=(histogram kernel);
run;

data heart_disease;
   set heart_disease;
   format target;
run;

In [15]:
cas;
caslib _all_ assign;
data casuser.heart_disease;
set heart_disease;
run;

345  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
345! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml14.htm[0m
346  
347  cas;
[38;5;21mNOTE: The session name identified with the SESSREF= SAS option is connected to Cloud Analytic Services. The default value for 
      SESSREF= is CASAUTO.[0m
348  caslib _all_ assign;
[38;5;21mNOTE: A SAS Library associated with a caslib can only reference library member names that conform to SAS Library naming conventions.[0m
[38;5;21mNOTE: CASLIB ACADEMIC for session CASAUTO will be mapped to SAS Library ACADEMIC.[0m
[38;5;21mNOTE: CASLIB ADML for session CASAUTO will be mapped to SAS Library ADML.[0m
[38;5;21mNOTE: CASLIB CASUSER(rachel.mclawhon@sas.com) for session CASAUTO will be mapped to SAS Library CASUSER.[0m
[38;5;21mNOTE: CASLIB CPML for session CASAUTO will be mapped to SAS Library CPML.[0m
[38;5;21mNOTE: CASLIB CRVA83

#### Partition
 Partition data into training and test sets.
 It is common to split the input data into training and test data. The
 PARTITION procedure is used to randomly partition HEART_DISEASE into
 HEART_DISEASE_TRAIN and HEART_DISEASE_TEST with an 80% to 20% ratio.

In [22]:
title2 'Create training and test data sets with the PARTITION procedure';
proc partition data=casuser.heart_disease seed=12345
    partind samppct=80;
    output out=casuser.heart_disease_part;
run;

data casuser.heart_disease_train(drop=_partind_);
    set casuser.heart_disease_part(where=(_partind_=1));
run;

data casuser.heart_disease_test(drop=_partind_);
    set casuser.heart_disease_part(where=(_partind_~=1));
run;




Simple Random Sampling Frequency,Simple Random Sampling Frequency
Number of Obs,Number of Samples
304,243

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),HEART_DISEASE_PART,304,15


#### Modeling
 Finally, we will show 5 different classification modeling techniques by using 5 SAS procedures: LOGSELECT, TREESPLIT, GRADBOOST, FOREST, and SVMACHINE

#### LOGSELECT procedure

In [26]:
title2 'Build classification model using PROC LOGSELECT';
ods output FitStatistics=logfitstat;
proc logselect data=casuser.heart_disease_train technique=lbfgs maxiter=1000 partfit;
    class target ca--fbs;
    model target = age--fbs;
    savestate rstore=casuser.logstore;
run;

data _null_;
    set logfitstat;
    if rowid = 'MISCLASS' then
        call symputx('acc_train_logselect', (1-value));
run;

title3 'Score the model with ASTORE for the test data';
proc astore;
    score data=casuser.heart_disease_test rstore=casuser.logstore out=casuser.log_scoreout copyvars=(target);
run;



Model Information,Model Information.1
Data Source,HEART_DISEASE_TRAIN
Response Variable,target
Distribution,Binary
Link Function,Logit
Optimization Technique,Limited-memory BFGS

0,1
Number of Observations Read,243
Number of Observations Used,240

Response Profile,Response Profile,Response Profile
Ordered Value,target,Total Frequency
1,0,129
2,1,111

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
ca,4,0 1 2 3
cp,4,0 1 2 3
exang,2,0 1
slope,3,0 1 2
thal,3,0 1 2
restecg,3,0 1 2
fbs,2,0 1

0
Convergence criterion (FCONV=1E-7) satisfied.

Dimensions,Dimensions.1
Columns in Design,27
Number of Effects,13
Max Effect Columns,4
Rank of Design,20
Parameters in Optimization,20

Testing Global Null Hypothesis: BETA=0,Testing Global Null Hypothesis: BETA=0,Testing Global Null Hypothesis: BETA=0,Testing Global Null Hypothesis: BETA=0
Test,DF,Chi-Square,Pr > ChiSq
Likelihood Ratio,19,175.6922,<.0001

Fit Statistics,Fit Statistics.1
-2 Log Likelihood,155.66719
AIC (smaller is better),195.66719
AICC (smaller is better),199.50281
SBC (smaller is better),265.27997
Average Square Error,0.0957
-2 Log L (Intercept-only),331.35938
R-Square,0.51908
Max-rescaled R-Square,0.69341
McFadden's R-Square,0.53022
Misclassification Rate,0.12083

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,Chi-Square,Pr > ChiSq
Intercept,1,0.61824,7.247215,0.0073,0.9320
age,1,0.044787,0.028791,2.4198,0.1198
trestbps,1,-0.030386,0.012477,5.9309,0.0149
chol,1,-0.003138,0.004175,0.5650,0.4523
thalch,1,0.026879,0.013078,4.2245,0.0398
oldpeak,1,-0.520137,0.270440,3.6991,0.0544
ca 0,1,2.273981,1.331737,2.9157,0.0877
ca 1,1,0.011708,1.341405,0.0001,0.9930
ca 2,1,-0.712074,1.494391,0.2271,0.6337
ca 3,0,0.0,.,.,.

Task Timing,Task Timing,Task Timing
Task,Seconds,Percent
Setup and Parsing,0.0,3.19%
Levelization,0.0,1.36%
Model Initialization,0.0,0.80%
SSCP Computation,0.0,0.70%
Model Fitting,0.13,89.19%
Model Storing,0.01,4.74%
Display,0.0,0.01%
Cleanup,0.0,0.00%
Total,0.14,100.00%

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),LOGSTORE,1,2

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),LOG_SCOREOUT,61,4

Task Timing,Task Timing,Task Timing
Task,Seconds,Percent
Loading the Store,0.0,0.50%
Creating the State,0.01,64.34%
Scoring,0.0,32.96%
Total,0.01,100.00%


 #### Compute accuracy score:
 The percentage of patients in test
 data whose predicted heart disease
 status matched their actual status.

In [29]:
data _null_;
    retain matchSum 0;
    set casuser.log_scoreout end=last;
    match = (I_target = target);
    matchSum + match;
    if last then call symputx ('acc_test_logselect', (matchSum/_n_));
run;



617  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
617! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml28.htm[0m
618  
619  data _null_;
620      retain matchSum 0;
621      set casuser.log_scoreout end=last;
622      match = (I_target = target);
623      matchSum + match;
624      if last then call symputx ('acc_test_logselect', (matchSum/_n_));
625  run;

[38;5;21mNOTE: There were 61 observations read from the data set CASUSER.LOG_SCOREOUT.[0m
[38;5;21mNOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds
      [0m

626  
627  
628  ods html5 (id=saspy_internal) close;ods listing;
629  




####  TREESPLIT procedure

In [31]:
title2 'Build classification model using PROC TREESPLIT';
ods output treeperformance=treestat;
proc treesplit data=casuser.heart_disease_train;
    class target ca--fbs;
    model target = age--fbs;
    prune c45;
    savestate rstore=casuser.dtstore;
run;

data _null_;
    set treestat;
    call symputx('acc_train_treesplit', (1-MiscRate));
run;

title3 'Score the model with ASTORE for the test data';
proc astore;
    score data=casuser.heart_disease_test rstore=casuser.dtstore out=casuser.dt_scoreout copyvars=(target);
run;



Model Information,Model Information.1
Split Criterion,IGR
Pruning Method,C45
Max Branches per Node,2
Max Tree Depth,10
Tree Depth Before Pruning,
Tree Depth After Pruning,8
Number of Leaves Before Pruning,
Number of Leaves After Pruning,16

Unnamed: 0,Training
Number of Observations Read,243
Number of Observations Used,243

Fit Statistics for Selected Tree,Fit Statistics for Selected Tree,Fit Statistics for Selected Tree
Unnamed: 0_level_1,Number of Leaves,Misclassification Rate
Training,16,0.0988

Variable Importance,Variable Importance,Variable Importance,Variable Importance
Variable,Training,Training,Count
Variable,Importance,Relative Importance,Count
thal,21.2333,1.0,2
oldpeak,17.7536,0.8361,1
ca,10.678,0.5029,2
thalch,8.1056,0.3817,1
age,7.6147,0.3586,3
cp,6.2203,0.293,3
chol,4.6694,0.2199,1
exang,2.2222,0.1047,1
trestbps,1.8667,0.0879,1

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),DT_SCOREOUT,61,5

Task Timing,Task Timing,Task Timing
Task,Seconds,Percent
Loading the Store,0.0,1.04%
Creating the State,0.0,32.96%
Scoring,0.0,61.67%
Total,0.01,100.00%


#### Compute accuracy score:
 The percentage of patients in test
 data whose predicted heart disease
 status matched their actual status.

In [32]:
data _null_;
    retain matchSum 0;
    set casuser.dt_scoreout(keep=I_target target) end=last;
    match = (I_target = target);
    matchSum + match;
    if last then call symputx ('acc_test_treesplit', (matchSum/_n_));
run;



678  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
678! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml31.htm[0m
679  
680  data _null_;
681      retain matchSum 0;
682      set casuser.dt_scoreout(keep=I_target target) end=last;
683      match = (I_target = target);
684      matchSum + match;
685      if last then call symputx ('acc_test_treesplit', (matchSum/_n_));
686  run;

[38;5;21mNOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      683:14   [0m
[38;5;21mNOTE: There were 61 observations read from the data set CASUSER.DT_SCOREOUT.[0m
[38;5;21mNOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds
      [0m

687  
688  
689  ods html5 (id=saspy_internal) close;ods listing;
690  




#### GRADBOOST procedure

In [34]:
title2 'Build classification model using PROC GRADBOOST';
ods output FitStatistics=gbfitstat;
proc gradboost data=casuser.heart_disease_train;
    input age--oldpeak / level=interval;
    input ca--fbs / level=nominal;
    target target / level=nominal;
    savestate rstore=casuser.gbstore;
run;

data _null_;
    set gbfitstat end=last;
    if last then
       call symputx('acc_train_gradboost', (1-MiscTrain));
run;

title3 'Score the model with ASTORE for the test data';
proc astore;
    score data=casuser.heart_disease_test rstore=casuser.gbstore out=casuser.gb_scoreout copyvars=(target);
run;



Model Information,Model Information.1
Number of Trees,100.0
Learning Rate,0.1
Subsampling Rate,0.5
Number of Variables Per Split,12.0
Number of Bins,50.0
Number of Input Variables,12.0
Maximum Number of Tree Nodes,25.0
Minimum Number of Tree Nodes,9.0
Maximum Number of Branches,2.0
Minimum Number of Branches,2.0

Unnamed: 0,Training
Number of Observations Read,243
Number of Observations Used,243

Variable Importance,Variable Importance,Variable Importance,Variable Importance
Variable,Importance,Std Dev Importance,Relative Importance
thal,0.9887,2.6173,1.0
cp,0.9184,1.8216,0.9289
ca,0.8953,2.0225,0.9056
age,0.7413,1.0013,0.7497
thalch,0.7366,0.8573,0.745
oldpeak,0.6984,1.2011,0.7064
chol,0.6335,0.8217,0.6407
trestbps,0.3999,0.4736,0.4045
exang,0.24,0.6923,0.2427
slope,0.2392,0.4945,0.2419

Fit Statistics,Fit Statistics,Fit Statistics,Fit Statistics
Number of Trees,Training Average Square Error,Training Misclassification Rate,Training Log Loss
1,0.2267,0.2634,0.646
2,0.208,0.2058,0.608
3,0.1914,0.1646,0.574
4,0.1785,0.1646,0.547
5,0.166,0.1317,0.52
6,0.156,0.1276,0.497
7,0.1475,0.1111,0.477
8,0.1388,0.1235,0.457
9,0.1335,0.1276,0.444
10,0.1271,0.1235,0.428

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),GB_SCOREOUT,61,5

Task Timing,Task Timing,Task Timing
Task,Seconds,Percent
Loading the Store,0.0,0.98%
Creating the State,0.0,34.36%
Scoring,0.0,60.72%
Total,0.01,100.00%


#### Compute accuracy score:
 The percentage of patients in test
 data whose predicted heart disease
 status matched their actual status.

In [35]:
data _null_;
    retain matchSum 0;
    set casuser.gb_scoreout(keep=I_target target) end=last;
    match = (I_target = target);
    matchSum + match;
    if last then call symputx ('acc_test_gradboost', (matchSum/_n_));
run;



741  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
741! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml34.htm[0m
742  
743  data _null_;
744      retain matchSum 0;
745      set casuser.gb_scoreout(keep=I_target target) end=last;
746      match = (I_target = target);
747      matchSum + match;
748      if last then call symputx ('acc_test_gradboost', (matchSum/_n_));
749  run;

[38;5;21mNOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      746:14   [0m
[38;5;21mNOTE: There were 61 observations read from the data set CASUSER.GB_SCOREOUT.[0m
[38;5;21mNOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.02 seconds
      [0m

750  
751  
752  ods html5 (id=saspy_internal) close;ods listing;
753  




#### FOREST procedure


In [36]:
title2 'Build classification model using PROC FOREST';
ods output modelInfo=forestModel;
proc forest data=casuser.heart_disease_train;
    input age--oldpeak / level=interval;
    input ca--fbs / level=nominal;
    target target / level=nominal;
    savestate rstore=casuser.forstore;
run;

data _null_;
    set forestModel;
    if prxmatch('m/misclassification/i', description) then
       call symputx('acc_train_forest', (1-value));
run;

title3 'Score the model with ASTORE for the test data';
proc astore;
    score data=casuser.heart_disease_test rstore=casuser.forstore out=casuser.for_scoreout copyvars=(target);
run;



Model Information,Model Information.1
Number of Trees,100.0
Number of Variables Per Split,4.0
Seed,382667181.0
Bootstrap Percentage,60.0
Number of Bins,50.0
Number of Input Variables,12.0
Maximum Number of Tree Nodes,39.0
Minimum Number of Tree Nodes,21.0
Maximum Number of Branches,2.0
Minimum Number of Branches,2.0

Unnamed: 0,Training
Number of Observations Read,243
Number of Observations Used,243

Variable Importance,Variable Importance,Variable Importance,Variable Importance
Variable,Importance,Std Dev Importance,Relative Importance
thal,9.5916,7.9085,1.0
ca,7.7555,6.6327,0.8086
cp,6.9495,7.0,0.7245
oldpeak,6.6517,5.5807,0.6935
thalch,5.4689,5.5065,0.5702
age,3.8725,4.5777,0.4037
chol,3.0794,2.5989,0.3211
trestbps,2.7227,2.7556,0.2839
slope,1.9575,2.8938,0.2041
exang,1.7938,3.8499,0.187

Fit Statistics,Fit Statistics,Fit Statistics,Fit Statistics,Fit Statistics,Fit Statistics,Fit Statistics
Number of Trees,OOB Average Square Error,Training Average Square Error,OOB Misclassification Rate,Training Misclassification Rate,OOB Log Loss,Training Log Loss
1,0.229,0.1665,0.283,0.214,2.844,1.723
2,0.222,0.127,0.268,0.1811,2.923,0.713
3,0.213,0.1175,0.302,0.1646,2.34,0.446
4,0.188,0.1019,0.258,0.1399,1.74,0.323
5,0.18,0.0973,0.258,0.1152,1.447,0.317
6,0.169,0.094,0.257,0.1235,1.081,0.314
7,0.168,0.092,0.235,0.0988,1.077,0.311
8,0.165,0.0943,0.239,0.1111,0.985,0.317
9,0.16,0.0937,0.218,0.1152,0.808,0.316
10,0.156,0.0911,0.21,0.1029,0.798,0.311

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),FOR_SCOREOUT,61,5

Task Timing,Task Timing,Task Timing
Task,Seconds,Percent
Loading the Store,0.0,0.92%
Creating the State,0.0,35.60%
Scoring,0.01,59.79%
Total,0.01,100.00%


#### Compute accuracy score:
 The percentage of patients in test
 data whose predicted heart disease
 status matched their actual status.

In [37]:
data _null_;
    retain matchSum 0;
    set casuser.for_scoreout(keep=I_target target) end=last;
    match = (I_target = target);
    matchSum + match;
    if last then call symputx ('acc_test_forest', (matchSum/_n_));
run;




779  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
779! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml36.htm[0m
780  
781  data _null_;
782      retain matchSum 0;
783      set casuser.for_scoreout(keep=I_target target) end=last;
784      match = (I_target = target);
785      matchSum + match;
786      if last then call symputx ('acc_test_forest', (matchSum/_n_));
787  run;

[38;5;21mNOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      784:14   [0m
[38;5;21mNOTE: There were 61 observations read from the data set CASUSER.FOR_SCOREOUT.[0m
[38;5;21mNOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.01 seconds
      [0m

788  
789  
790  
791  ods html5 (id=saspy_internal) close;ods listing;
792  




#### SVMACHINE procedure


In [38]:
title2 'Build classification model using PROC SVMACHINE';
ods output FitStatistics=svmstat;
proc svmachine data=casuser.heart_disease_train;
    input age--oldpeak / level=interval;
    input ca--fbs / level=nominal;
    target target / level=nominal;
    savestate rstore=casuser.svmstore;
run;

data _null_;
    set svmstat;
    if statistic = 'Accuracy' then
       call symputx('acc_train_svmachine', training);
run;

title3 'Score the model with ASTORE for the test data';
proc astore;
    score data=casuser.heart_disease_test rstore=casuser.svmstore out=casuser.svm_scoreout copyvars=(target);
run;


Model Information,Model Information.1
Task Type,C_CLAS
Optimization Technique,Interior Point
Scale,YES
Kernel Function,Linear
Penalty Method,C
Penalty Parameter,1
Maximum Iterations,25
Tolerance,1e-06

0,1
Number of Observations Read,243
Number of Observations Used,240

Training Results,Training Results.1
Inner Product of Weights,6.42612887
Bias,-0.58567
Total Slack (Constraint Violations),83.180748
Norm of Longest Vector,3.04994215
Number of Support Vectors,100.0
Number of Support Vectors on Margin,82.0
Maximum F,3.03933673
Minimum F,-3.4404135
Number of Effects,12.0
Columns in Data Matrix,26.0

Iteration History,Iteration History,Iteration History
Iteration,Complementarity,Feasibility
1,1002169.0686,72048.095028
2,656.17292969,11.259101315
3,31.377090487,1.1259102e-07
4,0.874266041,1.3016352e-09
5,0.2181840299,1.870324e-10
6,0.1076473257,7.741563e-11
7,0.0536174885,2.890632e-11
8,0.0300194091,1.306466e-11
9,0.0188295598,6.894596e-12
10,0.0041353221,5.645484e-13

Misclassification Matrix,Misclassification Matrix,Misclassification Matrix,Misclassification Matrix
Observed,Training Prediction,Training Prediction,Training Prediction
Observed,1,0,Total
1,85,26,111
0,8,121,129
Total,93,147,240

Fit Statistics,Fit Statistics
Statistic,Training
Accuracy,0.8583
Error,0.1417
Sensitivity,0.7658
Specificity,0.938

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),SVMSTORE,1,2

Output CAS Tables,Output CAS Tables,Output CAS Tables,Output CAS Tables
CAS Library,Name,Number of Rows,Number of Columns
CASUSER(rachel.mclawhon@sas.com),SVM_SCOREOUT,61,6

Task Timing,Task Timing,Task Timing
Task,Seconds,Percent
Loading the Store,0.0,1.57%
Creating the State,0.0,37.00%
Scoring,0.0,57.18%
Total,0.01,100.00%


#### Compute accuracy score:
 The percentage of patients in test
 data whose predicted heart disease
 status matched their actual status.

In [39]:
data _null_;
    retain matchSum 0;
    set casuser.svm_scoreout(keep=I_target target) end=last;
    match = (I_target = target);
    matchSum + match;
    if last then call symputx ('acc_test_svmachine', (matchSum/_n_));
run;



817  ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
817! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml38.htm[0m
818  
819  data _null_;
820      retain matchSum 0;
821      set casuser.svm_scoreout(keep=I_target target) end=last;
822      match = (I_target = target);
823      matchSum + match;
824      if last then call symputx ('acc_test_svmachine', (matchSum/_n_));
825  run;

[38;5;21mNOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      822:14   [0m
[38;5;21mNOTE: There were 61 observations read from the data set CASUSER.SVM_SCOREOUT.[0m
[38;5;21mNOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.02 seconds
      [0m

826  
827  
828  ods html5 (id=saspy_internal) close;ods listing;
829  




#### Comparison:
 We have completed the modeling and prediction using 5 different SAS Viya
 procedures for the HEART_DISEASE data set. We also recorded the training
 and test accuracy. In the section below, we will put all the recorded
 accuracy values together and use a bar-chart to display and compare them.

In [42]:

%macro CreateComparison;
    %let allprocs = logselect treesplit gradboost forest svmachine;
    data allMethods;
        length procname $16. type $8.;
        %do i = 1 %to %sysfunc(countw(&allprocs));
            %let currentProc = %scan(&allprocs,&i);
            procname = "&currentProc";
            type = "train";
            accuracy = &&&acc_train_&currentProc;
            output;
            procname = "&currentProc";
            type = "test";
            accuracy = &&&acc_test_&currentProc;
            output;
        %end;
    run;
    proc sgplot data=allMethods;
        vbar procname / response=accuracy group=type nostatlabel datalabel
                    groupdisplay=cluster dataskin=pressed;
        xaxis display=(nolabel);
        yaxis grid;
    run;
%mend;

title2 'Compare accuracy across all 5 procedures';
%CreateComparison;

title;