<font face="Verdana, cursive, sans-serif" >
<center><H1>Parametric and Non-parametric tests</H1></center>

<center><H2><font color='darkred'>How to quickly identify potential changes in the behavior of a service provider</font></H2></center>

<p>This documentation is powered by <b>Jupyter Notebooks</b>. To learn more about how to code SAS in Jupyter Notebooks environment, please refer to <a href="https://github.com/sassoftware/sas_kernel">SAS Kernel for Jupyter</a>. Please note that you - DO NOT -  required to have Python or Jupyter Notebooks in order to utilize these <b>SAS</b> macros.</p>

<p>Imagine you have thousands of service providers(hospitals, clinics, labs, garages, etc..), how can we prioritize and pick ones for auditing/investigating?</p> 

<p>There are numbers of reasons that the costs of services would changes: increases or decreases, but before we even deep dive into the reason <b>why?</b> we first need to identify <b>who?</b>. Luckily, we have plenty of statistical tools that can help address the later problem. The former problem, however, is not as straight forward as the other one but still possible with helps from subject matter experts(SMEs)...</p>

<H3>The Who?</H3>

The key to understanding data is to understand its distribution. The shape of the distribution is usually governed by a few parameters. The one most commonly known distribution is Gaussian distribution, aka normal distribution, of which the shape of the distribution is governed by the mean and the variance.

The mean(+ or -) moves the bell curve to the (right or left), while the variance(high or low) controls the spread(wide or narrow) of the bell curve.
<img src="./images/normal.png" >
ref: https://en.wikipedia.org/wiki/Normal_distribution

That is, if we would like to know whether 2 variable is randomly sampling from the same normally distributed population, we test whether the 2 distributions are statistically different or not.
<br>If the data is continuous, and the normal distribution assumption holds, the following are the common parametric tests that we can choose from...
<img src="./images/ttest.png" style="width: 50%">
ref: http://www.indiana.edu/~statmath/


<table >
  <tr>
    <th style="text-align: left; vertical-align: middle;">Parametric Tests</th>
    <th style="text-align: left; vertical-align: middle;">When and Why</th> 
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">One sample t-test</td>
    <td style="text-align: left; vertical-align: middle;">You would like to test whether the mean of a sample is equal to a constant 'c'; that is to say, Is this data randomly sampling from a population which has a mean of  'c'</td>
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">Pair sample t-test</td>
    <td style="text-align: left; vertical-align: middle;">You have 2 samples and you would like to test whether the distribution of these 2 data is the same. However, there is some correlation/dependence between these 2 samples. For example, you have a group of student do a test, then provide the training, and have them redo the test again afterward, and then compare the difference of the score before/after. </td>
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">Independent sample t-test</td>
    <td style="text-align: left; vertical-align: middle;">You have 2 samples and you would like to test whether the distribution of these 2 data is the same. And, these 2 samples are not at all correlated/dependent to one another. For example, you have a group of female students and male students. You would like to test whether the average test score of the female is statistically different from male</td>
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">ANOVA</td>
    <td style="text-align: left; vertical-align: middle;">ANalysis Of VAriance. Instead of compare means of 2 samples, you can also compare multiple means of multiple samples at the same time. Rejecting the null hypothesis means there is at least one mean that statistically different than others.</td>
  </tr>
</table>

<b>But what if normally distributed assumption does not hold?</b>

Outliers can cause the distribution to skewed. Data transformation can also make the data more conform to normal distribution. Such transformation methods are: outliers removal, square root, winzorization, logarithmic, logit;Among many other methods.

Yet, sometimes, even transformation cannot help. If, let's say, you have ordinal data, ranked data or you have very small dataset and cannot afford to loose any observations.


<b>The non-parametric tests</b>

If the data is not(or lack of) normally distributed, non-parametric tests are the alternative. Non-parametric tests are sometimes called distribution-free tests because it does not assume any shapes of the data.

However, it is also important to note that the power of non-parametric tests is much less. Meaning that it is much harder to reject the null hypothesis. The following table shows comparable tests between parametric and non-parametric tests.


<table >
  <tr>
    <th style="text-align: left; vertical-align: middle;">Parametric Tests</th>
    <th style="text-align: left; vertical-align: middle;">When and Why</th> 
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">One sample t-test</td>
    <td style="text-align: left; vertical-align: middle;">Sign ranked test</td>
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">Pair sample t-test</td>
    <td style="text-align: left; vertical-align: middle;">Wilcoxon signed-rank test</td>
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">Independent sample t-test</td>
    <td style="text-align: left; vertical-align: middle;">Mann-Whitney U Test</td>
  </tr>
  <tr>
    <td style="text-align: left; vertical-align: middle;">ANOVA</td>
    <td style="text-align: left; vertical-align: middle;">Kruskal-Wallis Test</td>
  </tr>
</table>

** Very very important note **
Non-parametric tests are distribution-free, not assumption-free. 
Other assumption like randomness, iid, and etc still stands - please study carefully full explanation <a href="http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric_print.html">here</a>.

<img src="./images/comp.png" style="width: 70%" >
An example of a service provider. By comparing claim costs 2016 vs. 2017, it seems that the year 2017 has more expensive claims. Wilcoxon tests also confirm that the two distribution are statistically different. The mean and median of 2017 also increases.

<font face="Verdana, cursive, sans-serif" >
<H2><font color='darkred'>Part-1: Simple data exploration, parametric and non-parametric test </font></H2>

<font face="Verdana, cursive, sans-serif" >
<b>First, let's take a look at a mockup data </b>

In [None]:
OPTION NOSOURCE NONOTES;

DATA WORK.CLAIMDATA ;
INFILE "C:\sasdata\claimdata_mockup.csv" DELIMITER	= ',' MISSOVER DSD LRECL=32767 FIRSTOBS=2;
INFORMAT Provider $1.  ClaimAmt Los best32. Year best32.;
INPUT Provider$ ClaimAmt Los Year;
RUN;

PROC SORT DATA=CLAIMDATA; BY Provider Year ;RUN;

PROC TABULATE DATA=CLAIMDATA MISSING;
CLASS YEAR Provider;
VAR ClaimAmt Los;
TABLE Provider, YEAR*ClaimAmt*(N MEAN MEDIAN);
RUN;

<font face="Verdana, cursive, sans-serif" >
<img src="./images/basic_stats.PNG" >
The above table shows that we have 8 providers. We have providers with increasing claim costs trends as well as decreasing claim costs trends. Let's check whether the data distribution is normally distributed.

In [None]:
*Take provider A and Year 2016 as an example;

PROC UNIVARIATE DATA=CLAIMDATA NORMAL PLOT;
VAR ClaimAmt;
WHERE Provider='A' AND YEAR=2016;
HISTOGRAM/NORMAL;
RUN; 


<font face="Verdana, cursive, sans-serif" >
<img src="./images/norm_test.PNG" style="width: 30%">
The above table shows that all p-values are rejected. That is, the data distribution is **NOT** normally distributed.
    
<p>Next, let's utilize PROC NPAR1WAY for non-parametric tests </p>

In [None]:
PROC NPAR1WAY DATA=CLAIMDATA Wilcoxon EDF ;
CLASS YEAR;
VAR ClaimAmt;
WHERE Provider='A' and YEAR in (2016 2017);
*EXACT; *Use this option for small sample or heaviliy skewed data, but be caution of heavy computaional requirements;
OUTPUT OUT=RESULTS;
RUN;

PROC PRINT DATA=RESULTS;RUN;

<font face="Verdana, cursive, sans-serif" >
<img src="./images/nonparam_test.PNG" style="width: 90%" >

2 out of 4 non-parametrics tests are significant at alpha = 5%. We can conclude that claim distribution of provider A of year 2016 is statistically different. And, from previous basic stats, both mean and total claim in 2017 is greather than 2016

<font face="Verdana, cursive, sans-serif" >
<H2><font color='darkred'>Part-2: SAS macro to perform non-parametric tests for all providers </font></H2>

The following macro is the exact calculation shown in previous section, only implemented in such a way that we can quickly perform multiple tests on multiple provider at the same time, and then categorize them into bins

In [None]:
******************
Define macros
******************;


%macro QUICK_LEFTJOIN(LTAB,RTAB,KEY,OUT=JOINED);
PROC SQL;
CREATE TABLE &OUT. AS
(
SELECT *
FROM &LTAB. LTAB
LEFT JOIN &RTAB. RTAB
ON LTAB.&KEY.=RTAB.&KEY.
) 
;RUN;QUIT;
%mend;

%macro NONPARAMETRIC_TESTS(DSNAME=,VAR=,BY=,GROUP=,OUT=NONPARA_TEST,CRI=0.05);

PROC SQL NOPRINT  ;
SELECT MAX(YEAR) AS YEAR
	INTO: LATEYEAR
	FROM &DSNAME.;
;QUIT;

ODS GRAPHICS OFF;
PROC SORT DATA= &DSNAME.; BY &BY. ;RUN;
PROC NPAR1WAY DATA=&DSNAME. Wilcoxon EDF  NOPRINT;
CLASS &GROUP. ;
VAR &VAR.;
BY &BY.;
OUTPUT OUT=NPAR(KEEP=&BY. PT2_WIL P_KW P_KSA P_KA RENAME=(PT2_WIL=Wilcoxon P_KW=KruskalWallis P_KSA=KolmogorovSmirnof P_KA=Kupiec));
RUN;

PROC MEANS DATA=&DSNAME. NOPRINT ;
VAR &VAR.;
CLASS &GROUP. &BY.;
OUTPUT N=N MEAN=MEAN SUM=SUM OUT=STATS(WHERE=(NOT MISSING(&GROUP.) AND NOT MISSING(&BY.)) DROP=_TYPE_ _FREQ_);
RUN;

PROC SORT DATA=STATS; BY &BY. &GROUP. ; RUN;

DATA BSTATS(KEEP= N MEAN SUM &BY. Change_ClaimCnt Change_AvgClaimAmt Change_TotClaimAmt  
RENAME=(N=LateYear_ClaimCnt MEAN=LateYear_AvgClaimAmt SUM=LateYear_TotClaimAmt));
SET STATS;
BY &BY. &GROUP. ;
Change_ClaimCnt=LOG(N/LAG(N));
Change_AvgClaimAmt=LOG(MEAN/LAG(MEAN));
Change_TotClaimAmt=LOG(SUM/LAG(SUM));
IF FIRST.&BY. THEN 
	DO; 
		Change_ClaimCnt=.;
		Change_AvgClaimAmt=.; 	
		Change_TotClaimAmt=.;
	END;
IF LAST.&BY.;
IF MISSING(Change_ClaimCnt) AND MISSING(Change_AvgClaimAmt) AND MISSING(Change_TotClaimAmt) THEN DELETE;

RUN;

%QUICK_LEFTJOIN(LTAB=BSTATS,RTAB=NPAR,KEY=&BY.,OUT=&OUT.);

DATA &OUT.;
SET &OUT.;
IF NMISS(Wilcoxon,KruskalWallis,KolmogorovSmirnof,Kupiec)=0 THEN Significant=SUM(Wilcoxon<&CRI.,KruskalWallis<&CRI.,KolmogorovSmirnof<&CRI., Kupiec<&CRI.);
ELSE Significant=0;
IF Significant>=2 THEN FLAG_ISSIGNIFICANT=1; ELSE  FLAG_ISSIGNIFICANT=0;
RUN;

PROC FORMAT;
VALUE PCT_RANGE 
. = 'Z_ERROR'
LOW-0 = '0_Reduced'
0-0.1 = 'A_(<10%]'
0.1-0.2 = 'B_(10%-20%]'
0.2-0.3='C_(20%-30%]'
0.3-0.4='D_(30%-40%]'
0.4-0.5='E_(40%-50%]'
0.5-HIGH = 'F_(50%>)' 
;
RUN;

Footnote If 2 out of 4 non-parametric tests are significant at alpha &CRI. - Flag=1;
TITLE3 Distribution of All Providers;
PROC TABULATE DATA= &out. MISSING;
FORMAT Change_TotClaimAmt Change_AvgClaimAmt PCT_RANGE. ;
CLASS Change_TotClaimAmt Change_AvgClaimAmt FLAG_ISSIGNIFICANT ;
TABLE (FLAG_ISSIGNIFICANT*Change_AvgClaimAmt)ALL,Change_TotClaimAmt ALL PCTN;
RUN;

DATA HOSP_TAG(KEEP=FLAG_ISSIGNIFICANT Provider Avg_BIN Tot_BIN);
SET &out.;
Avg_BIN=PUT(Change_AvgClaimAmt,PCT_RANGE.);
Tot_BIN=PUT(Change_TotClaimAmt,PCT_RANGE.);
RUN;

%QUICK_LEFTJOIN(LTAB=&DSNAME.,RTAB=HOSP_TAG,KEY=Provider,OUT=JOINED);

TITLE3 Average Length of Stay ;
PROC TABULATE DATA= JOINED ;
CLASS Tot_BIN Avg_BIN FLAG_ISSIGNIFICANT  ;
VAR LOS ClaimAmt;
TABLE (FLAG_ISSIGNIFICANT*Avg_BIN='Change_AvgClaimAmt')ALL,Tot_BIN='Change_TotClaimAmt'*LOS*(MEAN) ALL PCTN;
WHERE &GROUP. = &LATEYEAR.;
RUN;

TITLE3 Average Claim Amount ;
PROC TABULATE DATA= JOINED ;
CLASS Tot_BIN Avg_BIN FLAG_ISSIGNIFICANT  ;
VAR LOS ClaimAmt;
TABLE (FLAG_ISSIGNIFICANT*Avg_BIN='Change_AvgClaimAmt')ALL,Tot_BIN='Change_TotClaimAmt'*ClaimAmt*(MEAN) ALL PCTN;
WHERE &GROUP. = &LATEYEAR.;
RUN;

PROC DELETE DATA= JOINED HOSP_TAG BSTATS STATS NPAR; RUN;

footnote;
title2;
title3;
ODS GRAPHICS ON;
%mend;


%macro NONPARAMETRIC_INDIVIDUAL_TEST(DSNAME=,VAR=,WHERE=,GROUP=);

ODS GRAPHICS ON;

PROC MEANS DATA=&DSNAME. N MEAN MEDIAN SUM;
CLASS &GROUP.;
VAR &VAR.;
WHERE &WHERE.;
RUN;
%put here;

PROC SGPLOT DATA=&DSNAME. ;
  WHERE &WHERE. ;   
  HISTOGRAM &VAR. / GROUP=&GROUP. TRANSPARENCY=0.5 ;       
  DENSITY &VAR. / TYPE=KERNEL GROUP=&GROUP.; 
RUN;

PROC NPAR1WAY DATA=&DSNAME.  Wilcoxon EDF  ;
CLASS &GROUP. ;
VAR &VAR.;
WHERE  &WHERE.;  
RUN;

%mend;


<font face="Verdana, cursive, sans-serif" >
    
Just to keep thing simple and easy, we compare Y-o-Y data

In [None]:
DATA YEAR_1617 ;
SET CLAIMDATA;
IF YEAR IN (2016 2017) THEN OUTPUT YEAR_1617;
RUN;

*************************************
Compare claim 2016 vs. 2017
*************************************;
TITLE1 Non-Parametric Tests: Y-o-Y Comparison ;
%NONPARAMETRIC_TESTS(DSNAME=YEAR_1617,
                    VAR=CLAIMAMT,
                    BY=PROVIDER,
                    GROUP=YEAR);
                    
PROC PRINT DATA=NONPARA_TEST;RUN;

*************************************
To deepdive into 1 provider
*************************************;
%LET PROVIDER = "A";
TITLE1 Claim Behavior of Provider = &PROVIDER. ;
%NONPARAMETRIC_INDIVIDUAL_TEST(DSNAME=YEAR_1617,
                                VAR=CLAIMAMT,
                                WHERE=PROVIDER=&PROVIDER.,
                                GROUP=YEAR);

<font face="Verdana, cursive, sans-serif" >
<img src="./images/providers_quad.PNG" >
    
2 out of 8 providers shows evidence of morphing behavior 2016 vs.2017. 1 provider in partucular has is an extremist, with >50% AND >90% increase in average claim and total claim, respectively. 

<font face="Verdana, cursive, sans-serif" >
<H3>Conclusion</H3>

We only shows here how to quickly identify and shortlist potential provider for auditing process. 
What is not shown here is a deepdive into the particular provider. Please refer to <a href="https://analytics.knowledgehub.ageas.com/posts/2012571-sas-by-example-how-to-quickly-identify-potential-changes-in-the-behavior-of-a-s?version=33244">Ageas Knowledge Hub</a>  for more discussion. 


