<font face="Verdana, cursive, sans-serif" >
<center><H1>Anomaly Detection with PRIDIT</H1></center>

<center><H2><font color='darkred'>How to convert categorical variables to RIDIT score and implement PRIDIT </font></H2></center>

<p>This documentation is powered by <b>Jupyter Notebooks</b>. To learn more about how to code SAS in Jupyter Notebooks environment, please refer to <a href="https://github.com/sassoftware/sas_kernel">SAS Kernel for Jupyter</a>. Please note that you - DO NOT -  required to have Python or Jupyter Notebooks in order to utilize these <b>SAS</b> macros.</p>

    
<p><b>What is RIDIT?</b>
<br>RIDIT is sn approach to assign numerical values to categorical ordinal variables, introduced by Bross (1958), who coined the term RIDIT in analogy to logit and probit. Once a variable is tranformed, it can be used as an input for modeling.
    
<p><b>The calculation of RIDIT</b>

<img src="./images/ridit_eq.png">

An illustrated calculation in excel is provided <a href="https://github.com/swatakit/SAS-Tools/blob/master/RIDIT%20Transformation.xlsx"> here </a>.

<br>The RIDIT scoring mechanism incorporates the ranked nature of responses, that is, categories of an ordinal categorical variable. Assume the different response categories are ordered in *decreasing* likelihood of fraud suspicion so that a *higher* categorical response indicates a *lesser* suspicion of fraud. Let's take a look at a numeric example from Brockett et al(2002)

<img src="./images/compare_ridit.png">

<br>The above figure shows the first rule RIDIT_1("yes") = -0.56 while the second rule RIDIT_9("yes")= -0.96.  This clearly indicates that a response “yes” on the second indicator variable is more abnormal or indicative of fraud than a response “yes” on the first indicator, and as such the transformation yields RIDIT scores that can be easily included in a quantitative model and make sense from an operational or expert perspective.

<p><b>What is PRIDIT?</b>

<br>PRIDIT stands for *Principle Component of RIDITs*. PRIDIT analysis combines the principle component analysis and RIDIT and results in overall fraud suspicion scores calculated from a set of ordinal categorical fraud indicators.  As such, PRIDIT analysis may be used to assemble these indicators into a single variable that can be included in any further analysis. Alternatively, PRIDIT analysis can be used as a filter approach to reduce the number of indicator variables included in the further analysis, as well as the final outputted fraud suspicion score

<p>This program demonstrate a step-by-step anomaly detection with PRIDIT. The data provided here is a simple mock up data with 4 variables: R1, R2, R3, R4. Each variable holds either 1 or 0, where 1 indicate that a fraud detection rule is hit. 
</p> 

<b>Reference:</b> 
<ul>
<li>Irwin D. J. Bross. **How to Use Ridit Analysis.** Biometrics, vol. 14, no. 1, 1958, pp. 18–38. JSTOR.</li>
<li>Brockett, Patrick & A. Derrig, Richard & Golden, Linda & Levin, Mark(2002). **Fraud Classification Using Principal Compenent Analysis of RIDITs.** Journal of Risk and Insurance. 69. 341-371. 10.1111/1539-6975. 00027.</li>
<li>Baesens, Bart & Van Vlasselaer, Véronique & Verbeke, Wouter. (2015). **Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection.** </li>
</ul>


<p>In this post, I demonstrate step-by-step calculation to 3 parts</p>
<ul>
    <li><b>Part-1:</b> Demonstration of RIDIT transformation for 1 variable </li>
    <li><b>Part-2:</b> Demonstration of RIDIT transformation for multiple variables</li>
    <li><b>Part-3:</b> Implementation of PRIDIT</li>
</ul>


<font face="Verdana, cursive, sans-serif" >
<H2><font color='darkred'>Part-1: RIDIT transformation for 1 variable </font></H2>

<font face="Verdana, cursive, sans-serif" >
<b>First, let's execute Ted Clay macros </b>

In [None]:
OPTION NOSOURCE NONOTES;

%LET LOC_SASMACRO=C:\sasmacro;
LIBNAME MYDATA 'C:\sasdata\';
%INCLUDE "&LOC_SASMACRO.\NUMLIST.SAS";
%INCLUDE "&LOC_SASMACRO.\ARRAY.SAS";
%INCLUDE "&LOC_SASMACRO.\DO_OVER.SAS";


<font face="Verdana, cursive, sans-serif" >
<b>Create a preprocess a mockup example</b>

In [2]:
******************************
Use a mockup dataset as example
*******************************;
DATA RULES_DATA;
    SET MYDATA.RULES_DATA;
RUN;

PROC PRINT DATA=RULES_DATA(OBS=5);RUN;


Obs,R1,R2,R3,R4
1,1,1,0,1
2,0,1,0,0
3,1,1,0,0
4,0,1,1,1
5,0,0,0,0


<font face="Verdana, cursive, sans-serif" >
<b>Get RIDIT score with PROC FREQ</b>
<br>
PROC FREQ procedure also provides RIDIT scoring option. In order to get RIDIT score for a categorical variable, we only need to invoke <code>SCORES=RIDIT</code>. Before we proceed, there are a few technical notes that we need to be aware of

<ol>
<li>The SCOROUT option displays row and column scores only when two-way table statistics are computed. Therefore, the trick is, we will have to cross the target-to-calculate-ridit-variable with 1 categorical variable. This procedure does no effect the calculation of RIDIT score</li>
<li>If the target-to-calculate-riditt-variable is a binary 0 vs. 1, by default PROC FREQ procedure will rank in ascending order, meaning 0 will be on the first row of the frequency. For RIDIT calculation, we require 1, not 0, on the first so that we have 1 as the reference class. Therefore, we will recode 0 to 2 in order to have the correct RIDIT calculation</li>
</ol>


In [None]:

DATA RULES_DATA;
    SET RULES_DATA;

    *Recode 0 to 2;
    IF R1=0 THEN R1=2;

    *Create a dummy variable;
    GROUP="A";
    IF _N_>15 THEN GROUP="B";
RUN;


ODS OUTPUT ROWSCORES = SCOROUT;
PROC FREQ DATA = RULES_DATA ;
TABLE R1 * GROUP / MISSING CHISQ SCORES=RIDIT SCOROUT;
RUN;


<font face="Verdana, cursive, sans-serif" >
<b>Output RIDIT score from PROC FREQ</b>

<img src="./images/ridit_12.png">


The output RIDIT is stored in a dataset SCOROUT
We proceed on linear transformation with it

In [4]:
DATA _NULL_;
    SET SCOROUT;
    IF R1 = 1 THEN CALL SYMPUT("RIDIT_1", Score);
    IF R1 = 2 THEN CALL SYMPUT("RIDIT_2", Score);
RUN;

DATA RULES_DATA;
    SET RULES_DATA;
    IF R1 = 1 THEN P_R1=&RIDIT_1.;
    ELSE IF R1 = 2 THEN P_R1=&RIDIT_2.;
RUN;

PROC PRINT DATA=RULES_DATA(OBS=5);RUN;


Obs,R1,R2,R3,R4,GROUP,P_R1
1,1,1,0,1,A,0.18333
2,2,1,0,0,A,0.68333
3,1,1,0,0,A,0.18333
4,2,1,1,1,A,0.68333
5,2,0,0,0,A,0.68333


<font face="Verdana, cursive, sans-serif" >
<H2><font color='darkred'>Part-2: RIDIT transformation for multiple variables </font></H2>
<p>Now that we uderstand how to calculate and recode 1 variable, here is how we can proceed multiple variables
    
<p>Read and pre-process data. We set VARLIST = R1 to R4</p>

In [None]:

%LET VARLIST= R1 R2 R3 R4;

DATA RULES_DATA;
    SET MYDATA.RULES_DATA;

    *Recode 0 to 2;
    %DO_OVER(VALUES= &VARLIST.,
        PHRASE= IF ?=0 THEN ?=2; );

    *Dummy;
    GROUP="A";
    IF _N_>15 THEN GROUP="B";

RUN;


<font face="Verdana, cursive, sans-serif" >
<b>Define macros</b>

In [None]:

*Macro to cal RIDIT for 1 variable;
%MACRO CALCULATE_RIDITS(DSNAME,VAR2CAL,VARTOCROSS);

    ODS OUTPUT ROWSCORES = SCOROUT;
    PROC FREQ DATA = &DSNAME.;
    TABLE &VAR2CAL. * &VARTOCROSS. / MISSING CHISQ SCORES=RIDIT SCOROUT;
    RUN;


    DATA _NULL_;
    SET SCOROUT;
    IF &VAR2CAL. = 1 THEN CALL SYMPUT("RIDIT_1", Score);
    IF &VAR2CAL. = 2 THEN CALL SYMPUT("RIDIT_2", Score);
    RUN;


    DATA &DSNAME.;
    SET &DSNAME.;

    IF &VAR2CAL. = 1 THEN P_&VAR2CAL.=&RIDIT_1.;
    ELSE IF &VAR2CAL. = 2 THEN P_&VAR2CAL.=&&RIDIT_2.;

    RUN;

%MEND;

*Macro to loop thru all variable;
%MACRO LOOP_CALCULATE_RIDITS(VALUES);    
                                                                                                                
     /* count the number of values in the string */                                                                                                                                   
     %LET COUNT=%SYSFUNC(COUNTW(&VALUES)); 

     /* loop through the total number of values */                                                                                         
     %DO I = 1 %TO &COUNT; 
      %LET VALUE=%QSCAN(&VALUES,&I);                                                                                            
      %PUT  &I. &VALUE.;
     %CALCULATE_RIDITS(DSNAME=RULES_DATA,VAR2CAL=%TRIM(&VALUE.),VARTOCROSS=GROUP); 
     %END;                                                                                                                              
                                                                                                                                        
%MEND;  

In [None]:
%LOOP_CALCULATE_RIDITS(&VARLIST.);

In [8]:
PROC PRINT DATA=RULES_DATA(OBS=5);RUN;

Obs,R1,R2,R3,R4,GROUP,P_R1,P_R2,P_R3,P_R4
1,1,1,2,1,A,0.18333,0.23333,0.7,0.25
2,2,1,2,2,A,0.68333,0.23333,0.7,0.75
3,1,1,2,2,A,0.18333,0.23333,0.7,0.75
4,2,1,1,1,A,0.68333,0.23333,0.2,0.25
5,2,2,2,2,A,0.68333,0.73333,0.7,0.75


<font face="Verdana, cursive, sans-serif" >
<H2><font color='darkred'>Part-3:Implementation of PRIDIT </font></H2>

<p>Now that we have RIDITs, we proceed to calculation PCA of RIDIT


In [None]:

%LET VARLIST_p= %DO_OVER(VALUES= &VARLIST., PHRASE=P_?);

ODS GRAPHICS ON;

PROC FACTOR DATA = RULES_DATA SIMPLE METHOD = PRIN PRIORS = ONE MINEIGEN = 1 SCREE ROUND ROTATE = VARIMAX ;
VAR &VARLIST_P.;
RUN;


<font face="Verdana, cursive, sans-serif" >

<img src="./images/pca_n2.png">

<img src="./images/pattern.png">
Using Kaiser's rule(keep all componenst which have eigen values>1); The results suggest that we should retain 2 principle components, these 2 components together explain 61.44% variance of the sample. We then retain 2 components and calculate component score for each observation in the dataset

In [None]:

PROC FACTOR DATA = RULES_DATA SIMPLE METHOD = PRIN PRIORS = ONE MINEIGEN = 1 SCREE ROUND SCORE OUT = PRIDIT NFACT=2 PREFIX=PRIDIT;
VAR &VARLIST_P.;
RUN;

PROC PRINT DATA=PRIDIT(OBS=5);RUN;

PROC UNIVARIATE DATA=PRIDIT ;
VAR PRIDIT1;
HISTOGRAM ;
RUN;

<font face="Verdana, cursive, sans-serif" >

<img src="./images/pridit.png">

<img src="./images/hist_pridit1.png">

<font color="red">
<b>Interpretaion of the histogram</b>
Negative scores would indicate class 1 (potential fraud) claims. </font>
<p>That is, the more negative the score is, the higher the chance of fraudulent. An example of recommendation could investigate all cases which has score less than (-2), or  investigate all case from percentile-1 to percentile-5, ranked ascending by component score