<font face="Verdana, cursive, sans-serif" >
<center><H1>Variable Selection with Information Value <br>and Application of Weight of Evidence</H1></center>

<center><H2><font color='darkred'>How to quickly identify potential variable for classification model </font></H2></center>

<p>This documentation is powered by <b>Jupyter Notebooks</b>. To learn more about how to code SAS in Jupyter Notebooks environment, please refer to <a href="https://github.com/sassoftware/sas_kernel">SAS Kernel for Jupyter</a>. Please note that you - DO NOT -  required to have Python or Jupyter Notebooks in order to utilize these <b>SAS</b> macros.</p>

<p>Variable selection, or feature selection, is one of the most crucial part of data analytics process. Ideally, we would like the model to be *parsimonious* . That is, we should retain only few, relevant and predictive variables in the model. Generally speaking, variable selection also includes variable elimination as well.</p> 

<p>Benefits of having a parsimonious model are:</p>
<ol>
<li>Clean and easy to understand model </li>
<li>A model that runs fast</li>
<li>Reduce overfitting and improve prediction performance</li>
<li>Less effort on unnecessary data collecting and pre-processing</li>
</ol>

    
<p>The following lists are my go-to guidlines that you may find useful. However, it is important to note that the lists are non-exhaustive, and you are recommended to futher your study with <b>*A Practical Guideline to Dimension Reduction, Patel(2016)*</b></p>
<ol>
<li> Eliminate variables with lots of missing/invalid values. For example, if a variable is missing/invalid more than 50%, you may consider removing it from your dataset, or not include it in modeling process. I personally would like to avoid imputing missing values, hence my threashold is usually set as high as 85% or more. Check out
    <a href="https://nbviewer.jupyter.org/github/swatakit/SAS-Tools/blob/master/Missing%20Reports%20Notebooks.ipynb">sasmacro</a> and 
    <a href="https://nbviewer.jupyter.org/github/swatakit/Python-Tools/blob/master/Missing%20Reports.ipynb">python</a> that will do just that
    </li>
<li> Eliminate variables that is not relevant to problem statement. This would require domain knowledge of the subject.</li>
<li> Eliminate variables that heavily loaded on 1 class. For example if a variable COUNTRY is 99% loaded on THAILAND, this variable should not be in the model as it would not provide any information</li>
<li> Eliminate highly correlated variables. Using domain knowledge or spearman/pearson correlation matrix, we can easily identity pairwise correlated variables. Pick only ones which deem most relevant to the problem statement.</li>
<li>Apply dimensionality reduction techniques such as principal component analysis, factor analysis</li>
<li>Apply sequential selection strategy such as forward/backward/stepwise selection</li>
<li>Apply regularization regression technique. Based on input lambda parameters; Ridge regression will penalise large coefficients, while Lasso regression will drop a variable completely by setting its coeficient to zero</li>
<li>Identify potential useful variables by using statistical tools such as chi-square criteria, <b>Weight of Evidence/Information Values</b>,impurity,infomation gain or variable importance</li>
  </ol>

<p>Specifically for this post, I will focus on Weight of Evidence/Information Values calculation. For classification modeling, of which target variable is a binary:- Yes vs. No, Fraud vs. Non-Fraud, Positive vs. Negative, 1 vs. 0; WOE/IV is a very simple tool that can help you shortlist potential variables for classification model</p>

<p><b>Step-by-Step Calculation</b></p>
<ol>
<li>For a continuous variable, split data into a number of bins(for categorical variable, skip this step )</li>
<li>Calculate the number of events and non-events in each bin</li>
<li>Calculate the % of events and % of non-events in each bin.</li>
<li>Calculate WOE by taking natural log of division of % of non-events and % of events</li>
    <img src="./images/woe.png" width=200px height=8px>
<li>Calculate IV by taking summation of (%non-event-%event)*WOE</li>
    <img src="./images/iv.png" width=300px height=10px>
</ol>

Reference: <a href="www.listendata.com">listendata </a>

<img src="./images/iv_bins_9.png" >

An illustrated calculation in excel is provided <a href="https://github.com/swatakit/SAS-Tools/blob/master/WOE%20and%20IV%20Example.xlsx"> here </a>. As shown in the provided excel, iv calculated value also subjected to a size of a bin. However, the optimal sizes of bins is beyond the scope of this post.

As a rule of thumb, IV and predictiveness are shown follow:-
<img src="./images/iv_rthb.png" >

It is important to note that IV is merely a tool to suggest that a variable *maybe* predictive, it is not necessarily  mean that it will *stay* in the model. IV also has one very intriguing application, that is, not only that we can use IV to shortlist potential variable, we can also transform a raw variable to <b>WOE-Transformed variable</b> and use it as an input in a model. 

<img src="./images/woe_trans.PNG" >

As shown in the above picture, a raw data that is >401 will can be recoded with -0.2123.<br>
<br><b>Benefits of WOE transformation</b>
<ol>
<li>Variable Reduction</li>
<li>Deal with missing values</li>
<li>Deal with extreme values</li>
<li>Linearize the variables</li>
<li>Increase predictive accuracy in logistic modelling</li>
</ol>

Reference: Sharma(2011), SSRN Electronic Journal,
<br>*Evidence in Favor of Weight of Evidence and Binning Transformations for Predictive Modelling.* 


<p>In this post, I demonstrate WOE/IV calculation to 2 parts</p>
<ul>
    <li><b>Part-1: Basic</b> - A simple macro to calculate IV for a numeric variable and a charecter variable</li>
    <li><b>Part-2: Advanced</b> - By utilising the power of <code>DO_OVER</code> macro, published by <a href="http://www2.sas.com/proceedings/sugi31/040-31.pdf">Ted Clay</a>. <br>I demonstrate how to fully automate WOE/IV calcualtion</li>
</ul>


<font face="Verdana, cursive, sans-serif" >
<H2><font color='darkred'>Part-1: Basic - A simple calculation of WOE/IV </font></H2>

<font face="Verdana, cursive, sans-serif" >
<b>First, let's execute Ted Clay macros and MISSING_REPORTS macros</b>

In [None]:
OPTION NOSOURCE NONOTES;

%LET LOC_SASMACRO=C:\sasmacro;
%INCLUDE "&LOC_SASMACRO.\NUMLIST.SAS";
%INCLUDE "&LOC_SASMACRO.\ARRAY.SAS";
%INCLUDE "&LOC_SASMACRO.\DO_OVER.SAS";


<font face="Verdana, cursive, sans-serif" >
<b>let's use the same mockup data as an example</b>

In [None]:
******************************
Use TITANIC dataset as example
*******************************;
%INCLUDE "&LOC_SASMACRO.\DATA_TITANIC.SAS";


<font face="Verdana, cursive, sans-serif" >
<b>Set target variable and dataset to calculate WOE/IV</b>

In [None]:
%LET TARGET4MODEL=SURVIVED;
%LET DSBASE=TITANIC;

*If your DSBASE is huge, you may consider keeping only TARGET4MODEL and some variables,in order to speed up the process;
DATA TARGET/*(KEEP=&TARGET4MODEL. <varlist>)*/;
    SET &DSBASE.;
RUN;


<font face="Verdana, cursive, sans-serif" >
<H3>Macro to calculate WOE/IV for CHARECTER</H3>
The following macro is the exact calculation shown in excel, only implemented in SQL

In [None]:
%GLOBAL CNT;
%LET CNT=0;

*To keep list of var and iv;
DATA SOURCE_IV;
INFILE DATALINES dsd DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. ;
INPUT IV VARNAME$ ;
DATALINES ;
;
RUN;

*To keep list of var and woe;
DATA SOURCE_WOE;
INFILE DATALINES dsd DELIMITER='|' MISSOVER ; 
INFORMAT BIN VARNAME $100. ;
INPUT BIN NONRESP RESP TOT_NONRESP TOT_RESP VARNAME$ PCT_NONRESP PCT_RESP WOE IV ;
DATALINES ;
;
RUN;

In [None]:

%MACRO DOWOEIV_CHAR(TARGETVAR);

	%PUT TARGETVAR=&TARGETVAR.;
	%LET CNT=%EVAL(&CNT.+1);
	%PUT &CNT.;
	PROC SQL;
	CREATE TABLE TARGET_1 AS
	(
		SELECT 	&TARGET4MODEL.
				,&TARGETVAR.
				,SUM(&TARGET4MODEL.=0) AS TOTAL_NONRESP
				,SUM(&TARGET4MODEL.=1) AS TOTAL_RESP
		FROM TARGET
	)
	;
	CREATE TABLE WOE AS
	(
		SELECT 	&TARGETVAR.
				,SUM(&TARGET4MODEL.=0) AS NONRESP /*NONRESP*/
			   	,SUM(&TARGET4MODEL.=1) AS RESP /*RESP*/
				,MEAN(TOTAL_NONRESP) AS TOT_NONRESP/*TRICK, ONLY USE MEAN FUNCTION TO GET TOTAL OF R AND NR*/
				,MEAN(TOTAL_RESP) AS TOT_RESP
		FROM TARGET_1
		GROUP BY &TARGETVAR.
	)
	;
	RUN;QUIT;

	DATA _WOE_&CNT.(DROP=&TARGETVAR.);
	INFORMAT BIN VARNAME $100.;
	SET WOE;
	BIN=&TARGETVAR.;
		PCT_NONRESP=NONRESP/TOT_NONRESP;
		PCT_RESP=RESP/TOT_RESP;
		WOE=0;
		IF PCT_RESP>0 THEN	WOE=LOG(PCT_NONRESP/PCT_RESP);
		IV=(PCT_NONRESP-PCT_RESP)*WOE;
		VARNAME = "&TARGETVAR.";
	RUN;

	PROC MEANS DATA=_WOE_&CNT. NOPRINT; VAR IV; OUTPUT OUT=_IV_&CNT. SUM=IV; RUN;
	DATA  _IV_&CNT.(DROP= _TYPE_ _FREQ_);
	SET _IV_&CNT.;
		INFORMAT VARNAME $100.;
		VARNAME = "&TARGETVAR.";
	RUN;

	PROC APPEND DATA=_IV_&CNT. BASE=SOURCE_IV FORCE; RUN;
	PROC APPEND DATA=_WOE_&CNT. BASE=SOURCE_WOE FORCE; RUN;
		
	PROC DELETE DATA=TARGET_1  WOE  _WOE_&CNT. _IV_&CNT.  ; RUN;

%MEND;

In [None]:
%DOWOEIV_CHAR(SEX);
%DOWOEIV_CHAR(PCLASS);

In [7]:
*The following is an IV for each variable;
PROC PRINT DATA=SOURCE_IV;RUN;
*The following is an WOE by each BIN, of each variable. As illustrated here, 'female' raw value can be replaced with -1.52988;
PROC PRINT DATA=SOURCE_WOE;RUN;

Obs,VARNAME,IV
1,SEX,1.34168
2,PCLASS,0.50095

Obs,BIN,VARNAME,NONRESP,RESP,TOT_NONRESP,TOT_RESP,PCT_NONRESP,PCT_RESP,WOE,IV
1,female,SEX,81,233,549,342,0.14754,0.68129,-1.52988,0.81657
2,male,SEX,468,109,549,342,0.85246,0.31871,0.98383,0.52512
3,1,PCLASS,80,136,549,342,0.14572,0.39766,-1.00392,0.25293
4,2,PCLASS,97,87,549,342,0.17668,0.25439,-0.36448,0.02832
5,3,PCLASS,372,119,549,342,0.6776,0.34795,0.66648,0.2197


<font face="Verdana, cursive, sans-serif" >
<H3>Macro to calculate WOE/IV for NUMERIC</H3>
<br>For numeric variable, the variable needed to be discretize prior to proceeding with similar calculation. 
Let's take AGE as an example; let's bin AGE with 2 type of formats
- Format 1 Using __PROC RANK__ to bin into 3 groups, base on numeric values
- Format 2 Using __PROC FORMAT__ to bin into groups, base on some defined rules

In [None]:
%GLOBAL CNT;
%LET CNT=0;

*To keep list of var and iv;
DATA SOURCE_IV;
INFILE DATALINES DSD DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. ;
INPUT IV VARNAME$ BUCKET;
DATALINES ;
;
RUN;

*To keep list of var and WOE;
DATA SOURCE_WOE;
INFILE DATALINES DSD DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. ;
INPUT BIN MIN MAX NONRESP RESP TOT_NONRESP TOT_RESP VARNAME$ PCT_NONRESP PCT_RESP WOE IV BUCKET;
DATALINES ;
;
RUN;


*To keep list of var and iv;
DATA SOURCE_IVFMT;
INFILE DATALINES DSD DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. BUCKET $32. ;
INPUT IV VARNAME$ BUCKET$;
DATALINES ;
;
RUN;

*To keep list of var and WOE;
DATA SOURCE_WOEFMT;
INFILE DATALINES dsd DELIMITER='|' MISSOVER ; 
INFORMAT BIN VARNAME $100. BUCKET $32. ;
INPUT BIN NONRESP RESP TOT_NONRESP TOT_RESP VARNAME$ PCT_NONRESP PCT_RESP WOE IV BUCKET;
DATALINES ;
;
RUN;


In [None]:

%MACRO DOWOEIV_NUM(TARGETVAR,SBIN);

	%PUT SBIN=&SBIN.;
	%PUT TARGETVAR=&TARGETVAR.;
	%LET CNT=%EVAL(&CNT.+1);
	%PUT &CNT.;

	DATA TEMP;
	SET TARGET(KEEP=&TARGET4MODEL. &TARGETVAR.);
	RUN;

	PROC RANK DATA=TEMP GROUP=&SBIN. OUT=TARGET_BIN&SBIN.;
		VAR &TARGETVAR.;
		RANKS &TARGETVAR._R;
	RUN;

	PROC SQL;
	CREATE TABLE TARGET_BIN&SBIN._1 AS
	(
		SELECT 	&TARGET4MODEL.
				,&TARGETVAR.
				,&TARGETVAR._R
				,SUM(&TARGET4MODEL.=0) AS TOTAL_NONRESP
				,SUM(&TARGET4MODEL.=1) AS TOTAL_RESP
		FROM TARGET_BIN&SBIN.
	)
	;
	CREATE TABLE WOE AS
	(
		SELECT 	&TARGETVAR._R
				,MIN(&TARGETVAR.) AS MIN
				,MAX(&TARGETVAR.) AS MAX
				,SUM(&TARGET4MODEL.=0) AS NONRESP /*NONRESP*/
			   	,SUM(&TARGET4MODEL.=1) AS RESP /*RESP*/
				,MEAN(TOTAL_NONRESP) AS TOT_NONRESP
				,MEAN(TOTAL_RESP) AS TOT_RESP
		FROM TARGET_BIN&SBIN._1
		GROUP BY &TARGETVAR._R
	)
	;
	RUN;QUIT;

	DATA _WOE_&SBIN._&CNT.(RENAME=(&TARGETVAR._R=BIN));
		SET WOE;
		INFORMAT VARNAME $100.;
		PCT_NONRESP=NONRESP/TOT_NONRESP;
		PCT_RESP=RESP/TOT_RESP;
		WOE=0;
		IF PCT_RESP>0 THEN	WOE=LOG(PCT_NONRESP/PCT_RESP);
		IV=(PCT_NONRESP-PCT_RESP)*WOE;
		VARNAME = "&TARGETVAR.";
		BUCKET=&SBIN.;
	RUN;

	PROC MEANS DATA=_WOE_&SBIN._&CNT. NOPRINT; VAR IV; OUTPUT OUT=_IV_&SBIN._&CNT. SUM=IV; RUN;

	DATA  _IV_&SBIN._&CNT.(DROP= _TYPE_ _FREQ_);
	SET _IV_&SBIN._&CNT.;
		INFORMAT VARNAME $100.;
		VARNAME = "&TARGETVAR.";
		BUCKET=&SBIN.;
	RUN;

	PROC APPEND DATA=_IV_&SBIN._&CNT. BASE=SOURCE_IV FORCE; RUN;
	PROC APPEND DATA=_WOE_&SBIN._&CNT. BASE=SOURCE_WOE FORCE; RUN;

	PROC DELETE DATA=TARGET_BIN&SBIN._1  TARGET_BIN&SBIN. WOE  
	_IV_&SBIN._&CNT. _WOE_&SBIN._&CNT. TEMP; RUN;

%MEND;


%MACRO DOWOEIV_FMT(TARGETVAR,FMT);

    %PUT SBIN=&FMT.;
	%PUT TARGETVAR=&TARGETVAR.;
	%LET CNT=%EVAL(&CNT.+1);
	%PUT &CNT.;

	DATA TEMP;
	SET TARGET(KEEP=&TARGET4MODEL. &TARGETVAR.);
	&TARGETVAR._BIN=PUT(&TARGETVAR. ,&FMT.);
	RUN;

	PROC SQL;
	CREATE TABLE TARGET_1 AS
	(
		SELECT 	&TARGET4MODEL.
				,&TARGETVAR._BIN
				,SUM(&TARGET4MODEL.=0) AS TOTAL_NONRESP
				,SUM(&TARGET4MODEL.=1) AS TOTAL_RESP
		FROM TEMP
	)
	;
	CREATE TABLE WOE AS
	(
		SELECT 	&TARGETVAR._BIN
				,SUM(&TARGET4MODEL.=0) AS NONRESP /*NONRESP*/
			   	,SUM(&TARGET4MODEL.=1) AS RESP /*RESP*/
				,MEAN(TOTAL_NONRESP) AS TOT_NONRESP/*TRICK, ONLY USE MEAN FUNCTION TO GET TOTAL OF R AND NR*/
				,MEAN(TOTAL_RESP) AS TOT_RESP
		FROM TARGET_1
		GROUP BY &TARGETVAR._BIN
	)
	;
	RUN;QUIT;


		DATA _WOE_&CNT.(DROP= &TARGETVAR._BIN);
		INFORMAT BIN VARNAME $100. BUCKET $32.;
		SET WOE;
			PCT_NONRESP=NONRESP/TOT_NONRESP;
			PCT_RESP=RESP/TOT_RESP;
			WOE=0;
			IF PCT_RESP>0 THEN	WOE=LOG(PCT_NONRESP/PCT_RESP);
			IV=(PCT_NONRESP-PCT_RESP)*WOE;
			VARNAME = "&TARGETVAR.";
			BUCKET="&FMT.";
			BIN=&TARGETVAR._BIN;
		RUN;

		PROC MEANS DATA=_WOE_&CNT. NOPRINT; VAR IV; OUTPUT OUT=_IV_&CNT. SUM=IV; RUN;

		DATA  _IV_&CNT.(DROP= _TYPE_ _FREQ_);
		SET _IV_&CNT.;
		INFORMAT BUCKET $32.;
			INFORMAT VARNAME $100.;
			VARNAME = "&TARGETVAR.";
			BUCKET="&FMT.";
		RUN;


	PROC APPEND DATA=_IV_&CNT. BASE=SOURCE_IVFMT FORCE; RUN;
	PROC APPEND DATA=_WOE_&CNT. BASE=SOURCE_WOEFMT FORCE; RUN;
		
	PROC DELETE DATA=TARGET_1  WOE  _WOE_&CNT. _IV_&CNT. TEMP ; RUN;

%MEND;

In [None]:
PROC FORMAT;

VALUE AGE_BIN
    0-<18='0_(<18)'
    18-<35='A_[18,35)'
    35-<60='B_[35,60)'
    60-HIGH='C_[60>>)'
    OTHER='Z_MISSING'
;

RUN;

%DOWOEIV_NUM(AGE,3); 
%DOWOEIV_FMT(AGE,AGE_BIN.); 

In [11]:
*The following is an IV for each variable;
PROC PRINT DATA=SOURCE_IV;RUN;
PROC PRINT DATA=SOURCE_IVFMT;RUN;

Obs,VARNAME,IV,BUCKET
1,AGE,0.039786,3

Obs,VARNAME,BUCKET,IV
1,AGE,AGE_BIN.,0.096919


In [12]:
*The following is an WOE by each BIN, of each variable. As illustrated here, 'female' raw value can also be replaced with -1.52988;
PROC PRINT DATA=SOURCE_WOE;RUN;
PROC PRINT DATA=SOURCE_WOEFMT;RUN;

Obs,VARNAME,BIN,MIN,MAX,NONRESP,RESP,TOT_NONRESP,TOT_RESP,PCT_NONRESP,PCT_RESP,WOE,IV,BUCKET
1,AGE,.,.,.,125,52,549,342,0.22769,0.15205,0.40378,0.030542,3
2,AGE,0,0.42,22,133,98,549,342,0.24226,0.28655,-0.16791,0.007437,3
3,AGE,1,23.00,34,149,98,549,342,0.2714,0.28655,-0.05431,0.000823,3
4,AGE,2,34.50,80,142,94,549,342,0.25865,0.27485,-0.06076,0.000984,3

Obs,BIN,VARNAME,BUCKET,NONRESP,RESP,TOT_NONRESP,TOT_RESP,PCT_NONRESP,PCT_RESP,WOE,IV
1,0_(<18),AGE,AGE_BIN.,52,61,549,342,0.09472,0.17836,-0.63292,0.05294
2,"A_[18,35)",AGE,AGE_BIN.,231,135,549,342,0.42077,0.39474,0.06386,0.001662
3,"B_[35,60)",AGE,AGE_BIN.,122,87,549,342,0.22222,0.25439,-0.13517,0.004348
4,C_[60>>),AGE,AGE_BIN.,19,7,549,342,0.03461,0.02047,0.52524,0.007427
5,Z_MISSING,AGE,AGE_BIN.,125,52,549,342,0.22769,0.15205,0.40378,0.030542


<font face="Verdana, cursive, sans-serif" >
<H2><font color='darkred'>Part-2: Advanced - Automate the calculation of WOE/IV </font></H2>

<p></p>

<p>Now that we have an understanding of how <code>DOWOEIV_CHAR()</code>,<code>DOWOEIV_NUM()</code>,<code>DOWOEIV_FMT()</code> works, we take things a little bit further by looping thru all variables in a dataset. The following are strategies that carried out by the macros</p>

<ul>
<li>Calculate %populated of each variable, and exclude variable that has %populated less than 75% </li>
<li>If a variable is of type DATE, DATETIME or some kind of Citizen ID, Customer ID, Name, exclude it </li>
<li>Calculate calculate basic stats for all remaining variables, carefully identify how many BINS we may have for each variable </li>
<li>For the remaining  variables</li>
    <ul>
      <li>For character and flags, calculate WOE/IV with <code>DOWOEIV_CHAR()</code></li>
      <li>For numeric, calculate WOE/IV with <code>DOWOEIV_NUM()</code>,<code>DOWOEIV_FMT()</code> . <code>DOWOEIV_FMT()</code> is optional for some variable</li>
    </ul>
</ul>

Once we have identified which variables to calculate IV with which macros, we proceed with automation

<font face="Verdana, cursive, sans-serif" >
<b>Define macros and missing/invalid values patterns</b>

In [None]:
**********************************************
Defined formats for missing/invalids values.
***********************************************;
PROC FORMAT;
VALUE NM_MISS 
    .= '0' 
    99999999= '0'
    OTHER = '1'
;
VALUE $CH_MISS 
    '',' ','.','-','*'= '0' 
    'N/A','n/a','NA','N.A','-NA-','na','n.a.','n.a' = '0'
    'NULL','null','NONE','--NONE--' = '0'
    'unknown','UNKNOWN','Z_ERROR','Z_MISSING'= '0'
    '99999999','X','TESTUSER','U','C9999'= '0'
    'email@domain.com'= '0'
    OTHER = '1'
;
VALUE $NM_MISSLABEL
    '0'="MISS/INVALID"
    '1'="POPUPATED"
;
RUN;

%INCLUDE "&LOC_SASMACRO.\MISSING_REPORTS.SAS";
%INCLUDE "&LOC_SASMACRO.\QUICKSTATS.SAS";

<font face="Verdana, cursive, sans-serif" >
<b>Take a look at some sample</b>
<br>It is already obvious in this step that __NAME,PASSENGERID,and TICKET__ are likely to be dropped from IV calculation

In [14]:
PROC PRINT DATA=TITANIC(OBS=5);RUN;

Obs,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [None]:
%MISSING_REPORT(DSNAME=TITANIC,
                FMT_MISSNUM=NM_MISS.,
                FMT_MISSCHAR=$CH_MISS.); 

<font face="Verdana, cursive, sans-serif" >
Missing report shows that <b>CABIN</b> is likely to be dropped

In [16]:
PROC PRINT DATA=MSREPORT_TITANIC ; RUN;

Obs,VAR,MISS,P_MISS,OK,P_OK,TYPE,LENGTH
1,Age,177,19.9,714,80.1,1,8
2,Cabin,687,77.1,204,22.9,2,50
3,Embarked,2,0.2,889,99.8,2,50
4,Fare,0,0.0,891,100.0,1,8
5,Name,0,0.0,891,100.0,2,100
6,Parch,0,0.0,891,100.0,1,8
7,PassengerId,0,0.0,891,100.0,1,8
8,Pclass,0,0.0,891,100.0,1,8
9,Sex,0,0.0,891,100.0,2,6
10,SibSp,0,0.0,891,100.0,1,8


In [None]:
%QUICKSTATS(DSNAME=TITANIC,
			REPORTNAME=TITANIC,
			NLIMIT=20);

<font face="Verdana, cursive, sans-serif" >
<b>Quickstats shows that for numeric variable(TYPE=1)</b>
<ul>
<li>AGE, FARE can be discretized into bins=3,5,10</li>
<li>PARCH, PCLASS,SIBSP, instead of discretized into bins=3,5,10 ; these variables are better discretized into bins=NLEVELS of itself </li>
</ul>


In [18]:
PROC PRINT DATA=TITANIC_QSTATS_NM;RUN;

Obs,LIBNAME,MEMNAME,NAME,TYPE,FORMAT,INFORMAT,N,MIN,MAX,MEAN,STD,NLevels,NMissLevels,NNonMissLevels
1,WORK,TITANIC,Age,1,,BEST,714,0.42,80.000,29.699,14.526,89,1,88
2,WORK,TITANIC,Fare,1,,BEST,891,0.00,512.329,32.204,49.693,248,0,248
3,WORK,TITANIC,Parch,1,,BEST,891,0.00,6.000,0.382,0.806,7,0,7
4,WORK,TITANIC,PassengerId,1,,BEST,891,1.00,891.000,446.000,257.354,891,0,891
5,WORK,TITANIC,Pclass,1,,BEST,891,1.00,3.000,2.309,0.836,3,0,3
6,WORK,TITANIC,SibSp,1,,BEST,891,0.00,8.000,0.523,1.103,7,0,7
7,WORK,TITANIC,Survived,1,,BEST,891,0.00,1.000,0.384,0.487,2,0,2
8,WORK,TITANIC,Cabin,2,,$,.,.,.,.,.,148,1,147
9,WORK,TITANIC,Embarked,2,,$,.,.,.,.,.,4,1,3
10,WORK,TITANIC,Name,2,,$,.,.,.,.,.,891,0,891


In [19]:
PROC PRINT DATA=TITANIC_QSTATS_CH;RUN;

Obs,LIBNAME,MEMNAME,NAME,TYPE,FORMAT,INFORMAT,Frequency,Percent,CLASS
1,WORK,TITANIC,Embarked,2,,$,168,18.86,C
2,WORK,TITANIC,Embarked,2,,$,644,72.28,S
3,WORK,TITANIC,Embarked,2,,$,77,8.64,Q
4,WORK,TITANIC,Sex,2,,$,314,35.24,female
5,WORK,TITANIC,Sex,2,,$,577,64.76,male


<font face="Verdana, cursive, sans-serif" >
<b>Filter out variables that is not needed or did not pass criteria from IV calculation</b>

In [20]:
DATA VARLIST_CHAR VARLIST_NUM;
    SET MSREPORT_TITANIC;

    *too much missing, drop it;
    IF P_OK<75 THEN DELETE;

    *Dont include target for IV calculation;
    IF VAR IN ('Survived' ) THEN DELETE;

    *this is passenger id,name, ticket code which has no meaning, drop it;
    IF VAR IN ('PassengerId' 'Name' 'Ticket') THEN DELETE;

    IF TYPE=2 OR VAR IN ('Parch', 'Pclass','SibSp') THEN OUTPUT VARLIST_CHAR;
    ELSE OUTPUT VARLIST_NUM;

RUN;

PROC PRINT DATA=VARLIST_CHAR;RUN;
PROC PRINT DATA=VARLIST_NUM;RUN;

Obs,VAR,MISS,P_MISS,OK,P_OK,TYPE,LENGTH
1,Embarked,2,0.2,889,99.8,2,50
2,Parch,0,0.0,891,100.0,1,8
3,Pclass,0,0.0,891,100.0,1,8
4,Sex,0,0.0,891,100.0,2,6
5,SibSp,0,0.0,891,100.0,1,8

Obs,VAR,MISS,P_MISS,OK,P_OK,TYPE,LENGTH
1,Age,177,19.9,714,80.1,1,8
2,Fare,0,0.0,891,100.0,1,8


<font face="Verdana, cursive, sans-serif" >
<H3>Macro to AUTOMATE WOE/IV for CHARECTER</H3>

In [None]:
%ARRAY(VARLIST, DATA=VARLIST_CHAR, VAR=VAR);
%LET VARLIST = %DO_OVER(VARLIST,PHRASE=?);;

DATA TARGET(KEEP=&TARGET4MODEL.  &VARLIST.);
SET &DSBASE.;

RUN;

PROC DELETE DATA=&TARGET4MODEL._WOE_CHAR 
                &TARGET4MODEL._IV_CHAR 
                SOURCE_IV SOURCE_WOE ; RUN;

DATA SOURCE_IV;
INFILE DATALINES dsd DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. ;
INPUT IV VARNAME$ ;
DATALINES ;
;
RUN;

DATA SOURCE_WOE;
INFILE DATALINES dsd delimiter='|' MISSOVER ; 
INFORMAT BIN VARNAME $100. ;
INPUT BIN NONRESP RESP TOT_NONRESP TOT_RESP VARNAME$ PCT_NONRESP PCT_RESP WOE IV ;
DATALINES ;
;
RUN;


In [None]:
%GLOBAL CNT;
%LET CNT=0;

%DO_OVER(VALUES=&VARLIST.,MACRO=DOWOEIV_CHAR);

DATA &TARGET4MODEL._WOE_CHAR; 
    SET SOURCE_WOE;
RUN;

DATA &TARGET4MODEL._IV_CHAR;
    SET SOURCE_IV;
RUN;

PROC SORT DATA=&TARGET4MODEL._IV_CHAR;BY DESCENDING IV ; ;RUN;

PROC DELETE DATA=SOURCE_WOE SOURCE_IV ; RUN;
%CLEAN_DSLABEL(WORK,&TARGET4MODEL._WOE_CHAR);
%CLEAN_DSLABEL(WORK,&TARGET4MODEL._IV_CHAR);

%PUT IV CHARS ENDED..;

In [23]:
PROC PRINT DATA=&TARGET4MODEL._IV_CHAR;RUN;
PROC PRINT DATA=&TARGET4MODEL._WOE_CHAR;RUN;

Obs,VARNAME,IV
1,Sex,1.34168
2,Pclass,0.50095
3,SibSp,0.14243
4,Embarked,0.12237
5,Parch,0.11517

Obs,BIN,VARNAME,NONRESP,RESP,TOT_NONRESP,TOT_RESP,PCT_NONRESP,PCT_RESP,WOE,IV
1,,Embarked,0,2,549,342,0.0,0.00585,.,.
2,C,Embarked,75,93,549,342,0.13661,0.27193,-0.68840,0.09315
3,Q,Embarked,47,30,549,342,0.08561,0.08772,-0.02434,0.00005
4,S,Embarked,427,217,549,342,0.77778,0.6345,0.20360,0.02917
5,0,Parch,445,233,549,342,0.81056,0.68129,0.17375,0.02246
6,1,Parch,53,65,549,342,0.09654,0.19006,-0.67738,0.06335
7,2,Parch,40,40,549,342,0.07286,0.11696,-0.47329,0.02087
8,3,Parch,2,3,549,342,0.00364,0.00877,-0.87875,0.00451
9,4,Parch,4,0,549,342,0.00729,0.0,0.00000,0.00000
10,5,Parch,4,1,549,342,0.00729,0.00292,0.91301,0.00398


<font face="Verdana, cursive, sans-serif" >
<H3>Macro to AUTOMATE WOE/IV for NUMERIC</H3>

In [None]:
*For each numvar, determine how many bin we can have for autocal. you can set different threashold for NLevels ;
DATA BINS;
SET TITANIC_QSTATS_NM;
WHERE TYPE=1;
IF NLevels>=3 THEN R3=1; ELSE R3=0;
IF NLevels>=50 THEN R5=1; ELSE R5=0; *set higher criteria for bin5,10;
IF NLevels>=100 THEN R10=1; ELSE R10=0; *set higher criteria for bin5,10;

*Set some customised bin for some numvar;
INFORMAT FMT $50.;
FMT='NA';
IF NAME IN ('Age') THEN FMT='AGE_BIN.';

RENAME NAME=VAR;
RUN;

%QUICK_LEFTJOIN(VARLIST_NUM,BINS,VAR,OUT=VARNUMFMT(KEEP=VAR R3 R5 R10 P_OK FMT));


<font face="Verdana, cursive, sans-serif" >
<b>A dictionary for numeric variable-iv calculation</b>

After the above EDA, we set R3-R10 and FMT flag for IV calculation. Note that, you can change the threashold as well as having more bins with minor modification



In [25]:
PROC PRINT DATA=VARNUMFMT;RUN;

Obs,VAR,P_OK,R3,R5,R10,FMT
1,Age,80.1,1,1,0,AGE_BIN.
2,Fare,100.0,1,1,1,


In [None]:
%ARRAY(VARLIST, DATA=VARNUMFMT, VAR=VAR);
%LET VARLIST = %DO_OVER(VARLIST,PHRASE=?);;

DATA TARGET(KEEP=&TARGET4MODEL.  &VARLIST.);
SET &DSBASE.;
RUN;


PROC DELETE DATA=&TARGET4MODEL._WOE
				&TARGET4MODEL._IV 
				SOURCE_IV SOURCE_WOE 
				SOURCE_IVFMT SOURCE_WOEFMT; RUN;


DATA SOURCE_IV;
INFILE DATALINES DSD DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. ;
INPUT IV VARNAME$ BUCKET;
DATALINES ;
;
RUN;

DATA SOURCE_WOE;
INFILE DATALINES DSD DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. ;
INPUT BIN MIN MAX NONRESP RESP TOT_NONRESP TOT_RESP VARNAME$ PCT_NONRESP PCT_RESP WOE IV BUCKET;
DATALINES ;
;
RUN;

DATA SOURCE_IVFMT;
INFILE DATALINES DSD DELIMITER='|' MISSOVER ; 
INFORMAT VARNAME $100. BUCKET $32. ;
INPUT IV VARNAME$ BUCKET$;
DATALINES ;
;
RUN;

DATA SOURCE_WOEFMT;
INFILE DATALINES DSD DELIMITER='|' MISSOVER ; 
INFORMAT BIN VARNAME $100. BUCKET $32. ;
INPUT BIN NONRESP RESP TOT_NONRESP TOT_RESP VARNAME$ PCT_NONRESP PCT_RESP WOE IV BUCKET;
DATALINES ;
;
RUN;


In [None]:

%GLOBAL CNT;
%LET CNT=0;


%MACRO LOOP_DOWOEIV(TARGETVAR);
	DATA TEMP;
		SET VARNUMFMT;
		IF VAR="&TARGETVAR." THEN
		DO;
			CALL SYMPUTX('R3',R3);
			CALL SYMPUTX('R5',R5);
			CALL SYMPUTX('R10',R10);
			CALL SYMPUTX('FMT',FMT);
		END;
	RUN;
	%IF &R3. %THEN %DOWOEIV_NUM(&TARGETVAR.,3);
	%IF &R5. %THEN %DOWOEIV_NUM(&TARGETVAR.,5);
	%IF &R10. %THEN %DOWOEIV_NUM(&TARGETVAR.,10);
	%IF &FMT. NE NA %THEN %DOWOEIV_FMT(&TARGETVAR.,&FMT.);

	%LET R3=0;
	%LET R5=0;
	%LET R10=0;
	%LET FMT='NA';


%MEND;

%DO_OVER(VALUES=&VARLIST.,MACRO=LOOP_DOWOEIV);



DATA &TARGET4MODEL._WOE;
SET SOURCE_WOE;
RUN;

DATA &TARGET4MODEL._IV;
SET SOURCE_IV;
RUN;

%CLEAN_DSLABEL(WORK,&TARGET4MODEL._WOE);
%CLEAN_DSLABEL(WORK,&TARGET4MODEL._IV);

*Clean up redundant;
PROC SORT DATA=&TARGET4MODEL._IV; BY VARNAME DESCENDING IV BUCKET;RUN;
PROC SORT DATA=&TARGET4MODEL._IV NODUPKEY;BY VARNAME IV;RUN;

*consolidate FMT win BIN;
DATA &TARGET4MODEL._IV;
SET &TARGET4MODEL._IV;
INFORMAT BUCKET_TMP $32.;
BUCKET_TMP=PUT(COMPRESS(BUCKET),32.);
DROP BUCKET;
RENAME BUCKET_TMP=BUCKET;
RUN;

DATA &TARGET4MODEL._IV;
SET &TARGET4MODEL._IV SOURCE_IVFMT;
RUN;

PROC SORT DATA=&TARGET4MODEL._IV; BY VARNAME DESCENDING IV; RUN;

DATA &TARGET4MODEL._WOE;
INFORMAT BUCKET_TMP $32. BIN_TMP $100.;
SET &TARGET4MODEL._WOE;
BUCKET_TMP=PUT(COMPRESS(BUCKET),32.);
DROP BUCKET;
RENAME BUCKET_TMP=BUCKET;
BIN_TMP=PUT(COMPRESS(BIN),100.);
DROP BIN;
RENAME BIN_TMP=BIN;
RUN;

DATA &TARGET4MODEL._WOE;
SET &TARGET4MODEL._WOE SOURCE_WOEFMT;
RUN;

PROC SORT DATA=&TARGET4MODEL._WOE; BY VARNAME BUCKET BIN; RUN;

PROC DELETE DATA=SOURCE_WOE SOURCE_IV TARGET SOURCE_IVFMT SOURCE_WOEFMT; RUN;

%PUT IV-NUM ENDED..;


In [28]:
PROC PRINT DATA=&TARGET4MODEL._IV;RUN;
PROC PRINT DATA=&TARGET4MODEL._WOE;RUN;

Obs,VARNAME,IV,BUCKET
1,Age,0.09692,AGE_BIN.
2,Age,0.08209,5
3,Age,0.03979,3
4,Fare,0.64099,10
5,Fare,0.50466,5
6,Fare,0.4003,3

Obs,BUCKET,BIN,VARNAME,MIN,MAX,NONRESP,RESP,TOT_NONRESP,TOT_RESP,PCT_NONRESP,PCT_RESP,WOE,IV
1,3,.,Age,.,.,125,52,549,342,0.22769,0.15205,0.40378,0.03054
2,3,0,Age,0.4200,22.000,133,98,549,342,0.24226,0.28655,-0.16791,0.00744
3,3,1,Age,23.0000,34.000,149,98,549,342,0.2714,0.28655,-0.05431,0.00082
4,3,2,Age,34.5000,80.000,142,94,549,342,0.25865,0.27485,-0.06076,0.00098
5,5,.,Age,.,.,125,52,549,342,0.22769,0.15205,0.40378,0.03054
6,5,0,Age,0.4200,18.000,69,70,549,342,0.12568,0.20468,-0.48768,0.03852
7,5,1,Age,19.0000,24.500,91,48,549,342,0.16576,0.14035,0.16637,0.00423
8,5,2,Age,25.0000,31.000,94,56,549,342,0.17122,0.16374,0.04466,0.00033
9,5,3,Age,32.0000,41.000,81,63,549,342,0.14754,0.18421,-0.22197,0.00814
10,5,4,Age,42.0000,80.000,89,53,549,342,0.16211,0.15497,0.04506,0.00032


In [None]:
DATA TITANIC_BIN;
SET TITANIC;
AGE_BIN = PUT(AGE,AGE_BIN.);
RUN;

<font face="Verdana, cursive, sans-serif" >
<H3>Conclusion</H3>
<br>
With WOE/IV techniques, missing/invalid values problems can be alleviated. The above WOE table also serve as a dictionary for WOE transformation. We can see from the above table that missing/invalid values are allocated in one of the bins. With missing values taken care of, we have more choices for variable selection. T

<br>Let's take <b>AGE</b> as an example

<ol>
<li>Recode raw AGE to AGE_WOE. For example if AGE bin=3 is selected, missing value will be recoded with 0.40378, AGE between [0.42-22] will be recoded as -0.16791, so on and so forth, AGE_WOE is to be treated as numeric variable</li>
<li>Recode raw AGE to AGE_BIN. As shown in the below sample with <code>PUT()</code>. AGE_BIN is to be treated as charecter variable</li>
</ol>




In [30]:
PROC PRINT DATA=TITANIC_BIN(OBS=3); 
VAR AGE AGE_BIN;
RUN;

Obs,Age,AGE_BIN
1,22,"A_[18,35)"
2,38,"B_[35,60)"
3,26,"A_[18,35)"
