Weight of evidence (WOE) and Information value (IV) techniques used to perform variable transformation and selection

## Application
1.Mostly used in classfication models for variable selection


2.Widely used in credit scoring to measure the separation of good vs bad customers

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

In [3]:
data.head(3)

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   admit   400 non-null    int64  
 1   gre     400 non-null    int64  
 2   gpa     400 non-null    float64
 3   rank    400 non-null    int64  
dtypes: float64(1), int64(3)
memory usage: 12.6 KB


In [5]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
admit,400.0,0.3175,0.466087,0.0,0.0,0.0,1.0,1.0
gre,400.0,587.7,115.516536,220.0,520.0,580.0,660.0,800.0
gpa,400.0,3.3899,0.380567,2.26,3.13,3.395,3.67,4.0
rank,400.0,2.485,0.94446,1.0,2.0,2.0,3.0,4.0


Perform missing value imputation before applying WOE transformation. 

This dataset doesn't has any missing value. so proceeding to next steps.

In [6]:
def Get_WOE(df,X,Y):
    """
    This function calculates WOE for feature X of dataframe df
    Feature should be numerical in nature
    It returns WOE and IV values for that feature
    Assumption : if a feature has more than 10 unique values we are binning that into 10 bins. (Can change accordingly)
                 else we are customizing the number of bins accordingly(if 2< bin < 10)
    """
    
    woe_df = pd.DataFrame()
    tempdf = pd.DataFrame()
    
    if 2<(len(df[X].unique()))<10:                            ## if number of unique values gt 2 and lt 10 
        
        TotalY = df.groupby(X)[Y].count()
        SumOf1 = df.groupby(X)[Y].sum()
        SumOf0 = TotalY - SumOf1
        
        tempdf['Min'] = df[X].groupby(df[X]).min()
        tempdf['Max'] = df[X].groupby(df[X]).max()
    
    else:
        binned_x = pd.qcut(df[X],10,labels=list(range(1,11)),duplicates='drop')  ## binning X varible into 10 bins and 
                                                                    ##setting duplicates='drop' for removing duplicate edges
    
        TotalY = df.groupby(binned_x)[Y].count()                  ## Total Y cases in each bin
        SumOf1 = df.groupby(binned_x)[Y].sum()                    ## Sum of Y=1 cases in each bin
        SumOf0 = TotalY - SumOf1                                  ## Sum of Y=0 cases in each bin
    
        tempdf['Min'] = df[X].groupby(binned_x).min()             ## Minimum of each bin
        tempdf['Max'] = df[X].groupby(binned_x).max()              ## Maximum of each bin
    
    tempdf['Count'] = TotalY  
    tempdf['Event'] = SumOf1                                   ## Event-> Y=1
    tempdf['NonEvent'] = SumOf0                                ## NonEvent -> Y=0
    
    tempdf.insert(loc=0,column='Variable',value=X)                                   ## assiging Variable name
    
    tempdf['Dist_Event'] = tempdf['Event']/tempdf['Event'].sum()                     ## % of event in each bin (y=1)
    
    tempdf['Dist_NonEvent'] = tempdf['NonEvent']/tempdf['NonEvent'].sum()            ## % of nonevent in each bin (y=0)
    
    tempdf['WOE']= np.log(tempdf.Dist_Event/tempdf.Dist_NonEvent)
    
    tempdf['IV'] = tempdf['WOE']* (tempdf['Dist_Event'] - tempdf['Dist_NonEvent'])
    
    print("IV value of " + X + ":" + str(round(tempdf['IV'].sum(),5)))
    
    ivdf = pd.DataFrame({'Variable': [X], 
                         'IV': [tempdf['IV'].sum()]})
    
    woe_df = pd.concat([woe_df,tempdf],axis=0)
    
    return woe_df,ivdf

In [7]:
WOE_DF = pd.DataFrame() 
IV_DF  = pd.DataFrame()
count = 0

for feature in data.columns.difference(['admit']):
    WOE,IV = Get_WOE(df=data,X=feature,Y='admit')
    if count == 0:
        WOE_DF = pd.concat([WOE,WOE_DF],axis=0)
        IV_DF = pd.concat([IV,IV_DF],axis=0)
        count=count+1
    else:
        WOE_DF = pd.concat([WOE,WOE_DF],axis=0,ignore_index=True)
        IV_DF = pd.concat([IV,IV_DF],axis=0,ignore_index=True)
        count=count+1
        

IV value of gpa:0.27002
IV value of gre:0.31288
IV value of rank:0.29204


In [8]:
WOE_DF

Unnamed: 0,Variable,Min,Max,Count,Event,NonEvent,Dist_Event,Dist_NonEvent,WOE,IV
0,rank,1.0,1.0,61,33,28,0.259843,0.102564,0.929588,0.146204
1,rank,2.0,2.0,151,54,97,0.425197,0.355311,0.179558,0.012548
2,rank,3.0,3.0,121,28,93,0.220472,0.340659,-0.43511,0.052295
3,rank,4.0,4.0,67,12,55,0.094488,0.201465,-0.757142,0.080997
4,gre,220.0,440.0,48,6,42,0.047244,0.153846,-1.180625,0.125857
5,gre,460.0,500.0,51,12,39,0.094488,0.142857,-0.41337,0.019994
6,gre,520.0,520.0,24,10,14,0.07874,0.051282,0.428812,0.011774
7,gre,540.0,560.0,51,15,36,0.11811,0.131868,-0.110184,0.001516
8,gre,580.0,580.0,29,6,23,0.047244,0.084249,-0.57845,0.021406
9,gre,600.0,620.0,53,21,32,0.165354,0.117216,0.344071,0.016563


In [9]:
IV_DF

Unnamed: 0,Variable,IV
0,rank,0.292044
1,gre,0.312882
2,gpa,0.27002


## Transforming WOE variables into original data set