# AI Working Group
#### 9/28/18

In [1]:
import sys, os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, Image
from IPython.core.display import display, HTML

### Information Value (IV) and Weights of Evidence (WoE)

We'll look at techniques involving IV and WoE that can potentially assist the implementation of the MRM AI WG Idea "Variable Selection for PPNR Models" by Earvin.

First, let's cover a few fundamental concepts. You can find fundamentals of IV/WoE here:

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

#### _The Information Value of x for measuring y is a number that attempts to quantify the predictive power of x in capturing y._

In [2]:
Image(url="https://cdn-images-1.medium.com/max/800/1*6Aw782wiyiFtzvK7EOY8CA.png",width=250)

In [3]:
Image(url="https://cdn-images-1.medium.com/max/800/1*9Gi0fGyTpxfwM2TpV4GZQQ.png",width=400)

A good examples is are here:

https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb

http://www.angoss.com/information-value-a-numerical-example/

In [4]:
Image(url="https://cdn-images-1.medium.com/max/1000/1*kP86RE7G0_0pi7E6bnlUZA.png",width=800)

On assessing the results from IV, one rule of thumb introduced by Naeem Siddiqi in his book "Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring" (link to PDF below) is as follows:

https://pdfs.semanticscholar.org/dd5c/7f59d20d9a00d4c93e3d6a7e9973f3462e7e.pdf

In [5]:
Image(url="https://cdn-images-1.medium.com/max/800/1*5S_5aAHWe0_knDGZUK3W8w.png",width=500)

One key aspect of WoE and IV is the choice of binning (buckets) for the independent variables (features). There are lots of techniques out there, including decision trees. R has lots of packages cattered to this, one example is this:

https://cran.r-project.org/web/packages/woeBinning/woeBinning.pdf

We will run the example below with arbritary binning for 'age'. The other features already have natural binning as they are categorical variables.

### Numerical Example

The data for this numerical example was downloaded from here:

https://www.kaggle.com/varungitboi/employee-salary-dataset/version/1

We'll use Python's Pandas library in this example. For a review of Pandas, take a look here:

https://pandas.pydata.org/pandas-docs/stable/indexing.html

In [6]:
data = pd.read_csv('180928_employee_data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,id,groups,age,healthy_eating,active_lifestyle,salary
0,0,0,A,36,5,5,2297
1,1,1,A,55,3,5,1134
2,2,2,A,61,8,1,4969
3,3,3,O,29,3,6,902
4,4,4,O,34,6,2,3574


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
Unnamed: 0          1000 non-null int64
id                  1000 non-null int64
groups              1000 non-null object
age                 1000 non-null int64
healthy_eating      1000 non-null int64
active_lifestyle    1000 non-null int64
salary              1000 non-null int64
dtypes: int64(6), object(1)
memory usage: 54.8+ KB


In [8]:
data.describe()

Unnamed: 0.1,Unnamed: 0,id,age,healthy_eating,active_lifestyle,salary
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,499.5,41.155,4.944,5.683,2227.461
std,288.819436,288.819436,13.462995,2.013186,2.048587,1080.20976
min,0.0,0.0,18.0,0.0,0.0,553.0
25%,249.75,249.75,30.0,4.0,4.0,1360.0
50%,499.5,499.5,41.0,5.0,6.0,2174.0
75%,749.25,749.25,53.0,6.0,7.0,2993.75
max,999.0,999.0,64.0,10.0,10.0,5550.0


In [9]:
data['bin_class'] = 0
data.describe()

Unnamed: 0.1,Unnamed: 0,id,age,healthy_eating,active_lifestyle,salary,bin_class
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,499.5,41.155,4.944,5.683,2227.461,0.0
std,288.819436,288.819436,13.462995,2.013186,2.048587,1080.20976,0.0
min,0.0,0.0,18.0,0.0,0.0,553.0,0.0
25%,249.75,249.75,30.0,4.0,4.0,1360.0,0.0
50%,499.5,499.5,41.0,5.0,6.0,2174.0,0.0
75%,749.25,749.25,53.0,6.0,7.0,2993.75,0.0
max,999.0,999.0,64.0,10.0,10.0,5550.0,0.0


In [10]:
data.loc[data['salary']>data['salary'].mean(),'bin_class'] = 1
data.describe()

Unnamed: 0.1,Unnamed: 0,id,age,healthy_eating,active_lifestyle,salary,bin_class
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,499.5,41.155,4.944,5.683,2227.461,0.463
std,288.819436,288.819436,13.462995,2.013186,2.048587,1080.20976,0.498879
min,0.0,0.0,18.0,0.0,0.0,553.0,0.0
25%,249.75,249.75,30.0,4.0,4.0,1360.0,0.0
50%,499.5,499.5,41.0,5.0,6.0,2174.0,0.0
75%,749.25,749.25,53.0,6.0,7.0,2993.75,1.0
max,999.0,999.0,64.0,10.0,10.0,5550.0,1.0


In [11]:
bins = [0, 20, 30, 40, 50, 60, 150]
data['age_binned'] = pd.cut(data['age'],bins=bins)
data['age_class'] = np.searchsorted(bins,data['age'].values)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Unnamed: 0          1000 non-null int64
id                  1000 non-null int64
groups              1000 non-null object
age                 1000 non-null int64
healthy_eating      1000 non-null int64
active_lifestyle    1000 non-null int64
salary              1000 non-null int64
bin_class           1000 non-null int64
age_binned          1000 non-null category
age_class           1000 non-null int64
dtypes: category(1), int64(8), object(1)
memory usage: 71.5+ KB


In [12]:
data.head()

Unnamed: 0.1,Unnamed: 0,id,groups,age,healthy_eating,active_lifestyle,salary,bin_class,age_binned,age_class
0,0,0,A,36,5,5,2297,1,"(30, 40]",3
1,1,1,A,55,3,5,1134,0,"(50, 60]",5
2,2,2,A,61,8,1,4969,1,"(60, 150]",6
3,3,3,O,29,3,6,902,0,"(20, 30]",2
4,4,4,O,34,6,2,3574,1,"(30, 40]",3


### IV Calculation: Categorical Target Variable

The function below is taken from this link:

https://www.kaggle.com/puremath86/iv-woe-starter-for-python

In [13]:
# Calculate information value
def calc_iv(df, feature, target, pr=False):
    """
    Set pr=True to enable printing of output.
    
    Output: 
      * iv: float,
      * data: pandas.DataFrame
    """

    lst = []

    df[feature] = df[feature].fillna("NULL")

    for i in range(df[feature].nunique()):
        val = list(df[feature].unique())[i]
        lst.append([feature,                                                        # Variable
                    val,                                                            # Value
                    df[df[feature] == val].count()[feature],                        # All
                    df[(df[feature] == val) & (df[target] == 0)].count()[feature],  # Good (think: Fraud == 0)
                    df[(df[feature] == val) & (df[target] == 1)].count()[feature]]) # Bad (think: Fraud == 1)

    data = pd.DataFrame(lst, columns=['Variable', 'Value', 'All', 'Good', 'Bad'])

    data['Share'] = data['All'] / data['All'].sum()
    data['Bad Rate'] = data['Bad'] / data['All']
    data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
    data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
    data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])

    data = data.replace({'WoE': {np.inf: 0, -np.inf: 0}})

    data['IV'] = data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])

    data = data.sort_values(by=['Variable', 'Value'], ascending=[True, True])
    data.index = range(len(data.index))

    if pr:
        print(data)
        print('IV = ', data['IV'].sum())


    iv = data['IV'].sum()
    # print(iv)

    return iv, data

Let's see how well each of the variables in the dataset explain whether one is in the higher or lower salary bracket.

In [14]:
iv, dt = calc_iv(data,'groups','bin_class',pr=True)

  Variable Value  All  Good  Bad  Share  Bad Rate  Distribution Good  \
0   groups     A  375   206  169  0.375  0.450667           0.383613   
1   groups    AB  125    73   52  0.125  0.416000           0.135940   
2   groups     B  125    67   58  0.125  0.464000           0.124767   
3   groups     O  375   191  184  0.375  0.490667           0.355680   

   Distribution Bad       WoE        IV  
0          0.365011  0.049706  0.000925  
1          0.112311  0.190945  0.004512  
2          0.125270 -0.004021  0.000002  
3          0.397408 -0.110933  0.004629  
IV =  0.0100676446715


In [15]:
iv, dt = calc_iv(data,'age_class','bin_class',pr=True)

    Variable  Value  All  Good  Bad  Share  Bad Rate  Distribution Good  \
0  age_class      1   64    35   29  0.064  0.453125           0.065177   
1  age_class      2  204    99  105  0.204  0.514706           0.184358   
2  age_class      3  208   112   96  0.208  0.461538           0.208566   
3  age_class      4  224   120  104  0.224  0.464286           0.223464   
4  age_class      5  216   123   93  0.216  0.430556           0.229050   
5  age_class      6   84    48   36  0.084  0.428571           0.089385   

   Distribution Bad       WoE        IV  
0          0.062635  0.039781  0.000101  
1          0.226782 -0.207112  0.008787  
2          0.207343  0.005880  0.000007  
3          0.224622 -0.005170  0.000006  
4          0.200864  0.131314  0.003701  
5          0.077754  0.139411  0.001622  
IV =  0.0142237075675


In [16]:
iv, dt = calc_iv(data,'healthy_eating','bin_class',pr=True)

          Variable  Value  All  Good  Bad  Share  Bad Rate  Distribution Good  \
0   healthy_eating      0   14    12    2  0.014  0.142857           0.022346   
1   healthy_eating      1   25    24    1  0.025  0.040000           0.044693   
2   healthy_eating      2   71    71    0  0.071  0.000000           0.132216   
3   healthy_eating      3  138   137    1  0.138  0.007246           0.255121   
4   healthy_eating      4  173   157   16  0.173  0.092486           0.292365   
5   healthy_eating      5  179   102   77  0.179  0.430168           0.189944   
6   healthy_eating      6  176    34  142  0.176  0.806818           0.063315   
7   healthy_eating      7  116     0  116  0.116  1.000000           0.000000   
8   healthy_eating      8   73     0   73  0.073  1.000000           0.000000   
9   healthy_eating      9   26     0   26  0.026  1.000000           0.000000   
10  healthy_eating     10    9     0    9  0.009  1.000000           0.000000   

    Distribution Bad       



In [17]:
iv, dt = calc_iv(data,'active_lifestyle','bin_class',pr=True)

            Variable  Value  All  Good  Bad  Share  Bad Rate  \
0   active_lifestyle      0    7     0    7  0.007  1.000000   
1   active_lifestyle      1   26     4   22  0.026  0.846154   
2   active_lifestyle      2   34     8   26  0.034  0.764706   
3   active_lifestyle      3   92    39   53  0.092  0.576087   
4   active_lifestyle      4  104    51   53  0.104  0.509615   
5   active_lifestyle      5  168    75   93  0.168  0.553571   
6   active_lifestyle      6  213   116   97  0.213  0.455399   
7   active_lifestyle      7  163    99   64  0.163  0.392638   
8   active_lifestyle      8  114    82   32  0.114  0.280702   
9   active_lifestyle      9   64    53   11  0.064  0.171875   
10  active_lifestyle     10   15    10    5  0.015  0.333333   

    Distribution Good  Distribution Bad       WoE        IV  
0            0.000000          0.015119  0.000000 -0.000000  
1            0.007449          0.047516 -1.853019  0.074246  
2            0.014898          0.056156 -1.32



There are many ways in which one may implement it, in the end it's trivia maths. But very useful indeed in feature selection, especially in the old days when the machines were not powerful enough to run advanced ML.

### IV Calculation: Continuous Target Variable

What follows is extending the previous technique to continuous target variables. This link provides a good overview of how to do this:

https://support.sas.com/resources/papers/proceedings15/3242-2015.pdf

https://www.lexjansen.com/sesug/2014/SD-20.pdf