# AI Working Group
#### 9/28/18

In [26]:
import sys, os
import pandas as pd
import numpy as np
import pandas.core.algorithms as algos
from pandas import Series
import scipy.stats.stats as stats
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, Image
from IPython.core.display import display, HTML

### Information Value (IV) and Weights of Evidence (WoE)

We'll look at techniques involving IV and WoE that can potentially assist the implementation of the MRM AI WG Idea "Variable Selection for PPNR Models" by Earvin.

First, let's cover a few fundamental concepts. You can find fundamentals of IV/WoE here:

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

#### _The Information Value of x for measuring y is a number that attempts to quantify the predictive power of x in capturing y._

In [19]:
Image(url="https://cdn-images-1.medium.com/max/800/1*6Aw782wiyiFtzvK7EOY8CA.png",width=300)

In [24]:
Image(url="https://cdn-images-1.medium.com/max/800/1*9Gi0fGyTpxfwM2TpV4GZQQ.png",width=400)

A good examples is are here:

https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb

http://www.angoss.com/information-value-a-numerical-example/

In [25]:
Image(url="https://cdn-images-1.medium.com/max/1000/1*kP86RE7G0_0pi7E6bnlUZA.png",width=800)

On assessing the results from IV, one rule of thumb introduced by Naeem Siddiqi in his book "Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring" (link to PDF below) is as follows:

https://pdfs.semanticscholar.org/dd5c/7f59d20d9a00d4c93e3d6a7e9973f3462e7e.pdf

In [23]:
Image(url="https://cdn-images-1.medium.com/max/800/1*5S_5aAHWe0_knDGZUK3W8w.png",width=500)

One key aspect of WoE and IV is the choice of binning (buckets) for the independent variables (features). There are lots of techniques out there, including decision trees. R has lots of packages cattered to this, one example is this:

https://cran.r-project.org/web/packages/woeBinning/woeBinning.pdf

We will run the example below with arbritary binning for 'age'. The other features already have natural binning as they are categorical variables.

### Numerical Example

The data for this numerical example was downloaded from here:

https://www.kaggle.com/varungitboi/employee-salary-dataset/version/1

We'll use Python's Pandas library in this example. For a review of Pandas, take a look here:

https://pandas.pydata.org/pandas-docs/stable/indexing.html

In [27]:
data = pd.read_csv('180928_employee_data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,id,groups,age,healthy_eating,active_lifestyle,salary
0,0,0,A,36,5,5,2297
1,1,1,A,55,3,5,1134
2,2,2,A,61,8,1,4969
3,3,3,O,29,3,6,902
4,4,4,O,34,6,2,3574


In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
Unnamed: 0          1000 non-null int64
id                  1000 non-null int64
groups              1000 non-null object
age                 1000 non-null int64
healthy_eating      1000 non-null int64
active_lifestyle    1000 non-null int64
salary              1000 non-null int64
dtypes: int64(6), object(1)
memory usage: 54.8+ KB


In [29]:
data.describe()

Unnamed: 0.1,Unnamed: 0,id,age,healthy_eating,active_lifestyle,salary
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,499.5,41.155,4.944,5.683,2227.461
std,288.819436,288.819436,13.462995,2.013186,2.048587,1080.20976
min,0.0,0.0,18.0,0.0,0.0,553.0
25%,249.75,249.75,30.0,4.0,4.0,1360.0
50%,499.5,499.5,41.0,5.0,6.0,2174.0
75%,749.25,749.25,53.0,6.0,7.0,2993.75
max,999.0,999.0,64.0,10.0,10.0,5550.0


In [30]:
data['bin_class'] = 0
data.describe()

Unnamed: 0.1,Unnamed: 0,id,age,healthy_eating,active_lifestyle,salary,bin_class
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,499.5,41.155,4.944,5.683,2227.461,0.0
std,288.819436,288.819436,13.462995,2.013186,2.048587,1080.20976,0.0
min,0.0,0.0,18.0,0.0,0.0,553.0,0.0
25%,249.75,249.75,30.0,4.0,4.0,1360.0,0.0
50%,499.5,499.5,41.0,5.0,6.0,2174.0,0.0
75%,749.25,749.25,53.0,6.0,7.0,2993.75,0.0
max,999.0,999.0,64.0,10.0,10.0,5550.0,0.0


In [31]:
data.loc[data['salary']>data['salary'].mean(),'bin_class'] = 1
data.describe()

Unnamed: 0.1,Unnamed: 0,id,age,healthy_eating,active_lifestyle,salary,bin_class
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,499.5,41.155,4.944,5.683,2227.461,0.463
std,288.819436,288.819436,13.462995,2.013186,2.048587,1080.20976,0.498879
min,0.0,0.0,18.0,0.0,0.0,553.0,0.0
25%,249.75,249.75,30.0,4.0,4.0,1360.0,0.0
50%,499.5,499.5,41.0,5.0,6.0,2174.0,0.0
75%,749.25,749.25,53.0,6.0,7.0,2993.75,1.0
max,999.0,999.0,64.0,10.0,10.0,5550.0,1.0


In [38]:
bins = [0, 25, 35, 45, 55, 70, 150]
data['age_binned'] = pd.cut(data['age'],bins=bins)
data['age_class'] = np.searchsorted(bins,data['age'].values)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Unnamed: 0          1000 non-null int64
id                  1000 non-null int64
groups              1000 non-null object
age                 1000 non-null int64
healthy_eating      1000 non-null int64
active_lifestyle    1000 non-null int64
salary              1000 non-null int64
bin_class           1000 non-null int64
age_binned          1000 non-null category
age_class           1000 non-null int64
dtypes: category(1), int64(8), object(1)
memory usage: 71.5+ KB


In [44]:
data.head()

Unnamed: 0.1,Unnamed: 0,id,groups,age,healthy_eating,active_lifestyle,salary,bin_class,age_binned,age_class
0,0,0,A,36,5,5,2297,1,"(35, 45]",3
1,1,1,A,55,3,5,1134,0,"(45, 55]",4
2,2,2,A,61,8,1,4969,1,"(55, 70]",5
3,3,3,O,29,3,6,902,0,"(25, 35]",2
4,4,4,O,34,6,2,3574,1,"(25, 35]",2


### IV Calculation: Categorical Target Variable

The function below is taken from this link:

https://www.kaggle.com/puremath86/iv-woe-starter-for-python

In [40]:
# Calculate information value
def calc_iv(df, feature, target, pr=False):
    """
    Set pr=True to enable printing of output.
    
    Output: 
      * iv: float,
      * data: pandas.DataFrame
    """

    lst = []

    df[feature] = df[feature].fillna("NULL")

    for i in range(df[feature].nunique()):
        val = list(df[feature].unique())[i]
        lst.append([feature,                                                        # Variable
                    val,                                                            # Value
                    df[df[feature] == val].count()[feature],                        # All
                    df[(df[feature] == val) & (df[target] == 0)].count()[feature],  # Good (think: Fraud == 0)
                    df[(df[feature] == val) & (df[target] == 1)].count()[feature]]) # Bad (think: Fraud == 1)

    data = pd.DataFrame(lst, columns=['Variable', 'Value', 'All', 'Good', 'Bad'])

    data['Share'] = data['All'] / data['All'].sum()
    data['Bad Rate'] = data['Bad'] / data['All']
    data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
    data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
    data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])

    data = data.replace({'WoE': {np.inf: 0, -np.inf: 0}})

    data['IV'] = data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])

    data = data.sort_values(by=['Variable', 'Value'], ascending=[True, True])
    data.index = range(len(data.index))

    if pr:
        print(data)
        print('IV = ', data['IV'].sum())


    iv = data['IV'].sum()
    # print(iv)

    return iv, data

In [1]:
iv, dt = calc_iv(data,'groups','bin_class',pr=True)

NameError: name 'calc_iv' is not defined

In [43]:
iv, dt = calc_iv(data,'age_class','bin_class',pr=True)

    Variable  Value  All  Good  Bad  Share  Bad Rate  Distribution Good  \
0  age_class      1  160    78   82  0.160  0.512500           0.145251   
1  age_class      2  227   125  102  0.227  0.449339           0.232775   
2  age_class      3  203   104   99  0.203  0.487685           0.193669   
3  age_class      4  220   123   97  0.220  0.440909           0.229050   
4  age_class      5  190   107   83  0.190  0.436842           0.199255   

   Distribution Bad       WoE        IV  
0          0.177106 -0.198281  0.006316  
1          0.220302  0.055070  0.000687  
2          0.213823 -0.099000  0.001995  
3          0.209503  0.089202  0.001744  
4          0.179266  0.105717  0.002113  
IV =  0.0128551452066


In [None]:
iv, dt = calc_iv(data,'healthy_eating','bin_class',pr=True)

In [None]:
iv, dt = calc_iv(data,'active_lifestyle','bin_class',pr=True)

There are really many ways in which one may implement it, in the end it's trivia maths. But very useful indeed in feature selection, especially in the old days when the machines were not powerful enough to run advanced ML.

### IV Calculation: Continuous Target Variable

What follows is extending the previous technique to continuous target variables. This link provides a good overview of how to do this:

https://support.sas.com/resources/papers/proceedings15/3242-2015.pdf

https://www.lexjansen.com/sesug/2014/SD-20.pdf

http://opinions5.blogspot.com/2011/09/information-value.html