Data Set HMEQ
The data set HMEQ reports characteristics and delinquency information for 5,960
home equity loans. A home equity loan is a loan where the obligor uses the equity
of his or her home as the underlying collateral. The data set has the following characteristics:

BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan

LOAN: Amount of the loan request

MORTDUE: Amount due on existing mortgage

VALUE: Value of current property

REASON: DebtCon = debt consolidation; HomeImp = home improvement

JOB: Occupational categories

YOJ: Years at present job

DEROG: Number of major derogatory reports

DELINQ: Number of delinquent credit lines

CLAGE: Age of oldest credit line in months

NINQ: Number of recent credit inquiries

CLNO: Number of credit lines

DEBTINC: Debt-to-income ratio


1. UPLOADING USEFUL MODULES AND DATAFRAME 

In [1]:
#IMPORTING MODULES
import pandas as pd
import numpy as np
from random import sample
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm

#UPLOADING DATAFRAME
HMEQ = pd.read_csv('hmeq.csv')

#CREATING A SAMPLE 
s_hmeq = HMEQ.sample(frac =.40) 
if (0.40*len(HMEQ)==len(s_hmeq)):
    print('alright')
    print(len(HMEQ), len(s_hmeq))

#SAMPLE TEST
print((s_hmeq['BAD'].value_counts()/s_hmeq['BAD'].count())*100)
print((HMEQ['BAD'].value_counts()/HMEQ['BAD'].count())*100)

alright
5960 2384
0    80.075503
1    19.924497
Name: BAD, dtype: float64
0    80.050336
1    19.949664
Name: BAD, dtype: float64


2. TREATMENT OF MISSING VALUES

In [2]:
#except for bad loan the following list in subset have missing values
s_miss_hmeq= s_hmeq.dropna(subset=['MORTDUE', 'VALUE', 'YOJ', 'DEROG',
       'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'])

#piechart showed more than 75% of reason is debtcon | we assume missing in job as a structural unemployment
s_miss_hmeq['JOB'].fillna('UnEmp', inplace = True)
s_miss_hmeq['REASON'].fillna('DebtCon', inplace = True)

#creating checkpoint
nomiss_hmeq = s_miss_hmeq.copy()

#editing index which was damaged because of sampling 
nomiss_hmeq.index = range(len(nomiss_hmeq))

print((s_hmeq['BAD'].value_counts()/s_hmeq['BAD'].count())*100)

0    80.075503
1    19.924497
Name: BAD, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


3. TREATMENT OF OUTLIERS

In [3]:
#creating a check point
pps_hmeq = nomiss_hmeq.copy()

# this func permanently edits the dataframe and its removes outliers by IQR method.
def outlier_treatment(x):
    Q5 = x.quantile(0.50)
    Q1 = x.quantile(0.25)
    Q3 = x.quantile(0.75)
    IQR = Q3 - Q1
    t_min= Q1 - 1.5 * IQR
    t_max= Q3 + 1.5 * IQR
    for i in range(len(x)):
        if x[i] > t_max:
            x[i] = Q5
        elif x[i] < t_min:
            x[i] = Q5
            
#the following colums had outlier as suggested by boxplot(not shown here!)            
outlier_treatment(pps_hmeq['LOAN'])
outlier_treatment(pps_hmeq['MORTDUE'])
outlier_treatment(pps_hmeq['VALUE'])
outlier_treatment(pps_hmeq['YOJ'])
outlier_treatment(pps_hmeq['CLAGE'])
outlier_treatment(pps_hmeq['NINQ'])
outlier_treatment(pps_hmeq['CLNO'])
outlier_treatment(pps_hmeq['DEBTINC'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [4]:
pps_hmeq.columns.values

array(['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'REASON', 'JOB', 'YOJ', 'DEROG',
       'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'], dtype=object)

4. CLASSIFICATION


In [5]:
#CREATING A CHECK POINT
hmeq_class = pps_hmeq.copy()

# CREATE DECILES OF THE FOLLOWING BASED ON THEIR VALUE RANGE
hmeq_class['LOAN_bin'] = pd.qcut(hmeq_class['LOAN'], 6)
hmeq_class['MORTDUE_bin'] = pd.qcut(hmeq_class['MORTDUE'], 5)
hmeq_class['VALUE_bin'] = pd.qcut(hmeq_class['VALUE'], 5)
hmeq_class['YOJ_bin'] = pd.qcut(hmeq_class['YOJ'], 3)
hmeq_class['CLNO_bin'] = pd.cut(hmeq_class['CLNO'], 3)
hmeq_class['CLAGE_bin'] = pd.cut(hmeq_class['CLAGE'], 4)
hmeq_class['DEBTINC_bin'] = pd.cut(hmeq_class['DEBTINC'], 3)

# THE FOLLOWING CLASSIFICATION IS BUILT BY OBSERVING TEHIR VALUE_COUNTS 
hmeq_class.loc[hmeq_class['DEROG'] < 1, 'DEROG_bin'] = 'no'
hmeq_class.loc[hmeq_class['DEROG'] >= 1, 'DEROG_bin'] = 'yes'

hmeq_class.loc[hmeq_class['DELINQ'] < 1, 'DELINQ_bin'] = 'no'             
hmeq_class.loc[hmeq_class['DELINQ'] ==1, 'DELINQ_bin'] = 'one'
hmeq_class.loc[hmeq_class['DELINQ'] > 1, 'DELINQ_bin'] = 'yes'

hmeq_class.loc[hmeq_class['NINQ'] < 1, 'NINQ_bin'] = 'no'
hmeq_class.loc[hmeq_class['NINQ'] == 1, 'NINQ_bin'] = 'one'
hmeq_class.loc[hmeq_class['NINQ'] == 2, 'NINQ_bin'] = '3orless'
hmeq_class.loc[hmeq_class['NINQ'] == 3, 'NINQ_bin'] = '3orless'
hmeq_class.loc[hmeq_class['NINQ'] >= 4, 'NINQ_bin'] = '4ormore'

#FINALLY WE DROP THE UNREQUIRED COLUMNS
hmeq_class.drop(['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'CLNO', 'CLAGE', 'DEBTINC', 'DEROG', 'DELINQ', 'NINQ'], axis=1, inplace=True)

 

5. CALCULATING WOE AND IV

In [6]:
def calculate_woe_iv(dataset, feature, target):
    lst = []
    for i in range(dataset[feature].nunique()):
        val = list(dataset[feature].unique())[i]
        lst.append({
            'Value': val,
            'All': dataset[dataset[feature] == val].count()[feature],
            'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],
            'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]
        })        
    dset = pd.DataFrame(lst)
    dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
    dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()
    dset['WoE'] = np.log(dset['Distr_Good'] / dset['Distr_Bad'])
    dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})
    dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']
    iv = dset['IV'].sum()   
    dset = dset.sort_values(by='WoE')    
    return dset, iv

IVlist = []

for col in hmeq_class.columns:
    if col == 'BAD': continue
    else:
        print('WoE and IV for column: {}'.format(col))
        df, iv = calculate_woe_iv(hmeq_class, col, 'BAD')
        IVlist.append({
            'parameter': col,
            'IV': '{:.2f}'.format(iv)})
        print(df)
        print('IV score: {:.2f}'.format(iv))
        print('\n')
print(pd.DataFrame(IVlist, columns= ['parameter', 'IV']))

WoE and IV for column: REASON
    All  Bad  Good    Value  Distr_Good  Distr_Bad       WoE        IV
1   404   41   363  HomeImp    0.278374   0.362832 -0.264973  0.022379
0  1013   72   941  DebtCon    0.721626   0.637168  0.124473  0.010513
IV score: 0.03


WoE and IV for column: JOB
   All  Bad  Good    Value  Distr_Good  Distr_Bad       WoE        IV
6   29    9    20    Sales    0.015337   0.079646 -1.647296  0.105935
1   44    6    38     Self    0.029141   0.053097 -0.599977  0.014373
0  516   53   463    Other    0.355061   0.469027 -0.278369  0.031724
4   34    0    34    UnEmp    0.026074   0.000000  0.000000  0.000000
3  370   24   346  ProfExe    0.265337   0.212389  0.222581  0.011785
5  172   11   161      Mgr    0.123466   0.097345  0.237705  0.006209
2  252   10   242   Office    0.185583   0.088496  0.740549  0.071898
IV score: 0.24


WoE and IV for column: LOAN_bin
   All  Bad  Good                Value  Distr_Good  Distr_Bad       WoE  \
2  242   27   215  (2999.999,

6. WOE TRANSFORMATION

In [7]:
#CREATING A CHECK POINT
hmeq_woe = hmeq_class.copy()

cols= ['REASON', 'JOB', 'LOAN_bin', 'MORTDUE_bin', 'VALUE_bin',
       'YOJ_bin', 'CLNO_bin', 'CLAGE_bin', 'DEBTINC_bin', 'DEROG_bin',
       'DELINQ_bin', 'NINQ_bin']
cols_woe = ['REASON_woe', 'JOB_woe', 'LOAN_woe', 'MORTDUE_woe', 'VALUE_woe',
       'YOJ_woe', 'CLNO_woe', 'CLAGE_woe', 'DEBTINC_woe', 'DEROG_woe',
       'DELINQ_woe', 'NINQ_woe']

#transformation loop
for i in range(len(cols)):
    df, iv = calculate_woe_iv(hmeq_woe, cols[i], 'BAD')
    for j in range(df.shape[0]):
        hmeq_woe.loc[hmeq_woe[cols[i]] == df[df.columns.values[3]][j], cols_woe[i]] = df[df.columns.values[6]][j]

# dropping unnessesary columns
hmeq_woe.drop(cols, axis=1, inplace=True)

7. LOGISTIC REGRESSION

In [8]:
#CREATING A CHECK POINT
hmeq_logit = hmeq_woe.copy()

#MANUALLY ADDING INTERCEPT
hmeq_logit['intercept'] = 1.0

train_cols = hmeq_logit.columns[1:]

logit = sm.Logit(hmeq_logit['BAD'], hmeq_logit[train_cols])

result = logit.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.227359
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:                    BAD   No. Observations:                 1417
Model:                          Logit   Df Residuals:                     1404
Method:                           MLE   Df Model:                           12
Date:                Thu, 23 Jul 2020   Pseudo R-squ.:                  0.1826
Time:                        11:25:34   Log-Likelihood:                -322.17
converged:                       True   LL-Null:                       -394.14
                                        LLR p-value:                 9.590e-25
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
REASON_woe     -0.8196      0.605     -1.355      0.176      -2.006       0.366
JOB_woe        -0.8981    

8. CREATING THE SCORE-CARD 


It is not right to create the scorecard based on parameters that showed low IV score and failed the sinificance test at 5%.

Still we did it because random sampling always changes the distribution which allways changes the choice of columns to drop.  

In [9]:
Factor = 20/np.log(2)
Offset = 600 - Factor * np.log(50)

ScoreCard = pd.DataFrame(columns=['Parameter', 'Value', 'Score'])

o_col_name= ['REASON', 'JOB', 'LOAN', 'MORTDUE', 'VALUE',
       'YOJ', 'CLNO', 'CLAGE', 'DEBTINC', 'DEROG',
       'DELINQ', 'NINQ']

for i in range(len(cols)):
    df, iv = calculate_woe_iv(hmeq_class, cols[i], 'BAD')
    df['Parameter'] = o_col_name[i]
    df['Score'] = (((df['WoE'])*([result.params][0][i])+(([result.params][0][-1])/(len(result.params)-1)))*(-1))*Factor+(Offset/(len(result.params)-1))
    df.drop(['All', 'Bad', 'Good', 'Distr_Good', 'Distr_Bad', 'WoE', 'IV'], axis=1, inplace=True)
    df = df[['Parameter', 'Value', 'Score']]
    ScoreCard = ScoreCard.append(df)

print(ScoreCard)

  Parameter                 Value      Score
1    REASON               HomeImp  40.274342
0    REASON               DebtCon  49.484744
6       JOB                 Sales   3.852726
1       JOB                  Self  30.993075
0       JOB                 Other  39.327274
4       JOB                 UnEmp  46.540956
3       JOB               ProfExe  52.308947
5       JOB                   Mgr  52.700876
2       JOB                Office  65.731623
2      LOAN   (2999.999, 10300.0]  37.291589
5      LOAN    (17100.0, 20400.0]  45.346559
4      LOAN    (20400.0, 25900.0]  46.738402
3      LOAN    (13500.0, 17100.0]  48.769849
0      LOAN    (25900.0, 42200.0]  48.825128
1      LOAN    (10300.0, 13500.0]  54.576390
1   MORTDUE    (59905.0, 71370.6]  43.936246
2   MORTDUE   (5075.999, 45788.2]  43.987934
0   MORTDUE    (45788.2, 59905.0]  46.267622
3   MORTDUE   (94099.4, 154005.0]  49.053172
4   MORTDUE    (71370.6, 94099.4]  50.577737
2     VALUE  (27780.999, 68172.6]  38.949230
1     VALU

In [10]:
ScoreCard.to_csv('ScoreCard.csv', index = False)