#Credit Risk Modeling | Part 1: Preprocessing Data

<img src='https://youtrading.com/fr/wp-content/uploads/2022/10/GettyImages-1324277066.jpg'>

Credit risk modeling is important for financial institutions. It represents the risk of a borrower not being able to pay back the loan amount, credit card or other types of loans. In some cases, borrowers can pay partial of the debt amount; therefore, the principal amount and interest amount are not paid. Both statistics and machine learning techniques play an important role in credit risk modeling. Some core skills include handling big data and advanced statistical modeling. 

**Terms ‚ö†Ô∏è**

There are two types of Internal Rating Based (IRB) approaches which are Foundation IRB and Advanced IRB.  
**Foundation IRB**  
PD is estimated internally by the bank while LGD and EAD are prescribed by regulator.  
**Advanced IRB**  
PD, LGD, and EAD can be estimated internally by the bank itself.

*  PD: probability of default in logistic regression  
Probability of default means the likelihood that a borrower will default on debt (credit card, mortgage or non-mortgage loan) over a one-year period. In simple words, it returns the expected probability of customers fail to repay the loan. Probability is expressed in the form of percentage, lies between 0% and 100%. Higher the probability, higher the chance of default.  
*  LGD: Loss given default in beta regression model   
It means how much of the amount outstanding we expect to lose. It is a proportion of the total exposure when borrower defaults. It is calculated by (1 - Recovery Rate).  
*  EAD: exposure at default in beta regression model  
It means how much should we expect the amount outstanding to be in the case of default. It is the amount that the borrower has to pay the bank at the time of default.  

We use PD model to create score cards to accept or reject one's demand of credit risk.

There are three models to use for bank management
*  Exposure Lose, EL = PD x LGD x EAD

There are two types of credit risk modeling: 
1.  Application model.   
ÔÉ∞	Whether to grant a loan or not. What interest rate  
ÔÉ∞	Risk based pricing, higher the risk higher the interest  
2.  Behaviour model  
ÔÉ∞	Whether to lend more money to existing borrower. Application.   
ÔÉ∞	Statistical models for estimating credit risk. Represented in a simplified way. Score card. Probability of default model. PD model in a simpler way.   

**Dataset ‚ñ∂**

The dataset contains more than 800,000 consumer loans issued from 2007 to 2015 by Lending Club. It is a large US peer-to-peer lending company. There are different versions of this dataset online and we take a version available on kaggle.com on (link)[https://www.kaggle.com/wendykan/lending-club-loan-data/version/1} . It should be noted that are discrete and continous columns that should be preprocessed accordingly.

We assume a scenario where data from 2007 to 2014 are available at the moment of building initial Expected Loss models, and other part of data of 2015 will become available from the applications later on. Therefore, the data is divided into two periods: (i) data from 2007 to 2014 and (ii) data in 2015.

Later, we investigate whether the former Probability of Default (PD) model built with the 2015 data have similar characteristics with the applications we used to build the initial PD model.

**Target üéØ** 

One of the prominent bank is asked us to build a credit risk model by using Loan Data to provide them a scorecard to use in their daily procedures as well a pipeline to calculate exposure loss. 

Here is a step-by-step instruction obtained as also in compliance with the Basel II requirements: 

**In Notebook L01** (this notebook)   
1-  Preprocessing - Converting columns into dummy variables by fine and coarse classing 
  
**In Notebook L02**  
2-  Calculate the PD model with logistic regression  
2-  Based on PD model, provide a practical scorecard in csv format  
    
**In Notebook L03**   
3-  Construct LGD model with beta regression  
4-  Build EAD model with beta regression  
5-  Calculate the exposure loss after obtaining all models  
  
  
**In Notebook L04**  
6-  Check the models if they are still doing good with the recent credit risk modeling.   

# 1 Data Preparation

## Importing Libraries

In [None]:
#installing gdown package to download dataset stored in G Drive
!pip install gdown
# to upgrade
!pip install --upgrade gdown

In [2]:
#data handling libs
import numpy as np
import pandas as pd

#lib to download the g-drive data
import gdown

#sklearn libs
from sklearn.model_selection import train_test_split

#data viz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

#importing custom-made functions
import sys #importing local functions in src folder
sys.path.append('../src/')
from functions import *

## Importing Data

In [1]:
#loading dataset from Gdrive
# fname_2007etp = "https://drive.google.com/file/d/16JXrTBSgEJH4_30zlFFBye1GRHh5h4O0/view?usp=share_link"
# fname_2007etp = 'https://drive.google.com/uc?id=' + fname_2007etp.split('/')[-2]

# #updated dataset
# #fname_2015 = "https://drive.google.com/file/d/1Fb7LFd97aJm0ySb0A48_znfe9KQzaUZO/view?usp=share_link"
# #fname_2015 = 'https://drive.google.com/uc?id=' + fname_2015.split('/')[-2]

In [None]:
# # downloading gdrive files
# url = fname_2007etp
# output = "loan_data_2007_2014.csv"
# gdown.download(url, output, quiet=False)

In [None]:
loan_data= pd.read_csv(output)

## Exploring Data

In [None]:
# Uncomment below to set the pandas dataframe options to display all columns/ rows.
#pd.options.display.max_columns = None
#pd.options.display.max_rows = None

loan_data.shape

In [None]:
loan_data.head()

In [None]:
loan_data.tail()

In [None]:
loan_data.columns.values

In [None]:
# Displaying column names with non missing cases and datatype
loan_data.info()


There are 74 columns in our dataset. There are several empty cells. It is neccessary to concentrate on some of the columns and handle the empty cells in the coming sections.

After some trial and errors, we come up with the following columns that are found significant to be kept for the predictive models. Therefore, preprocessing will be applied solely on the following columns: 

Attn: The preprocessing part is lengthy and takes some time to overview. However, it is repetitive and not complex.

---
**Discrete Variables**

1. 'grade': assigned loan grade  
2. 'sub_grade': LC assigned loan subgrade taxliens
3. 'home_ownership': the home ownership status provided by the borrower during registration. Values are: RENT, OWN, MORTGAGE, OTHER.
4. addr_state: The state provided by the borrower in the loan application
5. 'verification_status': Indicates if the borrowers' joint income was verified by LC, not verified, or if the income source was verified
6. 'purpose': A category provided by the borrower for the loan request.
7. 'initial_list_status': The initial listing status of the loan. Possible values are ‚Äì W, F

---
**Continuous Variables**
1. 'term': number of payments on the loan. Values are in months and can be either 36 or 60.
2. 'emp_length': Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
3. 'int_rate': Interest rate on the loan
4. 'mths_since_earliest_cr_line': date the borrower's earliest reported credit line was opened
5. 'delinq_2yrs': The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
6. 'inq_last_6mths': The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
7. 'open_acc': The number of open credit lines in the borrower's credit file
8. 'pub_rec' : Number of derogatory public records
9. 'total_acc': The total number of credit lines currently in the borrower's credit file
10. 'acc_now_delinq': The number of accounts on which the borrower is now delinquent.
11. 'total_rev_hi_lim': Total revolving high credit/credit limit
12. 'annual_inc': The self-reported annual income provided by the borrower during registration.
13. 'dti': A ratio calculated using the borrower‚Äôs total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower‚Äôs self-reported monthly income.
14. 'mths_since_last_delinq': The number of months since the borrower's last delinquency.
15. 'mths_since_last_record': The number of months since the last public record.



Please note that the following column is not considered in the example herein but would have improved the prediction scores.
* 'mths_since_issue_d': Months since most recent issue d 


# 2 General Preprocessing

## a Discrete variables - Dummy columns

In [None]:
#creating dummy variables. adding new columns of values 1 or 0. 
# ex for Gender=Male, we use 0 Non 1 True. etc
# single dummy variable is sufficient for two categories
# if there are 4 categories, we need only 3 categories in our prediction model
# we use the function 
pd.get_dummies(loan_data['grade'], prefix = 'grade', prefix_sep = ':').head()

In [12]:
dummy_columns = ['grade','sub_grade','home_ownership','verification_status',
                 'loan_status','purpose','addr_state','initial_list_status']

df_Dummies = pd.DataFrame()
for col in dummy_columns:
  df_Dummy= pd.get_dummies(loan_data[col], prefix = col, prefix_sep = ':')
  df_Dummies = pd.concat([df_Dummies,df_Dummy ], axis=1)
  #print(loan_data_Dummies.head())
  # = pd.concat([loan_data,loan_data_Dummies],axis = 1)

In [None]:
#list of all dummy columns. 
df_Dummies.columns.values

In [14]:
#merging dummy columns with the main dataset
loan_data = pd.concat([loan_data,df_Dummies],axis = 1)

## b Continuous variables - Dt format conversion

In [None]:
#lets convert emp_length into integer
loan_data['emp_length'].unique()

In [26]:
# Clean and convert 'emp_length' to numeric
loan_data['emp_length_int'] = (
    loan_data['emp_length']
    .str.replace(r'\+ years?|< 1 year|n/a', '0', regex=True)  # Handle '+ years', '< 1 year', and 'n/a'
    .str.replace(r' years?| year', '', regex=True)            # Remove 'year' or 'years'
)

# Convert to numeric type
loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int'], errors='coerce')


In [27]:
loan_data['emp_length_int']=pd.to_numeric(loan_data['emp_length_int'])

In [None]:
# date variables not in dt format
loan_data['earliest_cr_line']

In [29]:
#converting date column to format %b-%y : Apr-03 => 2003-04-03
loan_data['earliest_cr_line_date']=pd.to_datetime(loan_data['earliest_cr_line'],format = '%b-%y')
#calculating the months since a default date taken as 2017-12-01
diff_cr_line = pd.to_datetime('2017-12-01') - loan_data['earliest_cr_line_date']
loan_data['mths_since_earliest_cr_line'] = round(pd.to_numeric( diff_cr_line / np.timedelta64(1, 'M')))

In [None]:
loan_data['mths_since_earliest_cr_line'].describe()

In [None]:
#finding out why there are negative values in our dataset
m1 = loan_data['mths_since_earliest_cr_line']<0
loan_data.loc[m1,['earliest_cr_line','earliest_cr_line_date','mths_since_earliest_cr_line']].head()

In [32]:
# it is neccessary to handle negative values. they are due to the 196x data read as 206x
# we take the maximum month difference to replace negative values
loan_data.loc[m1,'mths_since_earliest_cr_line'] = loan_data.loc[:,'mths_since_earliest_cr_line'].max()

In [None]:
#let's convert term column to integer format
loan_data['term']

In [None]:
loan_data['term_int'] = loan_data['term'].str.replace(' months', '').astype(int)
loan_data['term_int'].describe()

In [None]:
# Assuming we are in December 2017
loan_data['issue_d_date'] = pd.to_datetime(loan_data['issue_d'], format = '%b-%y')

#calculating the month difference from 2017-12-01
# We calculate the difference between two dates in months, turn it to numeric datatype and round it.
diff_issue_d = pd.to_datetime('2017-12-01') - loan_data['issue_d_date']
loan_data['mths_since_issue_d'] = round(pd.to_numeric( diff_issue_d / np.timedelta64(1, 'M')))

# Showing some descriptive statisics for the values of a column.
loan_data['mths_since_issue_d'].describe()

## c Checking for missing values or cleaning them

In [None]:
pd.options.display.max_rows = None
loan_data.isnull().sum().sort_values(ascending=False)

In [37]:
pd.options.display.max_rows = 100

In [38]:
# filling up the empty rows that will be used in our model.
# we use funded_amnt for the missing total_rev_hi_lim values
# fundedAmnt The total amount committed to that loan at that point in time.
loan_data['total_rev_hi_lim'].fillna(loan_data['funded_amnt'],inplace = True)

In [39]:
# for the missing values in annual_inc, mean value is considered.  
loan_data['annual_inc'].fillna(loan_data['annual_inc'].mean(),inplace = True)

In [40]:
# for the missing values below, we consider 0
loan_data['mths_since_earliest_cr_line'].fillna(0,inplace = True)
loan_data['acc_now_delinq'].fillna(0, inplace=True)
loan_data['total_acc'].fillna(0, inplace=True)
loan_data['pub_rec'].fillna(0, inplace=True)
loan_data['open_acc'].fillna(0, inplace=True)
loan_data['inq_last_6mths'].fillna(0, inplace=True)
loan_data['delinq_2yrs'].fillna(0, inplace=True)
loan_data['emp_length_int'].fillna(0, inplace=True)

# 3 PD Model Definition

Remember: EL = PD * LGB * EDA

First, we need to define the default definition. Good and bad definitions are needed

## a Data Preparation - Dependent Variable

Dependent variable. Good/Bad - Defaulted borrower Definition. Default and Non-default Accounts

In [None]:
# let's explore loan_status column further and findout the proportion of data
loan_data['loan_status'].value_counts()

In [None]:
loan_data['loan_status'].value_counts() / loan_data['loan_status'].count()

In [43]:
# if the result == 'Charged Off', 'Default', 'Does not meet the credit policy. Status: Charged Off.',
#                   'Late (31-120 days)', We take 0 (Bad). Otherwise = 1 (Good)
bad_def = ['Charged Off', 'Default','Does not meet the credit policy. Status: Charged Off.',
           'Late (31-120 days)']

#good is 1, bad is 0
loan_data['good_bad'] = np.where(loan_data['loan_status'].isin(bad_def), 0, 1) 

In [None]:
loan_data['good_bad'].head()

In [None]:
loan_data['good_bad'].value_counts() / loan_data['good_bad'].count() 

##**b Train n Test Set Split**


In [46]:
#splitting data into test and split datasets. 80:20 train to test ratio
inputs_train, inputs_test, targets_train, targets_test = train_test_split(
    loan_data.drop('good_bad',axis = 1), loan_data['good_bad'], test_size=0.2, 
    random_state= 75)

In [None]:
inputs_train.shape

In [None]:
inputs_test.shape

##**c Selecting test or train set for preproc.**


In [49]:
#first run it for train set and then for test set for the preprocessing
df_inputs_prepr = inputs_train
df_targets_prepr = targets_train
#test set
#df_inputs_prepr = inputs_test
#df_targets_prepr = targets_test

#next we will calculate Weight of Evidence (WoE)

#4 Discrete Data Preparation - WoE

**Let's automate the calculation of WoE and IV**

**Weight of Evidence- Woe**
To what extent an independent variable would predict a dependent variable

Positive WOE means Distribution of Goods > Distribution of Bads  
Negative WOE means Distribution of Goods < Distribution of Bads  
Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.  

Two ways to classify groups in WoE calculations  

1. Fine Classing: Create 10 to 20 bins/groups for a continuous independent variable and then calculate WOE and IV of the variable. 
2. Coarse Classing: Combine adjacent categories with similar WOE scores

---
**Information Value (IV)**

Information value is a useful technique to select important variables in a predictive model. It represents how much independent information it brings originally to explain dependent value. It helps to rank variables on the basis of their importance. The IV is calculated using the following formula :  

IV = ‚àë (% of non-events - % of events) * WOE

IV categories  
* Less than 0.02 => 	Not useful for prediction   
* 0.02 to 0.1	=> Weak predictive Power  
* 0.1 to 0.3	=> Medium predictive Power  
* 0.3 to 0.5 => 	Strong predictive Power  
* '>0.5 =>	Suspicious Predictive Power  


## Part 1 Discrete Variables: Dummy Variables

In [None]:
df_temp = woe_discrete(df_inputs_prepr, 'grade',df_targets_prepr)
df_temp

  **Drawing the plots**

In [None]:
plot_by_woe(df_temp,0)

As we can see, WoE increases with increasing external credit grade. 
That means loans with greater external grade are better in general.

In [None]:
df_temp = woe_discrete(df_inputs_prepr, 'home_ownership',df_targets_prepr)
df_temp

In [None]:
plot_by_woe(df_temp,0)

We don't want dummy variables for None, Other and Any.
They are undenrepresented categories. 

Let's combine None, Other and Any

In [54]:
df_inputs_prepr['home_ownership:RENT_OTHER_NONE_ANY'] = sum ([df_inputs_prepr['home_ownership:RENT'], df_inputs_prepr['home_ownership:OTHER'],
                                                              df_inputs_prepr['home_ownership:NONE'], df_inputs_prepr['home_ownership:ANY'],])

In [None]:
df_inputs_prepr['addr_state'].unique()

In [None]:
df_temp = woe_discrete(df_inputs_prepr,'addr_state', df_targets_prepr)
df_temp

In [None]:
plot_by_woe(df_temp)

Very few observations for first two and last two states + North Dakato no data.

In [58]:
if ['addr_state:ND'] in df_inputs_prepr.columns.values:
  pass
else:
  df_inputs_prepr['addr_state:ND'] = 0

In [None]:
plot_by_woe(df_temp.iloc[2:-2,:])

Lets combine Nevada with Florida. NE, IA, NV => FL

If no data, go with the worst case scenario. 

Last 4 cols can be regrouped together. 

In [None]:
plot_by_woe(df_temp.iloc[6:-6,:])

rest of the states can be in the same group. We can seperate NY and CA.

check with the borrowers, try to regroup some states together. 

TX also has high number of borrowers.

In [61]:
# We create the following categories:
# 'ND' 'NE' 'IA' NV' 'FL' 'HI' 'AL'
# 'NM' 'VA'
# 'NY'
# 'OK' 'TN' 'MO' 'LA' 'MD' 'NC'
# 'CA'
# 'UT' 'KY' 'AZ' 'NJ'
# 'AR' 'MI' 'PA' 'OH' 'MN'
# 'RI' 'MA' 'DE' 'SD' 'IN'
# 'GA' 'WA' 'OR'
# 'WI' 'MT'
# 'TX'
# 'IL' 'CT'
# 'KS' 'SC' 'CO' 'VT' 'AK' 'MS'
# 'WV' 'NH' 'WY' 'DC' 'ME' 'ID'

# 'IA_NV_HI_ID_AL_FL' will be the reference category.

df_inputs_prepr['addr_state:ND_NE_IA_NV_FL_HI_AL'] = sum([df_inputs_prepr['addr_state:ND'], df_inputs_prepr['addr_state:NE'],
                                              df_inputs_prepr['addr_state:IA'], df_inputs_prepr['addr_state:NV'],
                                              df_inputs_prepr['addr_state:FL'], df_inputs_prepr['addr_state:HI'],
                                                          df_inputs_prepr['addr_state:AL']])

df_inputs_prepr['addr_state:NM_VA'] = sum([df_inputs_prepr['addr_state:NM'], df_inputs_prepr['addr_state:VA']])

df_inputs_prepr['addr_state:OK_TN_MO_LA_MD_NC'] = sum([df_inputs_prepr['addr_state:OK'], df_inputs_prepr['addr_state:TN'],
                                              df_inputs_prepr['addr_state:MO'], df_inputs_prepr['addr_state:LA'],
                                              df_inputs_prepr['addr_state:MD'], df_inputs_prepr['addr_state:NC']])

df_inputs_prepr['addr_state:UT_KY_AZ_NJ'] = sum([df_inputs_prepr['addr_state:UT'], df_inputs_prepr['addr_state:KY'],
                                              df_inputs_prepr['addr_state:AZ'], df_inputs_prepr['addr_state:NJ']])

df_inputs_prepr['addr_state:AR_MI_PA_OH_MN'] = sum([df_inputs_prepr['addr_state:AR'], df_inputs_prepr['addr_state:MI'],
                                              df_inputs_prepr['addr_state:PA'], df_inputs_prepr['addr_state:OH'],
                                              df_inputs_prepr['addr_state:MN']])

df_inputs_prepr['addr_state:RI_MA_DE_SD_IN'] = sum([df_inputs_prepr['addr_state:RI'], df_inputs_prepr['addr_state:MA'],
                                              df_inputs_prepr['addr_state:DE'], df_inputs_prepr['addr_state:SD'],
                                              df_inputs_prepr['addr_state:IN']])

df_inputs_prepr['addr_state:GA_WA_OR'] = sum([df_inputs_prepr['addr_state:GA'], df_inputs_prepr['addr_state:WA'],
                                              df_inputs_prepr['addr_state:OR']])

df_inputs_prepr['addr_state:WI_MT'] = sum([df_inputs_prepr['addr_state:WI'], df_inputs_prepr['addr_state:MT']])

df_inputs_prepr['addr_state:IL_CT'] = sum([df_inputs_prepr['addr_state:IL'], df_inputs_prepr['addr_state:CT']])

df_inputs_prepr['addr_state:KS_SC_CO_VT_AK_MS'] = sum([df_inputs_prepr['addr_state:KS'], df_inputs_prepr['addr_state:SC'],
                                              df_inputs_prepr['addr_state:CO'], df_inputs_prepr['addr_state:VT'],
                                              df_inputs_prepr['addr_state:AK'], df_inputs_prepr['addr_state:MS']])

df_inputs_prepr['addr_state:WV_NH_WY_DC_ME_ID'] = sum([df_inputs_prepr['addr_state:WV'], df_inputs_prepr['addr_state:NH'],
                                              df_inputs_prepr['addr_state:WY'], df_inputs_prepr['addr_state:DC'],
                                              df_inputs_prepr['addr_state:ME'], df_inputs_prepr['addr_state:ID']])

## Part 2 Discrete Variables: Dummy Variables

**Let's repeat the same preprocessing on verification_status**

In [None]:
# 'verification_status'
df_temp = woe_discrete(df_inputs_prepr, 'verification_status', df_targets_prepr)
# We calculate weight of evidence.
df_temp


In [None]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.

OK
**How about purpose row?

In [None]:
# 'purpose'
df_temp = woe_discrete(df_inputs_prepr, 'purpose', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [66]:
# We combine 'educational', 'small_business', 'wedding', 'renewable_energy', 'moving', 'house' in one category: 'educ__sm_b__wedd__ren_en__mov__house'.
# We combine 'other', 'medical', 'vacation' in one category: 'oth__med__vacation'.
# We combine 'major_purchase', 'car', 'home_improvement' in one category: 'major_purch__car__home_impr'.
# We leave 'debt_consolidtion' in a separate category.
# We leave 'credit_card' in a separate category.
# 'educ__sm_b__wedd__ren_en__mov__house' will be the reference category.
df_inputs_prepr['purpose:educ__sm_b__wedd__ren_en__mov__house'] = sum([df_inputs_prepr['purpose:educational'], df_inputs_prepr['purpose:small_business'],
                                                                 df_inputs_prepr['purpose:wedding'], df_inputs_prepr['purpose:renewable_energy'],
                                                                 df_inputs_prepr['purpose:moving'], df_inputs_prepr['purpose:house']])
df_inputs_prepr['purpose:oth__med__vacation'] = sum([df_inputs_prepr['purpose:other'], df_inputs_prepr['purpose:medical'],
                                             df_inputs_prepr['purpose:vacation']])
df_inputs_prepr['purpose:major_purch__car__home_impr'] = sum([df_inputs_prepr['purpose:major_purchase'], df_inputs_prepr['purpose:car'],
                                                        df_inputs_prepr['purpose:home_improvement']])

**initial list status***

In [None]:
# 'initial_list_status'
df_temp = woe_discrete(df_inputs_prepr, 'initial_list_status', df_targets_prepr)
df_temp

In [None]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.

#5 Continous Data Preparation - WoE

In [None]:
df_inputs_prepr['term_int'].unique()

In [None]:
df_temp = woe_ordered_continuous ( df_inputs_prepr, 'term_int', df_targets_prepr)
df_temp

In [None]:
plot_by_woe(df_temp)

In [72]:
df_inputs_prepr['term:36'] = np.where((df_inputs_prepr['term_int']==36),1,0)
df_inputs_prepr['term:60'] = np.where((df_inputs_prepr['term_int']==60),1,0)

In [None]:
df_inputs_prepr['emp_length_int'].unique()

In [None]:
df_temp = woe_ordered_continuous (df_inputs_prepr,'emp_length_int',df_targets_prepr)
df_temp

In [None]:
plot_by_woe(df_temp)

In [76]:
# We create the following categories: '0', '1', '2 - 4', '5 - 6', '7 - 9', '10'
# '0' will be the reference category
df_inputs_prepr['emp_length:0'] = np.where(df_inputs_prepr['emp_length_int'].isin([0]), 1, 0)
df_inputs_prepr['emp_length:1'] = np.where(df_inputs_prepr['emp_length_int'].isin([1]), 1, 0)
df_inputs_prepr['emp_length:2-4'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(2, 5)), 1, 0)
df_inputs_prepr['emp_length:5-6'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(5, 7)), 1, 0)
df_inputs_prepr['emp_length:7-9'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(7, 10)), 1, 0)
df_inputs_prepr['emp_length:10'] = np.where(df_inputs_prepr['emp_length_int'].isin([10]), 1, 0)

##Part 1 Continous Variables - Dummy Variables

In [None]:
df_inputs_prepr['mths_since_issue_d'].unique()

In [78]:
#let's do fine classing first: to roughly group the values into categories
#second, we need to do coarse classing: determining final categories, combining few of initial fine classing
# categories into bigger categories, if needed
df_inputs_prepr['mths_since_issue_d_factor'] = pd.cut(df_inputs_prepr['mths_since_issue_d'],50)

In [None]:
df_inputs_prepr['mths_since_issue_d_factor'] 

In [None]:
df_temp = woe_ordered_continuous (df_inputs_prepr, 'mths_since_issue_d_factor',df_targets_prepr)
df_temp

In [None]:
plot_by_woe(df_temp,90)

In [82]:
# '< 9.548', '9.548 - 12.025', '12.025 - 15.74', '15.74 - 20.281', '> 20.281'
df_inputs_prepr['int_rate:<9.548'] = np.where((df_inputs_prepr['int_rate'] <= 9.548), 1, 0)
df_inputs_prepr['int_rate:9.548-12.025'] = np.where((df_inputs_prepr['int_rate'] > 9.548) & (df_inputs_prepr['int_rate'] <= 12.025), 1, 0)
df_inputs_prepr['int_rate:12.025-15.74'] = np.where((df_inputs_prepr['int_rate'] > 12.025) & (df_inputs_prepr['int_rate'] <= 15.74), 1, 0)
df_inputs_prepr['int_rate:15.74-20.281'] = np.where((df_inputs_prepr['int_rate'] > 15.74) & (df_inputs_prepr['int_rate'] <= 20.281), 1, 0)
df_inputs_prepr['int_rate:>20.281'] = np.where((df_inputs_prepr['int_rate'] > 20.281), 1, 0)

In [None]:
# funded_amnt
df_inputs_prepr['funded_amnt_factor'] = pd.cut(df_inputs_prepr['funded_amnt'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'funded_amnt_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [None]:
# mths_since_earliest_cr_line
df_inputs_prepr['mths_since_earliest_cr_line_factor'] = pd.cut(df_inputs_prepr['mths_since_earliest_cr_line'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'mths_since_earliest_cr_line_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [87]:
# We create the following categories:
# < 140, # 141 - 164, # 165 - 247, # 248 - 270, # 271 - 352, # > 352
df_inputs_prepr['mths_since_earliest_cr_line:<140'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:141-164'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(140, 165)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:165-247'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(165, 248)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:248-270'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(248, 271)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:271-352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(271, 353)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:>352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(353, int(df_inputs_prepr['mths_since_earliest_cr_line'].max()))), 1, 0)

In [None]:
# delinq_2yrs
df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_2yrs', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.

In [90]:
# Categories: 0, 1-3, >=4
df_inputs_prepr['delinq_2yrs:0'] = np.where((df_inputs_prepr['delinq_2yrs'] == 0), 1, 0)
df_inputs_prepr['delinq_2yrs:1-3'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 1) & (df_inputs_prepr['delinq_2yrs'] <= 3), 1, 0)
df_inputs_prepr['delinq_2yrs:>=4'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 9), 1, 0)

In [None]:
# inq_last_6mths
df_temp = woe_ordered_continuous(df_inputs_prepr, 'inq_last_6mths', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.

In [93]:
# Categories: 0, 1 - 2, 3 - 6, > 6
df_inputs_prepr['inq_last_6mths:0'] = np.where((df_inputs_prepr['inq_last_6mths'] == 0), 1, 0)
df_inputs_prepr['inq_last_6mths:1-2'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 1) & (df_inputs_prepr['inq_last_6mths'] <= 2), 1, 0)
df_inputs_prepr['inq_last_6mths:3-6'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 3) & (df_inputs_prepr['inq_last_6mths'] <= 6), 1, 0)
df_inputs_prepr['inq_last_6mths:>6'] = np.where((df_inputs_prepr['inq_last_6mths'] > 6), 1, 0)

In [None]:
# open_acc
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_acc', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [None]:
plot_by_woe(df_temp.iloc[ : 40, :], 90)
# We plot the weight of evidence values.

In [97]:
# Categories: '0', '1-3', '4-12', '13-17', '18-22', '23-25', '26-30', '>30'
df_inputs_prepr['open_acc:0'] = np.where((df_inputs_prepr['open_acc'] == 0), 1, 0)
df_inputs_prepr['open_acc:1-3'] = np.where((df_inputs_prepr['open_acc'] >= 1) & (df_inputs_prepr['open_acc'] <= 3), 1, 0)
df_inputs_prepr['open_acc:4-12'] = np.where((df_inputs_prepr['open_acc'] >= 4) & (df_inputs_prepr['open_acc'] <= 12), 1, 0)
df_inputs_prepr['open_acc:13-17'] = np.where((df_inputs_prepr['open_acc'] >= 13) & (df_inputs_prepr['open_acc'] <= 17), 1, 0)
df_inputs_prepr['open_acc:18-22'] = np.where((df_inputs_prepr['open_acc'] >= 18) & (df_inputs_prepr['open_acc'] <= 22), 1, 0)
df_inputs_prepr['open_acc:23-25'] = np.where((df_inputs_prepr['open_acc'] >= 23) & (df_inputs_prepr['open_acc'] <= 25), 1, 0)
df_inputs_prepr['open_acc:26-30'] = np.where((df_inputs_prepr['open_acc'] >= 26) & (df_inputs_prepr['open_acc'] <= 30), 1, 0)
df_inputs_prepr['open_acc:>=31'] = np.where((df_inputs_prepr['open_acc'] >= 31), 1, 0)

In [None]:
# pub_rec
df_temp = woe_ordered_continuous(df_inputs_prepr, 'pub_rec', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [100]:
# Categories '0-2', '3-4', '>=5'
df_inputs_prepr['pub_rec:0-2'] = np.where((df_inputs_prepr['pub_rec'] >= 0) & (df_inputs_prepr['pub_rec'] <= 2), 1, 0)
df_inputs_prepr['pub_rec:3-4'] = np.where((df_inputs_prepr['pub_rec'] >= 3) & (df_inputs_prepr['pub_rec'] <= 4), 1, 0)
df_inputs_prepr['pub_rec:>=5'] = np.where((df_inputs_prepr['pub_rec'] >= 5), 1, 0)

In [None]:
# total_acc
df_inputs_prepr['total_acc_factor'] = pd.cut(df_inputs_prepr['total_acc'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_acc_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [103]:
# Categories: '<=27', '28-51', '>51'
df_inputs_prepr['total_acc:<=27'] = np.where((df_inputs_prepr['total_acc'] <= 27), 1, 0)
df_inputs_prepr['total_acc:28-51'] = np.where((df_inputs_prepr['total_acc'] >= 28) & (df_inputs_prepr['total_acc'] <= 51), 1, 0)
df_inputs_prepr['total_acc:>=52'] = np.where((df_inputs_prepr['total_acc'] >= 52), 1, 0)

In [None]:
# acc_now_delinq
df_temp = woe_ordered_continuous(df_inputs_prepr, 'acc_now_delinq', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.

In [106]:
# Categories: '0', '>=1'
df_inputs_prepr['acc_now_delinq:0'] = np.where((df_inputs_prepr['acc_now_delinq'] == 0), 1, 0)
df_inputs_prepr['acc_now_delinq:>=1'] = np.where((df_inputs_prepr['acc_now_delinq'] >= 1), 1, 0)

In [None]:
# total_rev_hi_lim
df_inputs_prepr['total_rev_hi_lim_factor'] = pd.cut(df_inputs_prepr['total_rev_hi_lim'], 2000)
# Here we do fine-classing: using the 'cut' method, we split the variable into 2000 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_rev_hi_lim_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp.iloc[: 50, : ], 90)
# We plot the weight of evidence values.

In [109]:
# Categories
# '<=5K', '5K-10K', '10K-20K', '20K-30K', '30K-40K', '40K-55K', '55K-95K', '>95K'
df_inputs_prepr['total_rev_hi_lim:<=5K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] <= 5000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:5K-10K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 5000) & (df_inputs_prepr['total_rev_hi_lim'] <= 10000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:10K-20K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 10000) & (df_inputs_prepr['total_rev_hi_lim'] <= 20000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:20K-30K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 20000) & (df_inputs_prepr['total_rev_hi_lim'] <= 30000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:30K-40K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 30000) & (df_inputs_prepr['total_rev_hi_lim'] <= 40000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:40K-55K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 40000) & (df_inputs_prepr['total_rev_hi_lim'] <= 55000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:55K-95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 55000) & (df_inputs_prepr['total_rev_hi_lim'] <= 95000), 1, 0)
df_inputs_prepr['total_rev_hi_lim:>95K'] = np.where((df_inputs_prepr['total_rev_hi_lim'] > 95000), 1, 0)

In [None]:
# installment
df_inputs_prepr['installment_factor'] = pd.cut(df_inputs_prepr['installment'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'installment_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

##Part 2 Continuous Variables - Dummy Variables

In [None]:
#after trial and error, we decideded to keep the income values equal and less than 140k. So we will apply 50 cuts.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['annual_inc'] <= 140000,:  ]
df_inputs_prepr_temp ['annual_inc_factor'] = pd.cut(df_inputs_prepr_temp['annual_inc'],50)
df_temp = woe_ordered_continuous (df_inputs_prepr_temp, 'annual_inc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
df_temp

In [None]:
plot_by_woe(df_temp,90)

In [114]:
# WoE is monotonically decreasing with income, so we split income in 10 equal categories, each with width of 15k.
df_inputs_prepr['annual_inc:<20K'] = np.where((df_inputs_prepr['annual_inc'] <= 20000), 1, 0)
df_inputs_prepr['annual_inc:20K-30K'] = np.where((df_inputs_prepr['annual_inc'] > 20000) & (df_inputs_prepr['annual_inc'] <= 30000), 1, 0)
df_inputs_prepr['annual_inc:30K-40K'] = np.where((df_inputs_prepr['annual_inc'] > 30000) & (df_inputs_prepr['annual_inc'] <= 40000), 1, 0)
df_inputs_prepr['annual_inc:40K-50K'] = np.where((df_inputs_prepr['annual_inc'] > 40000) & (df_inputs_prepr['annual_inc'] <= 50000), 1, 0)
df_inputs_prepr['annual_inc:50K-60K'] = np.where((df_inputs_prepr['annual_inc'] > 50000) & (df_inputs_prepr['annual_inc'] <= 60000), 1, 0)
df_inputs_prepr['annual_inc:60K-70K'] = np.where((df_inputs_prepr['annual_inc'] > 60000) & (df_inputs_prepr['annual_inc'] <= 70000), 1, 0)
df_inputs_prepr['annual_inc:70K-80K'] = np.where((df_inputs_prepr['annual_inc'] > 70000) & (df_inputs_prepr['annual_inc'] <= 80000), 1, 0)
df_inputs_prepr['annual_inc:80K-90K'] = np.where((df_inputs_prepr['annual_inc'] > 80000) & (df_inputs_prepr['annual_inc'] <= 90000), 1, 0)
df_inputs_prepr['annual_inc:90K-100K'] = np.where((df_inputs_prepr['annual_inc'] > 90000) & (df_inputs_prepr['annual_inc'] <= 100000), 1, 0)
df_inputs_prepr['annual_inc:100K-120K'] = np.where((df_inputs_prepr['annual_inc'] > 100000) & (df_inputs_prepr['annual_inc'] <= 120000), 1, 0)
df_inputs_prepr['annual_inc:120K-140K'] = np.where((df_inputs_prepr['annual_inc'] > 120000) & (df_inputs_prepr['annual_inc'] <= 140000), 1, 0)
df_inputs_prepr['annual_inc:>140K'] = np.where((df_inputs_prepr['annual_inc'] > 140000), 1, 0)


In [None]:
# mths_since_last_delinq
# We have to create one category for missing values and do fine and coarse classing for the rest.
df_inputs_prepr_temp = df_inputs_prepr[pd.notnull(df_inputs_prepr['mths_since_last_delinq'])]
df_inputs_prepr_temp['mths_since_last_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_last_delinq'], 50)
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_last_delinq_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [117]:
# Categories: Missing, 0-3, 4-30, 31-56, >=57
df_inputs_prepr['mths_since_last_delinq:Missing'] = np.where((df_inputs_prepr['mths_since_last_delinq'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_delinq:0-3'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 0) & (df_inputs_prepr['mths_since_last_delinq'] <= 3), 1, 0)
df_inputs_prepr['mths_since_last_delinq:4-30'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 4) & (df_inputs_prepr['mths_since_last_delinq'] <= 30), 1, 0)
df_inputs_prepr['mths_since_last_delinq:31-56'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 31) & (df_inputs_prepr['mths_since_last_delinq'] <= 56), 1, 0)
df_inputs_prepr['mths_since_last_delinq:>=57'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 57), 1, 0)

In [None]:
# Similarly to income, initial examination shows that most values are lower than 35.
# Hence, we are going to have one category for more than 35, and we are going to apply our approach to determine
# the categories of everyone with 35 or less.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['dti'] <= 35, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_inputs_prepr_temp['dti_factor'] = pd.cut(df_inputs_prepr_temp['dti'], 50)
# We calculate weight of evidence.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'dti_factor', df_targets_prepr[df_inputs_prepr_temp.index])

df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [None]:
# Categories:
df_inputs_prepr['dti:<=1.4'] = np.where((df_inputs_prepr['dti'] <= 1.4), 1, 0)
df_inputs_prepr['dti:1.4-3.5'] = np.where((df_inputs_prepr['dti'] > 1.4) & (df_inputs_prepr['dti'] <= 3.5), 1, 0)
df_inputs_prepr['dti:3.5-7.7'] = np.where((df_inputs_prepr['dti'] > 3.5) & (df_inputs_prepr['dti'] <= 7.7), 1, 0)
df_inputs_prepr['dti:7.7-10.5'] = np.where((df_inputs_prepr['dti'] > 7.7) & (df_inputs_prepr['dti'] <= 10.5), 1, 0)
df_inputs_prepr['dti:10.5-16.1'] = np.where((df_inputs_prepr['dti'] > 10.5) & (df_inputs_prepr['dti'] <= 16.1), 1, 0)
df_inputs_prepr['dti:16.1-20.3'] = np.where((df_inputs_prepr['dti'] > 16.1) & (df_inputs_prepr['dti'] <= 20.3), 1, 0)
df_inputs_prepr['dti:20.3-21.7'] = np.where((df_inputs_prepr['dti'] > 20.3) & (df_inputs_prepr['dti'] <= 21.7), 1, 0)
df_inputs_prepr['dti:21.7-22.4'] = np.where((df_inputs_prepr['dti'] > 21.7) & (df_inputs_prepr['dti'] <= 22.4), 1, 0)
df_inputs_prepr['dti:22.4-35'] = np.where((df_inputs_prepr['dti'] > 22.4) & (df_inputs_prepr['dti'] <= 35), 1, 0)
df_inputs_prepr['dti:>35'] = np.where((df_inputs_prepr['dti'] > 35), 1, 0)

In [None]:
# mths_since_last_record
# We have to create one category for missing values and do fine and coarse classing for the rest.
df_inputs_prepr_temp = df_inputs_prepr[pd.notnull(df_inputs_prepr['mths_since_last_record'])]
#sum(loan_data_temp['mths_since_last_record'].isnull())
df_inputs_prepr_temp['mths_since_last_record_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_last_record'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_last_record_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp

In [None]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.

In [None]:
# Categories: 'Missing', '0-2', '3-20', '21-31', '32-80', '81-86', '>86'
df_inputs_prepr['mths_since_last_record:Missing'] = np.where((df_inputs_prepr['mths_since_last_record'].isnull()), 1, 0)
df_inputs_prepr['mths_since_last_record:0-2'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 0) & (df_inputs_prepr['mths_since_last_record'] <= 2), 1, 0)
df_inputs_prepr['mths_since_last_record:3-20'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 3) & (df_inputs_prepr['mths_since_last_record'] <= 20), 1, 0)
df_inputs_prepr['mths_since_last_record:21-31'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 21) & (df_inputs_prepr['mths_since_last_record'] <= 31), 1, 0)
df_inputs_prepr['mths_since_last_record:32-80'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 32) & (df_inputs_prepr['mths_since_last_record'] <= 80), 1, 0)
df_inputs_prepr['mths_since_last_record:81-86'] = np.where((df_inputs_prepr['mths_since_last_record'] >= 81) & (df_inputs_prepr['mths_since_last_record'] <= 86), 1, 0)
df_inputs_prepr['mths_since_last_record:>=86'] = np.where((df_inputs_prepr['mths_since_last_record'] > 86), 1, 0)

#6 Exporting CSV files - Preprocessed data

##a Exporting train dataset

In [124]:
#first run the code for the train and then test
inputs_train = df_inputs_prepr
inputs_train.to_csv('loan_data_inputs_train.csv')
targets_train.to_csv('loan_data_targets_train.csv')
#inputs_test = df_inputs_prepr
#inputs_test.to_csv('loan_data_inputs_test.csv')
#targets_test.to_csv('loan_data_targets_test.csv')

##b Exporting test dataset

Here we have two options to run preprocessing.

Option 1:

 Either go to the code in '3c' and change df_inputs_prepr to inputs_test. Run all the codes above till down here.

Option 2: 

  Use the python code on the src folder. Run the results with that.import sys #importing local functions in src folder
sys.path.append('../src/')
from functions import cross_validate_score, score_ML_log


In [None]:
#running preproc on test dataset
inputs_test= preproc_input (inputs_test)
inputs_test.columns.values

In [126]:
inputs_test.to_csv('loan_data_inputs_test.csv')
targets_train.to_csv('loan_data_targets_test.csv')

--- End of Notebook ---
#END