## **Bank Credit Scoring**
(Credit Scoring and Analytics)

### **Scorecard Development Process**

### **1. Load Data**

The sample will use a dataset from ["loan_default"](https://www.kaggle.com/datasets/nikhil1e9/loan-default/)

The data is a collection of information that includes demographic, financial, and target variable information.

The data consists of :
1. `LoanID` : A unique identifier fr each loan.
2. `Age` : The age of the borrower.
3. `Income` : The annual income of the borrower.
4. `LoanAmount` : The amount of money being borrowed.
5. `CreditScore` : The credit score of the borrower, indicating their creditworthiness.
6. `MonthsEmployed` : The number of months the borrower has been employed.
7. `NumCreditLines` : The number of credit lines the borrower has open.
8. `InterestRate` : The interest rate for the loan.
9. `LoanTerm` : The term length of the loan in months.
10. `DTIRatio` : The Debt-to-Income ratio, indicating the borrower's debt compared to their income.
11. `Education` : The highest level of education attained by the borrower (PhD, Master's, Bachelor, Hgh School).
12. `EmploymentType` : The type of employment status of the borrower (Full-time, Part-time, Self-employed, Unemployed).
13. `MaritalStatus` : The marital status of the borrower (Single, Married, Divorced).
14. `HasMortgage` : Whether the borrower has a mortgage (Yes or No).
15. `HasDependents` : Whether the borrower has dependents (Yes or No).
16. `LoanPurpose` : The purpose of the loan (Home, Auto, Education, Business, Other).
17. `HasCoSigner` : Whether the loan has a co-signer (Yes or No).
17. `Default` : The binary target variable indication whether the loan defaulted (1) or not (0).

In [1]:
#load library and configuration
import pandas as pd 
import sys

#append a specific path to the system path
sys.path.append("../src")

In [2]:
#import the 'utils' module which contains utility functions
import utils

In [3]:
#load configuration or data using 'config_load()' function from the 'utils' module
config_data = utils.config_load()
#display the loaded configuration data
config_data

{'raw_dataset_path': '../data/raw/Loan_default.csv',
 'dataset_path': '../data/output/data.pkl',
 'predictors_set_path': '../data/output/predictors.pkl',
 'response_set_path': '../data/output/response.pkl',
 'train_path': ['../data/output/X_train.pkl', '../data/output/y_train.pkl'],
 'test_path': ['../data/output/X_test.pkl', '../data/output/y_test.pkl'],
 'data_train_path': '../data/output/training_data.pkl',
 'data_train_binned_path': '../data/output/bin_training_data.pkl',
 'crosstab_list_path': '../data/output/list_crosstab.pkl',
 'WOE_table_path': '../data/output/WOE_table.pkl',
 'IV_table_path': '../data/output/IV_table.pkl',
 'WOE_map_dict_path': '../data/output/WOE_map_dict.pkl',
 'X_train_woe_path': '../data/output/X_train_woe.pkl',
 'response_variable': 'Default',
 'test_size': 0.2,
 'numeric_col': ['Age',
  'Income',
  'LoanAmount',
  'MonthsEmployed',
  'NumCreditLines',
  'InterestRate',
  'LoanTerm',
  'DTIRatio'],
 'categoric_col': ['Education',
  'EmploymentType',
  'Ma

In [4]:
def read_data():
    #load data from the specified path    
    data_path = config_data['raw_dataset_path']
    data = pd.read_csv(data_path)

    #print the shape of the loaded data to check its dimension
    print("Data shape       :", data.shape)

    #save (pickle dump) the loaded data to a specified path
    dump_path = config_data['dataset_path']
    utils.pickle_dump(data, dump_path)

    #return the loaded data
    return data

In [5]:
#load the dataset and display
loan_default = read_data()
loan_default.head()

Data shape       : (255347, 18)


Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


This dataset has **255,347** credit scores from **18 variables**, but we will only use a few variables that are relevant in influencing/evaluating loan default.

--> 15 predictors/potential characteristic

- predictor 1 : `Age`
- predictor 2 : `Income`
- predictor 3 : `LoanAmount`
- predictor 4 : `MonthsEmployed`
- predictor 5 : `NumCreditLines`
- predictor 6 : `InterestRate`
- predictor 7 : `LoanTerm`
- predictor 8 : `DTIRatio`
- predictor 9 : `Education`
- predictor 10 : `EmploymentType`
- predictor 11 : `MaritalStatus`
- predictor 12 : `HasMortgage`
- predictor 13 : `HasDependents`
- predictor 14 : `LoanPurpose`
- predictor 15 : `HasCoSigner`

--> 1 response variable

- Default : `1` - yes
- Default : `0` - no


In [6]:
#create a copy of the 'loan_default' dataset
data_score = loan_default.copy()

#select specific columns of interest for credit scoring
data_score = data_score[['Age', 'Income', 'LoanAmount', 'MonthsEmployed', 'NumCreditLines', 'InterestRate',
                       'LoanTerm', 'DTIRatio', 'Education', 'EmploymentType', 'MaritalStatus',
                       'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner', 'Default']]
#display the resulting dataset containing only the selected columns
data_score

Unnamed: 0,Age,Income,LoanAmount,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,56,85994,50587,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,69,50432,124440,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,46,84208,129188,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,32,31713,44799,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,60,20437,9139,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255342,19,37979,210682,109,4,14.11,12,0.85,Bachelor's,Full-time,Married,No,No,Other,No,0
255343,32,51953,189899,14,2,11.55,24,0.21,High School,Part-time,Divorced,No,No,Home,No,1
255344,56,84820,208294,70,3,5.29,60,0.50,High School,Self-employed,Married,Yes,Yes,Auto,Yes,0
255345,42,85109,60575,40,1,20.90,48,0.44,High School,Part-time,Single,Yes,Yes,Other,No,0


### **2. Sample Splitting**

Before modeling, it is important to perform data splitting. This is done to measure the performance of the model and prevent overfitting.

Splitting data will result in two sets of data referred to as training data and testing data. Training data is used to train the model or machine algorithm, while testing data is used to test the extent to which the trained model is able to generalize and make predictions correctly on data that has never been seen before.

Furthermore, in the context of classification, it is also important to examine the proportion of the response variable in order to identify possible class imbalances. Poor class balance can affect the performance of a classification model.

In [7]:
#check proportion of target variable
data_score['Default'].value_counts(normalize=True)

0    0.883872
1    0.116128
Name: Default, dtype: float64

The proportion of the response variable `default`, is not quite balanced.

To get the same ratio in training and testing set, define a stratified splitting based on response variable, `default`.

In [8]:
def splitting_data(data):
    """
    Split the dataset into predictor variables (X) and the response variable (y)

    Parameters
    ----------
    data : DataFrame
        The dataset containing both predictor and response variable

    Returns
    -------
    X : DataFrame
        Predictor variables (feature)
    y : Series
        Response variable

    This function takes a dataset and separate it into predictor variables (X) and response variable (y)
    It also saves the predictor variables and response variable to pickle files
    """

    #define response variable
    response_variable = config_data['response_variable']
    
    #extract the response variable (y) from dataset
    y = data[response_variable]

    #extract the predictor variables (X)
    X = data.drop(columns = [response_variable],
                  axis = 1)
    
    #display the shape of X and y 
    print('y shape :', y.shape)
    print('X shape :', X.shape)

    #save the predictor variable (X) to a pickle file
    dump_path_predictors = config_data['predictors_set_path']
    utils.pickle_dump(X, dump_path_predictors)

    #save the response variable (y) to a pickle file
    dump_path_response = config_data['response_set_path']    
    utils.pickle_dump(y, dump_path_response)
    
    return X, y

In [9]:
X, y = splitting_data(data_score)

y shape : (255347,)
X shape : (255347, 15)


Split training and testing from each predictors (X) and response variable (y)

- Set `stratify = y` for splitting the sample with stratify, based on the proportion of response y.
- Set `test_size = 0.2` for holding 20% of the sample as a testing set.
- Set `random_state = 42` for reproducibility.

In [10]:
#import library 
from sklearn.model_selection import train_test_split

Update the config file to have train & test data path and test size.

In [11]:
config_data = utils.config_load()
config_data

{'raw_dataset_path': '../data/raw/Loan_default.csv',
 'dataset_path': '../data/output/data.pkl',
 'predictors_set_path': '../data/output/predictors.pkl',
 'response_set_path': '../data/output/response.pkl',
 'train_path': ['../data/output/X_train.pkl', '../data/output/y_train.pkl'],
 'test_path': ['../data/output/X_test.pkl', '../data/output/y_test.pkl'],
 'data_train_path': '../data/output/training_data.pkl',
 'data_train_binned_path': '../data/output/bin_training_data.pkl',
 'crosstab_list_path': '../data/output/list_crosstab.pkl',
 'WOE_table_path': '../data/output/WOE_table.pkl',
 'IV_table_path': '../data/output/IV_table.pkl',
 'WOE_map_dict_path': '../data/output/WOE_map_dict.pkl',
 'X_train_woe_path': '../data/output/X_train_woe.pkl',
 'response_variable': 'Default',
 'test_size': 0.2,
 'numeric_col': ['Age',
  'Income',
  'LoanAmount',
  'MonthsEmployed',
  'NumCreditLines',
  'InterestRate',
  'LoanTerm',
  'DTIRatio'],
 'categoric_col': ['Education',
  'EmploymentType',
  'Ma

In [12]:
def split_train_test():
    """
    Split the dataset into training and testing

    Returns
    -------
    X_train : pd.DataFrame
        Training predictor variables
    X_test : pd.DataFrame
        Testing predictor variables
    y_train : pd.Series
        Training response variable
    y_test : pd.Series
        Testing response variable
    """
    
    #load the X and y
    X = utils.pickle_load(config_data['predictors_set_path'])
    y = utils.pickle_load(config_data['response_set_path'])

    #split the data
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        stratify = y,
                                                        test_size = config_data['test_size'],
                                                        random_state = 42)
    #validate splitting
    print('X_train shape :', X_train.shape)
    print('y_train shape :', y_train.shape)
    print('X_test shape  :', X_test.shape)
    print('y_test shape  :', y_test.shape)

    #dump data
    utils.pickle_dump(X_train, config_data['train_path'][0])
    utils.pickle_dump(y_train, config_data['train_path'][1])
    utils.pickle_dump(X_test, config_data['test_path'][0])
    utils.pickle_dump(y_test, config_data['test_path'][1])

    return X_train, X_test, y_train, y_test

In [13]:
#check the function
X_train, X_test, y_train, y_test = split_train_test()

X_train shape : (204277, 15)
y_train shape : (204277,)
X_test shape  : (51070, 15)
y_test shape  : (51070,)


Check proportion of response variable `default` in each training and testing set.

In [14]:
#check proportion of target variable on data training
y_train.value_counts(normalize = True)

0    0.883873
1    0.116127
Name: Default, dtype: float64

In [15]:
#check proportion of target variable on data testing
y_test.value_counts(normalize = True)

0    0.883865
1    0.116135
Name: Default, dtype: float64