# Credit Risk Scorecard Model Development

## Introduction

The **Credit risk Scorecard** model created from the Lending Club dataset is instrumental in computing the Probability of Default (PD), a key factor in ECL calculations. This scorecard assesses several credit characteristics of potential borrowers, like their credit history, income, outstanding debts, and more, each of which is assigned a specific score. By combining these scores, we derive a total score for each borrower, which translates into an estimated Point-in-Time (PiT) PD. The PiT PD reflects the borrower's likelihood of default at a specific point in time, accounting for both current and foreseeable future conditions.

Additionally, for a holistic view of credit risk, it's essential to estimate the Lifetime PD. The Lifetime PD, as the name suggests, predicts the borrower's likelihood of default throughout the life of the exposure, taking into account potential future changes in the economic and financial conditions.

## Setup

### Import Libraries

In [98]:
from notebooks.probability_of_default.helpers.Developer import Developer
from notebooks.probability_of_default.helpers.scorecard_tasks import *
from notebooks.probability_of_default.helpers.model_development_tasks import *

from IPython.display import HTML

import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "clk00h0u800x9qjy67gduf5om"
)



2023-08-14 14:32:02,428 - INFO(validmind.api_client): Connected to ValidMind. Project: [6] Credit Risk Scorecard - Initial Validation (clk00h0u800x9qjy67gduf5om)
INFO: Connected to ValidMind. Project: [6] Credit Risk Scorecard - Initial Validation (clk00h0u800x9qjy67gduf5om)


### Input Parameters

In [99]:
default_column = "default"

lending_club_url = "https://vmai.s3.us-west-1.amazonaws.com/datasets/lending_club_loan_data_2007_2014.csv"

preliminary_features_to_drop = [
    "id", "member_id", "funded_amnt", "emp_title", "url", "desc", "application_type",
    "title", "zip_code", "delinq_2yrs", "mths_since_last_delinq", "mths_since_last_record",
    "revol_bal", "total_rec_prncp", "total_rec_late_fee", "recoveries", "out_prncp_inv", "out_prncp", 
    "collection_recovery_fee", "next_pymnt_d", "initial_list_status", "pub_rec",
    "collections_12_mths_ex_med", "policy_code", "acc_now_delinq", "pymnt_plan",
    "tot_coll_amt", "tot_cur_bal", "total_rev_hi_lim", "last_pymnt_d", "last_credit_pull_d",
    'earliest_cr_line', 'issue_d']

final_features_to_drop = ['addr_state', 'total_rec_int', 'loan_amnt',
                    'funded_amnt_inv', 'dti', 'revol_util', 'total_pymnt', 
                    'total_pymnt_inv', 'last_pymnt_amnt', "inq_last_6mths"]

min_missing_percentage = 80

iqr_threshold = 1.5

### Register Developer Tasks

In [100]:
# Instantiate the Developer class
developer = Developer()

# Register developer tasks
developer.add_task(
    task_id="import_raw_data", 
    task=import_raw_data,
)

developer.add_task(
    task_id="drop_features",
    task=drop_features,  
)

developer.add_task(
    task_id="add_default_definition",
    task=add_default_definition,  
)

developer.add_task(
    task_id="convert_term_column",
    task=convert_term_column,  
)

developer.add_task(
    task_id="convert_emp_length_column",
    task=convert_emp_length_column,  
)

developer.add_task(
    task_id="convert_inq_last_6mths_column",
    task=convert_inq_last_6mths_column,  
)

developer.add_task(
    task_id="data_split",
    task=data_split,  
)

developer.add_task(
    task_id="drop_categories",
    task=drop_categories,  
)

developer.add_task(
    task_id="convert_to_woe",
    task=convert_to_woe,  
)

developer.add_task(
    task_id="add_constant",
    task=add_constant,  
)

developer.add_task(
    task_id="train_model",
    task=train_model,  
)

developer.add_task(
    task_id="remove_features_missing_values",
    task=remove_features_missing_values,  
)

developer.add_task(
    task_id="remove_iqr_outliers",
    task=remove_iqr_outliers,  
)

'remove_iqr_outliers'

## Model Development

### Data Description

In [101]:
df_raw = developer.execute_task(
    area_id = "data_description",
    task_id = "import_raw_data", 
    inputs = [lending_club_url],
    validation_tests = [
        "descriptive_statistics", 
        "missing_values_bar_plot",
        "iqr_outliers_table"]
)

INFO: Executing task 'import_raw_data'...



Importing raw data from: https://vmai.s3.us-west-1.amazonaws.com/datasets/lending_club_loan_data_2007_2014.csv


  df_out = pd.read_csv(source)


Data imported successfully with 466285 rows and 75 columns.


### Data Preparation

In [102]:
df_preparation = developer.execute_task(
    area_id = "data_preparation",
    task_id = "drop_features", 
    inputs = [df_raw, preliminary_features_to_drop]
)

INFO: Executing task 'drop_features'...



Dropped 33 columns.
Columns remaining after dropping: 42


In [103]:
df_preparation = developer.execute_task(
    area_id = "data_preparation",
    task_id = "add_default_definition", 
    inputs = [df_preparation, default_column]
)

INFO: Executing task 'add_default_definition'...



Converting 'loan_status' to target column...
Removed 239071 rows with undefined 'loan_status' values.
Converted 'loan_status' to 'default' and set its data type to integer.
'loan_status' column has been removed from the DataFrame.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[default_column] = df[default_column].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=["loan_status"], inplace=True)


In [104]:
df_preparation = developer.execute_task(
    area_id="data_preparation",
    task_id="remove_features_missing_values", 
    inputs=[df_preparation, min_missing_percentage]
)

INFO: Executing task 'remove_features_missing_values'...



Analyzing missing values in the dataset...
Found 18 features with more than 80% missing values.
Dropping the following columns: mths_since_last_major_derog, annual_inc_joint, dti_joint, verification_status_joint, open_acc_6m, open_il_6m, open_il_12m, open_il_24m, mths_since_rcnt_il, total_bal_il, il_util, open_rv_12m, open_rv_24m, max_bal_bc, all_util, inq_fi, total_cu_tl, inq_last_12m


In [105]:
df_preparation = developer.execute_task(
    area_id="data_preparation",
    task_id="convert_term_column", 
    inputs=[df_preparation]
)

df_preparation = developer.execute_task(
    area_id="data_preparation",
    task_id="convert_emp_length_column", 
    inputs=[df_preparation]
)

df_preparation = developer.execute_task(
    area_id="data_preparation",
    task_id="convert_inq_last_6mths_column", 
    inputs=[df_preparation]
)

INFO: Executing task 'convert_term_column'...

INFO: Executing task 'convert_emp_length_column'...

INFO: Executing task 'convert_inq_last_6mths_column'...



In [106]:
df_preparation = developer.execute_task(
    area_id="data_preparation",
    task_id="remove_iqr_outliers", 
    inputs=[df_preparation, default_column, iqr_threshold],
    validation_tests=[
        "class_imbalance",
        "missing_values_bar_plot",
        "iqr_outliers_bar_plot"]
)

INFO: Executing task 'remove_iqr_outliers'...



### Data Sampling

In [107]:
df_train, df_test = developer.execute_task(
    area_id="data_sampling",
    task_id="data_split", 
    inputs=[df_preparation, default_column],
)

INFO: Executing task 'data_split'...



Training data has 143111 rows and 23 columns.
Test data has 35778 rows and 23 columns.


### Exploratory Data Analysis

In [108]:
df_train = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_categories", 
    inputs=[df_train],
  
)

df_test = drop_categories(df_test)

INFO: Executing task 'drop_categories'...



Rows retained with purpose 'debt_consolidation' or 'credit_card': 111115
Rows after removing grades 'F' or 'G': 109833
Rows after removing sub_grades starting with 'F' or 'G': 109833
Rows after removing home_ownership values 'OTHER', 'NONE', or 'ANY': 109746
Total rows dropped: 33365
Rows retained with purpose 'debt_consolidation' or 'credit_card': 27847
Rows after removing grades 'F' or 'G': 27543
Rows after removing sub_grades starting with 'F' or 'G': 27543
Rows after removing home_ownership values 'OTHER', 'NONE', or 'ANY': 27523
Total rows dropped: 8255


In [109]:
df_train_eda = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_features", 
    inputs=[df_train, final_features_to_drop],
    validation_tests=[
        "high_cardinality",
        "tabular_numerical_histograms", 
        "tabular_categorical_bar_plots",
        "target_rate_bar_plots",
        "chi_squared_features_table", 
        "anova_one_way_table", 
        "pearson_correlation_matrix", 
        "feature_target_correlation_plot",
        "woe_bin_table",
        "woe_bin_table",   # with different parameters
        "woe_bin_plots"]
)

df_test_eda = drop_features(df_test, final_features_to_drop)

INFO: Executing task 'drop_features'...



Dropped 10 columns.
Columns remaining after dropping: 14
Dropped 10 columns.
Columns remaining after dropping: 14


### Feature Engineering

In [110]:
from validmind.vm_models.test_context import TestContext
from validmind.tests.data_validation.WOEBinTable import WOEBinTable

params = {
    "breaks_adj": {
        "int_rate": [5,10,15]}  
     }

vm_df = vm.init_dataset(dataset=df_train_eda, target_column=default_column)
test_context = TestContext(dataset=vm_df)

metric = WOEBinTable(test_context, params=params)
metric.run()
woe_dic = metric.result.metric.value['woe_iv']
woe_df = pd.DataFrame(woe_dic)

2023-08-14 14:32:57,983 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...


Running with breaks_adj: {'int_rate': [5, 10, 15]}
Performing binning with breaks_adj: {'int_rate': [5, 10, 15]}
[INFO] creating woe binning ...


 (ColumnNames: emp_length)


In [111]:
df_train_feateng = developer.execute_task(
    area_id="feature_engineering",
    task_id="convert_to_woe", 
    inputs=[df_train_eda, woe_df, default_column],
)     

df_test_feateng = convert_to_woe(df_test_eda, woe_df, default_column)

INFO: Executing task 'convert_to_woe'...



Converting 13 features to WoE values.
[INFO] converting into woe values ...


 (ColumnNames: emp_length)


Successfully converted features to WoE values.
Converting 13 features to WoE values.
[INFO] converting into woe values ...


 (ColumnNames: emp_length)


Successfully converted features to WoE values.


### Model Training

In [112]:
df_train_feateng = developer.execute_task(
    area_id="model_training",
    task_id="add_constant", 
    inputs=[df_train_feateng]
)

df_test_feateng = add_constant(df_test_feateng)

INFO: Executing task 'add_constant'...



Added constant to dataframe. Number of columns went from 14 to 15.
Added constant to dataframe. Number of columns went from 14 to 15.


In [113]:
model_fit_candidate = developer.execute_task(
    area_id="model_training",
    task_id="train_model", 
    inputs=[df_train_feateng, default_column]
)

print(model_fit_candidate.summary())

INFO: Executing task 'train_model'...



Training the model with 14 features and 109746 data points.
Model trained successfully.
                 Generalized Linear Model Regression Results                  
Dep. Variable:                default   No. Observations:               109746
Model:                            GLM   Df Residuals:                   109732
Model Family:                Binomial   Df Model:                           13
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -47702.
Date:                Mon, 14 Aug 2023   Deviance:                       95403.
Time:                        14:33:08   Pearson chi2:                 1.10e+05
No. Iterations:                     5   Pseudo R-squ. (CS):            0.08209
Covariance Type:            nonrobust                                         
                              coef    std err          z      P>|z|      [0.025      0.975]
------------------------------

In [114]:
model_features_to_drop = []

df_train_feateng = developer.execute_task(
    area_id="model_training",
    task_id="drop_features", 
    inputs=[df_train_feateng, model_features_to_drop]
)

df_test_feateng = drop_features(df_test_feateng, model_features_to_drop)

INFO: Executing task 'drop_features'...



Dropped 0 columns.
Columns remaining after dropping: 15
Dropped 0 columns.
Columns remaining after dropping: 15


In [115]:
model_fit_final = developer.execute_task(
    area_id="model_training",
    task_id="train_model", 
    inputs=[df_train_feateng, default_column],
    validation_tests = ["regression_coeffs_plot", 
                        "regression_models_coeffs", 
                        "log_regression_confusion_matrix", 
                        "regression_roc_curve", "gini_table", 
                        "logistic_reg_prediction_histogram", 
                        "logistic_reg_cumulative_prob", 
                        "scorecard_histogram"]
)

print(model_fit_final.summary())

INFO: Executing task 'train_model'...



Training the model with 14 features and 109746 data points.
Model trained successfully.
                 Generalized Linear Model Regression Results                  
Dep. Variable:                default   No. Observations:               109746
Model:                            GLM   Df Residuals:                   109732
Model Family:                Binomial   Df Model:                           13
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -47702.
Date:                Mon, 14 Aug 2023   Deviance:                       95403.
Time:                        14:33:08   Pearson chi2:                 1.10e+05
No. Iterations:                     5   Pseudo R-squ. (CS):            0.08209
Covariance Type:            nonrobust                                         
                              coef    std err          z      P>|z|      [0.025      0.975]
------------------------------

## Validation Plan

In [116]:
df_validation = developer.show_validation_plan()
display(HTML(df_validation.to_html(escape=False)))

Unnamed: 0,Area ID,Task ID,Input,Output,Validation Tests
0,data_description,import_raw_data,lending_club_url,df_raw,descriptive_statistics missing_values_bar_plot iqr_outliers_table
1,data_preparation,drop_features,"df_raw, preliminary_features_to_drop",df_preparation,none
2,data_preparation,add_default_definition,"df_preparation, default_column",df_preparation,none
3,data_preparation,remove_features_missing_values,"df_preparation, min_missing_percentage",df_preparation,none
4,data_preparation,convert_term_column,df_preparation,df_preparation,none
5,data_preparation,convert_emp_length_column,df_preparation,df_preparation,none
6,data_preparation,convert_inq_last_6mths_column,df_preparation,df_preparation,none
7,data_preparation,remove_iqr_outliers,"df_preparation, default_column, iqr_threshold",df_preparation,class_imbalance missing_values_bar_plot iqr_outliers_bar_plot
8,data_sampling,data_split,"df_preparation, default_column","df_train,df_test",none
9,exploratory_data_analysis,drop_categories,df_train,df_train,none


## Save Datasets and Models

In [117]:
objects_to_store = {
    "df_raw": df_raw,
    "df_preparation": df_preparation,
    "df_train_eda": df_train_eda,
    "df_train_feateng": df_train_feateng,
    "df_test_feateng": df_test_feateng,
    "model_fit_final": model_fit_final,
    "df_validation": df_validation,
}

developer.save_objects_to_pickle(
    filename="datasets/scorecard_data_and_models.pkl", 
    objects_to_save=objects_to_store)

INFO: Saved 7 objects to datasets/scorecard_data_and_models.pkl


['df_raw',
 'df_preparation',
 'df_train_eda',
 'df_train_feateng',
 'df_test_feateng',
 'model_fit_final',
 'df_validation']