# Probability of Default Model PoC

## Introduction

## Step 1: Connect Notebook to ValidMind Project
Prepare the environment for our analysis. First, **import** all necessary libraries and modules required for our analysis. Next, **connect** to the ValidMind MRM platform, which provides a comprehensive suite of tools and services for model validation.

Finally, define and **configure** the specific use case we are working on by setting up any required parameters, data sources, or other settings that will be used throughout the analysis.

#### Import Libraries

In [212]:
# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env

import zipfile
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix, precision_recall_curve, auc
from sklearn.feature_selection import f_classif
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.stats import chi2_contingency
%matplotlib inline

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


#### Connect Notebook to ValidMind Project

In [213]:

import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "clibjj9cl00056qy6tz2hkc6l"
)
  

2023-06-06 13:59:54,964 - INFO - api_client - Connected to ValidMind. Project: PD Model - Initial Validation (clibjj9cl00056qy6tz2hkc6l)


#### Explore Test Suites, Test Plans and Tests

In [173]:
vm.test_suites.list_suites()

ID,Name,Description,Test Plans
binary_classifier_full_suite,BinaryClassifierFullSuite,Full test suite for binary classification models.,"tabular_dataset_description, tabular_data_quality, binary_classifier_metrics, binary_classifier_validation, binary_classifier_model_diagnosis"
binary_classifier_model_validation,BinaryClassifierModelValidation,Test suite for binary classification models.,"binary_classifier_metrics, binary_classifier_validation, binary_classifier_model_diagnosis"
tabular_dataset,TabularDataset,Test suite for tabular datasets.,"tabular_dataset_description, tabular_data_quality"
time_series_dataset,TimeSeriesDataset,Test suite for time series datasets.,"time_series_data_quality, time_series_univariate, time_series_multivariate"
time_series_model_validation,TimeSeriesModelValidation,Test suite for time series model validation.,"regression_model_description, regression_models_evaluation, time_series_forecast, time_series_sensitivity"


In [174]:
vm.test_plans.list_plans()

ID,Name,Description
binary_classifier_metrics,BinaryClassifierMetrics,Test plan for sklearn classifier metrics
binary_classifier_validation,BinaryClassifierPerformance,Test plan for sklearn classifier models
binary_classifier_model_diagnosis,BinaryClassifierDiagnosis,Test plan for sklearn classifier model diagnosis tests
tabular_dataset_description,TabularDatasetDescription,Test plan to extract metadata and descriptive  statistics from a tabular dataset
tabular_data_quality,TabularDataQuality,Test plan for data quality on tabular datasets
time_series_data_quality,TimeSeriesDataQuality,Test plan for data quality on time series datasets
time_series_univariate,TimeSeriesUnivariate,Test plan to perform time series univariate analysis.
time_series_multivariate,TimeSeriesMultivariate,Test plan to perform time series multivariate analysis.
time_series_forecast,TimeSeriesForecast,Test plan to perform time series forecast tests.
time_series_sensitivity,TimeSeriesSensitivity,Test plan to perform time series forecast tests.


In [175]:
vm.test_plans.list_tests()

Test Type,ID,Name,Description
Metric,acf_pacf_plot,ACFandPACFPlot,Plots ACF and PACF for a given time series dataset.
Metric,auto_ar,AutoAR,Automatically detects the AR order of a time series using both BIC and AIC.
Metric,auto_ma,AutoMA,Automatically detects the MA order of a time series using both BIC and AIC.
Metric,auto_seasonality,AutoSeasonality,Automatically detects the optimal seasonal order for a time series dataset  using the seasonal_decompose method.
Metric,auto_stationarity,AutoStationarity,Automatically detects stationarity for each time series in a DataFrame  using the Augmented Dickey-Fuller (ADF) test.
Metric,classifier_in_sample_performance,ClassifierInSamplePerformance,Test that outputs the performance of the model on the training data.
Metric,classifier_out_of_sample_performance,ClassifierOutOfSamplePerformance,Test that outputs the performance of the model on the test data.
Metric,confusion_matrix,ConfusionMatrix,Confusion Matrix
Metric,dataset_correlations,DatasetCorrelations,Extracts the correlation matrix for a dataset. The following coefficients  are calculated:  - Pearson's R for numerical variables  - Cramer's V for categorical variables  - Correlation ratios for categorical-numerical variables
Metric,dataset_description,DatasetDescription,Collects a set of descriptive statistics for a dataset


## Step 2: Import Raw Data

#### Import Lending Club Dataset

In [216]:
# Specify the path to the zip file
filepath = '/Users/juanvalidmind/Dev/datasets/lending club/data_2007_2014/loan_data_2007_2014.csv'
df = pd.read_csv(filepath)

# Perform operations on the DataFrame as needed
print(df.head())

Columns (19) have mixed types. Specify dtype option on import or set low_memory=False.


        id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
0  1077501    1296599       5000         5000           4975.0   36 months   
1  1077430    1314167       2500         2500           2500.0   60 months   
2  1077175    1313524       2400         2400           2400.0   36 months   
3  1076863    1277178      10000        10000          10000.0   36 months   
4  1075358    1311748       3000         3000           3000.0   60 months   

   int_rate  installment grade sub_grade  ... total_bal_il il_util  \
0     10.65       162.87     B        B2  ...          NaN     NaN   
1     15.27        59.83     C        C4  ...          NaN     NaN   
2     15.96        84.33     C        C5  ...          NaN     NaN   
3     13.49       339.31     C        C1  ...          NaN     NaN   
4     12.69        67.79     B        B5  ...          NaN     NaN   

  open_rv_12m  open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi  \
0         NaN          NaN        Na

## Step 3: Describe Raw Data

The **Lending Club dataset** you're referring to is a well-known dataset in the data science community. It originates from **Lending Club Corporation**, an American **peer-to-peer lending** company, which was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC). 

The dataset typically contains complete loan data for all loans issued, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The original file is a matrix of about **890 thousand observations** and **75 variables**. However, this specific version of the dataset have been trimmed down to **466285 rows** and **52 columns**.

A brief description of some of the features:

1. `id`, `member_id`: A unique LC assigned ID for the loan listing and the borrower respectively.
2. `loan_amnt`: The listed amount of the loan applied for by the borrower.
3. `funded_amnt`: The total amount committed to that loan at that point in time.
4. `term`: The number of payments on the loan. Values are in months and can be either 36 or 60.
5. `int_rate`: Interest Rate on the loan.
6. `grade`, `sub_grade`: LC assigned loan grade and subgrade.
7. `emp_length`: Employment length in years.
8. `home_ownership`: The home ownership status provided by the borrower during registration. 
9. `annual_inc`: The self-reported annual income provided by the borrower during registration.
10. `loan_status`: Current status of the loan.

These and the remaining features provide a robust set of data for a variety of tasks such as risk modelling, credit analysis, and even 

### Summary of Raw Dataset

In [214]:
import pandas as pd
import numpy as np

def data_summary(df):
    # Initialize an empty dataframe to store the summary
    summary = pd.DataFrame()

    # Calculate the different metrics
    summary["Variable"] = df.columns
    summary["Number of Missing Values"] = df.isnull().sum().values
    summary["Number of Not Missing Values"] = df.shape[0] - summary["Number of Missing Values"]
    summary["Data Type"] = df.dtypes.values
    summary["Variable Type"] = ['Categorical' if str(x) == 'object' else 'Numerical' for x in df.dtypes.values]
    
    # Initialize lists to store min, mean, and max
    min_values = []
    mean_values = []
    max_values = []

    # Loop over all columns
    for col in df.columns:
        if str(df[col].dtype) == 'object':
            # If column is categorical, append 'N/A'
            min_values.append('N/A')
            mean_values.append('N/A')
            max_values.append('N/A')
        else:
            # If column is numerical, calculate min, mean, and max
            min_values.append(df[col].min())
            mean_values.append(df[col].mean())
            max_values.append(df[col].max())

    # Add the min, mean, and max values to the dataframe
    summary["Min Value"] = min_values
    summary["Mean Value"] = mean_values
    summary["Max Value"] = max_values

    return summary


In [215]:
summary = data_summary(df)  
pd.set_option('display.max_rows', None)
display(summary)


Unnamed: 0,Variable,Number of Missing Values,Number of Not Missing Values,Data Type,Variable Type,Min Value,Mean Value,Max Value
0,loan_amnt,0,466285,int64,Numerical,500,14317.277577,35000
1,funded_amnt,0,466285,int64,Numerical,500,14291.801044,35000
2,funded_amnt_inv,0,466285,float64,Numerical,0.0,14222.329888,35000.0
3,term,0,466285,int64,Numerical,36,42.605334,60
4,int_rate,0,466285,float64,Numerical,5.42,13.829236,26.06
5,installment,0,466285,float64,Numerical,15.67,432.061201,1409.99
6,grade,0,466285,object,Categorical,,,
7,sub_grade,0,466285,object,Categorical,,,
8,emp_length,21008,445277,object,Categorical,,,
9,home_ownership,0,466285,object,Categorical,,,


#### Summary of Categorical Variables

In [205]:
def categorical_summary(df):
    # Filter out categorical variables
    categorical_vars = df.select_dtypes(include='object')

    # Initialize an empty dataframe to store the summary
    summary = pd.DataFrame()

    # Calculate the different metrics
    summary["Categorical Variable"] = categorical_vars.columns
    summary["Num of Obs"] = categorical_vars.count().values
    summary["Num of Unique Values"] = categorical_vars.nunique().values
    summary["Unique Values"] = [df[col].unique() for col in categorical_vars.columns]
    summary["Missing Values (%)"] = (df[categorical_vars.columns].isnull().mean().values * 100).astype(int)

    # Sort by the percentage of missing values
    summary = summary.sort_values(by="Missing Values (%)", ascending=False)

    return summary

In [206]:
cat_summary = categorical_summary(df)
display(cat_summary)


Unnamed: 0,Categorical Variable,Num of Obs,Num of Unique Values,Unique Values,Missing Values (%)
2,emp_length,445277,11,"[10+ years, < 1 year, 1 year, 3 years, 8 years...",4
0,grade,466285,7,"[B, C, A, E, F, D, G]",0
1,sub_grade,466285,35,"[B2, C4, C5, C1, B5, A4, E1, F2, C3, B1, D1, A...",0
3,home_ownership,466285,6,"[RENT, OWN, MORTGAGE, OTHER, NONE, ANY]",0
4,verification_status,466285,3,"[Verified, Source Verified, Not Verified]",0
5,loan_status,466285,9,"[Fully Paid, Charged Off, Current, Default, La...",0
6,pymnt_plan,466285,2,"[n, y]",0
7,purpose,466285,14,"[credit_card, car, small_business, other, wedd...",0


#### Summary of Numerical Variables

In [203]:
def numerical_summary(df):
    # Filter out numerical variables
    numerical_vars = df.select_dtypes(include=['int64', 'float64'])

    # Initialize an empty dataframe to store the summary
    summary = pd.DataFrame()

    # Calculate the different metrics
    summary["Numerical Variable"] = numerical_vars.columns
    summary["Data Type"] = numerical_vars.dtypes.values
    summary["Num of Obs"] = numerical_vars.count().values
    summary["Min"] = numerical_vars.min().values
    summary["Mean"] = numerical_vars.mean().values
    summary["Max"] = numerical_vars.max().values
    summary["Missing Values (%)"] = (df[numerical_vars.columns].isnull().mean().values * 100).astype(int)

    # Sort by the percentage of missing values
    summary = summary.sort_values(by="Missing Values (%)", ascending=False)

    return summary



In [204]:
num_summary = numerical_summary(df)
display(num_summary)

Unnamed: 0,Numerical Variable,Data Type,Num of Obs,Min,Mean,Max,Missing Values (%)
10,mths_since_last_delinq,float64,215934,0.0,34.10443,188.0,53
21,total_rev_hi_lim,float64,396009,0.0,30379.087771,9999999.0,15
20,tot_cur_bal,float64,396009,0.0,138801.713385,8000078.0,15
19,tot_coll_amt,float64,396009,0.0,191.913517,9152545.0,15
12,pub_rec,float64,466256,0.0,0.160564,63.0,0
18,acc_now_delinq,float64,466256,0.0,0.004002,5.0,0
17,policy_code,int64,466285,1.0,1.0,1.0,0
16,collections_12_mths_ex_med,float64,466140,0.0,0.009085,20.0,0
15,total_acc,float64,466256,1.0,25.06443,156.0,0
14,revol_util,float64,465945,0.0,56.176947,892.3,0


### Variables Excluded

**Irrelevant Variables**

In [183]:
irrelevant_vars =  ['title', 'application_type', 'emp_title', 'zip_code', 'addr_state', 'earliest_cr_line', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d']

**ID Related Variables**

In [184]:
id_vars = ['id', 'member_id', 'url']

**Variables with Large Missing Values**

In [185]:
import pandas as pd

def identify_missing_values(df, threshold=0.8):
    """
    Identify variables with a large number of missing values.
    Args:
        df (DataFrame): The DataFrame to check for missing values.
        threshold (float, optional): The proportion threshold for considering a variable having a large number of missing values. Default is 0.8.

    Returns:
        DataFrame with variables, their missing values count, threshold, percentage of missing values and pass/fail status.
    """
    missing_values = df.isnull().sum()
    missing_values_ratio = missing_values / len(df)
    data = {
        'Variable': missing_values.index,
        'Missing Values': missing_values.values,
        'Threshold': threshold,
        'Pct Missing Values': missing_values_ratio.values * 100,
        'Pass/Fail': missing_values_ratio.values > threshold
    }
    summary_df = pd.DataFrame(data)
    summary_df['Pass/Fail'] = summary_df['Pass/Fail'].map({True: 'Fail', False: 'Pass'})
    
    # Sorting the DataFrame by 'Pct Missing Values' column in descending order
    summary_df.sort_values(by='Pct Missing Values', ascending=False, inplace=True)
    return summary_df

In [186]:
summary = identify_missing_values(df, 0.7)
display(summary)

Unnamed: 0,Variable,Missing Values,Threshold,Pct Missing Values,Pass/Fail
73,inq_last_12m,466285,0.7,100.0,Fail
55,verification_status_joint,466285,0.7,100.0,Fail
59,open_acc_6m,466285,0.7,100.0,Fail
60,open_il_6m,466285,0.7,100.0,Fail
61,open_il_12m,466285,0.7,100.0,Fail
62,open_il_24m,466285,0.7,100.0,Fail
63,mths_since_rcnt_il,466285,0.7,100.0,Fail
54,dti_joint,466285,0.7,100.0,Fail
64,total_bal_il,466285,0.7,100.0,Fail
53,annual_inc_joint,466285,0.7,100.0,Fail


In [187]:
def get_failed_variables(summary_df):
    """
    Get a list of variables that failed the missing values threshold test.
    
    Args:
        summary_df (DataFrame): The summary DataFrame outputted by identify_missing_values function.

    Returns:
        List of variable names that have failed the missing values threshold test.
    """
    failed_variables = summary_df[summary_df['Pass/Fail'] == 'Fail']['Variable']
    return failed_variables.tolist()

In [188]:
large_missing_vars = get_failed_variables(summary)
print(large_missing_vars)

['inq_last_12m', 'verification_status_joint', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'dti_joint', 'total_bal_il', 'annual_inc_joint', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl', 'mths_since_last_record', 'mths_since_last_major_derog', 'desc']


**Post Default Variables**

In [189]:
post_default_vars = ['out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt']

**Drop Excluded Variables**

In [190]:
def drop_columns(df, excluded_vars):
    # Get the intersection of excluded_vars and df's columns
    to_drop = set(excluded_vars) & set(df.columns)
    # Drop the columns and return the result
    return df.drop(to_drop, axis=1)

In [191]:
excluded_vars = irrelevant_vars + id_vars + large_missing_vars + post_default_vars
df = drop_columns(df, excluded_vars)

#### Summary of Variables after Exclusion

In [192]:
num_summary = numerical_summary(df)
display(num_summary)

Unnamed: 0,Numerical Variable,Data Type,Min,Mean,Max,Missing Values (%)
9,mths_since_last_delinq,float64,0.0,34.10443,188.0,53
20,total_rev_hi_lim,float64,0.0,30379.087771,9999999.0,15
19,tot_cur_bal,float64,0.0,138801.713385,8000078.0,15
18,tot_coll_amt,float64,0.0,191.913517,9152545.0,15
11,pub_rec,float64,0.0,0.160564,63.0,0
17,acc_now_delinq,float64,0.0,0.004002,5.0,0
16,policy_code,int64,1.0,1.0,1.0,0
15,collections_12_mths_ex_med,float64,0.0,0.009085,20.0,0
14,total_acc,float64,1.0,25.06443,156.0,0
13,revol_util,float64,0.0,56.176947,892.3,0


In [193]:
cat_summary = categorical_summary(df)
display(cat_summary)

Unnamed: 0,Categorical Variable,Num of Unique Values,Unique Values,Missing Values (%)
10,next_pymnt_d,100,"[nan, Feb-16, Jan-16, Sep-13, Feb-14, May-14, ...",48
3,emp_length,11,"[10+ years, < 1 year, 1 year, 3 years, 8 years...",4
0,term,2,"[ 36 months, 60 months]",0
1,grade,7,"[B, C, A, E, F, D, G]",0
2,sub_grade,35,"[B2, C4, C5, C1, B5, A4, E1, F2, C3, B1, D1, A...",0
4,home_ownership,6,"[RENT, OWN, MORTGAGE, OTHER, NONE, ANY]",0
5,verification_status,3,"[Verified, Source Verified, Not Verified]",0
6,issue_d,91,"[Dec-11, Nov-11, Oct-11, Sep-11, Aug-11, Jul-1...",0
7,loan_status,9,"[Fully Paid, Charged Off, Current, Default, La...",0
8,pymnt_plan,2,"[n, y]",0


### Format Dates

For all columns with dates convert them to datetime format, create a new column as a difference between model development date and the respective date feature and then drop the original feature.

In [194]:
def date_summary(df):
    # Filter out date variables
    date_vars = df.select_dtypes(include=['datetime64[ns]'])

    # Initialize an empty dataframe to store the summary
    summary = pd.DataFrame()

    # Calculate the different metrics
    summary["Variable"] = date_vars.columns
    summary["Data Type"] = date_vars.dtypes.values
    summary["Frequency"] = [df[col].asfreq('D').index.inferred_freq for col in date_vars.columns]
    summary["Min Date"] = date_vars.min().values
    summary["Max Date"] = date_vars.max().values
    summary["Missing Values (%)"] = (df[date_vars.columns].isnull().mean().values * 100).astype(int)

    # Sort by the percentage of missing values
    summary = summary.sort_values(by="Missing Values (%)", ascending=False)

    return summary

In [195]:
summary = date_summary(df)
display(summary)

Unnamed: 0,Variable,Data Type,Frequency,Min Date,Max Date,Missing Values (%)


In [196]:
def convert_to_datetime(df, column, date_format="%b-%y"):
    # Convert column to datetime format
    df[column] = pd.to_datetime(df[column], format=date_format)

In [197]:
# Define date columns
date_columns = ['next_pymnt_d', 'issue_d']

# Convert columns to datetime format
for col in date_columns:
    convert_to_datetime(df, col)

In [198]:
summary = date_summary(df)
display(summary)

Unnamed: 0,Variable,Data Type,Frequency,Min Date,Max Date,Missing Values (%)
1,next_pymnt_d,datetime64[ns],,2007-12-01,2016-03-01,48
0,issue_d,datetime64[ns],,2007-06-01,2014-12-01,0


### Correct Data Type

## Step 4: Univariate Analysis

### Histograms of Numerical Variables

### Bar Plots of Categorical Variables

## Step 4: Define Target Variable

#### Definition of Default

The definition of default for regulatory PD models follows the guidelines set by the **Basel Committee on Banking Supervision**.

As per the Basel II guidelines, a default is considered to have occurred with regard to a particular obligor when one or more of the following events have taken place:

1. The bank considers that the obligor is **unlikely to pay** its credit obligations to the banking group in full, without recourse by the bank to actions such as realizing security (if held).

2. The obligor is **past due more than 90 days** on any material credit obligation to the banking group. 

3. The obligor has filed for **bankruptcy** or similar protection from creditors.

## Step 3: Run Data Validation Test Suite on Raw Data

#### Explore the <...> Test Suite

#### Explore Test Plans

##### Connect Raw Dataset to ValidMind Platform

In [None]:
vm_dataset = vm.init_dataset(
    dataset=df,
    target_column=[''],
)

## 3. Data Collection

## 4. Data Description

The Lending Club dataset you're referring to is a well-known dataset in the data science community. It originates from Lending Club Corporation, an American peer-to-peer lending company, which was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC). 

The dataset typically contains complete loan data for all loans issued, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file is a matrix of about 890 thousand observations and 75 variables. However, your specific version of the dataset appears to have been trimmed down to 466285 entries and 52 features or columns.

A brief description of some of the features:

1. `id`, `member_id`: A unique LC assigned ID for the loan listing and the borrower respectively.
2. `loan_amnt`: The listed amount of the loan applied for by the borrower.
3. `funded_amnt`: The total amount committed to that loan at that point in time.
4. `term`: The number of payments on the loan. Values are in months and can be either 36 or 60.
5. `int_rate`: Interest Rate on the loan.
6. `grade`, `sub_grade`: LC assigned loan grade and subgrade.
7. `emp_length`: Employment length in years.
8. `home_ownership`: The home ownership status provided by the borrower during registration. 
9. `annual_inc`: The self-reported annual income provided by the borrower during registration.
10. `loan_status`: Current status of the loan.

These and the remaining features provide a robust set of data for a variety of tasks such as risk modelling, credit analysis, and even social economic studies.

In [None]:
df.info()

## 5. Data Preprocessing

#### Format Dates

For all columns with dates convert them to datetime format, create a new column as a difference between model development date and the respective date feature and then drop the original feature.

In [None]:
'''
function to convert date columns to datetime format and
create a new column as a difference between today and the respective date
'''
def date_columns(df, column):
    # store current month
    today_date = pd.to_datetime('2020-08-01')
    # convert to datetime format
    df[column] = pd.to_datetime(df[column], format = "%b-%y")
    # calculate the difference in months and add to a new column
    df['mths_since_' + column] = round(pd.to_numeric((today_date - df[column]) / 
							np.timedelta64(1, 'M')))
    # make any resulting -ve values to be equal to the max date
    df['mths_since_' + column] = df['mths_since_' + column].apply(
		lambda x: df['mths_since_' + column].max() if x < 0 else x)
    # drop the original date column
    df.drop(columns = [column], inplace = True)

# function to remove 'months' string from the 'term' column and convert it to numeric
def loan_term_converter(df, column):
    df[column] = pd.to_numeric(df[column].str.replace(' months', ''))

date_columns(df, 'earliest_cr_line')
date_columns(df, 'issue_d')
date_columns(df, 'last_pymnt_d')
date_columns(df, 'last_credit_pull_d')

#### Format Variable Values

Remove text from the `emp_length` column (e.g., years) and convert it to numeric.

In [None]:
# function to clean up the emp_length column, assign 0 to NANs, and convert to numeric
def emp_length_converter(df, column):
    df[column] = df[column].str.replace('\+ years', '')
    df[column] = df[column].str.replace('< 1 year', str(0))
    df[column] = df[column].str.replace(' years', '')
    df[column] = df[column].str.replace(' year', '')
    df[column] = pd.to_numeric(df[column])
    df[column].fillna(value = 0, inplace = True)

emp_length_converter(df, 'emp_length')

Remove text from the `term` column and convert it to numeric.

In [None]:
# function to remove 'months' string from the 'term' column and convert it to numeric
def loan_term_converter(df, column):
    df[column] = pd.to_numeric(df[column].str.replace(' months', ''))

loan_term_converter(df, 'term')

#### Handle Missing Values

In [None]:
# drop columns with more than 80% null values
df.dropna(thresh = df.shape[0]*0.2, axis = 1, inplace = True)


In [None]:
df.info()

#### Handle Outliers

#### Correct Data Types

#### Convert Categorical Variables

## 6. Univariate Analysis

## 4.5. Training Data

### 4.5.1. Sampling

Splitting our data before any feature engineering prevents any data leakage from the test set to the training set and results in more accurate model evaluation.

#### Sampling Method

Split data into 80/20 while keeping the distribution of bad loans in test set same as that in the pre-split dataset.

In [None]:
X = loan_data.drop('good_bad', axis = 1)
y = loan_data['good_bad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    random_state = 42, stratify = y)

## 4.6. Feature Engineering

In [None]:
# first divide training data into categorical and numerical subsets
X_train_cat = X_train.select_dtypes(include = 'object').copy()
X_train_num = X_train.select_dtypes(include = 'number').copy()


### 4.5.1. Missing Values

In [None]:
# since f_class_if does not accept missing values, we will do a very crude imputation of missing values
X_train_num.fillna(X_train_num.mean(), inplace = True)

### 4.5.2. Feature Selection

We will perform feature selection to identify the most suitable features for our binary classification problem using the Chi-squared test for categorical features and ANOVA F-statistic for numerical features.

In [None]:
# define an empty dictionary to store chi-squared test results
chi2_check = {}

# loop over each column in the training set to calculate chi-statistic with the target variable
for column in X_train_cat:
    chi, p, dof, ex = chi2_contingency(pd.crosstab(y_train, X_train_cat[column]))
    chi2_check.setdefault('Feature',[]).append(column)
    chi2_check.setdefault('p-value',[]).append(round(p, 10))

# convert the dictionary to a DF
chi2_result = pd.DataFrame(data = chi2_check)
chi2_result.sort_values(by = ['p-value'], ascending = True, ignore_index = True, inplace = True)

# Calculate F Statistic and corresponding p values
F_statistic, p_values = f_classif(X_train_num, y_train)

# convert to a DF
ANOVA_F_table = pd.DataFrame(data = {'Numerical_Feature': X_train_num.columns.values,
					'F-Score': F_statistic, 'p values': p_values.round(decimals=10)})
ANOVA_F_table.sort_values(by = ['F-Score'], ascending = False, ignore_index = True, inplace = True)

# save the top 20 numerical features in a list
top_num_features = ANOVA_F_table.iloc[:20,0].to_list()

# calculate pair-wise correlations between them
corrmat = X_train_num[top_num_features].corr()
plt.figure(figsize=(10,10))
sns.heatmap(corrmat)

# save the names of columns to be dropped in a list
drop_columns_list = ANOVA_F_table.iloc[20:, 0].to_list()
drop_columns_list.extend(chi2_result.iloc[4:, 0].to_list())
drop_columns_list


In [None]:
# function to drop these columns
def col_to_drop(df, columns_list):
    df.drop(columns = columns_list, inplace = True)

# apply to X_train
col_to_drop(X_train, drop_columns_list)

### 4.5.3. Encoding of Categorical Variables

In [None]:
# function to create dummy variables
def dummy_creation(df, columns_list):
    df_dummies = []
    for col in columns_list:
        df_dummies.append(pd.get_dummies(df[col], prefix = col, prefix_sep = ':'))
    df_dummies = pd.concat(df_dummies, axis = 1)
    df = pd.concat([df, df_dummies], axis = 1)
    return df

# apply to our final four categorical variables
X_train.info()
# X_train = dummy_creation(X_train, ['grade', 'home_ownership', 'verification_status', 'purpose'])


## 4.6. Model Selection

### 4.6.1. Model Selection Criteria

### 4.6.2. Model Analysis

## 4.7. Model Testing

TBC

## 4.8. Model Adjustments

TBC

# 5. Model Implementation

- Packages and dependencies
- Setup of model development
- Deployment to production infrastructure 
- Model execution and reporting 

# 6. Model Use

TBC

# 7. Ongoing Monitoring

TBC

# 8. Model Governance

TBC

##