# Group 1 - Project: Predicting Loan Default

Team members: Luca Matteucci, Santiago Mazzei, Srithijaa Sankepally, and Victor Floriano


## 1 - Business problem definition and data description


---



1.0 - **Problem Statement:**

Our goal is to predict what customers are more likely to default on their loan payments. By analyzing the Lending Club Loan dataset, we aim to understand the factors that contribute to loan defaults and late payments, as well as why some borrowers can't repay their loans on time, and to find out what helps borrowers succeed. We want to make the lending process better, reduce default risk, and increase profitability for lenders.

1.1 - **Data Source:**

We will use the Lending Club Loan dataset, which includes complete loan data for loans issued from 2007 to 2015. The Lending Club is a peer-to-peer lending company that matches people looking to invest money with people looking to borrow money. The dataset provides information about borrowers, loan characteristics, and loan performance. The data is freely available on Kaggle at: https://www.kaggle.com/datasets/adarshsng/lending-club-loan-data-csv




1.2 - **Data Description:**

The dataset contains approximately 2,260,668 observations and 145 variables (columns). The variables include information, such as borrower characteristics (e.g., credit scores, income, employment details), loan characteristics (e.g., loan amount, interest rate, purpose), and loan performance data (e.g., current loan status, delinquency history). Some of the variables we expect to be useful for our analysis include: `annual_inc`, `loan_amnt`, `int_rate`, `delinq_2yrs`, `purpose`, among others.

The dataset contains multiple data types, including float64 (105 instances), int64 (4 instances), and object (36 instances). Additionally, it includes several variables that necessitate preprocessing, such as:
1. Binary values assigned to 'Y'/'N'. (i.e. `hardship_flag`)
2. Object columns would work better as datetime. (i.e. `settlement_date`)
3. Variables with text input. (i.e. `desc`)

##2 - Import libraries, load the data, and sample from the original dataset


---
To avoid issues in our colab notebook, we decided to use a sample of the original data for train/test our models. Before moving to the next steps, make sure that the entire original dataset was loaded in the colab environment, the cell below with `loan_df.info()` should return DataFrame with 2,260,667 records.


In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [2]:
#Create drive and load the original dataset
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/BU_MSBA/BA810/Data/loan.csv')

Mounted at /content/drive


FileNotFoundError: ignored

In [None]:
#Display the number of entries, columns, and dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260668 entries, 0 to 2260667
Columns: 145 entries, id to settlement_term
dtypes: float64(105), int64(4), object(36)
memory usage: 2.4+ GB


Our target variable contains multiple values, we will transform this column into a binary 0/1 variable.

In [None]:
#Check distribution of target varible in the original dataset
print('Distribution of loan_status values in the original dataset:\n')
df['loan_status'].value_counts(normalize=True) * 100

Distribution of loan_status values in the original dataset:



Fully Paid                                             46.090448
Current                                                40.682444
Charged Off                                            11.574234
Late (31-120 days)                                      0.968608
In Grace Period                                         0.395989
Late (16-30 days)                                       0.165305
Does not meet the credit policy. Status:Fully Paid      0.087939
Does not meet the credit policy. Status:Charged Off     0.033663
Default                                                 0.001371
Name: loan_status, dtype: float64

To create a representative sample from our original dataset, we used the `train_test_split` method to create a stratified sample.

In [None]:
#Split the data into a left_out portion and our stratified sample
left_out, sample_df = train_test_split(df, test_size=0.1,
                                       stratify=df['loan_status'],
                                       random_state=42)

sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226067 entries, 1624968 to 850605
Columns: 145 entries, id to settlement_term
dtypes: float64(105), int64(4), object(36)
memory usage: 251.8+ MB


After creating our sample we check the distribution of loan_status to verify if it resembled the original dataset. While there was small difference in the distribution of loan_status values, the distribution is very similar to the original dataset.

In [None]:
#Check loan_status distribution for sample_df
print('Distribution of loan_status values in the sample dataset:\n')
sample_df['loan_status'].value_counts(normalize=True) * 100

Distribution of loan_status values in the sample dataset:



Fully Paid                                             46.090318
Current                                                40.682629
Charged Off                                            11.574002
Late (31-120 days)                                      0.968739
In Grace Period                                         0.395900
Late (16-30 days)                                       0.165438
Does not meet the credit policy. Status:Fully Paid      0.088027
Does not meet the credit policy. Status:Charged Off     0.033618
Default                                                 0.001327
Name: loan_status, dtype: float64

To clean RAM space in our Colab environment, we proceeded to remove the original dataset and the unused data

In [None]:
#Remove original dataset and unused data
del df
del left_out

#Manually trigger garbage collection
gc.collect()

0

##3 - ADD OUR INITIAL DATA EXPLORATION HERE?


---



## 4 - Prepare Data Processing for Machine Learning


---


###4.1 - Drop columns with too many missing values

In our set on predictors, one issue was that a number of features had an excessive number of missing values, we decided to drop those columns.

1.  Among the various columns, `id` and `member_id` serve as unique identifiers. The `id` column is distinct for each loan record, while `member_id` uniquely identifies each borrower. Besides the extensive number of unique values, most data from these features was missing, so we dropped both columns.

2. For the remaining variables in our dataset, we adopted a threshold-based approach, deciding to eliminate any predictors where over 35% of the data was missing. Given the significant lack of available information for those predictors, imputations methods would yield untrustworthy results.
  * 58 columns had more than 35% of their data missing and were dropped from our analysis.

In [None]:
#Create a Series with the % of missing values for each feature
missing_percent = sample_df.isnull().mean() * 100

#Select only columns with more than 35% of their data missing
#and extract the names of columns to be dropped
columns_to_drop = missing_percent[missing_percent>35.0].index

#Drop columns from sample_df
sample_df.drop(columns=columns_to_drop, inplace=True)

###4.2 - Description of Remaining Columns and Summary of Further Modifications:

Based on the descriptions of the columns, it appears that columns with numeric data should be processed as numerical values, and those labeled as 'object' columns as categoric.

* `loan_status` : Our TARGET varible, the status of the loan. While multiple values appeared in the data, we re-mapped those to (0 = no default) or (1 = default) | categoric
* `loan_amnt`: The listed amount of the loan applied for by the borrower | **numeric**
* `funded_amnt` : The total amount committed to that loan at that point in time | **numeric**
* `funded_amnt_inv` : NO DESCRIPTION FOUND, values also seem very similar to funded amount | **dropped**
* `term` : Number of payments on the loan. Values are in months and can be either 36 or 60, re-mapped to '30' and '60' (originally ' 30 months' and ' 60 months') | **categoric**
* `int_rate`: The effective interest rate is equal to the interest rate on a Note reduced by Lending Club's estimate of the impact of uncollected interest prior to charge off | **numeric**
* `installment` : The monthly payment owed by the borrower if the loan originates | **numeric**
* `grade`: LC assigned loan grade, values from 'A' to 'F'. | **categoric**
* `sub_grade`: LC assigned loan subgrade (i.e. 'A5')| **categoric**
* `emp_title`: The job title supplied by the Borrower when applying for the loan, too many unique values | **dropped**
* `emp_length`: Employment length in years. Possible values are between 0 and 10. 0 means less than one year and 10 means ten or more years. | **categoric**
* `home_ownership`:  The home ownership status provided by the borrower during registration | **categoric**
* `annual_inc`: The self-reported annual income provided by the borrower during registration | **numeric**
* `verification_status` : Income was verified by LC, not verified, or if the income source was verified | **categoric**
* `issue_d` : NO DESCRIPTION FOUND, appears to be a date value, possible the date in which the loan was issued | **dropped**
* `pymnt_plan` : Payment plan 'y' or 'n' | **categoric**
* `purpose`: A category provided by the borrower for the loan request(14 unique categories) | **categoric**
* `title`: Loan title provided by the borrower, a text based varible (+8k unique entries), since NLP is outside of the score of this project we dropped this feature | **dropped**
* `zip_code`
* `addr_state`: The state provided by the borrower in the loan application | **categoric**
* `dti`:  Ratio calculated of borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. | **numeric**
* `delinq_2yrs`: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years | **numeric**
* `earliesst_cr_line`: The date the borrower's earliest reported credit line was opened

(Continue...)





In [None]:
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226067 entries, 1624968 to 850605
Data columns (total 87 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   loan_amnt                   226067 non-null  int64  
 1   funded_amnt                 226067 non-null  int64  
 2   funded_amnt_inv             226067 non-null  float64
 3   term                        226067 non-null  object 
 4   int_rate                    226067 non-null  float64
 5   installment                 226067 non-null  float64
 6   grade                       226067 non-null  object 
 7   sub_grade                   226067 non-null  object 
 8   emp_title                   209243 non-null  object 
 9   emp_length                  211314 non-null  object 
 10  home_ownership              226067 non-null  object 
 11  annual_inc                  226067 non-null  float64
 12  verification_status         226067 non-null  object 
 13  issue_d 

In [None]:
#Check for unique values
sample_df['term'].unique()

#Re-mapping values
sample_df['term'].replace({' 36 months':'36', ' 60 months':'60'}, inplace=True)

In [None]:
#When checking the possible values for emp_title we
#found that in our sample of around 200k records, there
#are more than 80k unique values, this would create problems when
#splitting the data and could over complicate our models, so
#we decided to drop this predictor.
print(sample_df['emp_title'].value_counts())

#Drop predictor
sample_df.drop(columns=['emp_title'], inplace=True)

Teacher                      3926
Manager                      3424
Owner                        2128
Registered Nurse             1601
RN                           1501
                             ... 
Meredith College                1
Manager Quality Assurance       1
business Development Rep        1
administrative technician       1
IEA                             1
Name: emp_title, Length: 81288, dtype: int64


In [None]:
#issue_d is the date the loan was funded.
#Too many unique values, it would slow down our models too much
print(sample_df['issue_d'].unique())

sample_df.drop(columns=['issue_d'], inplace=True)

['May-2017' 'Dec-2013' 'May-2015' 'Dec-2014' 'Dec-2017' 'Nov-2017'
 'Oct-2013' 'Oct-2016' 'Mar-2016' 'Apr-2015' 'May-2013' 'Apr-2014'
 'Sep-2018' 'Jun-2016' 'Apr-2017' 'Nov-2015' 'Apr-2016' 'Aug-2012'
 'Aug-2014' 'Dec-2015' 'Jul-2018' 'Sep-2017' 'Jan-2015' 'Nov-2018'
 'Sep-2015' 'Nov-2013' 'Oct-2014' 'Dec-2018' 'Jul-2017' 'Mar-2018'
 'Jul-2016' 'Oct-2017' 'Sep-2013' 'Jul-2015' 'Jun-2013' 'Jun-2014'
 'Feb-2017' 'Sep-2012' 'Aug-2018' 'Mar-2014' 'May-2018' 'May-2016'
 'Apr-2018' 'Jun-2015' 'Feb-2016' 'Aug-2016' 'Mar-2017' 'Nov-2014'
 'Aug-2013' 'Dec-2016' 'Jun-2018' 'Feb-2014' 'Jan-2017' 'Mar-2015'
 'Jul-2013' 'Dec-2011' 'Sep-2016' 'Jun-2017' 'Oct-2018' 'Feb-2012'
 'Nov-2016' 'Oct-2015' 'Feb-2018' 'May-2014' 'Jan-2018' 'Aug-2017'
 'Jan-2012' 'Jan-2016' 'Jan-2013' 'May-2012' 'Feb-2015' 'Sep-2009'
 'Feb-2013' 'Dec-2012' 'Apr-2013' 'Nov-2012' 'Jun-2012' 'Sep-2011'
 'Aug-2015' 'Aug-2011' 'Mar-2013' 'Jul-2014' 'Mar-2011' 'Aug-2009'
 'Sep-2014' 'Jan-2014' 'Oct-2009' 'Nov-2010' 'Aug-2010' 'Apr-2

In [None]:
#Loan status has multiple different values.
#For simplicity re-map to either default or no default
print(sample_df['loan_status'].unique())

#(MIGHT NEED TO CHANGE THIS)
sample_df['loan_status'].replace({'Current':0, 'Fully Paid':0,
                                  'Charged Off':1, 'Late (31-120 days)':1,
                                  'In Grace Period':0, 'Does not meet the credit policy. Status:Fully Paid':0,
                                  'Late (16-30 days)':1, 'Does not meet the credit policy. Status:Charged Off':1,
                                  'Default':1
}, inplace=True)

['Current' 'Fully Paid' 'Charged Off' 'Late (31-120 days)'
 'In Grace Period' 'Does not meet the credit policy. Status:Fully Paid'
 'Late (16-30 days)' 'Does not meet the credit policy. Status:Charged Off'
 'Default']


In [None]:
sample_df['loan_status'].value_counts()

Fully Paid                                             104195
Current                                                 91970
Charged Off                                             26165
Late (31-120 days)                                       2190
In Grace Period                                           895
Late (16-30 days)                                         374
Does not meet the credit policy. Status:Fully Paid        199
Does not meet the credit policy. Status:Charged Off        76
Default                                                     3
Name: loan_status, dtype: int64

In [None]:
#Check title feature
#title seems to contain explanations that the
#borrower provided for the loan, or simply 'thank you'
#messages (unstructured data). Given that NLP is outside of the scope of this
#project we decided to drop this feature
print(sample_df['title'].value_counts())

sample_df.drop(columns=['title'], inplace=True)

Debt consolidation                         115478
Credit card refinancing                     47121
Home improvement                            13657
Other                                       12678
Major purchase                               4406
                                            ...  
Pay-Off Santander Car Loan and New Roof         1
I WILL PAY YOU ON TIME - THANK YOU              1
Personal events                                 1
Loan to payoff credit card debt                 1
2000loan                                        1
Name: title, Length: 8682, dtype: int64


In [None]:
#HOW TO DEAL WITH ZIP CODE?
len(sample_df['zip_code'].unique())

898

In [None]:
#Maybe we could transform this to a year value?
sample_df['earliest_cr_line'].unique()

array(['Oct-2001', 'Nov-2003', 'Mar-1999', 'Oct-2007', 'Jun-1992',
       'Nov-2011', 'Feb-2009', 'Oct-1984', 'May-2005', 'Nov-2006',
       'Mar-1976', 'Dec-2000', 'Dec-1999', 'Oct-2008', 'Aug-2001',
       'May-2004', 'Feb-1988', 'Jul-1997', 'Jun-1989', 'Feb-1994',
       'Jun-2001', 'Jul-1999', 'Sep-2007', 'Jun-1999', 'Nov-1993',
       'Dec-1982', 'Jul-2006', 'Apr-1999', 'Nov-2007', 'Jul-1989',
       'Jan-1997', 'Nov-2005', 'Jul-2000', 'Apr-2006', 'Aug-1999',
       'Dec-2006', 'May-2012', 'Oct-1999', 'Apr-2009', 'Feb-1996',
       'Jul-1995', 'Feb-2004', 'Jun-2008', 'Oct-2005', 'May-2006',
       'Jan-2005', 'Mar-2002', 'Dec-2002', 'Aug-2011', 'Sep-2004',
       'Nov-2012', 'Nov-1990', 'Sep-1999', 'Jan-1978', 'Sep-1984',
       'Mar-2006', 'Dec-1996', 'Nov-2000', 'Jun-1985', 'Aug-1997',
       'Sep-1996', 'Nov-1984', 'Oct-1986', 'Nov-1996', 'Sep-1994',
       'Sep-1982', 'Mar-2012', 'Apr-2012', 'Feb-2012', 'May-2002',
       'Feb-2008', 'Aug-2010', 'Apr-1993', 'Jun-2000', 'Jun-19

###4.3 - Create Train/Test Split

In [None]:
#Train/test split

X = sample_df.drop('loan_status', axis=1)
y = sample_df['loan_status'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0, stratify=y)

#Create copies to experiment on before adding to pipeline
X_train_copy = X_train.copy();
y_train_copy = y_train.copy();

###4.4 - Imputation of Null Values

After removing the features with more than 35% missing values, some predictors still contained some null values. We decided to test SimpleImputer, a method that will later be applied in our pre-processing pipeline (SECTION X)


In [None]:
#Note: test will be done on numeric columns
#Create mean imputer
mean_imputer = SimpleImputer(strategy="mean")

#Select only numerical variables
X_train_copy_num = X_train_copy.select_dtypes(include=[np.number])

#Fit/Transform
X_train_copy_imp = mean_imputer.fit_transform(X_train_copy_num)

#Create dataframe with results
X_train_copy_imp_df = pd.DataFrame(X_train_copy_imp, columns=X_train_copy_num.columns, index=X_train_copy_num.index)

#Check for any null values remaining
print('Missing values remaining (numeric cols):', X_train_copy_imp_df.isnull().sum().sum())


Missing values remaining (numeric cols): 0


###4.4 - Feature Scaling

To prepare our data for models that require normalized values, we tested feature scaling in this section before applying it on our pre-processing pipeline

In [None]:
#Create standard scaler object
std_scaler = StandardScaler()

#Fit/Transform
X_train_copy_num_std_scaled = std_scaler.fit_transform(X_train_copy_num)

#Create dataframe with results
X_train_copy_num_std_scaled_df = pd.DataFrame(X_train_copy_num_std_scaled, columns=X_train_copy_num.columns, index=X_train_copy_num.index)
X_train_copy_num_std_scaled_df.head(5)

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
422327,-0.984018,-0.983687,-0.981291,0.099898,-0.905076,-0.3404,0.532062,-0.355482,0.472205,0.069344,...,-0.171136,2.147643,0.651981,-1.171893,2.375307,-0.12217,-0.8238,-0.760504,-0.391661,-0.780861
1750701,-0.570255,-0.569812,-0.567596,0.255178,-0.408099,-0.101021,0.542821,0.804096,1.592586,-0.989852,...,-0.171136,-0.041263,-0.126933,0.482326,-0.353796,-0.12217,-0.495279,-0.031372,1.694158,-0.414982
515423,1.629224,1.63026,1.631517,-0.063664,0.876241,0.311592,-0.066178,-0.355482,-0.648175,1.481605,...,-0.171136,-0.588489,0.651981,-0.253801,-0.353796,-0.12217,1.373387,0.88218,0.140612,1.434579
2147172,-0.330708,-0.330201,-0.328089,-0.452898,-0.198925,-0.164016,-0.632139,-0.355482,-0.648175,-0.813319,...,-0.171136,-0.588489,0.651981,0.667048,-0.353796,-0.12217,-0.83042,-0.694128,-0.664289,-0.850416
1963121,-0.439593,-0.439115,-0.436956,-0.436335,-0.319844,-0.3341,-0.293567,-0.355482,-0.648175,-0.813319,...,-0.171136,-0.588489,0.651981,1.585139,-0.353796,-0.12217,-0.899293,-0.826759,-0.703236,-0.972786
