# Group 1 - Project Proposal: Predicting Loan Default

Team members: Luca Matteucci, Santiago Mazzei, Srithijaa Sankepally, and Victor Floriano


**Problem Statement:**

Our goal to predict what customers are more likely to default on their loan payments. By analyzing a Lending Club Loan dataset, we aim to understand the factors that contribute to loan defaults and late payments, why some borrowers can't repay their loans on time, and to find out what helps borrowers succeed. We want to make the lending process better, reduce default risk, and increase profitability for lenders.

**Data Source:**

We will use the Lending Club Loan dataset, which includes complete loan data for loans issued from 2007 to 2015. The Lending Club is a peer-to-peer lending company that matches people looking to invest money with people looking to borrow money. The dataset provides information about borrowers, loan characteristics, and loan performance. The data is freely available on Kaggle at: https://www.kaggle.com/datasets/adarshsng/lending-club-loan-data-csv




**Data Description:**

The dataset contains approximately 2,260,668 observations and 145 variables (columns). The variables include information, such as borrower characteristics (e.g., credit scores, income, employment details), loan characteristics (e.g., loan amount, interest rate, purpose), and loan performance data (e.g., current loan status, delinquency history). Some of the variables we expect to be useful for our analysis include: `annual_inc`, `loanAmnt`, `intRate`, `delinq2Yrs`, `purpose`, among others.

The dataset contains multiple data types, including float64 (105 instances), int64 (4 instances), and object (36 instances). Additionally, it includes several variables that necessitate preprocessing, such as:
1. Binary values assigned to 'Y'/'N'. (i.e. `hardship_flag`)
2. Object columns would work better as datetime. (i.e. `settlement_date`)
3. Variables with text input. (i.e. `desc`)

**Loading the data and brief exploration:**

In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from google.colab import drive
drive.mount('/content/drive')
loan_df = pd.read_csv('/content/drive/MyDrive/BU_MSBA/BA810/Data/loan.csv')

Mounted at /content/drive


  loan_df = pd.read_csv('/content/drive/MyDrive/BU_MSBA/BA810/Data/loan.csv')


In [15]:
#Display the number of entries, columns, and dtypes
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260668 entries, 0 to 2260667
Columns: 145 entries, id to settlement_term
dtypes: float64(105), int64(4), object(36)
memory usage: 2.4+ GB


In [16]:
#Our target variable contains multiple values, we will
#transform this column into a binary 0/1 variable
loan_df['loan_status'].value_counts()

Fully Paid                                             1041952
Current                                                 919695
Charged Off                                             261655
Late (31-120 days)                                       21897
In Grace Period                                           8952
Late (16-30 days)                                         3737
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                     31
Name: loan_status, dtype: int64

In [17]:
#Summary Statstics of selected varibles
loan_df.describe() #EDIT THIS TO CONTAIN ONLY A FEW VARIABLES

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,url,dti,...,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,0.0,2260668.0,2260668.0,2260668.0,2260668.0,2260668.0,2260664.0,0.0,2258957.0,...,10613.0,10613.0,10613.0,10613.0,8426.0,10613.0,10613.0,33056.0,33056.0,33056.0
mean,,,15046.93,15041.66,15023.44,13.09291,445.8076,77992.43,,18.8242,...,3.0,155.006696,3.0,13.686422,454.840802,11628.036442,193.606331,5030.606922,47.7756,13.148596
std,,,9190.245,9188.413,9192.332,4.832114,267.1737,112696.2,,14.18333,...,0.0,129.113137,0.0,9.728138,375.830737,7615.161123,198.694368,3692.027842,7.336379,8.192319
min,,,500.0,500.0,0.0,5.31,4.93,0.0,,-1.0,...,3.0,0.64,3.0,0.0,1.92,55.73,0.01,44.21,0.2,0.0
25%,,,8000.0,8000.0,8000.0,9.49,251.65,46000.0,,11.89,...,3.0,59.37,3.0,5.0,174.9675,5628.73,43.78,2227.0,45.0,6.0
50%,,,12900.0,12875.0,12800.0,12.62,377.99,65000.0,,17.84,...,3.0,119.04,3.0,15.0,352.605,10044.22,132.89,4172.855,45.0,14.0
75%,,,20000.0,20000.0,20000.0,15.99,593.32,93000.0,,24.49,...,3.0,213.26,3.0,22.0,622.7925,16114.94,284.18,6870.7825,50.0,18.0
max,,,40000.0,40000.0,40000.0,30.99,1719.83,110000000.0,,999.0,...,3.0,943.94,3.0,37.0,2680.89,40306.41,1407.86,33601.0,521.35,181.0


In [None]:
#ADD 1/2 PLOTS

**Anticipated Results:**

Our target varible will be wheter a customer was charged off/defaulted on the loan or not. Therefore, we will be dealing with a classification problem and our models will return a class 1(default) or 0(no default) based on each records characterists. We expect to apply pre-processing steps such as data cleaning/imputation of missing values, normalization, and feaure selection due to the high dimensionality of our data. For our feature selection we plan to experiment with Sequential Feature Selection and Tree-based Feature Selection (available on scikit-learn)

In order to predict which customers will default on their loans we plan to test the following models:
* Logistic Regression (sklearn-LogisticRegression) (params)
* k-Nearest Neighbors (sklearn-KNeighborsClassifier) (params)
* Decision Tree (sklearn-DecisionTreeClassifier) (params)
* Random Forest (sklearn-RandomForestClassifier) (params)
Assessing the influence of loan amount and term on defaults and late payments.
Exploring patterns in delinquency history and their impact on loan performance.

We will then fine-tune our models hyperparameters to achieve the best performance based on selected metrics such as accuracy, precision, recall, and f1. In order to try multiple hyperparameters we anticipate using parameter search methods such as GridSearchCV and RandomizedSearchCV. Our final step will then be to compare the models and select the best one.


**Potential Implications:**

With the results of this analysis, we expect to be able to identify the biggest drivers of loan default, such as [INSERT FINDINGS FROM THE ANTICIPATED RESULTS], and to generate a model with relatively accurate predictions.
 1. **Risk Assessment and Mitigation:** Identifying the factors contributing to loan defaults and late payments allows lending institutions to assess the risk associated with potential borrowers more effectively. This can lead to improved risk management practices.

2. **Market Competitiveness:** Borrowers can benefit from more competitive loan terms and rates based on their credit profiles and purposes, resulting in higher customer satisfaction. Investors will also be more attracted because understanding the factors that contribute to loan defaults can help them diversify their investments and potentially earn higher returns.

3. **Compliance with Ethics and Regulations:** Ensuring fair and transparent lending practices, supported by data-driven decisions, can help lending institutions comply with financial regulations and avoid legal issues. This also contributes to the promotion of ethical lending practices by reducing the likelihood of lending discrimination and bias.
