# Credit Risk & Loan Performance: Data Sampling and Exploratory Data Analysis

#### Author: Satveer Kaur
#### Date: 2025-10-19
#### Notebook Purpose:
This notebook focuses on **data sampling and exploratory data analysis (EDA)** for the LendingClub Accepted and Rejected Loans datasets.LendingClub Accepted and Rejected Loans datasets. 
The goal is to:
1. Create manageable **sample datasets** for faster analysis while maintaining representativeness of the original data.
2. Explore **key patterns, distributions, and correlations** that influence credit risk and loan performance.
3. Generate **business-relevant insights** to guide financial decision-making, such as borrower reliability, loan trends, and approval characteristics.
4. Prepare data summaries and visuals that will later feed into **dashboard creation and modeling.**

#### 1. Load Cleaned Datasets
**Purpose**: Import the **cleaned Accepted and Rejected Loans** CSV files and verify successful loading by checking their shape and basic structure. This ensures both datasets are ready for sampling and further exploratory analysis.

In [41]:
# Importing Libraries
import pandas as pd
from sklearn.model_selection import train_test_split  # for stratified sampling
import seaborn as sns
import matplotlib.pyplot as plt

# Load cleaned datasets
accepted_loans = pd.read_csv("../data/clean_data/accepted_loans_cleaned.csv", low_memory=False)
rejected_loans = pd.read_csv("../data/clean_data/rejected_loans_cleaned.csv")

print(f"Accepted Loans: {accepted_loans.shape}")
print(f"Rejected Loans: {rejected_loans.shape}")

accepted_loans['loan_status'].value_counts(normalize=True)

Accepted Loans: (2260701, 121)
Rejected Loans: (27648741, 7)


loan_status
Fully Paid                                             0.476298
Current                                                0.388521
Charged Off                                            0.118796
Late (31-120 days)                                     0.009496
In Grace Period                                        0.003732
Late (16-30 days)                                      0.001924
Does not meet the credit policy. Status:Fully Paid     0.000879
Does not meet the credit policy. Status:Charged Off    0.000337
Default                                                0.000018
Name: proportion, dtype: float64

#### 2. Create Sample Datasets
**Purpose:**  
The cleaned datasets are large, which can make visualization and analysis slower. To enable efficient exploratory data analysis (EDA), we create **representative samples** that retain the overall data distribution while reducing size.  

This approach allows for faster testing, plotting, and insight generation — especially useful when working on limited local resources.

##### 2.1 Drop Rows with NaN `loan_status` and `state` and change datatype to String
**Purpose:**  
Stratified sampling requires a valid target variable. Rows with missing `loan_status` cannot be assigned to a class and the column is converted to string to ensure proper class handling in the sample.

In [27]:
# Ensure 'loan_status' in accepted loans has no NaNs and is string type
accepted_loans.loc[:,'loan_status'] = accepted_loans.dropna(subset=['loan_status'])
accepted_loans['loan_status'] = accepted_loans['loan_status'].astype(str)
print(f"Accepted Loans: {accepted_loans.shape}, dtype: {accepted_loans['loan_status'].dtype}")

# Ensure 'state' in rejected loans has no NaNs and is string type
rejected_loans.loc[:,'state'] = rejected_loans.dropna(subset=['state'])
rejected_loans['state'] = rejected_loans['state'].astype(str)
print(f"Rejected Loans: {rejected_loans.shape}, dtype: {rejected_loans['state'].dtype}")


Accepted Loans: (2260701, 121), dtype: object
Rejected Loans: (27648741, 7), dtype: object


##### 2.2 Stratified Sampling for Accepted and Rejected Loans
**Purpose:**  
To reduce dataset size while preserving class proportions, we perform stratified sampling on accepted loans (`loan_status`) and rejected loans (`state`).

In [32]:
# Stratified sampling for accepted loans by loan_status
accepted_sample, _ = train_test_split(
    accepted_loans,
    test_size=0.9, # keep 10% for sample 
    stratify=accepted_loans['loan_status'],
    random_state=42
)

# Stratified sampling for rejected loans by state (Since no loan_status column)
rejected_sample, _ = train_test_split(
    rejected_loans,
    test_size=0.9915,
    stratify=rejected_loans['state'],
    random_state=42
)

print(f'Accepted Loans Sample: {accepted_sample.shape}')
print(f'Rejected Loans Sample: {rejected_sample.shape}')

Accepted Loans Sample: (226070, 121)
Rejected Loans Sample: (235014, 7)


#### 3. Exploratory Data Analysis (EDA)
**Purpose:**  
Understand the data distributions, identify patterns, spot anomalies, and get insights that will guide feature engineering and modeling.

##### 3.1 Overview of the Data
**Purpose:**  
Get a quick summary of the numeric and categorical features, including counts, distributions, and basic statistics.

In [34]:
# Numeric summary
accepted_sample.describe()

Unnamed: 0,amount_requested,funded_amount,funded_amount_invested,interest_rate,installment,annual_income,application_date,debt_to_income_ratio,delinquencies_2yrs,fico_range_low,...,sec_app_fico_range_high,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog
count,226067.0,226067.0,226067.0,226067.0,226067.0,226067.0,0.0,225887.0,226066.0,226067.0,...,10730.0,10730.0,10730.0,10730.0,10559.0,10730.0,10730.0,10730.0,10730.0,3554.0
mean,15070.37803,15065.513321,15047.403243,13.092544,446.550096,77962.65,,18.844887,0.308459,698.57989,...,673.942311,0.626561,1.544268,11.555918,57.987461,3.019758,12.623486,0.04287,0.075303,37.447946
std,9216.516898,9215.136915,9219.251538,4.829759,268.001749,72516.15,,15.012531,0.876817,33.045582,...,44.428848,0.990618,1.756749,6.642762,25.499223,3.29085,8.228078,0.34544,0.382675,23.940992
min,500.0,500.0,0.0,5.31,14.77,0.0,,0.0,0.0,615.0,...,544.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8000.0,8000.0,8000.0,9.49,252.14,46000.0,,11.9,0.0,675.0,...,649.0,0.0,0.0,7.0,39.6,1.0,7.0,0.0,0.0,17.0
50%,13000.0,12975.0,12875.0,12.62,378.2,65000.0,,17.83,0.0,690.0,...,674.0,0.0,1.0,10.0,60.4,2.0,11.0,0.0,0.0,36.0
75%,20000.0,20000.0,20000.0,15.99,593.83,93296.5,,24.49,0.0,715.0,...,699.0,1.0,2.0,15.0,78.4,4.0,17.0,0.0,0.0,56.0
max,40000.0,40000.0,40000.0,30.99,1714.54,10999200.0,,999.0,25.0,845.0,...,850.0,6.0,14.0,73.0,165.1,43.0,96.0,12.0,12.0,141.0


In [37]:
# Categorical summary of loan_status
accepted_sample['loan_status'].value_counts()

loan_status
Fully Paid                                             107675
Current                                                 87832
Charged Off                                             26856
Late (31-120 days)                                       2147
In Grace Period                                           843
Late (16-30 days)                                         435
Does not meet the credit policy. Status:Fully Paid        199
Does not meet the credit policy. Status:Charged Off        76
Default                                                     4
nan                                                         3
Name: count, dtype: int64

In [39]:
# Categorical summary of state
rejected_sample['state'].value_counts()

state
CA    27558
TX    21212
FL    18424
NY    16925
GA     9211
PA     8905
OH     8596
IL     8509
NC     7343
NJ     7253
MI     6423
VA     6280
MD     5055
AZ     5006
TN     4896
MA     4648
IN     4397
WA     4232
MO     4217
AL     4192
CO     3995
SC     3981
LA     3584
WI     3074
MN     3054
KY     2976
CT     2886
NV     2835
OK     2602
AR     2476
OR     2410
MS     2357
KS     1960
UT     1525
NM     1406
HI     1323
NH     1000
NE      959
RI      953
WV      903
DE      800
ID      682
ME      670
MT      605
AK      514
SD      484
DC      449
VT      438
WY      422
ND      405
IA        4
Name: count, dtype: int64

##### 3.2 Visualize Distributions
**Purpose:**  
Check how key variables are distributed and spot any anomalies or patterns. This includes numeric features (loan amounts, interest rates, terms) and categorical features (loan_status, state).