# Credit Risk & Loan Performance: Data Sampling and Exploratory Data Analysis

#### Author: Satveer Kaur
#### Date: 2025-10-19
#### Notebook Purpose:
This notebook focuses on **data sampling and initial exploratory data analysis (EDA)** for the LendingClub Accepted and Rejected Loans datasets. LendingClub Accepted Loans dataset. 
The goal is to:
1. Create **representative sample datasets** for faster, efficient exploration while preserving the distribution of key variables.
2. Conduct **initial EDA**, including univariate analysis of important numerical and categorical features, to understand data structure and detect potential issues.
3. Evaluate **variable distributions, proportions, and data balance** between accepted and rejected loans
4. Prepare the groundwork for deeper **statistical analysis, feature engineering, and visualization** in subsequent notebooks.

#### 1. Load Cleaned Datasets
**Purpose**: Import the **cleaned Accepted Loans** CSV files and verify successful loading by checking their shape and basic structure. This ensures dataset is ready for sampling and further exploratory analysis.

In [42]:
# Importing Libraries
import pandas as pd
from sklearn.model_selection import train_test_split  # for stratified sampling
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('ggplot')
# To see all the columns in the df
# pd.set_option('display.max_columns', False)


# Load cleaned datasets
accepted_loans = pd.read_csv("../data/clean_data/accepted_loans_cleaned.csv", low_memory=False)

print(f"Accepted Loans: {accepted_loans.shape}")

Accepted Loans: (2260701, 102)


#### 2. Create Sample Datasets
**Purpose:**  
The cleaned datasets are large, which can make visualization and analysis slower. To enable efficient exploratory data analysis (EDA), we create **representative samples** that retain the overall data distribution while reducing size.  

This approach allows for faster testing, plotting, and insight generation — especially useful when working on limited local resources.

##### 2.1 Drop Rows with NaN in `loan_status` ,`annual_income` and `fico_range_high` and ensure correct Data Types.
**Purpose:**  
For predictive risk modeling, the dataset must be complete for the most critical variables. Rows missing the Target Variable (`loan_status_grouped`) or the Core Predictors (`annual_income`, `fico_range_high`) are unusable, as these fields are essential for measuring and pricing risk.

We drop these incomplete rows to maintain data quality, and then convert loan_status_grouped to a string (object) to ensure proper classification handling in subsequent steps.

In [43]:
# columns to drop na values from
critical_columns = [
    'loan_status',
    'annual_income',
    'fico_range_high'
]

# drop NaN values from columns
accepted_loans = accepted_loans.dropna(subset=critical_columns).copy()

# converting loan_status to string
accepted_loans['loan_status'] = accepted_loans['loan_status'].astype(str)

# Checking data types of columns
print(accepted_loans[['loan_status', 'annual_income', 'fico_range_high']].dtypes)

loan_status         object
annual_income      float64
fico_range_high    float64
dtype: object


##### 2.2 Create `loan_status_grouped` for Risk Classification
**Purpose:**  
This step groups the raw `loan_status` values into simplified buckets: **Fully Paid (Success), Charged Off (Failure), Current/Pending (Uncertain), and Other/Exclude (Irrelevant)**. This is essential for risk analysis as it accurately defines the outcomes needed for the target variable (is_default) and for subsequent filtering.

In [44]:
#  the complete mapping for the unique statuses  
status_mapping = {
    # Success
    'Fully Paid': 'Fully Paid',
    
    # Failure/Default (CRITICAL for losses)
    'Charged Off': 'Charged Off',
    'Default': 'Charged Off', # 'Default' is a final failure state
    'Does not meet the credit policy. Status:Charged Off': 'Charged Off', # Treat as a failure
    
    # Uncertain/Pending (CRITICAL for later filtering)
    'Current': 'Current/Pending',
    'In Grace Period': 'Current/Pending',
    'Late (31-120 days)': 'Current/Pending',
    'Late (16-30 days)': 'Current/Pending',
    
    # Other/Exclude (Irrelevant to core risk modeling)
    'Does not meet the credit policy. Status:Fully Paid': 'Other/Exclude',
}
 
#  Apply the mapping, defaulting to 'Other/Exclude' if any new status appears 
def group_loan_status_accurate(status):
    """Maps status using the dictionary, or defaults to 'Other/Exclude'."""
    status_clean = status.strip()
    return status_mapping.get(status_clean, 'Other/Exclude')

accepted_loans['loan_status_grouped'] = accepted_loans['loan_status'].apply(group_loan_status_accurate)

# Validate the distribution of the new column
print("Distribution of the new grouped loan status:")
print(accepted_loans['loan_status_grouped'].value_counts(normalize=True))

Distribution of the new grouped loan status:
loan_status_grouped
Fully Paid         0.476299
Current/Pending    0.403673
Charged Off        0.119151
Other/Exclude      0.000878
Name: proportion, dtype: float64


##### 2.3 Stratified Sampling for Accepted Loans
**Purpose:**  
To create a smaller, manageable sample of the data for faster computation and exploration while preserving the exact proportions of loan outcomes (Fully Paid, Charged Off, Current/Pending). We use stratified sampling based on the cleaned `loan_status_grouped` column.

In [45]:
# Stratified sampling for accepted loans by loan_status
accepted_sample, _ = train_test_split(
    accepted_loans,
    test_size=0.9, # keep 10% for sample 
    stratify=accepted_loans['loan_status_grouped'],
    random_state=42
)

print(f'Accepted Loans Sample: {accepted_sample.shape}')

Accepted Loans Sample: (226066, 103)


##### 2.4 Create Binary Target (`is_default`) and Filter Out Uncertain Loans
**Purpose:**  
This step prepares the data for core risk analysis by filtering out all uncertain loans (Current/Pending) and keeping only definitive outcomes. We then create the binary target variable, is_default (1 = Charged Off, 0 = Fully Paid), which is the essential analytical flag used to accurately quantify default rates across different loan segments and calculate historical profitability.

In [46]:
accepted_sample_filtered = accepted_sample[
    accepted_sample['loan_status_grouped'].isin(['Fully Paid','Charged Off'])
].copy()

accepted_sample_filtered['is_default']= accepted_sample_filtered['loan_status_grouped'].apply(
    lambda x: 1 if x == 'Charged Off' else 0
)

#### 3. Core Feature Engineering and Validation
**Purpose:**  
To transform raw, continuous data (**like Annual Income, Fico Score and DTI**) into discrete, risk-quantifying features (**like Income Brackets, Fico Bins and DTI Quintiles**). This process creates variables that are both highly predictive for risk analysis and directly actionable for underwriting and portfolio management policy.

##### 3.1 Create `annual_income` Brackets

In [47]:
accepted_sample_filtered.annual_income.max()
income_labels = ['< $50k',' $50k - $100k',' $100k - $150k',' > $150k']
income_bins = [0, 50_000, 100_000, 150_000, accepted_sample_filtered['annual_income'].max()+1]
accepted_sample_filtered['income_brackets'] = pd.cut(
    accepted_sample_filtered['annual_income'], 
    bins=income_bins, 
    labels=income_labels, 
    include_lowest=True,
    right=False # Ensure 50k lands in the 50k-100k bin, not the <50k bin
)

print("Distribution of default rate by income brackets :\n")
print(accepted_sample_filtered['income_brackets'].value_counts(normalize=True).map('{:.2%}'.format))

Distribution of default rate by income brackets :

income_brackets
 $50k - $100k     50.64%
< $50k            28.70%
 $100k - $150k    14.42%
 > $150k           6.23%
Name: proportion, dtype: object


##### 3.2 Create `dti_quintile` Brackets

In [48]:
accepted_sample_filtered['dti_quintile'] = pd.qcut(
    accepted_sample_filtered['debt_to_income_ratio'],
    q=5, # Creates 5 bins of equal population size
    labels=[
        'Q1 (Lowest DTI)',  # 0
        'Q2',               # 1
        'Q3',               # 2
        'Q4',               # 3
        'Q5 (Highest DTI)'  # 4
    ],
    duplicates='drop'
)
# validate dti_quintile 
dti_risk_analysis = accepted_sample_filtered.groupby('dti_quintile', observed=True)['is_default'].mean()
print('Observed Default Rate by DTI Quintile:\n ')
print(dti_risk_analysis.sort_values(ascending=False).map('{:.2%}'.format))


Observed Default Rate by DTI Quintile:
 
dti_quintile
Q5 (Highest DTI)    27.02%
Q4                  22.05%
Q3                  19.05%
Q2                  16.64%
Q1 (Lowest DTI)     15.29%
Name: is_default, dtype: object


##### 3.3 Create `fico_score` Bins

In [49]:
# Calculate fico score - axis=1 (row-wise)
accepted_sample_filtered['fico_score'] = accepted_sample_filtered[['fico_range_high','fico_range_low']].mean(axis=1)
max_score = accepted_sample_filtered['fico_score'].max()
fico_bins = [
    0,
    670, # boundary for poor/fair
    740, # good/very good
    800, # very good/excellent
    max_score+1
]
fico_labels = [
    'Subprime/Poor (<670)', 
    'Good (670-739)', 
    'Very Good (740-799)', 
    'Excellent (800+)'
]
accepted_sample_filtered['fico_bin']= pd.cut(
    accepted_sample_filtered['fico_score'],
    bins=fico_bins,
    labels=fico_labels,
    right=False,  #The bins are inclusive on the left
    include_lowest=True
)
print("Top FICO bins created:\n")
print(accepted_sample_filtered['fico_bin'].value_counts())

Top FICO bins created:

fico_bin
Good (670-739)          96065
Subprime/Poor (<670)    24035
Very Good (740-799)     12985
Excellent (800+)         1526
Name: count, dtype: int64


In [50]:
# validate fico bins
fico_risk_analysis = accepted_sample_filtered.groupby('fico_bin', observed=True)['is_default'].mean()
print('Observed Default Rate by Fico Bins:\n ')
print(fico_risk_analysis.sort_values(ascending=False).map('{:.2%}'.format))

Observed Default Rate by Fico Bins:
 
fico_bin
Subprime/Poor (<670)    26.24%
Good (670-739)          20.01%
Very Good (740-799)     10.03%
Excellent (800+)         6.68%
Name: is_default, dtype: object


#### 4. Secondary Feature Engineering (Categorical)
**Purpose:**  
To clean and consolidate granular categorical data (e.g., loan_purpose, term) into a limited set of high-level, actionable categories (e.g., 'Debt Cons', 'Other', '36 Mo'). This reduces noise, ensures every segment is large enough for robust risk analysis, and provides clear, digestible insights for policy recommendations.

##### 4.1 Simplify `purpose` Feature

In [51]:
top_purposes = ['debt_consolidation', 'credit_card', 'home_improvement']
accepted_sample_filtered['purpose_grouped']=accepted_sample_filtered['purpose'].apply(lambda x: x if x in top_purposes else 'Other')

# Validate Loan Purpose Groups
purpose_risk_analysis = accepted_sample_filtered.groupby('purpose_grouped')['is_default'].mean()
print('Observed Default Rate by Loan Purpose Group:\n')
print(purpose_risk_analysis.sort_values(ascending=False).map('{:.2%}'.format))

Observed Default Rate by Loan Purpose Group:

purpose_grouped
debt_consolidation    21.21%
Other                 20.99%
home_improvement      18.00%
credit_card           16.83%
Name: is_default, dtype: object


##### 4.2 Clean `loan_term`

In [52]:
accepted_sample_filtered['term_num'] =  (
    accepted_sample_filtered['term']
    .str.replace(' months','', regex=False)
    .str.strip()
    .astype(int)
)
# Validate Loan term
term_risk_analysis = accepted_sample_filtered.groupby('term_num')['is_default'].mean()
print('Observed Default Rate by Loan Term:\n')
print(term_risk_analysis.map('{:.2%}'.format))

Observed Default Rate by Loan Term:

term_num
36    16.07%
60    32.38%
Name: is_default, dtype: object


##### 5. Final Cleanup and Export
**Purpose:**  
To finalize the engineered dataset by selecting only the essential analytical features (risk segments, financial terms, and the target variable). This step establishes a crucial data checkpoint by exporting the clean subset to a new file (`accepted_loans_sample.csv`). Exporting ensures that all intensive data cleaning and feature engineering steps are preserved, allowing subsequent analysis notebooks (EDA, Visualization) to load a fully prepared, concise, and ready-to-use dataset, avoiding redundant processing.

In [53]:
accepted_sample_filtered.debt

AttributeError: 'DataFrame' object has no attribute 'debt'

In [None]:
final_colunms = [
    'is_default', # Target variable
    # key loan terms
    'amount_requested',
    'funded_amount',
    'interest_rate',
    'installment',
    'term_num',
    # core credit risk segments
    'fico_bin',
    'dti_quintile',
    'income_brackets',
    'purpose_grouped',
    # core credit risk scores
    'fico_score',
    'annual_income',
    'debt_to_income_ratio',
    # original risk grades
    'grade',
    'sub_grade',
    # other
    'total_bc_limit',
    'total_il_high_credit_limit',
    # date
    'application_date'
]

# Save sampled and clean datasets to CSV
final_loan_data = accepted_sample_filtered[final_colunms].copy()
final_loan_data.to_csv('../data/sample_data/final_loan_data.csv', index=False)

##### 6. Summary and Next Steps
##### Summary

This notebook successfully completed all necessary **Feature Engineering (FE)** and data preparation steps on the stratified sample, resulting in a dataset ready for advanced risk analysis.

1. Target Creation: The dataset was filtered to loans with definitive outcomes (Fully Paid vs. Charged Off) to create the binary target variable, is_default.
2. Core Segment Engineering: We transformed the most predictive continuous variables into discrete, actionable risk segments:
  - `fico_bin`: Standard industry tiers (e.g., Subprime, Good).
  - `dti_quintile`: Five equal-sized population groups based on Debt-to-Income (DTI).
  - `income_brackets`: Fixed tiers for Annual Income (< $50k, etc.).
3. Secondary Feature Cleaning: Critical categorical variables were cleaned and validated:
  - `purpose_grouped`: Consolidated rare loan_purpose categories into a single 'Other' bucket.
  - `term_num`: Cleaned the loan term from a string to a numeric integer (36 or 60).
4. Validation: Every engineered feature was validated by calculating the Observed Default Rate (ODR) per segment, confirming all expected risk trends.
5. Checkpoint: The final clean, engineered sample was saved as final_loan_data.csv, serving as the input for the next phase.

##### Next Steps

The next notebook will focus entirely on Exploratory Data Analysis (EDA) and Visualization to graphically present the key risk findings and prepare the data for predictive modeling.

**Planned Tasks:**
1. Risk Trend Visualization: Plot the Observed Default Rate (ODR) for every engineered segment (`fico_bin`, `dti_quintile`, `income_brackets`) to visually demonstrate their predictive power.
2. Univariate Analysis: Visualize the distributions of key features (`amount_requested`, `interest_rate`) to understand the borrower population.
3. Bivariate Analysis: Investigate correlations and interactions between features (e.g., Interest Rate vs. Default Rate across different FICO Bins).
4. Conclusion: Summarize the key findings that will directly inform the final predictive model design and business policy recommendations.
The next phase will be documented in [`3_Bivariate_EDA.ipynb`](3_Bivariate_EDA.IPYNB)