## Notebook 1 : 01_data_cleaning.ipynb

#### Author: Satveer Kaur
#### Date: 2025-10-18
#### Notebook Purpose:
This notebook performs **initial data ingestion, cleaning, and structural preparation** on the full loan portfolio dataset.

#### Primary Goal:
To establish a stable, high-quality base DataFrame that is ready for feature engineering in subsequent notebooks.

#### Key Actions:
1.  **Ingestion:** Load the full, raw loan data to ensure total portfolio volume is preserved.
2.  **Data Hygiene:** Identify and manage missing values in critical columns (e.g., FICO, DTI) to prevent calculation errors.
3.  **Structural Cleaning:** Standardize data types, convert date fields, and remove irrelevant identifier or highly sparse columns.

#### Context:
You are a data analyst at LendingClub, and accurate data preparation is critical for building reliable credit risk analytics models and dashboards. The clean output of this notebook serves as the foundation for quantifying loan performance.

#### 1. Setup and Data Ingestion
**Purpose**: Initialize core library (`pandas`) required for data manipulation and cleaning. This section loads the full, raw loan data and performs initial checks to prepare the data for quality assessment.

In [2]:
# Import necessary libraries
import pandas as pd
# import numpy as np

# Load full_loan_data CSV
df = pd.read_csv('../data/raw/full_loan_data.csv', low_memory=False)

#### 2. Initial Data Quality Assessment
**Purpose:** Quickly assess the data's structure, identify the scale of the missing data problems, and locate columns that may be immediate candidates for removal due to excessive nulls or irrelevance.

In [3]:
# check data shape and columns
print(f'Total rows: {df.shape[0]:,.0f} | Total columns: {df.shape[1]}')

# Analyze Missing Values (Top 20 Columns)
# Calculate the percentage of missing values for all columns
missing_data = df.isna().sum().sort_values(ascending=False)
missing_percentage = (missing_data / len(df)) * 100

# Combine into a DataFrame and display the top 20
missing_info = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percent': missing_percentage
})

print('\nTop 20 Columns with Missing Data')
# Filter to show only columns with at least one missing value
display(missing_info[missing_info['Missing Count'] > 0].head(20).style.format({'Missing Count': '{:,.0f}', 'Missing Percent': '{:.2f}%'}))

# 3. Quick Data Type Review (for cleaning planning)
print("\n--- Data Type Summary (Top) ---")
print(df.info(verbose=False, memory_usage='deep'))

Total rows: 2,260,701 | Total columns: 151

Top 20 Columns with Missing Data


Unnamed: 0,Missing Count,Missing Percent
member_id,2260701,100.00%
orig_projected_additional_accrued_interest,2252050,99.62%
hardship_reason,2249784,99.52%
hardship_payoff_balance_amount,2249784,99.52%
hardship_last_payment_amount,2249784,99.52%
payment_plan_start_date,2249784,99.52%
hardship_type,2249784,99.52%
hardship_status,2249784,99.52%
hardship_start_date,2249784,99.52%
deferral_term,2249784,99.52%



--- Data Type Summary (Top) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 5.9 GB
None


#### 3. Structural Cleaning and Irrelevant Column Removal 
**Purpose:** Streamline the DataFrame by dropping non-analytical, redundant, or excessively null columns that would introduce noise into the feature engineering process.

In [4]:
# Drop Redundant and Non-Analytical Columns
# These columns are either unique identifiers, text descriptions, or related to post-default debt-settlement actions which are irrelevant for initial risk assessment.
drop_cols = [
    'member_id', 'url', 'desc', 'title', 'pymnt_plan', 'initial_list_status',
    'out_prncp_inv', # post-issue metric
    
    # Debt Settlement and Hardship flags (post-default actions)
    'hardship_flag', 'hardship_type', 'hardship_reason', 'hardship_status',
    'deferral_term', 'hardship_amount', 'hardship_start_date', 'hardship_end_date',
    'payment_plan_start_date', 'hardship_length', 'hardship_dpd',
    'hardship_loan_status', 'orig_projected_additional_accrued_interest',
    'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
    'disbursement_method', 'debt_settlement_flag', 'debt_settlement_flag_date',
    'settlement_status', 'settlement_date', 'settlement_amount',
    'settlement_percentage', 'settlement_term'
]

df.drop(columns=drop_cols, inplace=True,errors='ignore')  # Avoid errors if some columns name don't exist

# Drop Columns based on high null threshold
# Removing columns that are > 70% empty
threshold = 0.7
cols_before = df.shape[1]
df = df.loc[:, df.isnull().mean() < threshold]
cols_after = df.shape[1]

print(f'Dropped non-analytical and high-null columns.')
print(f'Columns remaining: {cols_after} (Dropped {cols_before - cols_after} columns)')

Dropped non-analytical and high-null columns.
Columns remaining: 102 (Dropped 19 columns)


#### 4. Column Renaming and Standardization
**Purpose:** Rename the remaining columns to use clear, standardized names that match analytical conventions and improve readability for the Tableau dashboard.

In [5]:
# Create a dictionary for renaming
rename_dict = {
    'loan_amnt': 'amount_requested',
    'funded_amnt': 'funded_amount',
    'funded_amnt_inv': 'funded_amount_invested',
    'int_rate': 'interest_rate',
    'emp_length': 'employment_length',
    'annual_inc': 'annual_income',
    'dti': 'debt_to_income_ratio',
    'addr_state': 'state',
    'fico_range_low': 'fico_low',          # Simplified name
    'fico_range_high': 'fico_high',        # Simplified name
    'delinq_2yrs': 'delinquencies_2yrs',
    'open_acc': 'open_accounts',
    'pub_rec': 'public_records',
    'revol_bal': 'revolving_balance',
    'revol_util': 'revolving_utilization',
    'total_acc': 'total_accounts',
    'issue_d': 'issue_date',
    'loan_status': 'loan_status'
}

df.rename(columns=rename_dict, inplace=True)

# Preview first 10 columns to confirm
print('Columns renamed for clarity.')
print('\nNew Columns:')
print(df.columns[:10].tolist())


Columns renamed for clarity.

New Columns:
['id', 'amount_requested', 'funded_amount', 'funded_amount_invested', 'term', 'interest_rate', 'installment', 'grade', 'sub_grade', 'emp_title']


#### 5. Data Type Finalization and Conversion
**Purpose:** Convert key columns to their correct numerical and datetime formats to enable calculations and time-series analysis. This is critical for preventing errors in later steps.

In [6]:
# Explicitly define columns that should be float/int
numeric_cols = [
    'amount_requested', 'funded_amount', 'funded_amount_invested', 'interest_rate',
    'installment', 'annual_income', 'debt_to_income_ratio', 'fico_low', 'fico_high',
    'delinquencies_2yrs', 'open_accounts', 'public_records', 'revolving_balance',
    'revolving_utilization', 'total_accounts', 'policy_code'
]

for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce') # invalid parsing wil be set to NaN

# Convert dates to datetime
df['issue_date'] = pd.to_datetime(df['issue_date'], format= '%b-%Y', errors='coerce') # invalid parsing will be set as NaT

print('Data types converted to required format (datetime and numeric)')
print('\n--- Final Data Types for Key Fields ---')
print(df[['issue_date', 'annual_income', 'interest_rate']].dtypes)

Data types converted to required format (datetime and numeric)

--- Final Data Types for Key Fields ---
issue_date       datetime64[ns]
annual_income           float64
interest_rate           float64
dtype: object


#### 6. Managing Remaining Missing Values (Imputation)
**Purpose:** Finalize the data cleaning by handling the remaining null values in critical or useful columns using appropriate imputation strategies to maximize row retention for the Tableau dashboard.

In [7]:
# Fill NaN with 'UNKNOWN' to ensure all loans are kept and "missingness" becomes its own analytical category.
print(f"Imputing {df['employment_length'].isnull().sum():,.0f} nulls in employment_length.")
df['employment_length'] = df['employment_length'].fillna('UNKNOWN')

# Use the median to fill missing values, as it is less sensitive to outliers than the mean.
revol_util_median = df['revolving_utilization'].median()
df['revolving_utilization'] = df['revolving_utilization'].fillna(revol_util_median)

# Drop any rows where critical fields (loan_status, issue_date) are still null. 
# These rows are fundamentally useless for the analysis.
rows_before_drop = len(df)
df.dropna(subset=['issue_date', 'loan_status'], inplace=True)
rows_after_drop = len(df)

print('Remaining critical null values imputed or dropped.')
print(f'Total rows dropped due to critical nulls: {rows_before_drop - rows_after_drop:,}')
print(f'Total rows remaining after final clean-up: {len(df):,.0f}')

Imputing 146,940 nulls in employment_length.
Remaining critical null values imputed or dropped.
Total rows dropped due to critical nulls: 33
Total rows remaining after final clean-up: 2,260,668


#### 7. Checkpoint and Export
**Purpose:** Save the cleaned, stabilized DataFrame to the processed folder. This file will be the input for the next notebook in the workflow.

In [8]:
# Save the cleaned dataframe for use in the next notebook
df.to_csv('../data/processed/clean_data_for_sampling.csv', index=False)

print(f'Notebook 1 Complete. Clean data Saved.')

Notebook 1 Complete. Clean data Saved.


#### 8. Summary and Next Steps
##### Summary

1. **Project Stabilization:** Successfully initiated the project by loading the full loan portfolio and stabilizing the data structure.

2. **Structural Cleaning:** Dropped highly sparse, redundant, and non-analytical columns reducing the DataFrame size significantly.

3. **Data Hygiene:** Managed all remaining missing values in critical columns. Nulls in categorical fields (like `employment_length`) were imputed with UNKNOWN to preserve loan volume, while numerical nulls (`revolving_utilization`) were imputed with the median.

4. **Output Integrity:** The final cleaned DataFrame is now structurally sound and contains 2,260,668 records, making it the reliable base for all subsequent exploration.

##### Next Steps: Defining Analytical Path

The data is now clean, but the core risk drivers must be identified and quantified. The next phase will be exploratory, focusing on isolating the best features for segmentation and risk analysis.

**Action:** Proceed to Notebook 02 to begin iterative investigation.

1. **Target Variable Definition:** The initial step will involve defining the binary target variable, `is_default`, by consolidating terminal `loan_status` categories (e.g., 'Charged Off', 'Default') to quantify the rate of non-performance.

2. **Analytical Efficiency:** To accelerate feature development and testing against the 2.26 million record dataset, a statistically representative sample will be created to optimize the iterative analysis loop.

3. **Feature Validation and Selection:** Investigation will commence on potential risk drivers (e.g., FICO Score, Debt-to-Income Ratio (DTI), and Annual Income) to identify features exhibiting a clear, monotonic relationship with the Observed Default Rate. These findings will determine the final segmentation features for reporting.
