# Credit Risk & Loan Performance: Data Cleaning

#### Author: Satveer Kaur
#### Date: 2025-10-18
#### Notebook Purpose:
This notebook performs **data cleaning and preprocessing** on the LendingClub Accepted and Rejected Loans datasets. 
The goal is to:
1. Align CSV columns with the SQL schema for database import.
2. Handle missing or inconsistent data.
3. Prepare cleaned sample datasets for reproducibility and GitHub.

#### Background:
You are a data analyst at LendingClub tasked with evaluating loan performance. 
Accurate data preparation is critical for risk analytics and downstream exploratory analysis and modeling.


#### 1. Setup
**Purpose**: Initialize all core libraries used for data cleaning, transformation, and visualization.
These packages will help in identifying missing data, type mismatches, and inconsistencies.

In [15]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os 

# Display options for cleaner output - show all columns
pd.set_option('display.max_columns', None)

# Create folders for outputs if not already present
os.makedirs('../data/clean_data', exist_ok=True)
os.makedirs('../data/sample_data', exist_ok=True)

#### 2. Load Raw Data

In [16]:
# Load accepted and reject loans CSVs
accepted_loans = pd.read_csv('../data/accepted_loans.csv', low_memory=False)
rejected_loans = pd.read_csv('../data/rejected_loans.csv')

#### 3. Inspect Data
**Purpose:** This section provides an overview of data size, structure, and variable types.
It helps determine which columns need cleaning, conversion, or renaming.

In [17]:
# Basic Info and Summary
print("Accepted Loans Info: ")
accepted_loans.info()

Accepted Loans Info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB


In [18]:
# Check data types
accepted_loans.dtypes

id                        object
member_id                float64
loan_amnt                float64
funded_amnt              float64
funded_amnt_inv          float64
                          ...   
settlement_status         object
settlement_date           object
settlement_amount        float64
settlement_percentage    float64
settlement_term          float64
Length: 151, dtype: object

In [19]:
print("Rejected Loans Info")
rejected_loans.info()

Rejected Loans Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Data columns (total 9 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Amount Requested      float64
 1   Application Date      object 
 2   Loan Title            object 
 3   Risk_Score            float64
 4   Debt-To-Income Ratio  object 
 5   Zip Code              object 
 6   State                 object 
 7   Employment Length     object 
 8   Policy Code           float64
dtypes: float64(3), object(6)
memory usage: 1.9+ GB


In [20]:
# Inspect first few rows of accepted_loans 
accepted_loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,leadman,10+ years,MORTGAGE,55000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,190xx,PA,5.91,0.0,Aug-2003,675.0,679.0,1.0,30.0,,7.0,0.0,2765.0,29.7,13.0,w,0.0,0.0,4421.723917,4421.72,3600.0,821.72,0.0,0.0,0.0,Jan-2019,122.67,,Mar-2019,564.0,560.0,0.0,30.0,1.0,Individual,,,,0.0,722.0,144904.0,2.0,2.0,0.0,1.0,21.0,4981.0,36.0,3.0,3.0,722.0,34.0,9300.0,3.0,1.0,4.0,4.0,20701.0,1506.0,37.2,0.0,0.0,148.0,128.0,3.0,3.0,1.0,4.0,69.0,4.0,69.0,2.0,2.0,4.0,2.0,5.0,3.0,4.0,9.0,4.0,7.0,0.0,0.0,0.0,3.0,76.9,0.0,0.0,0.0,178050.0,7746.0,2400.0,13734.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,Engineer,10+ years,MORTGAGE,65000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,small_business,Business,577xx,SD,16.06,1.0,Dec-1999,715.0,719.0,4.0,6.0,,22.0,0.0,21470.0,19.2,38.0,w,0.0,0.0,25679.66,25679.66,24700.0,979.66,0.0,0.0,0.0,Jun-2016,926.35,,Mar-2019,699.0,695.0,0.0,,1.0,Individual,,,,0.0,0.0,204396.0,1.0,1.0,0.0,1.0,19.0,18005.0,73.0,2.0,3.0,6472.0,29.0,111800.0,0.0,0.0,6.0,4.0,9733.0,57830.0,27.1,0.0,0.0,113.0,192.0,2.0,2.0,4.0,2.0,,0.0,6.0,0.0,5.0,5.0,13.0,17.0,6.0,20.0,27.0,5.0,22.0,0.0,0.0,0.0,2.0,97.4,7.7,0.0,0.0,314017.0,39475.0,79300.0,24667.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,truck driver,10+ years,MORTGAGE,63000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,home_improvement,,605xx,IL,10.78,0.0,Aug-2000,695.0,699.0,0.0,,,6.0,0.0,7869.0,56.2,18.0,w,0.0,0.0,22705.924294,22705.92,20000.0,2705.92,0.0,0.0,0.0,Jun-2017,15813.3,,Mar-2019,704.0,700.0,0.0,,1.0,Joint App,71000.0,13.85,Not Verified,0.0,0.0,189699.0,0.0,1.0,0.0,4.0,19.0,10827.0,73.0,0.0,2.0,2081.0,65.0,14000.0,2.0,5.0,1.0,6.0,31617.0,2737.0,55.9,0.0,0.0,125.0,184.0,14.0,14.0,5.0,101.0,,10.0,,0.0,2.0,3.0,2.0,4.0,6.0,4.0,7.0,3.0,6.0,0.0,0.0,0.0,0.0,100.0,50.0,0.0,0.0,218418.0,18696.0,6200.0,14877.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,Information Systems Officer,10+ years,MORTGAGE,110000.0,Source Verified,Dec-2015,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,076xx,NJ,17.06,0.0,Sep-2008,785.0,789.0,0.0,,,13.0,0.0,7802.0,11.6,17.0,w,15897.65,15897.65,31464.01,31464.01,19102.35,12361.66,0.0,0.0,0.0,Feb-2019,829.9,Apr-2019,Mar-2019,679.0,675.0,0.0,,1.0,Individual,,,,0.0,0.0,301500.0,1.0,1.0,0.0,1.0,23.0,12609.0,70.0,1.0,1.0,6987.0,45.0,67300.0,0.0,1.0,0.0,2.0,23192.0,54962.0,12.1,0.0,0.0,36.0,87.0,2.0,2.0,1.0,2.0,,,,0.0,4.0,5.0,8.0,10.0,2.0,10.0,13.0,5.0,13.0,0.0,0.0,0.0,1.0,100.0,0.0,0.0,0.0,381215.0,52226.0,62500.0,18000.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,Contract Specialist,3 years,MORTGAGE,104433.0,Source Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,major_purchase,Major purchase,174xx,PA,25.37,1.0,Jun-1998,695.0,699.0,3.0,12.0,,12.0,0.0,21929.0,64.5,35.0,w,0.0,0.0,11740.5,11740.5,10400.0,1340.5,0.0,0.0,0.0,Jul-2016,10128.96,,Mar-2018,704.0,700.0,0.0,,1.0,Individual,,,,0.0,0.0,331730.0,1.0,3.0,0.0,3.0,14.0,73839.0,84.0,4.0,7.0,9702.0,78.0,34000.0,2.0,1.0,3.0,10.0,27644.0,4567.0,77.5,0.0,0.0,128.0,210.0,4.0,4.0,6.0,4.0,12.0,1.0,12.0,0.0,4.0,6.0,5.0,9.0,10.0,7.0,19.0,6.0,12.0,0.0,0.0,0.0,4.0,96.6,60.0,0.0,0.0,439570.0,95768.0,20300.0,88097.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


In [21]:
# Inspect first few rows of accepted_loans 
rejected_loans.head()

Unnamed: 0,Amount Requested,Application Date,Loan Title,Risk_Score,Debt-To-Income Ratio,Zip Code,State,Employment Length,Policy Code
0,1000.0,2007-05-26,Wedding Covered but No Honeymoon,693.0,10%,481xx,NM,4 years,0.0
1,1000.0,2007-05-26,Consolidating Debt,703.0,10%,010xx,MA,< 1 year,0.0
2,11000.0,2007-05-27,Want to consolidate my debt,715.0,10%,212xx,MD,1 year,0.0
3,6000.0,2007-05-27,waksman,698.0,38.64%,017xx,MA,< 1 year,0.0
4,1500.0,2007-05-27,mdrigo,509.0,9.43%,209xx,MD,< 1 year,0.0


#### 4. Data Cleaning
**Purpose**: 
Prepare the raw datasets (`accepted_loans.csv` and `rejected_loans.csv`) for analysis and modeling by ensuring consistency, removing errors, and matching the SQL schema.

##### 4.1 Drop irrelevant columns


In [22]:
# Columns to drop in accepted_loans
drop_cols_accepted = [
    'member_id', 'url', 'desc', 'title', 'pymnt_plan', 'initial_list_status',
    'out_prncp_inv', 'hardship_flag', 'hardship_type', 'hardship_reason',
    'hardship_status', 'deferral_term', 'hardship_amount', 'hardship_start_date',
    'hardship_end_date', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd',
    'hardship_loan_status', 'orig_projected_additional_accrued_interest',
    'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
    'disbursement_method', 'debt_settlement_flag', 'debt_settlement_flag_date',
    'settlement_status', 'settlement_date', 'settlement_amount',
    'settlement_percentage', 'settlement_term'
]

accepted_loans_clean = accepted_loans.drop(columns=drop_cols_accepted, errors='ignore')  # Avoid errors if some columns name don't exist

# Columns to drop in rejected_loans
drop_cols_rejected = [
    'Loan Title', 'Policy Code'  # free-text or internal code
]

rejected_loans_clean = rejected_loans.drop(columns=drop_cols_rejected, errors='ignore')

##### 4.2 Standarize Column Names

In [23]:
# Rename accepted_loans columns to match SQL schema 
accepted_loans_clean.rename(columns={
    'id': 'id',                                    # Unique identifier
    'loan_amnt': 'amount_requested',               # Requested loan amount
    'funded_amnt': 'funded_amount',                # Amount funded
    'funded_amnt_inv': 'funded_amount_invested',   # Amount invested by investors
    'term': 'term',                                # Loan term
    'int_rate': 'interest_rate',                   # Interest rate
    'installment': 'installment',                  # Monthly installment
    'grade': 'grade',                              # Loan grade
    'sub_grade': 'sub_grade',                      # Loan sub-grade
    'emp_length': 'employment_length',             # Years of employment
    'home_ownership': 'home_ownership',            # Home ownership status
    'annual_inc': 'annual_income',                 # Annual income
    'verification_status': 'verification_status',  # Income verification status
    'dti': 'debt_to_income_ratio',                 # Debt-to-Income ratio
    'zip_code': 'zip_code',                        # First 5 digits recommended
    'addr_state': 'state',                         # Two-letter state code
    'fico_range_low': 'fico_range_low',            # FICO lower bound
    'fico_range_high': 'fico_range_high',          # FICO upper bound
    'delinq_2yrs': 'delinquencies_2yrs',           # Delinquencies in last 2 years
    'open_acc': 'open_accounts',                   # Number of open accounts
    'pub_rec': 'public_records',                   # Public records
    'revol_bal': 'revolving_balance',              # Revolving balance
    'revol_util': 'revolving_utilization',         # Revolving utilization %
    'total_acc': 'total_accounts',                 # Total number of accounts
    'policy_code': 'policy_code',                  # Internal policy code
    'issue_d': 'application_date',                 # Loan application date
    'title': 'loan_title',                         # Loan title
    'loan_status': 'loan_status'                   # Loan approval status
}, inplace=True)

# Rename rejected_loans columns to match SQL schema
rejected_loans_clean.rename(columns={
    'Amount Requested': 'amount_requested',
    'Application Date': 'application_date',
    'Risk_Score': 'risk_score',
    'Debt-To-Income Ratio': 'debt_to_income_ratio',
    'Zip Code': 'zip_code',
    'State': 'state',
    'Employment Length': 'employment_length'
}, inplace=True)


# Preview first 10 columns to confirm
accepted_loans.columns[:10]


Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade'],
      dtype='object')

##### 4.3 Convert Data Types

In [24]:
# Convert Numeric columns in accepted_loans_clean to float
numeric_cols_accepted = [
    'amount_requested', 'funded_amount', 'funded_amount_invested', 'interest_rate',
    'installment', 'annual_income', 'debt_to_income_ratio', 'fico_range_low', 'fico_range_high',
    'delinquencies_2yrs', 'open_accounts', 'public_records', 'revolving_balance',
    'revolving_utilization', 'total_accounts', 'policy_code'
]
# Convert Numeric columns in rejected_loans_clean to float
numeric_cols_rejected = [
    'amount_requested', 'risk_score', 'debt_to_income_ratio'
]

accepted_loans_clean[numeric_cols_accepted] = accepted_loans_clean[numeric_cols_accepted].apply(pd.to_numeric, errors='coerce') # invalid parsing will be set to NaN
rejected_loans_clean[numeric_cols_rejected] = rejected_loans_clean[numeric_cols_rejected].apply(pd.to_numeric, errors='coerce')

# Convert dates to datetime
accepted_loans_clean['application_date'] = pd.to_datetime(accepted_loans_clean['application_date'], format= '%Y-%m-%d', errors='coerce') # invalid parsing will be set as NaT
rejected_loans_clean['application_date'] = pd.to_datetime(rejected_loans_clean['application_date'], format= '%Y-%m-%d', errors='coerce')


##### 4.4 Check Cleaned Data

In [25]:
accepted_loans_clean.info()
accepted_loans_clean.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 121 entries, id to sec_app_mths_since_last_major_derog
dtypes: datetime64[ns](1), float64(101), object(19)
memory usage: 2.0+ GB


Unnamed: 0,id,amount_requested,funded_amount,funded_amount_invested,term,interest_rate,installment,grade,sub_grade,emp_title,employment_length,home_ownership,annual_income,verification_status,application_date,loan_status,purpose,zip_code,state,debt_to_income_ratio,delinquencies_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_accounts,public_records,revolving_balance,revolving_utilization,total_accounts,out_prncp,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog
0,68407277,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,leadman,10+ years,MORTGAGE,55000.0,Not Verified,NaT,Fully Paid,debt_consolidation,190xx,PA,5.91,0.0,Aug-2003,675.0,679.0,1.0,30.0,,7.0,0.0,2765.0,29.7,13.0,0.0,4421.723917,4421.72,3600.0,821.72,0.0,0.0,0.0,Jan-2019,122.67,,Mar-2019,564.0,560.0,0.0,30.0,1.0,Individual,,,,0.0,722.0,144904.0,2.0,2.0,0.0,1.0,21.0,4981.0,36.0,3.0,3.0,722.0,34.0,9300.0,3.0,1.0,4.0,4.0,20701.0,1506.0,37.2,0.0,0.0,148.0,128.0,3.0,3.0,1.0,4.0,69.0,4.0,69.0,2.0,2.0,4.0,2.0,5.0,3.0,4.0,9.0,4.0,7.0,0.0,0.0,0.0,3.0,76.9,0.0,0.0,0.0,178050.0,7746.0,2400.0,13734.0,,,,,,,,,,,,,
1,68355089,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,Engineer,10+ years,MORTGAGE,65000.0,Not Verified,NaT,Fully Paid,small_business,577xx,SD,16.06,1.0,Dec-1999,715.0,719.0,4.0,6.0,,22.0,0.0,21470.0,19.2,38.0,0.0,25679.66,25679.66,24700.0,979.66,0.0,0.0,0.0,Jun-2016,926.35,,Mar-2019,699.0,695.0,0.0,,1.0,Individual,,,,0.0,0.0,204396.0,1.0,1.0,0.0,1.0,19.0,18005.0,73.0,2.0,3.0,6472.0,29.0,111800.0,0.0,0.0,6.0,4.0,9733.0,57830.0,27.1,0.0,0.0,113.0,192.0,2.0,2.0,4.0,2.0,,0.0,6.0,0.0,5.0,5.0,13.0,17.0,6.0,20.0,27.0,5.0,22.0,0.0,0.0,0.0,2.0,97.4,7.7,0.0,0.0,314017.0,39475.0,79300.0,24667.0,,,,,,,,,,,,,
2,68341763,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,truck driver,10+ years,MORTGAGE,63000.0,Not Verified,NaT,Fully Paid,home_improvement,605xx,IL,10.78,0.0,Aug-2000,695.0,699.0,0.0,,,6.0,0.0,7869.0,56.2,18.0,0.0,22705.924294,22705.92,20000.0,2705.92,0.0,0.0,0.0,Jun-2017,15813.3,,Mar-2019,704.0,700.0,0.0,,1.0,Joint App,71000.0,13.85,Not Verified,0.0,0.0,189699.0,0.0,1.0,0.0,4.0,19.0,10827.0,73.0,0.0,2.0,2081.0,65.0,14000.0,2.0,5.0,1.0,6.0,31617.0,2737.0,55.9,0.0,0.0,125.0,184.0,14.0,14.0,5.0,101.0,,10.0,,0.0,2.0,3.0,2.0,4.0,6.0,4.0,7.0,3.0,6.0,0.0,0.0,0.0,0.0,100.0,50.0,0.0,0.0,218418.0,18696.0,6200.0,14877.0,,,,,,,,,,,,,
3,66310712,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,Information Systems Officer,10+ years,MORTGAGE,110000.0,Source Verified,NaT,Current,debt_consolidation,076xx,NJ,17.06,0.0,Sep-2008,785.0,789.0,0.0,,,13.0,0.0,7802.0,11.6,17.0,15897.65,31464.01,31464.01,19102.35,12361.66,0.0,0.0,0.0,Feb-2019,829.9,Apr-2019,Mar-2019,679.0,675.0,0.0,,1.0,Individual,,,,0.0,0.0,301500.0,1.0,1.0,0.0,1.0,23.0,12609.0,70.0,1.0,1.0,6987.0,45.0,67300.0,0.0,1.0,0.0,2.0,23192.0,54962.0,12.1,0.0,0.0,36.0,87.0,2.0,2.0,1.0,2.0,,,,0.0,4.0,5.0,8.0,10.0,2.0,10.0,13.0,5.0,13.0,0.0,0.0,0.0,1.0,100.0,0.0,0.0,0.0,381215.0,52226.0,62500.0,18000.0,,,,,,,,,,,,,
4,68476807,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,Contract Specialist,3 years,MORTGAGE,104433.0,Source Verified,NaT,Fully Paid,major_purchase,174xx,PA,25.37,1.0,Jun-1998,695.0,699.0,3.0,12.0,,12.0,0.0,21929.0,64.5,35.0,0.0,11740.5,11740.5,10400.0,1340.5,0.0,0.0,0.0,Jul-2016,10128.96,,Mar-2018,704.0,700.0,0.0,,1.0,Individual,,,,0.0,0.0,331730.0,1.0,3.0,0.0,3.0,14.0,73839.0,84.0,4.0,7.0,9702.0,78.0,34000.0,2.0,1.0,3.0,10.0,27644.0,4567.0,77.5,0.0,0.0,128.0,210.0,4.0,4.0,6.0,4.0,12.0,1.0,12.0,0.0,4.0,6.0,5.0,9.0,10.0,7.0,19.0,6.0,12.0,0.0,0.0,0.0,4.0,96.6,60.0,0.0,0.0,439570.0,95768.0,20300.0,88097.0,,,,,,,,,,,,,


In [26]:
rejected_loans_clean.info()
rejected_loans_clean.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Data columns (total 7 columns):
 #   Column                Dtype         
---  ------                -----         
 0   amount_requested      float64       
 1   application_date      datetime64[ns]
 2   risk_score            float64       
 3   debt_to_income_ratio  float64       
 4   zip_code              object        
 5   state                 object        
 6   employment_length     object        
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 1.4+ GB


Unnamed: 0,amount_requested,application_date,risk_score,debt_to_income_ratio,zip_code,state,employment_length
0,1000.0,2007-05-26,693.0,,481xx,NM,4 years
1,1000.0,2007-05-26,703.0,,010xx,MA,< 1 year
2,11000.0,2007-05-27,715.0,,212xx,MD,1 year
3,6000.0,2007-05-27,698.0,,017xx,MA,< 1 year
4,1500.0,2007-05-27,509.0,,209xx,MD,< 1 year


#### 5. Finalize Data Cleaning
**Purpose**: 
Wrap up the data cleaning process by confirming the dataset is ready for analysis and documenting any potential future cleaning or transformation steps that might be required during Exploratory Data Analysis (EDA) or dashboard development.

##### 5.1 Verify Final Structure

In [27]:
print("Accepted Loans Dataset:")
print(accepted_loans_clean.info())
print("\nRejected Loans Dataset:")
print(rejected_loans_clean.info())

Accepted Loans Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 121 entries, id to sec_app_mths_since_last_major_derog
dtypes: datetime64[ns](1), float64(101), object(19)
memory usage: 2.0+ GB
None

Rejected Loans Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Data columns (total 7 columns):
 #   Column                Dtype         
---  ------                -----         
 0   amount_requested      float64       
 1   application_date      datetime64[ns]
 2   risk_score            float64       
 3   debt_to_income_ratio  float64       
 4   zip_code              object        
 5   state                 object        
 6   employment_length     object        
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 1.4+ GB
None


##### 5.2 Save Cleaned Data

In [29]:
# Save cleaned data to csv files
accepted_loans_clean.to_csv('../data/clean_data/accepted_loans_cleaned.csv', index=False)
rejected_loans_clean.to_csv('../data/clean_data/rejected_loans_cleaned.csv', index=False)

##### 5.3 Summary and Next Steps
##### Summary

- Removed irrelevant columns.  
- Standardized column names for consistency.  
- Converted data types to appropriate formats (numeric, datetime).  
- Verified that the cleaned datasets are ready for sampling and visualization.  

##### Next Steps

Although this notebook completes the initial data cleaning phase, additional transformations may be performed during later stages of the project:

- Handle missing values more contextually if they impact analysis.  
- Create derived or engineered variables for deeper insights.  
- Bin numeric features for improved visualization and model interpretability.  
- Detect and manage outliers during exploratory data analysis (EDA).  

The cleaned datasets are now ready for **sampling and exploratory data analysis**, which will be performed in the notebook [`2_data_sampling_and_exploration.ipynb`](2_data_sampling_and_exploration.ipynb)