# Credit Risk & Loan Performance: Data Cleaning

#### Author: Satveer Kaur
#### Date: 2025-10-17
#### Notebook Purpose:
This notebook performs **data cleaning and preprocessing** on the LendingClub Accepted and Rejected Loans datasets. 
The goal is to:
1. Align CSV columns with the SQL schema for database import.
2. Handle missing or inconsistent data.
3. Prepare cleaned sample datasets for reproducibility and GitHub.

#### Background:
You are a data analyst at LendingClub tasked with evaluating loan performance. 
Accurate data preparation is critical for risk analytics and downstream exploratory analysis and modeling.


#### 1. Setup

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

#### 2. Load Raw Data

In [6]:
# Load accepted and reject loans CSVs
accepted_loans = pd.read_csv('../data/accepted_loans.csv', low_memory=False)
rejected_loans = pd.read_csv('../data/rejected_loans.csv')

# Inspect first few rows of accepted_loans 
accepted_loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,...,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,...,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,...,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,...,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,...,,,Cash,N,,,,,,


In [5]:
# Inspect first few rows of accepted_loans 
rejected_loans.head()

Unnamed: 0,Amount Requested,Application Date,Loan Title,Risk_Score,Debt-To-Income Ratio,Zip Code,State,Employment Length,Policy Code
0,1000.0,2007-05-26,Wedding Covered but No Honeymoon,693.0,10%,481xx,NM,4 years,0.0
1,1000.0,2007-05-26,Consolidating Debt,703.0,10%,010xx,MA,< 1 year,0.0
2,11000.0,2007-05-27,Want to consolidate my debt,715.0,10%,212xx,MD,1 year,0.0
3,6000.0,2007-05-27,waksman,698.0,38.64%,017xx,MA,< 1 year,0.0
4,1500.0,2007-05-27,mdrigo,509.0,9.43%,209xx,MD,< 1 year,0.0


#### 3. Inspect Data

In [13]:
# Basic Info and Summary
accepted_loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB


In [11]:
rejected_loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Data columns (total 9 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Amount Requested      float64
 1   Application Date      object 
 2   Loan Title            object 
 3   Risk_Score            float64
 4   Debt-To-Income Ratio  object 
 5   Zip Code              object 
 6   State                 object 
 7   Employment Length     object 
 8   Policy Code           float64
dtypes: float64(3), object(6)
memory usage: 1.9+ GB
