# Two-Stage Loan Approval & Valuation System

## Project Overview
This project implements an end-to-end machine learning system that simulates how financial institutions evaluate loan applications.
The system follows a **two-stage modeling approach**:

1. **Stage 1 – Loan Approval (Classification)**  
   Determines whether a loan application should be approved or rejected.

2. **Stage 2 – Loan Valuation (Regression)**  
   Predicts the optimal loan amount for approved applicants.

This architecture mirrors real-world banking workflows and helps reduce financial risk while improving decision efficiency.


## Business Objectives

The primary business objectives of this project are:

- Minimize financial risk by accurately identifying high-risk loan applicants
- Improve operational efficiency by automating loan approval decisions
- Ensure fair and consistent loan valuation using data-driven methods
- Reduce manual bias in the credit evaluation process

The system prioritizes recall in the classification stage to minimize the risk of approving default-prone applicants.

## Machine Learning Objectives

Stage 1 – Classification:
- Build a model to predict Loan_Status
- Optimize Recall to reduce false approvals
- Evaluate using Accuracy, Precision, Recall, and F1-score

Stage 2 – Regression:
- Predict LoanAmount for approved loans
- Handle skewness and outliers in loan amounts
- Evaluate using RMSE, MAE, and R²

## Project Workflow

1. Data Understanding & EDA
2. Data Preprocessing & Feature Engineering
3. Model Development (Classification & Regression)
4. Hyperparameter Tuning using GridSearchCV
5. Model Evaluation
6. Model Serialization
7. Deployment using Streamlit

## Dataset Description

The project uses a public **Loan Prediction Dataset** containing demographic, financial, and credit-related attributes of loan applicants.

### Target Variables
- **Loan_Status** → Binary target for loan approval (Approved / Rejected)
- **LoanAmount** → Continuous target for loan valuation (used only for approved loans)


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [3]:
df = pd.read_csv('loan_approval_dataset.csv')

In [4]:
df

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4264,4265,5,Graduate,Yes,1000000,2300000,12,317,2800000,500000,3300000,800000,Rejected
4265,4266,0,Not Graduate,Yes,3300000,11300000,20,559,4200000,2900000,11000000,1900000,Approved
4266,4267,2,Not Graduate,No,6500000,23900000,18,457,1200000,12400000,18100000,7300000,Rejected
4267,4268,1,Not Graduate,No,4100000,12800000,8,780,8200000,700000,14100000,5800000,Approved


In [5]:
df.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


In [6]:
df.columns

Index(['loan_id', ' no_of_dependents', ' education', ' self_employed',
       ' income_annum', ' loan_amount', ' loan_term', ' cibil_score',
       ' residential_assets_value', ' commercial_assets_value',
       ' luxury_assets_value', ' bank_asset_value', ' loan_status'],
      dtype='object')

In [8]:
df.shape

(4269, 13)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


In [9]:
df.columns = df.columns.str.strip()

In [10]:
df.columns

Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',
       'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype='object')