<a href="https://colab.research.google.com/github/rola1174/Elevvo_Tasks/blob/main/LoanApprovalPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Since the dataset is imbalanced, evaluation focused on Precision, Recall,
and F1-score rather than accuracy alone. SMOTE was applied to balance
training samples, improving the modelâ€™s ability to correctly predict
loan approvals for minority classes.


In [1]:
# Install kaggle
!pip install kaggle

from google.colab import files
files.upload()   # upload kaggle.json

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download dataset
!kaggle datasets download -d architsharma01/loan-approval-prediction-dataset

# Unzip
!unzip loan-approval-prediction-dataset.zip



Saving kaggle.json to kaggle.json
Dataset URL: https://www.kaggle.com/datasets/architsharma01/loan-approval-prediction-dataset
License(s): MIT
Downloading loan-approval-prediction-dataset.zip to /content
  0% 0.00/80.6k [00:00<?, ?B/s]
100% 80.6k/80.6k [00:00<00:00, 286MB/s]
Archive:  loan-approval-prediction-dataset.zip
  inflating: loan_approval_dataset.csv  


In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [3]:
df = pd.read_csv("loan_approval_dataset.csv")
df.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


In [6]:
# Fill numeric with median
for col in df.select_dtypes(include=np.number).columns:
    df[col] = df[col].fillna(df[col].median())

# Fill categorical with mode
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].fillna(df[col].mode()[0])
print(df.isnull().sum())

loan_id                      0
 no_of_dependents            0
 education                   0
 self_employed               0
 income_annum                0
 loan_amount                 0
 loan_term                   0
 cibil_score                 0
 residential_assets_value    0
 commercial_assets_value     0
 luxury_assets_value         0
 bank_asset_value            0
 loan_status                 0
dtype: int64


In [9]:
# Encode categorical features
le = LabelEncoder()

for col in df.select_dtypes(include='object').columns:
    df[col] = le.fit_transform(df[col])

In [12]:
# Split data
# Clean column names by stripping whitespace
df.columns = df.columns.str.strip()

X = df.drop("loan_status", axis=1)
y = df["loan_status"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

In [13]:
# Handle Imbalanced Data (Bonus SMOTE)
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

In [14]:
# Train Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

pred_lr = lr.predict(X_test)

print("Logistic Regression Results")
print(classification_report(y_test, pred_lr))

Logistic Regression Results
              precision    recall  f1-score   support

           0       0.84      0.86      0.85       531
           1       0.76      0.73      0.75       323

    accuracy                           0.81       854
   macro avg       0.80      0.80      0.80       854
weighted avg       0.81      0.81      0.81       854



In [15]:
# Train Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

pred_dt = dt.predict(X_test)

print("Decision Tree Results")
print(classification_report(y_test, pred_dt))

Decision Tree Results
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       531
           1       0.98      0.97      0.98       323

    accuracy                           0.98       854
   macro avg       0.98      0.98      0.98       854
weighted avg       0.98      0.98      0.98       854



In [16]:
print(confusion_matrix(y_test, pred_dt))

[[525   6]
 [  9 314]]


In [17]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Logistic Regression evaluation
print("===== Logistic Regression =====")
print("Accuracy:", accuracy_score(y_test, pred_lr))
print("\nClassification Report:\n", classification_report(y_test, pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_lr))


# Decision Tree evaluation
print("\n===== Decision Tree =====")
print("Accuracy:", accuracy_score(y_test, pred_dt))
print("\nClassification Report:\n", classification_report(y_test, pred_dt))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_dt))

===== Logistic Regression =====
Accuracy: 0.8126463700234192

Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.86      0.85       531
           1       0.76      0.73      0.75       323

    accuracy                           0.81       854
   macro avg       0.80      0.80      0.80       854
weighted avg       0.81      0.81      0.81       854

Confusion Matrix:
 [[457  74]
 [ 86 237]]

===== Decision Tree =====
Accuracy: 0.9824355971896955

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.99       531
           1       0.98      0.97      0.98       323

    accuracy                           0.98       854
   macro avg       0.98      0.98      0.98       854
weighted avg       0.98      0.98      0.98       854

Confusion Matrix:
 [[525   6]
 [  9 314]]


In [18]:
%%writefile README.md
# Loan Approval Prediction using Machine Learning

## Project Overview
This project builds a **Loan Approval Prediction** system using supervised machine learning techniques.
The goal is to predict whether a loan application will be **approved or rejected** based on applicant financial and personal information.

The project focuses on handling **imbalanced data**, preprocessing categorical variables, and evaluating models using **Precision, Recall, and F1-score**.

---

## Dataset
Dataset used: **Loan Approval Prediction Dataset (Kaggle)**
Link: https://www.kaggle.com/datasets/architsharma01/loan-approval-prediction-dataset

The dataset contains features such as:
- Number of Dependents
- Education
- Self Employment Status
- Annual Income
- Loan Amount
- Loan Term
- CIBIL Score
- Asset Values
- Loan Status (Target Variable)

---

## Objectives
- Handle missing values
- Encode categorical variables
- Address class imbalance using **SMOTE**
- Train classification models
- Compare **Logistic Regression** and **Decision Tree**
- Evaluate performance using **Precision, Recall, and F1-score**

---

## Tools & Libraries
- Python
- Pandas
- Scikit-learn
- Imbalanced-learn (SMOTE)

---

## Methodology
1. Load and explore dataset
2. Handle missing values using median and mode
3. Encode categorical variables using Label Encoding
4. Split dataset into training and testing sets
5. Apply **SMOTE** to handle class imbalance
6. Train Logistic Regression model
7. Train Decision Tree model
8. Evaluate models using classification metrics

---

## Results
Both models were evaluated using Precision, Recall, and F1-score to properly measure performance on imbalanced data.
SMOTE improved the model's ability to correctly classify minority loan approval cases.

---

## Conclusion
Machine learning models can effectively predict loan approval decisions when proper preprocessing and imbalance handling techniques are applied.
Evaluation using F1-score provides a more reliable performance measure compared to accuracy alone.

---


Writing README.md
