# Task 4: Loan Approval Prediction

### Description:
- Dataset (Recommended): Loan-Approval-Prediction-Dataset (Kaggle)  
- Build a model to predict whether a loan application will be approved  
- Handle missing values and encode categorical features  
- Train a classification model and evaluate performance on imbalanced data  
- Focus on precision, recall, and F1-score  


### Step 1: Import Libraries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report


### Step 2: Import Dataset

In [4]:
df = pd.read_csv(r"D:\eleevo internship\task4\loan_approval_dataset.csv")
df.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


### Step 3: Explore Dataset

In [5]:
# Shape of dataset
print("Shape:", df.shape)


Shape: (4269, 13)


In [6]:
# Info about columns
print("\nInfo:")
print(df.info())




Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB
None


In [7]:
# Missing values
print("\nMissing Values:")
print(df.isnull().sum())



Missing Values:
loan_id                      0
 no_of_dependents            0
 education                   0
 self_employed               0
 income_annum                0
 loan_amount                 0
 loan_term                   0
 cibil_score                 0
 residential_assets_value    0
 commercial_assets_value     0
 luxury_assets_value         0
 bank_asset_value            0
 loan_status                 0
dtype: int64


In [None]:
# Fix column names
# Remove spaces/newlines from column names
df.columns = df.columns.str.strip()

print("Columns after cleaning:")
print(df.columns)


Columns after cleaning:
Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',
       'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype='object')


In [None]:
print("\nTarget Variable Distribution:")
print(df['loan_status'].value_counts())



Target Variable Distribution:
loan_status
Approved    2656
Rejected    1613
Name: count, dtype: int64


### Step 4: Preprocessing

In [None]:
# Handle missing values and encode categorical features

# Drop loan_id (not useful for prediction)
df = df.drop('loan_id', axis=1)

# Encode categorical variables
label_enc = LabelEncoder()

df['education'] = label_enc.fit_transform(df['education'])       
df['self_employed'] = label_enc.fit_transform(df['self_employed']) #
df['loan_status'] = label_enc.fit_transform(df['loan_status'])   

print("Encoded dataset sample:")
print(df.head())

Encoded dataset sample:
   no_of_dependents  education  self_employed  income_annum  loan_amount  \
0                 2          0              0       9600000     29900000   
1                 0          1              1       4100000     12200000   
2                 3          0              0       9100000     29700000   
3                 3          0              0       8200000     30700000   
4                 5          1              1       9800000     24200000   

   loan_term  cibil_score  residential_assets_value  commercial_assets_value  \
0         12          778                   2400000                 17600000   
1          8          417                   2700000                  2200000   
2         20          506                   7100000                  4500000   
3          8          467                  18200000                  3300000   
4         20          382                  12400000                  8200000   

   luxury_assets_value  bank_asset_val

### Step 5: Train/Test Split

In [13]:
from sklearn.model_selection import train_test_split

# Features (everything except target)
X = df.drop('loan_status', axis=1)

# Target (loan_status: 1=Approved, 0=Rejected)
y = df['loan_status']

# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])


Training set size: 3415
Testing set size: 854


### Step 6: Train Models 

In [14]:
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
log_reg.fit(X_train, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test)

print("Logistic Regression model trained successfully")


Logistic Regression model trained successfully


### Step 7: Model Evaluation (Logistic Regression)

In [15]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Accuracy
acc = accuracy_score(y_test, y_pred_log)

# Precision, Recall, F1
prec = precision_score(y_test, y_pred_log)
rec = recall_score(y_test, y_pred_log)
f1 = f1_score(y_test, y_pred_log)

print(f"Accuracy: {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {rec:.4f}")
print(f"F1 Score: {f1:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_log))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_log)
print("\nConfusion Matrix:")
print(cm)


Accuracy: 0.8080
Precision: 0.8245
Recall: 0.6254
F1 Score: 0.7113

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.92      0.86       531
           1       0.82      0.63      0.71       323

    accuracy                           0.81       854
   macro avg       0.81      0.77      0.78       854
weighted avg       0.81      0.81      0.80       854


Confusion Matrix:
[[488  43]
 [121 202]]


# Summary of Task 4

---

## Task 4: Loan Approval Prediction  

### Description:
- Dataset: Loan-Approval-Prediction-Dataset (Kaggle)  
- Goal: Predict whether a loan application will be approved or rejected.  

### Steps & Outcomes:
1. **Preprocessing**
   - No missing values.  
   - Dropped `loan_id` (not useful for prediction).  
   - Encoded categorical columns:  
     - `education` → Graduate=1, Not Graduate=0  
     - `self_employed` → Yes=1, No=0  
     - `loan_status` (target) → Approved=1, Rejected=0  

2. **Train/Test Split**
   - Training set: 3415 samples  
   - Testing set: 854 samples  

3. **Model Training**
   - Algorithm: Logistic Regression  

4. **Evaluation**
   - Accuracy: **80.8%**  
   - Precision: **0.82** → Model is good at correctly predicting Approved cases.  
   - Recall: **0.63** → Model misses ~37% of actual Approved cases (false negatives).  
   - F1-score: **0.71** → Balanced measure, moderate performance.  
   - Confusion Matrix:  
     - Correctly predicted **488 Rejected** and **202 Approved**.  
     - Misclassified **43 Rejected → Approved** and **121 Approved → Rejected**.  

### Insights:
- Model is **strong in avoiding false approvals** (good for reducing risky loans).  
- However, recall is weaker → some deserving applicants are rejected.  
- In banking, this tradeoff is important:  
  - High precision = safer for bank.  
  - Improving recall = more fair approvals for applicants.