<div style="background-color: #f0f8ff; padding: 20px; border-radius: 10px; text-align: center; border: 2px solid #4682b4;">
    <h1 style="color: #2c3e50; font-family: 'Helvetica Neue', sans-serif; font-size: 2.5em; text-shadow: 1px 1px 2px #bdc3c7;">
        Loan Approval Prediction System
    </h1>
    <p style="color: #34495e; font-family: 'Helvetica Neue', sans-serif; font-size: 1.2em; margin-top: -10px;">
        <strong>Réalisé par :</strong> KHOUMRI Tarik, ZOUHARI Dyae Errahmane, MONTAKI Souhaib & SAMY AYOUB
    </p>
    <img src="../static/images/dataset-cover.jpg" alt="Loan Approval" style="width: 70%; margin-top: 15px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
</div>

<h4 style="color: #6c6d6eff;">In this notebook, we will follow a data science approach, specifically the CRISP-DM (Cross-Industry Standard Process for Data Mining) lifecycle.</h4>
<div style="text-align: center;">
    <img src="../static/images/crisp_process.jpg" alt="CRISP-DM" style="display: block; margin-left: auto; margin-right: auto; width: 50%;">
</div>

<h2>I. Business understanding </h2>

Objective:  Predict loan approval using applicant data by preprocessing, handling missing values, encoding categorical features, and training binary classification models while addressing class imbalance. 

<h5>Importing packages</h5>

In [2]:
import pandas as pd
import numpy as np


<h2>II. Data understanding </h2>

The loan approval dataset is a collection of financial records and associated information used to determine the eligibility of individuals or organizations for obtaining loans from a lending institution. It includes various factors such as cibil score, income, employment status, loan term, loan amount, assets value, and loan status. This dataset is commonly used in machine learning and data analysis to develop models and algorithms that predict the likelihood of loan approval based on the given features.

<h4>1. Columns recensement</h4>

<p style="margin-top:0.1rem; margin-bottom:1rem; color:#333;">
  Description of the dataset columns.
</p>

<table style="border-collapse:collapse; width:100%; max-width:1000px;">
  <thead>
    <tr style="background:#f2f4f7; text-align:left;">
      <th style="padding:8px 12px; border:1px solid #e1e4ea;">Variable</th>
      <th style="padding:8px 12px; border:1px solid #e1e4ea;">Description</th>
      <th style="padding:8px 12px; border:1px solid #e1e4ea;">Type</th>
      <th style="padding:8px 12px; border:1px solid #e1e4ea;">Role</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">loan_id</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Unique identifier of the loan</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Categorical (ID)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Ignore for modeling</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">no_of_dependents</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Number of dependents</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (discrete)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">education</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Education level (Graduate / Not Graduate)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Categorical</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">self_employed</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Indicates if the applicant is self-employed (Yes/No)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Categorical</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">income_annum</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Annual income</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (continuous)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">loan_amount</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Requested loan amount</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (continuous)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">loan_term</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Loan duration (e.g., in months)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (discrete)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">cibil_score</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Credit score (range 300–900)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (continuous)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature (very important)</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">residential_assets_value</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Value of residential assets</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (continuous)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">commercial_assets_value</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Value of commercial assets</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (continuous)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">luxury_assets_value</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Value of luxury assets</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (continuous)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">bank_asset_value</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Bank assets (savings/liquid assets)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Numerical (continuous)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Feature</td>
    </tr>
    <tr>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">loan_status</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Loan status: Approved / Rejected</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Categorical (target)</td>
      <td style="padding:8px 12px; border:1px solid #e1e4ea;">Target</td>
    </tr>
  </tbody>
</table>


<h4>2. Importing DataSet</h4>

In [6]:
Df = pd.read_csv(r"loan_approval_dataset.csv")
Df.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


In [7]:
Df.tail()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
4264,4265,5,Graduate,Yes,1000000,2300000,12,317,2800000,500000,3300000,800000,Rejected
4265,4266,0,Not Graduate,Yes,3300000,11300000,20,559,4200000,2900000,11000000,1900000,Approved
4266,4267,2,Not Graduate,No,6500000,23900000,18,457,1200000,12400000,18100000,7300000,Rejected
4267,4268,1,Not Graduate,No,4100000,12800000,8,780,8200000,700000,14100000,5800000,Approved
4268,4269,1,Graduate,No,9200000,29700000,10,607,17800000,11800000,35700000,12000000,Approved


<h4>3. Describing DataSet</h4>

In [10]:
#check the distribution of numerical variables :
print("descreptiv statistics of the data  : \n")
Df.describe().T

descreptiv statistics of the data  : 



Unnamed: 0,count,mean,std,min,25%,50%,75%,max
loan_id,4269.0,2135.0,1232.498,1.0,1068.0,2135.0,3202.0,4269.0
no_of_dependents,4269.0,2.498712,1.69591,0.0,1.0,3.0,4.0,5.0
income_annum,4269.0,5059124.0,2806840.0,200000.0,2700000.0,5100000.0,7500000.0,9900000.0
loan_amount,4269.0,15133450.0,9043363.0,300000.0,7700000.0,14500000.0,21500000.0,39500000.0
loan_term,4269.0,10.90045,5.709187,2.0,6.0,10.0,16.0,20.0
cibil_score,4269.0,599.9361,172.4304,300.0,453.0,600.0,748.0,900.0
residential_assets_value,4269.0,7472617.0,6503637.0,-100000.0,2200000.0,5600000.0,11300000.0,29100000.0
commercial_assets_value,4269.0,4973155.0,4388966.0,0.0,1300000.0,3700000.0,7600000.0,19400000.0
luxury_assets_value,4269.0,15126310.0,9103754.0,300000.0,7500000.0,14600000.0,21700000.0,39200000.0
bank_asset_value,4269.0,4976692.0,3250185.0,0.0,2300000.0,4600000.0,7100000.0,14700000.0


Information : the descreption table shows that there is a huge differance between the ranges of the columns which mean that we have to do some normalisations to scale the data to a common range . 

In [12]:
Df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


yaaaaay there is no missing value but we have a lot of incompatible columns types which mean we have to transform the columns types in the phase of preprocessing

In [14]:
print("the number of values in each column :")
print(Df.nunique())

print(f"the shape of our data : {Df.shape[0]} Columns and {Df.shape[1]} rows")

the number of values in each column :
loan_id                      4269
 no_of_dependents               6
 education                      2
 self_employed                  2
 income_annum                  98
 loan_amount                  378
 loan_term                     10
 cibil_score                  601
 residential_assets_value     278
 commercial_assets_value      188
 luxury_assets_value          379
 bank_asset_value             146
 loan_status                    2
dtype: int64
the shape of our data : 4269 Columns and 13 rows


<h4>4. verifier la qualites des donnees :</h4>

<h5>4.1 verify the redondant values : </h5>

In [15]:
Df.duplicated().sum()

np.int64(0)

we have no duplicate values

<h5>4.2 verify the Missing values : </h5>

In [16]:
Df.isna().sum()

loan_id                      0
 no_of_dependents            0
 education                   0
 self_employed               0
 income_annum                0
 loan_amount                 0
 loan_term                   0
 cibil_score                 0
 residential_assets_value    0
 commercial_assets_value     0
 luxury_assets_value         0
 bank_asset_value            0
 loan_status                 0
dtype: int64

<h5>4.3 verify the Types of values : </h5>

In [17]:
Df.dtypes

loan_id                       int64
 no_of_dependents             int64
 education                   object
 self_employed               object
 income_annum                 int64
 loan_amount                  int64
 loan_term                    int64
 cibil_score                  int64
 residential_assets_value     int64
 commercial_assets_value      int64
 luxury_assets_value          int64
 bank_asset_value             int64
 loan_status                 object
dtype: object

we have to do some data encoding to pass from the categorical data to numerical data in order to apply a modeling approach 

<h4>5. EDA Exploratory Data analysis</h4>

i have included this section in the Data understanding for mieux visualize the distrbution of data also to get overview about the corelation between variables and better understand the data

<h3>III-DATA PREPROCESSING</h3>

<h4> 1. Columns rename : </h4> 

In [None]:
headers = ["School","Sex","Age","Home_Address_Type","Family_Size","Parents_Cohabitation_Status","Mother_Eductaion","Father_Education","Mother_Job","Father_Job","Reason_Choosing_School","Guardien","Travel_Time","Revision_Time","Past_Class_Failures","Extra_Eductional_Support",
           "Family_Educational_Support","Extra_Paid_Classes","Extra_Activities","Attended_Nursery","Wants_Higher_Education","Internet_Access","Romantic_Relationship","Quality_Family_Relationship","Free_Time","Go_out","Day_Drugs_Consumption","Week_Drugs_Consumption","Health_Status",
           "Abscence","First_Periode_Grade","Seconde_Periode_Grade","Final_Periode_Grade"]
Df.columns = headers