Step 1: Load the Dataset


In [32]:
import pandas as pd

# Load the dataset
df = pd.read_csv('loan_data.csv')

# Display the first few rows
print(df.head())

# Check the structure of the dataset
print(df.info())


    Loan_ID Gender Married  Dependents     Education Self_Employed  \
0  LP001002   Male     Yes           0      Graduate            No   
1  LP001003   Male     Yes           1      Graduate            No   
2  LP001005   Male     Yes           0      Graduate           Yes   
3  LP001006   Male      No           0  Not Graduate            No   
4  LP001008   Male     Yes           0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                  0         128               360   
1             4583               1508         128               360   
2             3000                  0          66               360   
3             2583               2358         120               360   
4             6000                  0         141               360   

   Credit_History Property_Area Loan_Status  
0               1         Urban           Y  
1               1         Rural           N  
2             

Loan_ID   Gender  Married  Dependents  Education     Self_Employed  ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History  Property_Area  Loan_Status
LP001002  Male    Yes      0           Graduate      No             5849             0                  128         360               1               Urban          Y              1
LP001003  Male    Yes      1           Graduate      No             4583             1508               128         360               1               Rural          N              1
LP001046  Male    No       0           Not Graduate  Yes            4887             0                  133         360               1               Urban          Y              1
LP001043  Female  Yes      0           Graduate      No             3510             0                  76          360               0               Semiurban      N              1
LP001038  Male    Yes      1           Not Graduate  No             3596             0         

In [35]:
df.shape

(20, 13)

In [36]:
print(df.isnull().sum())

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


Step 2: Explore the Dataset

Let's understand the key features and the target variable (Loan_Status), which indicates whether the loan was approved (1) or not approved (0).

Dataset Features: 

Some important features in this dataset include:

Gender: Male/Female (categorical)

Married: Yes/No (categorical)

Dependents: Number of dependents (categorical, including '3+')

Education: Graduate/Not Graduate (categorical)

Self_Employed: Yes/No (categorical)

ApplicantIncome: Numeric income of the applicant

CoapplicantIncome: Numeric income of the coapplicant

LoanAmount: Numeric loan amount

Loan_Amount_Term: Term in months

Credit_History: 1 or 0 (whether the applicant has credit history)

Property_Area: Urban/Rural/Semiurban (categorical)

Loan_Status: 1 (approved) or 0 (not approved) – the target variable.

In [12]:
# Get a statistical summary of numerical columns
print(df.describe())

# Check for missing values
print(df.isnull().sum())

# Check the distribution of the target variable (Loan_Status)
print(df['Loan_Status'].value_counts())

       Dependents  ApplicantIncome  CoapplicantIncome  LoanAmount  \
count   20.000000        20.000000          20.000000   20.000000   
mean     0.550000      4142.450000        1987.950000  134.650000   
std      0.825578      2428.969274        2893.234804   72.508638   
min      0.000000      1299.000000           0.000000   17.000000   
25%      0.000000      2895.750000           0.000000   98.750000   
50%      0.000000      3553.000000        1297.000000  126.500000   
75%      1.000000      4902.750000        2464.250000  145.250000   
max      2.000000     12841.000000       10968.000000  349.000000   

       Loan_Amount_Term  Credit_History  
count         20.000000       20.000000  
mean         333.000000        0.900000  
std           68.755861        0.307794  
min          120.000000        0.000000  
25%          360.000000        1.000000  
50%          360.000000        1.000000  
75%          360.000000        1.000000  
max          360.000000        1.000000  


Step 3: Data Preprocessing

3.1 Handling Missing Values

First, we need to handle missing values (if any).

We can either fill missing values with a strategy (mean/median for numerical data, mode for categorical) or remove them.

In [13]:
# Check for missing values
print(df.isnull().sum())

# Drop the Loan_ID column as it's not useful for modeling
#df.drop('Loan_ID', axis=1, inplace=True)

# Fill missing values in 'LoanAmount' with the median
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)

# Fill missing values in 'Loan_Amount_Term' with the mode
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)

# Fill missing categorical values with mode (e.g., Gender, Married)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)


Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


3.2 Encoding Categorical Variables

Categorical features like Gender, Married, Dependents, Education, and Property_Area need to be converted into numeric values for machine learning models.

We can use Label Encoding or One-Hot Encoding.

In [14]:
# Convert categorical columns to numeric using pd.get_dummies() for one-hot encoding
df = pd.get_dummies(df, columns=['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area'], drop_first=True)

# Display the new DataFrame structure
print(df.head())


    Loan_ID  ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0  LP001002             5849                  0         128               360   
1  LP001003             4583               1508         128               360   
2  LP001005             3000                  0          66               360   
3  LP001006             2583               2358         120               360   
4  LP001008             6000                  0         141               360   

   Credit_History Loan_Status  Gender_Male  Married_Yes  Dependents_1  \
0               1           Y         True         True         False   
1               1           N         True         True          True   
2               1           Y         True         True         False   
3               1           Y         True        False         False   
4               1           Y         True         True         False   

   Dependents_2  Education_Not Graduate  Self_Employed_Yes  \
0         Fa

3.3 Feature Scaling

For models like logistic regression, scaling the features is important. We’ll use StandardScaler to scale numerical features like ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term.

In [15]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# List of numerical features to be scaled
num_features = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

# Scale the numerical features
df[num_features] = scaler.fit_transform(df[num_features])

# Check the scaled features
print(df[num_features].head())


   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term
0         0.720834          -0.704953   -0.094096          0.402895
1         0.186085          -0.170196   -0.094096          0.402895
2        -0.482562          -0.704953   -0.971380          0.402895
3        -0.658700           0.131224   -0.207294          0.402895
4         0.784615          -0.704953    0.089851          0.402895


Step 4: Build a Loan Eligibility Prediction Model

4.1 Split Data into Training and Test Sets
We’ll split the dataset into training and testing sets to train the model and evaluate its performance.

In [21]:
from sklearn.model_selection import train_test_split

# Define the feature set (X) and target (y)
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




4.2 Train a Machine Learning Model (Logistic Regression)
For binary classification, we can start with Logistic Regression.

In [25]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

model.score(X_test, y_test)




0.5

Step 5: Evaluate the Model
We will evaluate the model using accuracy, precision, recall, and F1-score.

In [39]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Calculate accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_test, y_pred)
#precision = precision_score(y_test, y_pred)
#recall = recall_score(y_test, y_pred)
#f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
#print(f"Precision: {precision:.2f}")
#print(f"Recall: {recall:.2f}")
#print(f"F1 Score: {f1:.2f}")

Accuracy: 0.50
