## Loan Approval Prediction

This project aims to predict loan approval using machine learning models. It involves data preprocessing, model training, evaluation, and prediction on test data. The project utilizes several Python libraries, including numpy, pandas, matplotlib, and seaborn.

### Dataset Download
The dataset can be downloaded from the following link: [Loan Approval Prediction Dataset](https://www.kaggle.com/datasets/sonalisingh1411/loan-approval-prediction?select=Training+Dataset.csv)


## Importing Necessary Libraries

In [65]:
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Importing Machine Learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

## Loading the training dataset

In [66]:
df = pd.read_csv('Loan Approval Dataset/Training Dataset.csv')

In [67]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## Data Preprocessing

In [68]:
# Checking the number of missing values in each column
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [69]:
# Counting the occurrences of each unique value in the 'LoanAmount' column
df['LoanAmount'].value_counts()

120.0    20
110.0    17
100.0    15
160.0    12
187.0    12
         ..
240.0     1
214.0     1
59.0      1
166.0     1
253.0     1
Name: LoanAmount, Length: 203, dtype: int64

In [70]:
# Filling missing values in categorical columns with the mode (most frequent value)
categorical_columns = ['Gender', 'Dependents', 'Married', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History']
for column in categorical_columns:
    df[column].fillna(df[column].mode()[0], inplace=True)

In [71]:
# Filling missing values in the 'LoanAmount' column with the median value
df['LoanAmount'] = df['LoanAmount'].fillna(np.nanmedian(df['LoanAmount']))

In [72]:
df.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [73]:
df['Education'].value_counts()

Graduate        480
Not Graduate    134
Name: Education, dtype: int64

In [75]:
# Encoding categorical columns with numerical values using LabelEncoder
categorical_columns = ['Gender', 'Dependents', 'Married', 'Education', 'Self_Employed', 'Credit_History']
for column in categorical_columns:
    encoder = LabelEncoder()
    df[column] = encoder.fit_transform(df[column])

In [76]:
# Encoding the 'Loan_Status' column with numerical values using LabelEncoder
encoder = LabelEncoder()
df['Loan_Status'] = encoder.fit_transform(df['Loan_Status'])

In [26]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0,0,0,5849,0.0,128.0,360.0,1,Urban,1
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1,Rural,0
2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1,Urban,1
3,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1,Urban,1
4,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1,Urban,1


In [27]:
df['Property_Area'].value_counts()

Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64

In [77]:
# Converting the 'Property_Area' column to dummy variables and concatenate them with the original DataFrame
dummy_data = pd.get_dummies(df['Property_Area'])
df = pd.concat([df, dummy_data], axis=1)

# Droping the original 'Property_Area' column from the DataFrame
df.drop(['Property_Area'], axis=1, inplace=True)

# Removing the 'Loan_ID' column from the DataFrame
df.drop(['Loan_ID'], axis=1, inplace=True)

In [78]:
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Rural,Semiurban,Urban
0,1,0,0,0,0,5849,0.0,128.0,360.0,1,1,0,0,1
1,1,1,1,0,0,4583,1508.0,128.0,360.0,1,0,1,0,0
2,1,1,0,0,1,3000,0.0,66.0,360.0,1,1,0,0,1
3,1,1,0,1,0,2583,2358.0,120.0,360.0,1,1,0,0,1
4,1,0,0,0,0,6000,0.0,141.0,360.0,1,1,0,0,1


## Dividing the dataset into features and target variable

In [79]:
# Separating features (X) and target (y)
X = df.drop(['Loan_Status'], axis=1)  
y = df['Loan_Status'] 

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=300)

In [80]:
# Transforming test data using the scaler fitted on training data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Evaluation

In [81]:
# Initializing ML models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}


In [82]:
# Training and evaluating models
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    results[model_name] = {
        'Accuracy': accuracy,
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score']
    }

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [83]:
# Converting results to DataFrame for better display
results_df = pd.DataFrame(results).T
results_df = results_df.reset_index().rename(columns={'index': 'ML Algorithm'})

results_df

Unnamed: 0,ML Algorithm,Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.869919,0.879855,0.869919,0.856353
1,Decision Tree,0.699187,0.741644,0.699187,0.713011
2,Random Forest,0.796748,0.788044,0.796748,0.790853
3,SVM,0.869919,0.879855,0.869919,0.856353
4,K-Nearest Neighbors,0.829268,0.822728,0.829268,0.824316
5,Naive Bayes,0.837398,0.830995,0.837398,0.826577


## Making Predictions on Test Data

In [84]:
test_df = pd.read_csv('Loan Approval Dataset/Test Dataset.csv')

for column in ['Gender', 'Dependents', 'Married', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History']:
    test_df[column].fillna(test_df[column].mode()[0], inplace=True)
test_df['LoanAmount'] = test_df['LoanAmount'].fillna(np.nanmedian(test_df['LoanAmount']))

for column in ['Gender', 'Dependents', 'Married', 'Education', 'Self_Employed', 'Credit_History']:
    encoder = LabelEncoder()
    test_df[column] = encoder.fit_transform(test_df[column])
    
dummy_data = pd.get_dummies(test_df['Property_Area'])
test_df = pd.concat([test_df, dummy_data], axis=1)
test_df.drop(['Property_Area'], axis=1, inplace=True)

loan_ids = test_df['Loan_ID']
test_df.drop(['Loan_ID'], axis=1, inplace=True)

scaler = StandardScaler()
scaler.fit(X_train)  

test_scaled_features = scaler.transform(test_df)

best_model = models['Logistic Regression']
test_predictions = best_model.predict(test_scaled_features)

# Converting predictions to DataFrame and saving to CSV file
predictions_df = pd.DataFrame({
    'Loan_ID': loan_ids,
    'Loan_Status': test_predictions
})
predictions_df.to_csv('Predicted_Loan_Status.csv', index=False)



In [85]:
predictions_df

Unnamed: 0,Loan_ID,Loan_Status
0,LP001015,1
1,LP001022,0
2,LP001031,1
3,LP001035,0
4,LP001051,1
...,...,...
362,LP002971,0
363,LP002975,1
364,LP002980,0
365,LP002986,0
