## Loan Approval Prediction
This project leverages data from a loan approval dataset to build a predictive model that can classify loan applications as either approved or rejected. By following this approach, we explore different machine learning models and hyperparameter tuning to achieve optimal performance

Project Overview

Goal: To build a predictive model for loan approval using key features of applicants, such as income, education, credit history, and more.

Dataset: The loan dataset contains 614 entries with 13 features including gender, marital status, number of dependents, education level, employment status, applicant income, co-applicant income, loan amount, loan term, credit history, and loan status (target variable).

Project Steps

1. Data Preparation:
	Load the dataset.
	Drop the Loan_ID column as it is non-informative for prediction.
	Remove any rows with missing values to ensure data quality.
	Convert categorical variables to numerical using dummy variables.
2. Feature and Target Variables:
	Set the feature matrix X by excluding the Loan_Status column.
	Define the target variable y as Loan_Status.
3. Data Splitting:
	Split the data into training and testing sets with an 80-20 ratio.
4. Model Pipeline:
    Create a pipeline with a MinMaxScaler for feature scaling and a classifier.         Initially, use K-Nearest Neighbors (KNN) as the classifier.
5. Model Training:
	Train a default KNN classifier using the pipeline and evaluate accuracy on the     test set.
6. Hyperparameter Tuning:
	Define a search space for the KNN classifier, varying n_neighbors from 1 to 10.
	Use GridSearchCV with 5-fold cross-validation to find the optimal n_neighbors     value.
7. Expanded Model Search:
	Expand the search space to include other models like Logistic Regression and     Random Forest.
	Run another grid search to identify the best-performing model and its optimal hyperparameters.
8. Result Analysis:
    Compare model performances and analyze the impact of hyperparameter tuning on accuracy.

Results Summary

Through iterative modeling and tuning, I identified the best-performing model and optimized its parameters, achieving improved accuracy compared to the initial models. This project demonstrates the importance of model selection and hyperparameter tuning in predictive performance.


# Loan Approval Prediction

## Project Description

This project aims to predict loan approvals using a dataset of applicant information, employing various machine learning techniques. By building a pipeline with multiple models and tuning their hyperparameters, we identified the optimal model to enhance predictive accuracy.

## Dataset

The dataset contains information on loan applicants with 13 features, such as:
- **Applicant Income**
- **Co-Applicant Income**
- **Loan Amount**
- **Credit History**
- **Education**

The target variable, `Loan_Status`, indicates loan approval or rejection.

## Installation and Requirements

To run this project, you will need:
- Python 3.7+
- `pandas`, `numpy`, `scikit-learn`

Install the necessary libraries using:
```bash
pip install pandas numpy scikit-learn


Project Workflow

	1.	Data Preprocessing: Load data, clean, and encode categorical features.
	2.	Modeling: Use a pipeline with MinMaxScaler and KNN, followed by Logistic Regression and Random Forest.
	3.	Hyperparameter Tuning: Tune parameters for optimal model selection.
	4.	Evaluation: Evaluate models and choose the best based on accuracy.

Usage

	1.	Run the code cells in sequence in a Python environment (Jupyter Notebook or similar).
	2.	Review the output to understand model performance.

Results

The final model demonstrated improved accuracy by using Random Forest with optimized hyperparameters.

License

This project is open-source under the MIT License.

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [5]:
# Load dataset
df_loan = pd.read_csv('Loan_Train.csv')
print(f"Dataset Shape: {df_loan.shape}")
df_loan.head()

Dataset Shape: (614, 13)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [6]:
# Drop Loan_ID column
df_loan = df_loan.drop(columns='Loan_ID', axis=1)

# Drop rows with missing data
df_loan = df_loan.dropna()

# Convert categorical features to dummy variables
df_loan = pd.get_dummies(df_loan, drop_first=True)
print(f"Dataset Shape after Cleaning: {df_loan.shape}")
df_loan.head()

Dataset Shape after Cleaning: (480, 15)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Male,Married_Yes,Dependents_1,Dependents_2,Dependents_3+,Education_Not Graduate,Self_Employed_Yes,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_Y
1,4583,1508.0,128.0,360.0,1.0,True,True,True,False,False,False,False,False,False,False
2,3000,0.0,66.0,360.0,1.0,True,True,False,False,False,False,True,False,True,True
3,2583,2358.0,120.0,360.0,1.0,True,True,False,False,False,True,False,False,True,True
4,6000,0.0,141.0,360.0,1.0,True,False,False,False,False,False,False,False,True,True
5,5417,4196.0,267.0,360.0,1.0,True,True,False,True,False,False,True,False,True,True


In [7]:
x= df_loan.drop('Loan_Status_Y', axis =1)
y= df_loan['Loan_Status_Y']

# split into train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

print('Records in each set:\n','*'*15)
print(f'x train: {x_train.shape}\ny train: {y_train.shape}\nx test: {x_test.shape}\ny test: {y_test.shape}')

Records in each set:
 ***************
x train: (384, 14)
y train: (384,)
x test: (96, 14)
y test: (96,)


In [7]:
pipeline = Pipeline([('min_max_scaler', MinMaxScaler()),('knn', KNeighborsClassifier())
])

In [9]:
# Create a pipeline with MinMaxScaler and KNN
pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('knn', KNeighborsClassifier())
])

# Fit pipeline to the training data
pipeline.fit(x_train, y_train)

# Predict and evaluate accuracy
y_pred = pipeline.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Default KNN Model Accuracy: {accuracy * 100:.2f}%")

Default KNN Model Accuracy: 78.12%


In [11]:
# Define parameter grid for KNN
param_grid = {'knn__n_neighbors': range(1, 11)}

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train)

# Best parameters and accuracy
print(f"Best n_neighbors: {grid_search.best_params_}")
best_knn_accuracy = accuracy_score(y_test, grid_search.predict(x_test))
print(f"KNN Best Model Accuracy: {best_knn_accuracy * 100:.2f}%")

Best n_neighbors: {'knn__n_neighbors': 3}
KNN Best Model Accuracy: 79.17%


In [13]:
best_model_accuracy = accuracy_score(y_test, grid_search.predict(x_test))
print(f"Best Model Accuracy: {best_model_accuracy*100:.2f}%")

Best Model Accuracy: 79.17%


In [13]:
# Expanded search space with Logistic Regression and Random Forest
search_space = [
    {'scaler': [MinMaxScaler()], 'knn': [KNeighborsClassifier()], 'knn__n_neighbors': range(1, 11)},
    {'scaler': [MinMaxScaler()], 'knn': [LogisticRegression()], 'knn__C': np.logspace(-4, 4, 4)},
    {'scaler': [MinMaxScaler()], 'knn': [RandomForestClassifier()], 'knn__n_estimators': [10, 50, 100]}
]

# Expanded grid search with 5-fold cross-validation
expanded_grid_search = GridSearchCV(Pipeline([('scaler', MinMaxScaler()), ('knn', KNeighborsClassifier())]),
                                    search_space, cv=5, scoring='accuracy', n_jobs=-1)
expanded_grid_search.fit(x_train, y_train)

# Best model and accuracy
print(f"Best Model: {expanded_grid_search.best_estimator_}")
expanded_best_accuracy = accuracy_score(y_test, expanded_grid_search.predict(x_test))
print(f"Expanded Grid Search Best Model Accuracy: {expanded_best_accuracy * 100:.2f}%")

Best Model: Pipeline(steps=[('scaler', MinMaxScaler()),
                ('knn', LogisticRegression(C=21.54434690031882))])
Expanded Grid Search Best Model Accuracy: 82.29%


In [15]:
# Summary of Results
print(f"Default KNN Accuracy: {accuracy * 100:.2f}%")
print(f"KNN with Best n_neighbors: {best_knn_accuracy * 100:.2f}%")
print(f"Best Model from Expanded Search: {expanded_grid_search.best_estimator_}")
print(f"Accuracy of Best Model from Expanded Search: {expanded_best_accuracy * 100:.2f}%")

Default KNN Accuracy: 78.12%
KNN with Best n_neighbors: 79.17%
Best Model from Expanded Search: Pipeline(steps=[('scaler', MinMaxScaler()),
                ('knn', LogisticRegression(C=21.54434690031882))])
Accuracy of Best Model from Expanded Search: 82.29%
