## Project Overview: Loan Approval Prediction Model

This project aims to build a machine learning model to predict the approval status of a loan application based on various demographic and financial attributes of the applicants. The dataset includes features such as the applicant's income, credit history, employment status, loan amount, and more. By leveraging this data, the goal is to help financial institutions or lenders assess the likelihood of loan approval more efficiently and accurately.

The process includes:
- **Data Ingestion**: Connecting to a MongoDB database and importing the data into a working environment.
- **Data Exploration and Preprocessing**: Cleaning and transforming the dataset, handling missing values, encoding categorical features, and preparing the data for modeling.
- **Modeling**: Selecting and training machine learning models to predict loan approvals.
- **Evaluation**: Evaluating the model’s performance using accuracy, precision, and other relevant metrics.
- **Potential Future Improvements**: Suggestions for enhancing the model, such as trying advanced algorithms, adding more features, or performing hyperparameter tuning.

By the end of this notebook, a trained machine learning model will be able to classify loan applications into approved or not approved, providing insights into the key factors driving these decisions.

---

### 1. Data Description

- **Gender**: 0 for Female, 1 for Male.
- **Married**: 0 for No, 1 for Yes.
- **Dependents**: Number of dependents (0, 1, 2, or 3+).
- **Education**: 0 for Graduate, 1 for Not Graduate.
- **Self_Employed**: 0 for No, 1 for Yes.
- **ApplicantIncome**: The applicant's income.
- **CoapplicantIncome**: Income of the co-applicant (if any).
- **LoanAmount**: Amount of the loan.
- **Loan_Amount_Term**: The term of the loan in months.
- **Credit_History**: 0 for No credit history, 1 for Good credit history.
- **Property_Area**: 0 for Rural, 1 for Urban, 2 for Semi-Urban.

The target variable is whether the loan is approved or not (1 for approved, 0 for not approved).

---

### 2. Data Ingestion

In this section, we connect to a MongoDB Atlas database to retrieve the loan application dataset. The data is stored in a MongoDB collection and is imported into the environment using the `pymongo` library.

The steps involved in data ingestion:
- **Connecting to MongoDB**: We establish a connection to the MongoDB cluster using the `MongoClient` class.
- **Retrieving the Dataset**: The data is fetched from the MongoDB collection and converted into a Pandas DataFrame for easy manipulation and analysis.
- **Initial Data Inspection**: We will inspect the dataset by viewing its structure, checking for missing values, and understanding the types of features present.

The connection string used ensures that the connection is secure and the data can be pulled seamlessly into the notebook.

In [2]:
pip install "pymongo[srv]"

Collecting pymongo[srv]
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
[0mCollecting dnspython<3.0.0,>=1.16.0 (from pymongo[srv])
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0
Note: you may need to restart the kernel to use updated packages.


In [41]:
from pymongo import MongoClient
import pandas as pd


connection_string = "mongodb+srv://ryan123islam:ryanman12312@cluster0.nwspi.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

# Connect to MongoDB Atlas using pymongo
client = MongoClient(connection_string)
db = client['my_database']  
collection = db['my_collection']

# Find all data within the collection
data = list(collection.find())
df = pd.DataFrame(data)
df.head(5)

Unnamed: 0,_id,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmount
0,66de3998c9b26f1e594e9fd5,LP001002,Male,No,0,Graduate,No,5849,0.0,360.0,1.0,Urban,Y,
1,66de3998c9b26f1e594e9fd6,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,360.0,1.0,Rural,N,128.0
2,66de3998c9b26f1e594e9fd7,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,360.0,1.0,Urban,Y,66.0
3,66de3998c9b26f1e594e9fd8,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,360.0,1.0,Urban,Y,120.0
4,66de3998c9b26f1e594e9fd9,LP001008,Male,No,0,Graduate,No,6000,0.0,360.0,1.0,Urban,Y,141.0


##### **Check the final five rows of the dataset.**

In [42]:
#Print the last 5 rows
df.tail(5)

Unnamed: 0,_id,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmount
609,66de3998c9b26f1e594ea236,LP002978,Female,No,0,Graduate,No,2900,0.0,360.0,1.0,Rural,Y,71.0
610,66de3998c9b26f1e594ea237,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,180.0,1.0,Rural,Y,40.0
611,66de3998c9b26f1e594ea238,LP002983,Male,Yes,1,Graduate,No,8072,240.0,360.0,1.0,Urban,Y,253.0
612,66de3998c9b26f1e594ea239,LP002984,Male,Yes,2,Graduate,No,7583,0.0,360.0,1.0,Urban,Y,187.0
613,66de3998c9b26f1e594ea23a,LP002990,Female,No,0,Graduate,Yes,4583,0.0,360.0,0.0,Semiurban,N,133.0


---
## 3. Data Exploration and Preprocessing

In this section, we explore the dataset and perform necessary preprocessing steps to prepare the data for model training. Proper exploration and preprocessing help ensure the data is clean and structured in a way that maximizes the model's performance.

### Key Steps:

1. **Viewing Dataset Structure**: 
   - We start by using functions like `df.info()`, `df.head()`, and `df.describe()` to get an overview of the data, including data types, column names, and a summary of numerical values.
   
2. **Handling Missing Values**:
   - Missing values can negatively impact model performance, so we identify and handle them. This may involve filling missing values with mean/median values or dropping rows/columns with too many missing values.

3. **Encoding Categorical Variables**:
   - Many machine learning algorithms cannot handle categorical data directly. Therefore, categorical features (e.g., **Gender**, **Married**) are encoded as numerical values. Techniques like one-hot encoding or label encoding are applied depending on the model and the feature.

4. **Feature Scaling**:
   - Scaling numerical features (e.g., **ApplicantIncome**, **LoanAmount**) ensures that no single feature dominates others during model training. We use techniques like standardization or normalization as needed.

By the end of this step, the dataset will be clean and ready for training a machine learning model.


##### **Determine the dimensions of the dataset, including both the total rows and columns.**

In [43]:
df.shape

(614, 14)

In [44]:
# No of rows and columns
print("Number of Rows: ",df.shape[0])
print("Number of Columns: ",df.shape[1])

Number of Rows:  614
Number of Columns:  14


##### **Retrieve comprehensive details about our dataset, such as the total row count, column count, data types for each column, and memory usage.**

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   _id                614 non-null    object 
 1   Loan_ID            614 non-null    object 
 2   Gender             601 non-null    object 
 3   Married            611 non-null    object 
 4   Dependents         599 non-null    object 
 5   Education          614 non-null    object 
 6   Self_Employed      582 non-null    object 
 7   ApplicantIncome    614 non-null    int64  
 8   CoapplicantIncome  614 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
 13  LoanAmount         592 non-null    float64
dtypes: float64(4), int64(1), object(9)
memory usage: 67.3+ KB


##### **Check Null Values In The Dataset**

In [46]:
# Dropping _id column and printing sum of null values for each column
df.drop('_id', axis=1, inplace=True)
df.isnull().sum()


Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
LoanAmount           22
dtype: int64

In [47]:
# Missing Percentage
df.isnull().sum()*100 / len(data)

Loan_ID              0.000000
Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
LoanAmount           3.583062
dtype: float64

##### **Handling The missing Values**

In [48]:
df.drop('Loan_ID', axis=1, inplace=True)

In [49]:
df.head(1)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmount
0,Male,No,0,Graduate,No,5849,0.0,360.0,1.0,Urban,Y,


In [50]:
# making a list of columns with missing percentage < 5%

columns = ['Gender','Dependents','LoanAmount','Loan_Amount_Term']
df = df.dropna(subset=columns)
df.isnull().sum()*100 / len(df)

Gender               0.000000
Married              0.000000
Dependents           0.000000
Education            0.000000
Self_Employed        5.424955
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
Loan_Amount_Term     0.000000
Credit_History       8.679928
Property_Area        0.000000
Loan_Status          0.000000
LoanAmount           0.000000
dtype: float64

- All columns have been processed except for 'Self_Employed' and 'Credit_History.' Since these columns have more than 5% missing values, deleting the rows is not an option. We need to impute the missing values with suitable alternatives.

In [51]:
print(df['Self_Employed'].unique())
print(df['Credit_History'].unique())

['No' 'Yes' nan]
[ 1.  0. nan]


In [52]:
print(df['Credit_History'].mode()[0])
print(df['Self_Employed'].mode()[0])

1.0
No


In [53]:
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mode()[0])

##### **All missing values are handled.**

In [54]:
df.isnull().sum()*100 / len(df)

Gender               0.0
Married              0.0
Dependents           0.0
Education            0.0
Self_Employed        0.0
ApplicantIncome      0.0
CoapplicantIncome    0.0
Loan_Amount_Term     0.0
Credit_History       0.0
Property_Area        0.0
Loan_Status          0.0
LoanAmount           0.0
dtype: float64

##### **Managing categorical columns efficiently.**

In [55]:
df.sample(5)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmount
98,Male,Yes,0,Not Graduate,No,4188,0.0,180.0,1.0,Semiurban,Y,115.0
518,Male,No,0,Graduate,No,4683,1915.0,360.0,1.0,Semiurban,N,185.0
65,Male,Yes,0,Graduate,No,5726,4595.0,360.0,1.0,Semiurban,N,258.0
604,Female,Yes,1,Graduate,No,12000,0.0,360.0,1.0,Semiurban,Y,496.0
407,Female,No,0,Not Graduate,No,2213,0.0,360.0,1.0,Rural,Y,66.0


In [56]:
df['Dependents'] = df['Dependents'].replace(to_replace="3+",value=3)

  df['Dependents'] = df['Dependents'].replace(to_replace="3+",value=3)


In [57]:
df['Dependents'].unique()

array([1, 0, 2, 3])

---

## 4. Modeling

In this section, we select and train a machine learning model to predict loan approval status based on the preprocessed dataset. The choice of model is critical, and for this task, we will begin with a simple and interpretable algorithm, such as **Logistic Regression**.

### Key Steps:

1. **Model Selection**:
   - **Logistic Regression** was chosen for its simplicity, interpretability, and suitability for binary classification problems like loan approval.
   - Logistic Regression provides insight into the contribution of each feature to the final prediction, making it a good first choice for understanding the data.

2. **Model Training**:
   - The preprocessed dataset is split into training and testing sets, allowing the model to be trained on one portion of the data and evaluated on another.
   - The `fit()` function is used to train the model on the training data, learning the relationship between input features and the target variable (loan approval status).

3. **Evaluation**:
   - After training the model, we evaluate its performance using metrics such as **accuracy**, **precision**, and **recall**. These metrics help assess how well the model generalizes to unseen data.

### Why Logistic Regression?

Logistic Regression is computationally efficient and easy to interpret, making it an ideal baseline model for binary classification tasks. If needed, more complex models such as **Random Forest** or **Gradient Boosting** could be used in future iterations to improve performance.

By the end of this section, we will have a trained model capable of predicting whether a loan will be approved based on the provided features.


##### **Binary encoding.**

In [None]:
# Binary encoding using map for other categorical variables
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0}).astype('int')
df['Married'] = df['Married'].map({'Yes': 1, 'No': 0}).astype('int')
df['Education'] = df['Education'].map({'Graduate': 1, 'Not Graduate': 0}).astype('int')
df['Self_Employed'] = df['Self_Employed'].map({'Yes': 1, 'No': 0}).astype('int')
df['Loan_Status'] = df['Loan_Status'].map({'Y': 1, 'N': 0}).astype('int')
df['Property_Area'] = df['Property_Area'].map({'Rural':0,'Semiurban':2,'Urban':1}).astype('int')

In [62]:
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmount
1,1,1,1,1,0,4583,1508.0,360.0,1.0,0,0,128.0
2,1,1,0,1,1,3000,0.0,360.0,1.0,1,1,66.0
3,1,1,0,0,0,2583,2358.0,360.0,1.0,1,1,120.0
4,1,0,0,1,0,6000,0.0,360.0,1.0,1,1,141.0
5,1,1,2,1,1,5417,4196.0,360.0,1.0,1,1,267.0


##### **Save the feature matrix in variable X and the target response in variable y.**

In [64]:
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

##### **Feature Scaling.**

In [65]:
X.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,LoanAmount
1,1,1,1,1,0,4583,1508.0,360.0,1.0,0,128.0
2,1,1,0,1,1,3000,0.0,360.0,1.0,1,66.0
3,1,1,0,0,0,2583,2358.0,360.0,1.0,1,120.0
4,1,0,0,1,0,6000,0.0,360.0,1.0,1,141.0
5,1,1,2,1,1,5417,4196.0,360.0,1.0,1,267.0


In [66]:
# making a list of columns that we need to scale
cols = ['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']

In [67]:
from sklearn.preprocessing import StandardScaler

st = StandardScaler()
X[cols] = st.fit_transform(X[cols])

In [68]:
X.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,LoanAmount
1,1,1,1,1,0,-0.128694,-0.049699,0.279961,1.0,0,-0.214368
2,1,1,0,1,1,-0.394296,-0.545638,0.279961,1.0,1,-0.952675
3,1,1,0,0,0,-0.464262,0.229842,0.279961,1.0,1,-0.309634
4,1,0,0,1,0,0.109057,-0.545638,0.279961,1.0,1,-0.059562
5,1,1,2,1,1,0.011239,0.834309,0.279961,1.0,1,1.440866


In [69]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import numpy as np

In [70]:
model_df = {}

def model_val(model,X,y):
    # spliting dataset for training and testing
    X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                   test_size=0.20,
                                                   random_state=42)
    
    # training the model
    model.fit(X_train, y_train)
    
    # asking model for prediction
    y_pred = model.predict(X_test)
    
    # checking model's prediction accuracy
    print(f"{model} accuracy is {accuracy_score(y_test,y_pred)}")
    
    # to find the best model we use cross-validation, thru this we can compare different algorithms
    # In this we use whole dataset to for testing not just 20%, but one at a time and summarize 
    # the result at the end.
    
    # 5-fold cross-validation (but 10-fold cross-validation is common in practise)
    score = cross_val_score(model,X,y,cv=5)  # it will divides the dataset into 5 parts and during each iteration 
                                             # uses (4,1) combination for training and testing 
    print(f"{model} Avg cross val score is {np.mean(score)}")
    model_df[model] = round(np.mean(score)*100,2)
    

In [72]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# passing this model object of LogisticRegression Class in the function we've created
model_val(model,X,y)

LogisticRegression() accuracy is 0.8018018018018018
LogisticRegression() Avg cross val score is 0.8047829647829647


In [73]:
model_df

{LogisticRegression(): 80.48}

In [74]:
from sklearn import svm

model = svm.SVC()
model_val(model,X,y)

SVC() accuracy is 0.8018018018018018
SVC() Avg cross val score is 0.7938902538902539


In [75]:
model_df

{LogisticRegression(): 80.48, SVC(): 79.39}

In [76]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model_val(model,X,y)

print(model_df)

DecisionTreeClassifier() accuracy is 0.7387387387387387
DecisionTreeClassifier() Avg cross val score is 0.7143488943488943
{LogisticRegression(): 80.48, SVC(): 79.39, DecisionTreeClassifier(): 71.43}


In [77]:
from sklearn.ensemble import RandomForestClassifier

model =RandomForestClassifier()
model_val(model,X,y)

print(model_df)

RandomForestClassifier() accuracy is 0.7567567567567568
RandomForestClassifier() Avg cross val score is 0.7903357903357904
{LogisticRegression(): 80.48, SVC(): 79.39, DecisionTreeClassifier(): 71.43, RandomForestClassifier(): 79.03}


In [78]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier()
model_val(model,X,y)

model_df

GradientBoostingClassifier() accuracy is 0.7927927927927928
GradientBoostingClassifier() Avg cross val score is 0.774004914004914


{LogisticRegression(): 80.48,
 SVC(): 79.39,
 DecisionTreeClassifier(): 71.43,
 RandomForestClassifier(): 79.03,
 GradientBoostingClassifier(): 77.4}

---

## 5. Model Tuning and Prediction

In this section, we fine-tune the hyperparameters of different machine learning models using **RandomizedSearchCV** to improve their performance. After tuning, we train the models with the optimized hyperparameters, save the final model, and use it to make predictions on new data.

### Hyperparameter Tuning

1. **Logistic Regression Tuning**:
   - We use `RandomizedSearchCV` to tune the `C` and `solver` hyperparameters for the Logistic Regression model.
   - The goal is to find the best combination of parameters that yields the highest model performance.
   

In [79]:
from sklearn.model_selection import RandomizedSearchCV

In [80]:
# Let's tune hyper parameters of LogisticRegression (we've choosen 'C' and 'solver' parameter for tuning)

log_reg_grid = {"C": np.logspace(-4,4,20),
                "solver": ['liblinear']}

In [81]:
# In RandomizedSearchCV we've to pass estimator, which is nothing but Algo class, It will return
# a model with it's Hyper Parameter already set and we've to train that model, with our dataset

rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                   param_distributions = log_reg_grid,
                   n_iter=20, cv=5, verbose=True)

In [82]:
# Let's train our model with these set hyper parameters for optimized results.

rs_log_reg.fit(X,y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [83]:
rs_log_reg.best_score_

0.8047829647829647

In [84]:
rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 0.23357214690901212}

2. **Support Vector Classifier (SVC) Tuning:**

    - For the SVC model, we optimize the `C` and `kernel` parameters using **RandomizedSearchCV** to determine the best values.

In [85]:
svc_grid = {'C':[0.25,0.50,0.75,1],
            "kernel":["linear"]}


In [86]:
rs_svc=RandomizedSearchCV(svm.SVC(),
                  param_distributions = svc_grid,
                  cv=5,
                  n_iter=20,
                  verbose=True)

In [87]:
rs_svc.fit(X,y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits




In [88]:
rs_svc.best_score_

0.8066011466011467

In [89]:
rs_svc.best_params_

{'kernel': 'linear', 'C': 0.25}

3. **Random Forest Tuning**

    - For the Random Forest model, we optimize several hyperparameters, including `n_estimators`, `max_depth`, `min_samples_split`, and `max_features`, to enhance model performance.


In [90]:
rf_grid = {'n_estimators':np.arange(10,1000,10),
           'max_features':['log2','sqrt'],
           'max_depth':[None,3,5,10,20,30],
           'min_samples_split':[2,5,20,50,100],
           'min_samples_leaf':[1,2,5,10]
          }

In [91]:
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                  param_distributions = rf_grid,
                  cv=5,
                  n_iter=20,
                  verbose=True)

In [92]:
rs_rf.fit(X,y)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [93]:
rs_rf.best_score_

0.8066175266175266

In [94]:
rs_rf.best_params_

{'n_estimators': 350,
 'min_samples_split': 20,
 'min_samples_leaf': 5,
 'max_features': 'sqrt',
 'max_depth': 30}

In [None]:
LogisticRegression score Before Hyperparameter Tuning: 80.48
LogisticRegression score after Hyperparameter Tuning: 80.48 
    
------------------------------------------------------
SVC score Before Hyperparameter Tuning: 79.38
SVC score after Hyperparameter Tuning: 80.66
    
--------------------------------------------------------
RandomForestClassifier score Before Hyperparameter Tuning: 77.76
RandomForestClassifier score after Hyperparameter Tuning: 80.66 

In [96]:
X = df.drop('Loan_Status',axis=1)
y = df['Loan_Status']

##### **Model Performance Comparison**
| Model                      | Before Tuning | After Tuning |
|-----------------------------|---------------|--------------|
| Logistic Regression         | 80.48%        | 80.48%       |
| Support Vector Classifier    | 79.38%        | 80.66%       |
| Random Forest Classifier     | 77.76%        | 80.66%       |

##### **Final Model: Random Forest**

- Following the tuning process, the **Random Forest Classifier** is chosen as the final model due to its superior optimized performance. The model is trained using the optimal hyperparameters identified during tuning.


In [97]:
rf = RandomForestClassifier(n_estimators = 270,
                            min_samples_split = 5,
                            min_samples_leaf = 5,
                            max_features = 'sqrt',
                            max_depth = 5)

In [98]:
rf.fit(X,y)

In [99]:
import joblib

In [100]:
# saving our model by passing an instance of our model and giving it a name.

joblib.dump(rf,'loan_status_predictor_model')

['loan_status_predictor_model']

In [101]:
# In Future, we can perform predictin using this saved model, as shown below

model = joblib.load('loan_status_predictor_model')

In [107]:
df = pd.DataFrame({
    'Gender':1,
    'Married':1,
    'Dependents':2,
    'Education':0,
    'Self_Employed':0,
    'ApplicantIncome':2889,
    'CoapplicantIncome':0.0,
    'Loan_Amount_Term':180,
    'Credit_History':0,
    'Property_Area':1,
    'LoanAmount':45
},index=[0])

In [108]:
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,Loan_Amount_Term,Credit_History,Property_Area,LoanAmount
0,1,1,2,0,0,2889,0.0,180,0,1,45


In [109]:
result = model.predict(df)

In [110]:
if result==1:
    print("Loan Approved")
else:
    print("Loan Not Approved")

Loan Not Approved


## Conclusion

In this project, we built and tuned several machine learning models to predict the approval status of loans based on various applicant features. After exploring multiple models, including Logistic Regression, Support Vector Classifier, and Random Forest, we found that the **Random Forest Classifier** performed the best with an accuracy of 80.66% after hyperparameter tuning.

We demonstrated how to:
- Ingest data from MongoDB,
- Preprocess the dataset,
- Train and tune models using `RandomizedSearchCV`,
- Save the final model using `joblib`, and
- Make predictions on new data.

By using the tuned Random Forest model, we can now predict whether a loan will be approved based on the provided applicant data, aiding financial institutions in their decision-making processes. Future improvements could include testing more advanced algorithms, expanding the dataset, or further optimizing hyperparameters.
