# Loan Application Data - A complete Solution of Machine Learning Model

**Problem Statement**
India Housing Finance offers home loans for low-income housing. They have presence across all urban, semi urban and rural areas. When customer applies for home loan, the company validates the customer eligibility for loan. They want to automate the loan eligibility process based on customer details provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History etc.


**Objectives:**
1. Building a Predictive Model
2. Evaluate the model.
3. Refine the model, as appropriate

**Working approach**
1. Importing required Libraries
2. Loading dataset
3. Descriptive analysis (shape, describe, missing data etc)
4. Exploratory Data Analysis
5. Variable analysis
6. Data Cleaning
7. Handling categorical data
8. Feature selection
9. Model building and Model diagnostics (decision Tree)
10. Model performance and evaluations
11. Decision Tree visualization





**Importing Librarires**

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats


%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 12
plt.style.use("seaborn")
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

#supressing all the warnings
import warnings
warnings.filterwarnings('ignore')

import scipy.stats as stats

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

**Loading dataset**

In [None]:
loan_data_df = pd.read_csv("/kaggle/input/loanapplicantdata/LoanApplicantData.csv")

In [None]:
#Reading top 5 rows of the data set
loan_data_df.head()

***Descriptive Analysis***

In [None]:
#number of observations and features
loan_data_df.shape
# will shows the result (#Rows, #Cols)

In [None]:
#data types in the dataframe
loan_data_df.info()
# The below result shows for 13 columns having how many data, data types

**Checking for missing data**

In [None]:
#check for any column has missing values
loan_data_df.isnull().any()
#it will give the boolean value (True/false) aginst all the columns, 
#if any column is having some missing values, then it will shows True otherwise it will show False.

In [None]:
## basic descriptive statistics
loan_data_df.describe()

**Descriptive Statistics:** are the coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a #population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency #include the mean, median, and mode, while measures of variability include the standard deviation, variance, the minimum and maximum variables, and the #kurtosis and skewness.
#(a)	Here, we notice that mean value is less than median value for “ApplicantIncome”, “CoapplicantIncome”, “LoanAmount” columns which is represented by 50%#(50th percentile) in index column while for “Loan_Amount_Term” and “Credit_History” mean lies above the median value. 
#(b)	There is notably a large difference between 75th %tile and max values of predictors for “ApplicantIncome”, “CoapplicantIncome”, “LoanAmount” columns.
#(c)	Thus observations (a) and (b) suggests that there are extreme values-Outliers in our data set.


**Maping of target variables**
•	Target variable/Dependent variable is categorical in nature with two classes – ‘Y’ and ‘N’. We will be starting with Logistic Regression or Decision Tree #Classifier for the Binary Classification Problem at hand. So, we will have to convert the categorical variable in the dataset into numeric variables which we will achieve by Category Indexing. For this, we will assign a numeric value to each of the Class labels: ‘Y’ as 1 and ‘N’ as 0.

In [None]:
#map loan status "Y" to 1 and "N" to 0
loan_data_df["Loan_Status"] = loan_data_df["Loan_Status"].map({"Y" : 1, "N" : 0})

In [None]:
loan_data_df["Loan_Status"].value_counts()

**Observations**
- Dataset consists of 614 observations and 13 variables - out of which “Loan Status” is Dependent Variable while others are Independent Variables.
- 13 features is a mix of categorical and numerical data ("Loan status" is categorical variable, remaining all are numerical variable)
- There are missing values in the dataset.
- Data set is highly imbalanced data, An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of - observations is not the same for all the classes in a classification dataset. This can be reduced by implementing Over Sampling, Under Sampling,            - Cost Sensitive Learning Techniques and Ensemble Learning Techniques. We will be using Stratified Sampling .

**Exploratory Data Analysis**
#Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In [None]:
def NumericalVariables_targetPlots(df,segment_by,target_var = "Loan_Status"):
    """A function for plotting the distribution of numerical variables and its effect on Loan_Status"""
    
    fig, ax = plt.subplots(ncols= 2, figsize = (14,6))    

    #boxplot for comparison
    sns.boxplot(x = target_var, y = segment_by, data=df, ax=ax[0])
    ax[0].set_title("Comparision of " + segment_by + " vs " + target_var)
    
    #distribution plot
    ax[1].set_title("Distribution of "+segment_by)
    ax[1].set_ylabel("Frequency")
    sns.distplot(a = df[segment_by].dropna(), ax=ax[1], kde=False)
    
    plt.show()

In [None]:
def CategoricalVariables_targetPlots(df, segment_by,invert_axis = False, target_var = "Loan_Status"):
    
    """A function for Plotting the effect of variables(categorical data) on Loan_Status """
    
    fig, ax = plt.subplots(ncols= 2, figsize = (14,6))
    
    #countplot for distribution along with target variable
    #invert axis variable helps to inter change the axis so that names of categories doesn't overlap
    if invert_axis == False:
        sns.countplot(x = segment_by, data=df,hue="Loan_Status",ax=ax[0])
    else:
        sns.countplot(y = segment_by, data=df,hue="Loan_Status",ax=ax[0])
        
    ax[0].set_title("Comparision of " + segment_by + " vs " + "Loan_Status")
    
    #plot the effect of variable on attrition
    if invert_axis == False:
        sns.barplot(x = segment_by, y = target_var ,data=df,ci=None)
    else:
        sns.barplot(y = segment_by, x = target_var ,data=df,ci=None)
        
    ax[1].set_title("Loan_Status rate by {}".format(segment_by))
    ax[1].set_ylabel("Average(Loan_Status)")
    plt.tight_layout()

    plt.show()

**Analyizing the variables**
- Numerical Variables

In [None]:
numeric_var_names = [key for key in dict(loan_data_df.dtypes) if dict(loan_data_df.dtypes)[key] in ['float64', 'int64', 'float32', 'int32']]
print(numeric_var_names)

**ApplicantIncome**

In [None]:
# we are checking the distribution of ApplicantIncome and its related to loan status or not

NumericalVariables_targetPlots(loan_data_df, segment_by="ApplicantIncome")

In [None]:
print("Max of Applicant Income:", loan_data_df["ApplicantIncome"].max())
print("Max of Applicant Income:", loan_data_df["ApplicantIncome"].min())

**Observations**

**Numerical Variables Analysis:**
- The maximum applicant income is 81,000 and minimum applicant income is 150. There is a wide variation in the distribution of the applicant income.
- Most of the applicants has income in the range of 0 - 10,000. Very few applicants have an income over 10,000.
- From the box plot we can observe that the applicants with more income, will most likely get loan approved.
- Most of the co-applicants income is zero. Al most half of the co-applicants income is zero.
- From the box plot analysis, coapplicantincome is not very useful in distinguishing the loan status
- Loan Amount follows almost normal distribution.

**Categorical Variables Analysis:**
- Gender plays an important role in deciding the loan status of an applicant. The loan eligibility for male customers is more than the female customers.
- The average loan status rate is more for the male customers.
- Married customers mostly likely get the loan approved compared to non married customers.
- In the total dataset, most of the customers (50%) for the bank has no dependents. Very few customers have more than 3 dependents.
- As expected, bank tend to give loan approvals to customers with no dependents. Since they don't have to spend money on dependent people.
- This feature could be an important variable in model building
- Education variable plays an important role in deciding the loan status of that customer. As expected banks tend to give loans to educated customers as they can get job and repay the loan.
- Bank is conservative in giving loans to non-graduated applicants.
- Education is an important feature in deciding the loan eligibility of the applicant.
- Most of the banks customers are not self employed, meaning they are either employed in public or private organisations.
- Self employment comes with a certain risk of uncertainity, bank doesn't want to give more loans to self employed applicants.

**CoapplicantIncome**

In [None]:
# we are checking the distribution of CoapplicantIncome and its related to loan status or not

NumericalVariables_targetPlots(loan_data_df, segment_by="CoapplicantIncome")

**Observations**
- Most of the co-applicants income is zero. Al most half of the co-applicants income is zero.
- From the box plot analysis, coapplicantincome is not very useful in distinguishing the loan status

**LoanAmount**

In [None]:
# we are checking the distribution of LoanAmount and its related to loan status or not

NumericalVariables_targetPlots(loan_data_df, segment_by="LoanAmount")

**Observations**
- Loan Amount follows almost normal distribution.

**Credit_History**

In [None]:
# we are checking the distribution of Credit_History and its related to loan status or not

NumericalVariables_targetPlots(loan_data_df, segment_by="Credit_History")

**Analyizing the variables**
- Categorical Variables

In [None]:
catgorical_var_names = [key for key in dict(loan_data_df.dtypes) if dict(loan_data_df.dtypes)[key] in ['object']]
print(catgorical_var_names)

**Gender**

In [None]:
CategoricalVariables_targetPlots(loan_data_df,"Gender")

**Observations**
- Gender plays an important role in deciding the loan status of an applicant. The loan eligibility for male customers is more than the female customers.
- The average loan status rate is more for the male customers.

**Married**

In [None]:
CategoricalVariables_targetPlots(loan_data_df,"Married")

**Observations**
- Married customers mostly likely get the loan approved compared to non married customers.

**Dependents**

In [None]:
CategoricalVariables_targetPlots(loan_data_df,"Dependents")

**Observations**
- In the total dataset, most of the customers (50%) for the bank has no dependents. Very few customers have more than 3 dependents.
- As expected, bank tend to give loan approvals to customers with no dependents. Since they don't have to spend money on dependent people.
- This feature could be an important variable in model building

**Education**

In [None]:
CategoricalVariables_targetPlots(loan_data_df,"Education")

**Observations**
- `Education` variable plays an important role in deciding the loan status of that customer. As expected banks tend to give loans to educated customers as they can get job and repay the loan. 
- Bank is conservative in giving loans to non-graduated applicants.
- `Education` is an important feature in deciding the loan eligibility of the applicant.

**Self_Employed**

In [None]:
CategoricalVariables_targetPlots(loan_data_df,"Self_Employed")

**Observations**
- Most of the banks customers are not self employed, meaning they are either employed in public or private organisations.
- Self employment comes with a certain risk of uncertainity, bank doesn't want to give more loans to self employed applicants.

**Data Cleaning**
- Imputing missing values : Impute Gender, Married, Dependents, Self_Employed with mode since they are categorical features. Mode is statistical strategy to impute missing values where missing data is replaced with the most frequent values/mode within each column. Impute the LoanAmount, Loan_Amount_Term, Credit_History with median data because they are skewed. This works by calculating the median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.

In [None]:
#check for number of missing values
loan_data_df.isnull().sum()

In [None]:
#impute gender, married, dependents, Self_Employed with mode since they are categorical features

loan_data_df["Gender"] = loan_data_df["Gender"].fillna(loan_data_df["Gender"].mode()[0])
loan_data_df["Married"] = loan_data_df["Married"].fillna(loan_data_df["Married"].mode()[0])
loan_data_df["Dependents"] = loan_data_df["Dependents"].fillna(loan_data_df["Dependents"].mode()[0])
loan_data_df["Self_Employed"] = loan_data_df["Self_Employed"].fillna(loan_data_df["Self_Employed"].mode()[0])

In [None]:
#impute the LoanAmount, Loan_Amount_Term, Credit_History with median data because they are skewed

loan_data_df["LoanAmount"] = loan_data_df["LoanAmount"].fillna(loan_data_df["LoanAmount"].median())
loan_data_df["Loan_Amount_Term"] = loan_data_df["Loan_Amount_Term"].fillna(loan_data_df["Loan_Amount_Term"].median())
loan_data_df["Credit_History"] = loan_data_df["Credit_History"].fillna(loan_data_df["Credit_History"].median())

In [None]:
loan_data_df.head()

**Handling Categorical Data**
- Convert Categorical variables to encoded variables

In [None]:
#take a copy of the data
loan_data_encoded_df = loan_data_df.copy()

In [None]:
#convert 'Gender' -> Male to 1 and Female to 0
loan_data_encoded_df["Gender"] = loan_data_df["Gender"].map({"Male": 1, "Female": 0})

#convert the Married variable Yes to 1 and No to 0
loan_data_encoded_df["Married"] = loan_data_df["Married"].map({"Yes" : 1, "No" : 0})

#convert the Self_Employed variable Yes to 1 and No to 0
loan_data_encoded_df["Self_Employed"] = loan_data_df["Self_Employed"].map({"Yes" : 1, "No" : 0})

#education: there is an order inolved. Graduate > Not graduate
loan_data_encoded_df["Education"] = loan_data_df["Education"].map({"Graduate" : 2, "Not Graduate" : 1})

#education: there is an order inolved. Graduate > Not graduate
loan_data_encoded_df["Property_Area"] = loan_data_df["Property_Area"].map({"Rural" : 1, "Semiurban" : 2, "Urban" : 3})

#replace dependents 3+ with 3.
loan_data_encoded_df["Dependents"] = loan_data_df["Dependents"].map({"3+" : 3, "1": 1, "2": 2, "0" : 0})

In [None]:
loan_data_encoded_df.head()

**Feature Selection**
- Correlation Check: The statistical relationship between two variables is referred to as their correlation. A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neutral or zero, meaning that the variables are unrelated.
- Removing the “Loan Id” feature, since it doesn't give any predictive power
- Gender and Married Features are correlated among themselves (auto correlation). We can drop Married feature because we know that the Gender is an important feature from our data analysis. 
- Remaining all other features are not auto-correlated.


In [None]:
#removing the loan Id feature, since it doesn't give any predictive power
loan_data_encoded_df.drop(["Loan_ID"], axis = 1, inplace=True)

In [None]:
#correlation matrix
corr_matrix = loan_data_encoded_df.corr()

ax = sns.heatmap(
    corr_matrix, 
    vmin=-1, vmax=1, center=0,
    cmap="coolwarm",
    square=True
)

ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

**Observations**
- `Gender` and `Married` Features are correlated among themselves (auto correlation). We can drop `Married` feature because we know that the `Gender` is an important feature from our data analysis.
- Remaining all other features are not auto-correlated.

**Model Building and Model Diagnostics**
 - Logistic Regression
 - Decision Tree classifier
 - Supervised learning: Supervised learning is a type of system in which both Input and desired Output data are provided. They can be further grouped into Regression and Classification problems. Here, we have a  Classification Problem as the output variable “Loan_Status” which can take one out of two values “Y” or “N”. We will start with Logistic Regression Algorithm.
- We will use SciKit-Learn library in Python to import all the methods of Classification Algorithms.
- We will make separate sets of Predictor and Target Variables.
- Splitting the dataset: The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset. We will do this using using the train_test_split method with Stratified Sampling which allows for best representation of the entire population being studied.

In [None]:
featurecolumns = loan_data_encoded_df.columns.difference(['Gender', 'Loan_Status'])
featurecolumns

**Separating the Target and the Predictors**

In [None]:
X = loan_data_encoded_df[featurecolumns]
y = loan_data_encoded_df["Loan_Status"]

**Train-Test Split(Stratified Sampling of Y)**

In [None]:
# Function for creating model pipelines
from sklearn.pipeline import make_pipeline

#function for crossvalidate score
from sklearn.model_selection import cross_validate

#to find the best 
from sklearn.model_selection import GridSearchCV

In [None]:
#80% training data and 20% test data

X_train, X_test, y_train,y_test = train_test_split(X,y,test_size = 0.2,stratify = y,random_state = 100)

In [None]:
#check the proportion of data
y_train.value_counts(normalize = True) * 100

In [None]:
pd.DataFrame(y_train.value_counts(normalize = True) * 100).plot(kind = "bar")
plt.title("Distribution of Loan Status")
plt.show()

**Decision Tree**
- Average accuracy of pipeline with Decision Tree Classifier is 83.48%
- Cross-Validation and Hyper Parameters Tuning: Cross Validation is the process of finding the best combination of parameters for the model by traning and evaluating the model for each combination of the parameters
- Declare a hyper-parameters to fine tune the Decision Tree Classifier
- Decision Tree is a greedy algorithm it searches the entire space of possible decision trees. So, we need to find an optimum parameter(s) or criteria for stopping the decision tree at some point. We use the hyperparameters to prune the decision tree.

In [None]:
#make a pipeline for decision tree model 

pipelines = {
    "clf": make_pipeline(DecisionTreeClassifier(max_depth=3,random_state=100))
}

**Cross-Validation and Hyper Parameters Tuning**
Cross Validation is the process of finding the best combination of parameters for the model by traning and evaluating the model for each combination of the parameters

- Declare a hyper-parameters to fine tune the Decision Tree Classifier

- Decision Tree is a greedy alogritum it searches the entire space of possible decision trees. so we need to find a optimum parameter(s) or criteria for stopping the decision tree at some point. We use the hyperparameters to prune the decision tree

In [None]:
decisiontree_hyperparameters = {
    "decisiontreeclassifier__max_depth": np.arange(3,12),
    "decisiontreeclassifier__max_features": np.arange(3,10),
    "decisiontreeclassifier__min_samples_split": [2,3,4,5,6,7,8,9,10,11,12,13,14,15],
    "decisiontreeclassifier__min_samples_leaf" : np.arange(1,3)
}

**Decision Tree classifier with gini index**
- Fit and tune models with cross-validation
- Now that we have our pipelines and hyperparameters dictionaries declared, we're ready to tune our models with cross-validation.

- We are doing 5 fold cross validation

In [None]:
#Create a cross validation object from decision tree classifier and it's hyperparameters

clf_model = GridSearchCV(pipelines['clf'], decisiontree_hyperparameters, cv=5, n_jobs=-1)

In [None]:
#fit the model with train data
clf_model.fit(X_train, y_train)

In [None]:
#Display the best parameters for Decision Tree Model
clf_model.best_params_

In [None]:
#Display the best score for the fitted model
clf_model.best_score_

**Model Performance Evaluation**
- We will now predict the test set results and check the accuracy with each of our model. 
- The classification_report() function displays the Precision, Recall, f1-score and support for each class. Precision gives Percentage of positive instances out of the total predicted positive instances.  Recall gives Percentage of positive instances out of the total actual positive instances. F1 score summarizes both Precision and Recall and can be understood as the harmonic mean of the two measures.
- To check the accuracy, we import Confusion Matrix method of Metrics Class. The Confusion matrix is a way of tabulating the number of misclassifications, i.e., the number of predicted classes which ended up in a wrong classification bin based on the true classes.
- The ROC curve plots the true positive rate against the false positive rate, or the sensitivity against 1-specificity for each threshold. From the Curve, we have a choice to make depending on the value we place on true positive and tolerance for false positive rate. If we wish to give loans to more customers, we could increase the true positive rate by adjusting the probability cut-off for classification. However, by doing so we would also increase the false positive rate. We need to find the optimum value of cut-off for classification


In [None]:
#Predicting the test cases
bankloans_test_pred_dt = pd.DataFrame({'actual':y_test, 'predicted': clf_model.predict(X_test)})
bankloans_test_pred_dt = bankloans_test_pred_dt.reset_index(drop = True)

#predicted probability
bankloans_test_pred_dt["predicted_prob"] = pd.DataFrame([p[1] for p in clf_model.predict_proba(X_test)])

bankloans_test_pred_dt.head()

In [None]:
#classification report

print(metrics.classification_report(bankloans_test_pred_dt.actual, bankloans_test_pred_dt.predicted))

**Confusion Matrix**
- The confusion matrix is a way of tabulating the number of misclassifications, i.e., the number of predicted classes which ended up in a wrong classification bin based on the true classes.

In [None]:
#confusion matrix
metrics.confusion_matrix(bankloans_test_pred_dt.actual,bankloans_test_pred_dt.predicted)

In [None]:
#Area Under ROC Curve

auc_score_test = metrics.roc_auc_score(bankloans_test_pred_dt.actual,bankloans_test_pred_dt.predicted)
print("AUROC Score:",round(auc_score_test,4))

In [None]:
##Plotting the ROC Curve

fpr, tpr, thresholds = metrics.roc_curve(bankloans_test_pred_dt.actual,bankloans_test_pred_dt.predicted_prob,drop_intermediate=False)


plt.figure(figsize=(8, 6))
plt.plot( fpr, tpr, label='ROC curve (area = %0.4f)' % auc_score_test)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic cuve')
plt.legend(loc="lower right")
plt.show()

From the ROC Curve, we have a choice to make depending on the value we place on true positive and tolerance for false positive rate

- If we wish to give loans to more customers, we could increase the true positive rate by adjusting the probability cutoff for classification. However by doing so would also increase the false positive rate. we need to find the optimum value of cutoff for classification

**Metrics**
- Recall: Ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates the class is correctly recognized
- Precision: To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labeled as positive is indeed positive

In [None]:
#calculating the recall score
print("Recall Score:",round(metrics.recall_score(bankloans_test_pred_dt.actual,bankloans_test_pred_dt.predicted) * 100,3))

In [None]:
#calculating the precision score
print("Precision Score:",round(metrics.precision_score(bankloans_test_pred_dt.actual,bankloans_test_pred_dt.predicted) * 100,3))

In [None]:
#compute f1 score
metrics.f1_score(bankloans_test_pred_dt.actual, bankloans_test_pred_dt.predicted)

**Visualization of Decision Tree**

In [None]:
pip install pydotplus

In [None]:
from six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus as pdot
import graphviz as graphviz


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

In [None]:
#write the dot data
dot_data = StringIO()

In [None]:
#saving into a variable to get graph
clf_best_model = clf_model.best_estimator_.named_steps['decisiontreeclassifier']

In [None]:
#export the decision tree along with the feature names into a dot file format

export_graphviz(clf_best_model,out_file=dot_data,filled=True,
                rounded=True,special_characters=True,feature_names = X_train.columns.values,class_names = ["No","Yes"])

In [None]:
#make a graph from dot file 
graph = pdot.graph_from_dot_data(dot_data.getvalue())

In [None]:
Image(graph.create_png())