**HOME LOAN PREDICTIONS**

**Executive summary**
1. Housing Finance company which provide home loans for the houses that are present across all urban, semiurban and rural areas for their valued customers.
2. The company validates the eligibility of loan after customer applies for the loan. However, it consumes lot of time for the manual validation of eligibility process.
3. Hence, the company wants to automate the loan elibility process based on the customer information and identify the factors/customer segments who are eligible for taking the loan.
4. As banks would give loans to only those customers who are eligible so that they can be assured of getting the money back.
5. Hence, the more accurate we are in predicting the eligible customers, the more beneficial it would be for the company.

**Detailed Overview of the Mortgage Approval & Funding Process: **
1.	Pre-Assessment Discussion (15 minute conversation)  
2.	Pre-Approval Kick-Off (takes us no more than 1 day)
3.	Opening a File (takes us no more than 1 day)
4.	Lender Underwriting (takes 1 - 7 days from our formal submission)
* Credit history - Your lender will want to make sure when you've borrowed money, you've paid it back
* Capital - Ensuring you’ve accumulated assets
* Collateral - When it comes to a mortgage, you're putting your house up as collateral
* Capacity - In short, capacity is debt servicing. For instance, your housing cost shouldn't exceed 30 per cent to 32 per cent of your gross income and all of your debts shouldn't exceed 40 per cent to 42 per cent of your gross income
* Character - It’s an evaluation of all four previous C's as well as subjective and objective things such as how long have you been in your job, what type of job you have and how long you have lived in your current residence.
5.	Conditional Commitment Processing (takes 2 - 4 days from lender approval)
6.	Pre-Closing (takes 7 - 10 days from 'file complete')
7.	Closing (typically by noon on the funding/possession date)


**Problem Statement**
1. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. 
2. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others.
3. To automate this process, the company has given us a problem to identify the customer segments, those are eligible for loan amount so that they can specifically target these customers. Here, we have been provided a partial data set for further analysis.

**Structured Analysis Planning:** 
The SMART (Specific Measurable Assignable Relevant Time-based) objective was employed to analyze the data and understand the problem statement. The next step is to identify our independent variables and our dependent variable, the below map illustrates the process which was conducted to structure plan the project.

To this approach, we employed Exploratory data analysis which include univariate analysis and bivariate analysis.

**Assumptions:**
1. The customers whose salary is more can have a greater chance of loan approval.
2. The applicants who are graduate have a better chance of loan approval than non-graduate applicants.
3. Married applicants would have upper hand than single or no-relationship applicants for loan approval.
4. The applicant who has less number of dependents have a high probability for loan approval.
5. The lesser the loan amount, the higher chances of loan getting approved.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
print(os.listdir("../input"))

In [None]:
HomeLoan_Test=pd.read_csv('../input/Test_Loan_Home.csv') #Reading the test dataset
HomeLoan_Test.head()

In [None]:
HomeLoan_Train=pd.read_csv('../input/Train_Loan_Home.csv') #Reading the train dataset
HomeLoan_Train.head()

In [None]:
#1. Identification of variables
HomeLoan_Train_list=HomeLoan_Train.columns.tolist()
for i in range(0,len(HomeLoan_Train_list)):
    print(HomeLoan_Train_list[i])

**Type of Variables: **
1. Input variable (Predictor): Gender, Married, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History
2. Output variable (Target): Loan_Status

In [None]:
#Identifying the datatypes:
HomeLoan_Train.dtypes
HomeLoan_Train.info() #gives detailed information on dtypes

There are 3 different datatypes in the dataset which include float64(4), int64(1), object(8)

**Variable category:**
1. Categorical variables:  Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Property_Area, Loan_Status
2. Continous variables: ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term

**Data-Preprocessing steps:**

1. Handling missing values
2. Creation of Dummy variables
3. Replacing the data-values

In [None]:
HomeLoan_Train.isnull().sum()  #To check the complete list of missing values in each column.

In [None]:
HomeLoan_Train.fillna(HomeLoan_Train.mean(), inplace=True)

In [None]:
most_common=pd.get_dummies(HomeLoan_Train.Gender).sum().sort_values(ascending=False).index[0] #created dummies for the Gender column

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

x1=HomeLoan_Train.Gender.map(replace_most_common)
print(x1)

In [None]:
HomeLoan_Train['Gender_Updated']=x1

In [None]:
HomeLoan_Train.drop('Gender', axis=1, inplace=True)

In [None]:
HomeLoan_Train.head()

In [None]:
most_common=pd.get_dummies(HomeLoan_Train.Self_Employed).sum().sort_values(ascending=False).index[0] #created dummies for the Self_Employed column

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

x2=HomeLoan_Train.Self_Employed.map(replace_most_common)
print(x2)

In [None]:
HomeLoan_Train['Self_Employed_updated']=x2

In [None]:
HomeLoan_Train.head()

In [None]:
HomeLoan_Train.drop('Self_Employed', axis=1, inplace=True)

In [None]:
HomeLoan_Train.head()

In [None]:
most_common=pd.get_dummies(HomeLoan_Train.Dependents).sum().sort_values(ascending=False).index[0] #Created dummies for Dependents column

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

x3=HomeLoan_Train.Dependents.map(replace_most_common)
print(x3)

In [None]:
HomeLoan_Train['Dependents_Updated']=x3

In [None]:
HomeLoan_Train.drop('Dependents',axis=1, inplace=True)

In [None]:
HomeLoan_Train.replace('3+','3', inplace=True)

In [None]:
HomeLoan_Train.head()

In [None]:
most_common=pd.get_dummies(HomeLoan_Train.Married).sum().sort_values(ascending=False).index[0] #Created dummies for Married column

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

x4=HomeLoan_Train.Married.map(replace_most_common)
print(x4)

In [None]:
HomeLoan_Train['Married_updated']=x4

In [None]:
HomeLoan_Train.drop('Married',axis=1,inplace=True)

In [None]:
HomeLoan_Train.isnull().sum()

In [None]:
x2=HomeLoan_Train.Married_updated #Plotting Married_Updated column for preliminary analysis
print(x2)
x2.value_counts().plot(kind='bar')
plt.xlabel('Marital Status', fontsize=16)
plt.ylabel('count', fontsize=16)
plt.title("HomeLoan_Marital Status")
plt.show()

**Marital Status: **

From the above results, we can conclude that most of the home loans were approved to married couples compared to persons who are single or with no relationship.

In [None]:
x3=HomeLoan_Train.Gender_Updated #Plotting Gender_Updated column for preliminary analysis
print(x3)
x3.value_counts().plot(kind='bar')
plt.xlabel('Gender', fontsize=16)
plt.ylabel('count', fontsize=16)
plt.title('Gender As Primary Applicant')
plt.show()

**Gender Column:**
According to our analysis, Gender may influence home loan approval. As we can conclude that, mortgage lenders were more inclined towards men than women expecting men to be the lead borrowers on single applications.

In [None]:
x5=HomeLoan_Train.Dependents_Updated #Plotting Dependents_Updated column for preliminary analysis
depend=['No_Dependents','1_Dependents','2_Dependents','3+_Dependents']
print(x5)
x5.value_counts().plot(kind='bar')
plt.xlabel('Dependents', fontsize=16)
plt.ylabel('count', fontsize=16)
plt.title('Dependents on Primary Applicant')
plt.show()

**Dependents: **
From the analysis, we can conclude that the number of dependents may automatically affected the approvals of home loans. There is a higher chance of getting home loan approval for applicants who have less number of dependents or no dependents.

In [None]:
#Relationship between Property area and loan status
x12=HomeLoan_Train.groupby(['Property_Area','Loan_Status']).Loan_Status.value_counts()
print(x12)

Property_Area=['Rural','Semiurban','Urban']
Loan_Status=['Yes', 'No']
pos=np.arange(len(Property_Area))
bar_width=0.35
Loan_Status_Yes=[110,179,133]
Loan_Status_NO=[69,54,69]

plt.bar(pos,Loan_Status_Yes,bar_width,color='blue',edgecolor='black')
plt.bar(pos+bar_width,Loan_Status_NO,bar_width,color='red',edgecolor='black')
plt.xticks(pos, Property_Area)
plt.xlabel('Property Area', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Property Area vs Loan Status',fontsize=18)
plt.legend(Loan_Status,loc=1)
plt.show()

**Relationship between Property area and loan status: **From the above results we can infer that, the higher percentage of loan approval is for semi-urban houses followed by urban and rural houses.

In [None]:
#Relationship between Credit History and Loan Status: 
x6=HomeLoan_Train.groupby(['Credit_History','Loan_Status']).Loan_Status.value_counts()
print(x6)

Credit_History=['Bad','Medium','Good']
Loan_Status=['Yes', 'No']
pos=np.arange(len(Credit_History))
bar_width=0.35
Loan_Status_Yes=[7,37,378]
Loan_Status_NO=[82,13,97]

plt.bar(pos,Loan_Status_Yes,bar_width,color='navy',edgecolor='black')
plt.bar(pos+bar_width,Loan_Status_NO,bar_width,color='red',edgecolor='black')
plt.xticks(pos, Credit_History)
plt.xlabel('Credit History', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Credit History vs Loan Status',fontsize=18)
plt.legend(Loan_Status,loc=2)
plt.show()

In [None]:
#Relationship between Gender and Loan Status:
x7=HomeLoan_Train.groupby(['Gender_Updated','Loan_Status']).Loan_Status.value_counts()
print(x7)

Gender_Updated=['Male', 'Female']
Loan_Status=['Yes', 'No']
pos=np.arange(len(Gender_Updated))
bar_width=0.30
Loan_Status_Yes=[347,75]
Loan_Status_NO=[155,37]

plt.bar(pos,Loan_Status_Yes,bar_width,color='blue',edgecolor='black')
plt.bar(pos+bar_width,Loan_Status_NO,bar_width,color='red',edgecolor='black')
plt.xticks(pos, Gender_Updated)
plt.xlabel('Gender', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Gender vs Loan status',fontsize=18)
plt.legend(Loan_Status,loc=1)
plt.show()

**Relationship between Gender and Loan status: **From the data analysis, we can conclude that male gender as primary applicants have higher percentage of loan approval than female as primary applicants.

In [None]:
#Relationship between education vs Loan status:
x8=HomeLoan_Train.groupby(['Education','Loan_Status']).Loan_Status.value_counts()
print(x8)

Education=['Graduate', 'Non-Graduate']
Loan_Status=['Yes', 'No']
pos=np.arange(len(Education))
bar_width=0.30
Loan_Status_Yes=[340,82]
Loan_Status_NO=[140,52]

plt.bar(pos,Loan_Status_Yes,bar_width,color='navy',edgecolor='black')
plt.bar(pos+bar_width,Loan_Status_NO,bar_width,color='red',edgecolor='black')
plt.xticks(pos, Education)
plt.xlabel('Education', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Education vs Loan status',fontsize=18)
plt.legend(Loan_Status,loc=1)
plt.show()

**Relationship between education vs Loan status:** From the analysis, we can conclude that the applicants who are graduate were in higher percentage of loan approval than non-graduate applicants.

In [None]:
#Relationship between Self-Employed vs Loan_Status:
x9=HomeLoan_Train.groupby(['Self_Employed_updated','Loan_Status']).Loan_Status.value_counts()
print(x9)
Self_Employed=['Yes', 'No']
Loan_Status=['Yes', 'No']
pos=np.arange(len(Self_Employed))
bar_width=0.30
Loan_Status_Yes=[56,366]
Loan_Status_NO=[26,166]

plt.bar(pos,Loan_Status_Yes,bar_width,color='navy',edgecolor='black')
plt.bar(pos+bar_width,Loan_Status_NO,bar_width,color='red',edgecolor='black')
plt.xticks(pos, Self_Employed)
plt.xlabel('Self Employed', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Self Employed vs Loan status',fontsize=18)
plt.legend(Loan_Status,loc=1)
plt.show()

**Relationship between Self-Employed vs Loan_Status: **From the data analysis, we can conclude that home-ownership rates for self-employed households were more declined than for salaried households. 

In [None]:
#Relationship between Dependents vs Loan status
x10=HomeLoan_Train.groupby(['Dependents_Updated','Loan_Status']).Loan_Status.value_counts()
print(x10)

Dependents=['Dpdnt_No', 'Dpdnt_1', 'Dpdnt_2', 'Dpdnt_3']
Loan_Status=['Yes', 'No']
pos=np.arange(len(Dependents))
bar_width=0.30
Loan_Status_Yes=[247,66,76,33]
Loan_Status_NO=[113,36,25,18]

plt.bar(pos,Loan_Status_Yes,bar_width,color='navy',edgecolor='black')
plt.bar(pos+bar_width,Loan_Status_NO,bar_width,color='red',edgecolor='black')
plt.xticks(pos, Dependents)
plt.xlabel('Dependents', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Dependents vs Loan status',fontsize=18)
plt.legend(Loan_Status,loc=1)
plt.show()

**Relationship between Dependents vs Loan status:** From the analysis, we can conclude that the number of dependents may automatically affected the approvals of home loans. There is a higher chance of getting home loan approval for applicants who have less number of dependents or no dependents. 

In [None]:
#Relationship between marital status vs loan status
x11=HomeLoan_Train.groupby(['Married_updated','Loan_Status']).Loan_Status.value_counts()
print(x11)

MaritalStatus=['Yes', 'No']
Loan_Status=['Yes', 'No']
pos=np.arange(len(MaritalStatus))
bar_width=0.30
Loan_Status_Yes=[288,134]
Loan_Status_NO=[113,79]

plt.bar(pos,Loan_Status_Yes,bar_width,color='navy',edgecolor='black')
plt.bar(pos+bar_width,Loan_Status_NO,bar_width,color='red',edgecolor='black')
plt.xticks(pos, MaritalStatus)
plt.xlabel('Marital Status', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Marital Status vs Loan status',fontsize=18)
plt.legend(Loan_Status,loc=1)
plt.show()

**Relationship between Marital status vs Loan status:** From the analysis, we can conclude that the highest number of customers are married who were eligible for the home loan approval than single customers.

**Machine Learning Algorithms:**
Converting the categorical variables into numerical values using map function

In [None]:
HomeLoan_Train.Education=HomeLoan_Train.Education.map({'Not Graduate':0,'Graduate':1})

In [None]:
HomeLoan_Train.Property_Area=HomeLoan_Train.Property_Area.map({'Rural':0,'Semiurban':1,'Urban':2})

In [None]:
HomeLoan_Train.Loan_Status=HomeLoan_Train.Loan_Status.map({'N':0,'Y':1})

In [None]:
HomeLoan_Train.Self_Employed_updated=HomeLoan_Train.Self_Employed_updated.map({'No':0,'Yes':1})

In [None]:
HomeLoan_Train.Married_updated=HomeLoan_Train.Married_updated.map({'No':0,'Yes':1})

In [None]:
HomeLoan_Train.Gender_Updated=HomeLoan_Train.Gender_Updated.map({'Female':0,'Male':1})

In [None]:
HomeLoan_Train.head()

In [None]:
HomeLoan_Train.drop(['Loan_ID'], axis=1, inplace=True) #Drop the column Loan_ID

In [None]:
HomeLoan_Train.head()

In [None]:
X=HomeLoan_Train.drop(['Loan_Status'], axis=1)

In [None]:
y=HomeLoan_Train.Loan_Status

In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize=(10,10))
import math
cor = abs(HomeLoan_Train.corr())
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
max_accu=0 #Importing the model
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
estimator = LogisticRegression()
for  i  in range(1,len(X.iloc[0])+1):
    selector =RFE(estimator, i, step=1)
    selector = selector.fit(X,y)
    accuracy = selector.score(X,y)
    if max_accu < accuracy:
        sel_features = selector.support_
        max_accu =accuracy
 
X_sub = X.loc[:,sel_features]

In [None]:
#Data Preprocessing
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_scaled = pd.DataFrame(sc_X.fit_transform(X), columns=X.columns)

In [None]:
#split train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=0)

In [None]:
#import classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
classifier.score(X_test,y_test) #classifier performance on test set

In [None]:
# importing performance measuring tools
from sklearn.metrics import accuracy_score, confusion_matrix,recall_score,precision_score,classification_report

recall_score(y_test,y_pred,average='macro')

In [None]:
cr=classification_report(y_test,y_pred)
print(cr)

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
precision_score(y_test,y_pred,average='macro')

In [None]:
#import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=21,weights='distance',p=1)
model.fit(X_train,y_train)
model.score(X_test,y_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X_train,y_train)
model.score(X_test,y_test)
param_dict=({'n_neighbors':range(3,11,2),'weights':['uniform','distance'],'p':[1,2,3,4,5]})
from sklearn.model_selection import GridSearchCV
best_model=GridSearchCV(model,param_dict,cv=5)
best_model.fit(X_scaled,y)
best_model.best_params_
best_model.best_score_

In [None]:
from sklearn.ensemble import RandomForestClassifier
model2 = RandomForestClassifier(max_depth=25)
model2.fit(X_train,y_train)
model2.score(X_test,y_test)
param_dict_2=({'n_estimators':range(2,50)})
from sklearn.model_selection import GridSearchCV
best_model=GridSearchCV(model2,param_dict_2,cv=5)
best_model.fit(X_scaled,y)
best_model.best_params_
best_model.best_score_

In [None]:
from sklearn.ensemble import AdaBoostClassifier
model3 = AdaBoostClassifier(n_estimators=20)
model3.fit(X_train,y_train)
model3.score(X_test,y_test)
param_dict_3=({'n_estimators':range(2,50)})
from sklearn.model_selection import GridSearchCV
best_model=GridSearchCV(model3,param_dict_3,cv=5)
best_model.fit(X_scaled,y)
best_model.best_params_
best_model.best_score_

In [None]:
#Support vector Machine model
from sklearn.svm import SVC
model_svc = SVC(kernel='linear',gamma=0.001,C=1.0)
model_svc.fit(X_train,y_train)
model_svc.score(X_test,y_test)

In [None]:
#Estimating the best model using Cross-validation
new_model=best_model.best_estimator_ #gives the best model estimation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
cross_val_score(new_model, X_scaled,y,cv=5).mean()

In [None]:
print(new_model)

From the results, we can conclude that the results were 80.9% accurate for loan status predictions from **AdaBoostClassifier model**.

**Results of Machine Learning models:**

Logistic regression model: 83.7%

KNeighborsClassifier model: 80.1%

RandomForestClassifier model: 78.9%

AdaBoostClassifier model: 80.9%

Support vector Machine model: 83.1%

KFold cross_val_score: 80.9%

**Conclusions**

1. The main purpose of this project is to classify and analyze the nature of the loan applications. 
2. From a proper analysis of the data set and constraints of the banking sector, different graphs were generated and visualized.
3. From data analysis, many conclusions have been made and information were inferred such as short-term loan was preferred by majority of the loan applicants and the clients majority apply loan for debt consolidation.
4. From predictive analysis, Logistic regression in simple terms predicts the probability of occurrence of an event by fitting data. We generated a confusion matrix with accuracy, precision, recall score of 83%, 85% and 73% for the model.
5. We generated ensembling methods such as KNeighborsClassifier, RandomForestClassifier, AdaBoostClassifier, SupportVectorClassifier with scores 80.1%, 78.9%, 80.9%, 83.1% respectively.

**Recommendations**
1. Mortgage lenders were more inclined towards men than women expecting the men to be the lead borrowers on single or joint applicants.
2. Improvement in debt-to-income ratio
3. For self-employed applicants, lender relies on determining repayment ability, including expected income or assets, reviewing tax returns and financial institution records.
4. Higher chance of getting home loan approval for applicants who have less number of dependents or no dependents.
5. Applicants with good credit history had higher chances of loan approval.
6. Applicants with higher education.
7. Applicants who have stable jobs.