![BAIME banner](https://user-images.githubusercontent.com/47600826/89530907-9b3f6480-d7ef-11ea-9849-27617f6025cf.png)

# Predicting Loan Egibility

![lening](https://images.financialexpress.com/2020/07/HOME-LOAN-HIKE.jpg)

## The problem
In this notebook we look at the data we got via this [Kaggle dataset](https://www.kaggle.com/gavincanacam/home-loan-predictions). 

This company called "Housing Finance company" wants to automate the loan eligibility process based on the customer information and identify the factors/customer segments who are eligible for taking the loan.

We will explore the dataset given, check the various features we have and we will make an algorithm that can predict whether or not the loan would be approved in order to automate the process

# Import the important libraries / packages
These packages are needed to load and use the dataset

In [None]:
import pandas as pd #we use this to load, read and transform the dataset
import numpy as np #we use this for statistical analysis
import matplotlib.pyplot as plt #we use this to visualize the dataset
import seaborn as sns #we use this to make countplots
import sklearn.metrics as sklm #This is to test the models

In [None]:
#here we load the train data
data = pd.read_csv(r'/kaggle/input/home-loan-predictions/Train_Loan_Home.csv')

#and immediately I would like to see how this dataset looks like
data.head()

In [None]:
#now let's look closer at the dataset we got
data.info()

It seems that we have a lot of text / category information (these are of the Dtype 'object') and a few numerical columns (Dtypes 'int64' and 'float64'). 

The last column 'Loan_status' is the column we would like to predict. 

In [None]:
data.shape

The dataset consists of 614 rows and 13 columns. 

In [None]:
data.describe()

It seems that we have some strange outliers for the income and loan amounts. We will look and handle these later on. 

In [None]:
data.describe(include='O')

In [None]:
#Let's see what the options are in the text columns (the objects)
print('Gender: ' + str(data['Gender'].unique()))
print('Married: ' + str(data['Married'].unique()))
print('Dependents: '+ str(data['Dependents'].unique()))
print('Education: '+ str(data['Education'].unique()))
print('Self_Employed: '+ str(data['Self_Employed'].unique()))
print('Property_Area: '+ str(data['Property_Area'].unique()))

Seems there are more categorical (binary) columns, such as Gender, Married and education

# Loan Status in this Dataset

![approved or rejected](https://db3pap006files.storage.live.com/y4pVnKKIPUMfGtdOP-mIsJIDFD6QD9mNmC5br03t9oSX6uCFHlSgyrzOKvkBvemfQbgGRltJXJI1DygwGgxBzszvmqoQtfMhbsE_Ajl8VAnNDIy3BIOXRlTJAB3jdnZYTPtQFmMkHmo74vxcBUc_JjX1kW47Rp33UKov0MllAFFuPU-lzJypcr-s05Yv1bCIpcC9bwZsareXmkMCxxmCZBS67Ya2zrP2Ac3z3F0enmC6qo/stamp-2114884_1920.png?psid=1&width=192&height=65)

As Loan Status is the column we want to predict, let's explore this column in the training dataset. 

In [None]:
#first let's count the number of loans approved and rejected
Approved = data[data['Loan_Status'] == 'Y']['Loan_Status'].count()
Rejected = data[data['Loan_Status'] == 'N']['Loan_Status'].count()

#now let's put these results in a dataframe to visualize them
df = {"Count" : [Approved, Rejected]} #this is for the legend to be clear that it is counts
Status = pd.DataFrame(df, index=["Approved", "Rejected"])

#let's visualize the bar plot
ax = Status.plot(kind = 'bar', title = 'Status of the loans')

#here I want to add the labels to the bars and to make this more clear I've made them white of color
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2, p.get_height() - 30), color = 'white', fontweight = 'bold')

In [None]:
#let's see the percentages of the status:
print('The percentage of approved loans : %.2f' % (data['Loan_Status'].value_counts()[0] / len(data)))
print('The percentage of rejected loans : %.2f' % (data['Loan_Status'].value_counts()[1] / len(data)))

It looks like this is not well balanced in this set.
But as this is the only data we have, I will leave this as is for now. 

# Handling missing values
Let's continue with handling the missing values in this dataset. 
Let's see where and how many missing values there are in this dataset.  

In [None]:
#let's look in what columns there are missing values 
data.isnull().sum().sort_values(ascending = False)

I will look closely at the top 3 here (as these have the most missing values) and I will drop the other missing value rows. 

In [None]:
#Let's look at the credit history in more detail to see what the best way is to handle these missing values
#I will use seaborn for the visualization
sns.countplot(data['Loan_Status'],hue=data['Credit_History'])

Looks like a good feature to use, as there is clearly a difference in the size of the columns for the yes and the no, so let's look deeper!

In [None]:
print(pd.crosstab(data['Credit_History'],data['Loan_Status']))

In [None]:
print('The percentage of credit history yes : %.2f' % (data['Credit_History'].value_counts()[1] / len(data)))
print('The percentage of credit history no : %.2f' % (data['Credit_History'].value_counts()[0] / len(data)))

Seems that if you have a credit history, it is more likely to get the loan approved. 

Options in handling these missing values:
- Drop all the rows with missing values
- Handle the missing values with 0 (so no history) as there is nothing clear. 
- Or we use the most frequent number, which is 1 for the credit history. 

In this case, I tend to go for the most frequent number, as this is 86% of the dataset, so most likely to be true.

In [None]:
data['Credit_History'] = data['Credit_History'].fillna(1)
data.isnull().sum().sort_values(ascending = False)

In [None]:
#Continue with Self_Employed
sns.countplot(data['Loan_Status'],hue=data['Self_Employed'])

As this seems to have no effect on the outcome, I will fill these with the most frequent one (so No) 

In [None]:
data['Self_Employed'] = data['Self_Employed'].fillna('No')
data.isnull().sum().sort_values(ascending = False)

In [None]:
#Continue with LoanAmount, as this is a numeric, thus continous number, I will use a scatterplot to see if there is a pattern / correlation. 
plt.scatter(data['Loan_Status'], data['LoanAmount'])

In [None]:
#As the patterns look similar for yes and no, I will fill the missing values with the mean of the column
data['LoanAmount'] = data['LoanAmount'].fillna( data['LoanAmount'].mean())
data.isnull().sum().sort_values(ascending = False)

In [None]:
#Let's drop the rest of the missing values:
data.dropna(inplace = True)
data.shape

# Take a closer look at some of the features
let's look at the outliers!

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize = (20,10))
ax1.boxplot(data['ApplicantIncome'])
ax2.boxplot(data['CoapplicantIncome'])
ax3.boxplot(data['LoanAmount'])
plt.show()

In [None]:
#Look closely at the ApplicantIncome column.
plt.boxplot(data['ApplicantIncome'])

In [None]:
#We see that there are two great outliers here. 
#let's look closer to these two outliers
outliers = data[data['ApplicantIncome'] > 50000]
outliers.head()

In [None]:
#As you can see that these are just two rows and the status is not for both approved, I will remove these two rows for the model. 
data = data[data['ApplicantIncome'] < 50000]
#let's plot the applicant income again in a boxplot
plt.boxplot(data['ApplicantIncome'])

In [None]:
#still a lot of outliers above the 25000. Let's look closer to those again to be sure we need to add them to get a good model performance
outliers = data[data['ApplicantIncome'] > 25000]
outliers.head()

These seem to be ok for the model as 75% is approved. So let's keep them for now. 

In [None]:
#Look closely at the CoApplicantIncome column.
plt.boxplot(data['CoapplicantIncome'])

In [None]:
#We see that there are three great outliers here. 
#let's look closer to these two outliers
outliers = data[data['CoapplicantIncome'] > 25000]
outliers.head()

In [None]:
#As you can see that these are just two rows and the status is not approved, I will remove these two rows for the model. 
data = data[data['CoapplicantIncome'] < 25000]
#let's plot the applicant income again in a boxplot
plt.boxplot(data['CoapplicantIncome'])

# Make all columns numeric
We need to make all column input numeric to use them further on. 
This is what I will do now. 

In [None]:
#First make the target column (Loan_Status) numerical
data['Loan_Status'] = np.where((data['Loan_Status'] == 'Y'), 1, 0)

In [None]:
#Next we will drop the loan_ID column as this will only confuse the model later on
data.drop('Loan_ID', axis=1, inplace=True)
data.info()

In [None]:
#Next, make all other columns numerical as well. 
data['Married'] = np.where((data['Married'] == 'Yes'), 1, 0)
data['Gender'] = np.where((data['Gender'] == 'Female'), 1, 0)
data['Education'] = np.where((data['Education'] == 'Graduate'), 1, 0)
data['Self_Employed'] = np.where((data['Self_Employed'] == 'Yes'), 1, 0)
data['Dependents'] = np.where((data['Dependents'] == '0'), 0, 1) #I saw that there was no big difference between the number of dependents if there are any. So I made no dependents = 0  and yes dependents = 1

In [None]:
#Lastly I want to change the Property_Area column, but I want to keep all three options. Therefore this I will do differently. 

def f(row):
  if row['Property_Area'] == "Rural":
    val = 1
  elif row['Property_Area'] == "Urban":
    val = 0
  else:
    val = 2
  return val

data['Property_Area'] = data.apply(f, axis=1)

In [None]:
data.info()

Right so now all columns are numeric



In [None]:
# Most important features
Let's continue by looking at the most important features according to three different tests. 
Than we will use the top ones to train and test our first model. 

In [None]:
#First we need to split the dataset in the y-column (the target) and the components (X), the independent columns. 
#This is needed as we need to use the X columns to predict the y in the model. 

X = data.iloc[:,0:11]  #independent columns 
y = data.iloc[:,-1]    #target column = Status of the loan

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Name of the column','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

In [None]:
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))

#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

Seems that the three feature selection models differ in what feature is the most important.
For the first test I will keep:
- Credit history (high in all three tests and the highest in the correlation)
- Co Applicant Income (high in two tests, negative in the correlation, but this is explainable, as no income for the spous means more risk)
- Property Area (high in two tests)
- Married (mentioned in two tests)

After a test, these 4 gave better results than using all features. 

# Machine learning Model
As this is a binary problem (so yes or no in the status), I choose for binary models:
- Decision Tree
- K-nearest Neighbors

But we can cross check it with a logistic regression model here.

For the record, I left out Random Forrest, as this is a random decision tree model, so not the same each time you run the model

In [None]:
#Load the chosen models here
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

#add the logistic regression for cross check
from sklearn.linear_model import LogisticRegression

# Split the dataset in train and test
Before we are going to use the models choosen, we will first split the dataset in a train and test set.
This because we want to test the performance of the model on the training set and to be able to check it's accuracy. 


In [None]:
from sklearn.model_selection import train_test_split

#First try with the 4 most important features
X_4 = data[['Credit_History', 'CoapplicantIncome', 'Married', 'Property_Area']] #independent columns chosen 
y = data.iloc[:,-1]    #target column = Status of the loan

#I want to withhold 30 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X_4,y, test_size=0.3 , random_state = 25)

In [None]:
print('Shape of X_train is: ', X_train.shape)
print('Shape of X_test is: ', X_test.shape)
print('Shape of Y_train is: ', y_train.shape)
print('Shape of y_test is: ', y_test.shape)

In [None]:
#Let's confirm that we use the same number of status approved versus disapproved in the test and train data.
#As approved is 1, this can be counted easily. 
print('The % approved status versus not approved in original_data :',data['Loan_Status'].value_counts().values/ len(data))
print('\nThe % approved status versus not approved in y_train :',y_train.value_counts().values/ len(y_train))
print('\nThe % approved status versus not approved in in y_test :',y_test.value_counts().values/ len(y_test))

This looks about the same, let's continue. 

# Try and check the models 

In [None]:
#To check the models, I want to build a check matrix within two functions:
def score_model(probs, threshold):
    return np.array([1 if x > threshold else 0 for x in probs[:,1]])

def print_metrics(labels, probs, threshold):
    scores = score_model(probs, threshold)
    metrics = sklm.precision_recall_fscore_support(labels, scores)
    conf = sklm.confusion_matrix(labels, scores)
    print('                 Confusion matrix')
    print('                 Score positive    Score negative')
    print('Actual positive    %6d' % conf[0,0] + '             %5d' % conf[0,1])
    print('Actual negative    %6d' % conf[1,0] + '             %5d' % conf[1,1])
    print('')
    print('DETAILS ACCURACY, PRECISION AND RECALL')
    print('Accuracy        %0.2f' % sklm.accuracy_score(labels, scores))
    print('AUC             %0.2f' % sklm.roc_auc_score(labels, probs[:,1]))
    print('Macro precision %0.2f' % float((float(metrics[0][0]) + float(metrics[0][1]))/2.0))
    print('Macro recall    %0.2f' % float((float(metrics[1][0]) + float(metrics[1][1]))/2.0))
    print(' ')
    print('           Positive      Negative')
    print('Num case   %6d' % metrics[3][0] + '        %6d' % metrics[3][1])
    print('Precision  %6.2f' % metrics[0][0] + '        %6.2f' % metrics[0][1])
    print('Recall     %6.2f' % metrics[1][0] + '        %6.2f' % metrics[1][1])
    print('F1         %6.2f' % metrics[2][0] + '        %6.2f' % metrics[2][1])

# K-Nearest Neighbors

In [None]:
#Start with the K-Nearest Neighbors
K_n = KNeighborsClassifier()
K_n.fit(X_train, y_train)

In [None]:
#Now let's see how this model performs
prob_K = K_n.predict_proba(X_test)
print_metrics(y_test, prob_K, 0.3) 

This model does not seem to predict well enough for the positives. The true positives are 6 versus 43 false negative. On the other hand the true negatives are 121 over 1 false positive.  

# Decision Tree

In [None]:
#Continue with the decision tree with a max number of layers of 3
D_tree = DecisionTreeClassifier(max_depth = 3)
D_tree.fit(X_train, y_train)

In [None]:
#let's see it's performance
prob_D = D_tree.predict_proba(X_test)
print_metrics(y_test, prob_D, 0.3)

The accuracy seems to be higher (true positives better, 19 now), but still room for improvement

# Logistic regression

In [None]:
# logistic_regression model
logistic_mod = LogisticRegression(C = 1.0, class_weight = {0:0.45, 1:0.55}) 
logistic_mod.fit(X_train, y_train)

In [None]:
#Check the performance of the logistic regression model
probabilities = logistic_mod.predict_proba(X_test)
print_metrics(y_test, probabilities, 0.3) 

This model seems to perform less than the decision tree model. 16 true positives and 1 false positives. 

# Conclusion:
We would need more data to make the models perform better. 

For now, The decision tree has the highest accuracy and precision scores with the 4 most important features. 
Therefore this would be the model to use for the prediction on the status

# Predict on the testset

Test the model on the Test dataset