# Loan Prediction

> **Understanding data** before making an algorithm to learn it is the correct way to approach. It **makes ML problem solving process much smoother and clearer** to both us and the machine. In this notebook, we've formed certain level of understanding and insights on the data as follows:
> 1. The Data is `Biased`
> 2. Credit History Column has the `highest significance` with the target variable(Loan Status).
> 3. There are some `Typos in dataset` that leads to bad training(related to observation no. 8) especially in Loan Amount Term column.
> 4. `Larger population appearing` for loans are Male, Graduate, Not Self-employed, Married and with 0 dependents.
> 5. People who are Graduated and Not self-employed has `better chances of getting loan.`
> 6. People with Property area in `Semi-urban places` has `greater chances of loan approval.` 
> 7. People with > 0 Dependents are `mostly Married`
> 8. There's a slight `linear relationship` of `Loan Amount` with the `Applicant Income`,might be because greater rank business people needs higher loans for higher trades.
> 9. People with Residual Income of 0 or negative(considering EMI's and number of dependents) have `higher negative Credit History records`. It is also *facinating* that some of those people also gets lucky with the loans.
> 10. There is `no such thing` as `Higher Applicant Income` gets `higher chances` of receiving `loans`, *however*, people with `low Total Income`(Applicant Income + Co-applicant Income) has `lesser chances` compared to higher people with higher Total Income.

## Predict Loan Eligibility for Dream Housing Finance company

I'm going to build an end-to-end machine learning model to classify if someone is eligible for providing loan or not based on certain details.

I'm going to take following approach
1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

### 1. Problem definition
In a statement, 
> Given details about a person, we can predict whether a person is eligible to lend loan or not?

Business Understanding

> Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

> Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

### 2. Data
The original data came from
> Analytics vidhya platform
https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/#ProblemStatement

### 3. Evaluation
Evaluation
> According to the problem, if we can reach atleast > 85 % accuracy as a proof of concept, we can persue the project and improve it.

> *`Problem specifies evaluation metric to be Accuracy` *

> Note: Check of there are even patterns present in our model

### 4. Features

#### Data Dictionary

In [None]:
'''
(Features)
Loan_ID	            - Unique Loan ID                                String
Gender              - Gender of person                              Categorical(nominal): Male/ Female
Married	            - Applicant married                             Categorical(nominal): Y/N
Dependents          - Number of people dependent on that person     Categorical(nominal): 0/1/2/3+
Education           - Applicant Education                           Categorical(nominal): Graduate/ Not Graduate
Self_Employed	    - Self employed                                 Categorical(nominal): Y/N
ApplicantIncome	    - Applicant income                              Numerical           : in $'s
CoapplicantIncome   - Coapplicant income                            Numerical           : in $'s
LoanAmount	        - Loan amount                                   Numerical           : in thousands of $'s
Loan_Amount_Term	- Term of loan                                  Numerical           : in number of months
Credit_History	    - Credit history meets guidelines               Categorical         : 1/0   
Property_Area	    - Borrower's property at stake location         Categorical         : Urban/ Semi Urban/ Rural

(Target)
Loan_Status         - Loan approved (Y/N)


Notes: 
1. Credit history => record of a borrower's responsible repayment of debts
2. A co-applicant refers to a person who applies along with the borrower for a loan. 
   This is done so that the income of the co-applicant can be used to supplement the borrower's income and increase his/her eligibility
3. Having dependents means you have higher commitments, which in turn lower your disposable income.
'''

In [None]:
# Standard imports
import numpy as np
import pandas as pd
from glob import glob
pd.set_option('display.max_columns',500)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Custom tools
from plotting_helper import *

In [None]:
data_train, data_test = pd.read_csv('../input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv', encoding='UTF-8'), pd.read_csv('../input/loan-prediction-problem-dataset/test_Y3wMUE5_7gLdaTN.csv', encoding='UTF-8')
data_train['which_data'] = 'data_train'
data_test['which_data'] = 'data_test'

In [None]:
# Combine all data to fill/analyse/transform/etc. according to whole dataset(train/test)
data_all = pd.concat([data_train,data_test], axis=0)    # test target col will contain all nan's
data_train.shape, data_test.shape, data_all.shape

In [None]:
data_all.head()

In [None]:
data_all.info()

In [None]:
# Fill missing values with -1 for now so that errors are avoided while casting dtype
data_all.fillna(value=-1, inplace=True)

In [None]:
# Check Max values to alter columns' dtypes accordingly
data_all.max()[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History']]

In [None]:
# Reduce unnecessary memory usage
data_all['ApplicantIncome'] = data_all['ApplicantIncome'].astype('int32')
data_all['CoapplicantIncome'] = data_all['CoapplicantIncome'].astype('float32')
data_all[['LoanAmount','Loan_Amount_Term']] = data_all[['LoanAmount','Loan_Amount_Term']].astype('float16')
data_all['Credit_History'] = data_all['Credit_History'].astype('int8')
data_all[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History']].dtypes

Reference for above conversation decision: [Here](https://stackoverflow.com/questions/9696660/what-is-the-difference-between-int-int16-int32-and-int64)

In [None]:
data_all.replace(-1,np.nan, inplace=True)

In [None]:
# Drop un-necessary cols
data_all.drop(labels=['Loan_ID'], inplace=True, axis=1)

In [None]:
# IS there any duplicate row? remove it.
data_train.duplicated().any()

In [None]:
# It should have been cleaned and numeric
data_all['Dependents'].unique()

In [None]:
dependent_dict = {'0':0,'1':1,'2':2,'3+':3}
data_all['Dependents'] = data_all['Dependents'].map(dependent_dict)

--------
## EDA - Exploratory Data Analysis
--------

### Goal
Become a subject matter expert on the dataset

Checklist
- 1. What kind of data is present?
- 2. What features are most important in loan approval/ disapproval predictions ?

### 1. Univariate Analysis

Initial Heuristics: 
1. `Applicant Income`, `Credit History` Status and `Property Area` should be highly related to Loan Approval.
2. Lower dependency should be on number of `dependents` and `loan amount`
3. People who are not `Self Employed` i.e., with jobs should have higher chance of `loan approval` due to job stability and low risks, is this seen?

In [None]:
contineous_features = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term',]
categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History'\
                    , 'Property_Area', 'Loan_Status']

In [None]:
# Overall Categorical data distribution
plt.figure(figsize=(20,8))
for index, col in enumerate(categorical_features, start=1):
    plt.subplot(2,4,index)
    plt.title(col)
    plt.pie(data_all[col].value_counts().values,autopct='%1.0f%%', labels=data_all[col].value_counts().index)
plt.tight_layout()

**Distribution bias:**
1. `Gender`: Most people are Male
2. `Married`: Most people are Married
3. `Dependents`: Most people have 0 dependents
4. `Graduate`: Most people are Gratuated
5. `Self-Employed`: Most people are not self-employed i.e., they have jobs
6. `Credit History`: Most people have creedit history 1         
7. Loan Status: Most people get loans. **Our problem is `biased`**

**Near Equal distribution**
7. `Property Area`: Little difference in people lending property area from 3 places

*Note: Observations in context to people appearing to take loans*

In [None]:
# Overall Contineous Data Distribution(ideally, should be normal/gaussian distributed)
plt.figure(figsize=(20,5))
plt.tight_layout()
for index, col in enumerate(contineous_features, start=1):
    plt.subplot(1,4,index)
    plt.title(col)
    plt.xlabel('Range')
    plt.ylabel('Values')
    sns.distplot(data_all[col], kde_kws={'bw':0.1})

**As Expected, finance related values are very skewed (& Not Gaussian)**

In [None]:
# Analysing validity of default Mathemetical Outlier Removal techniques
# IQR Ranges
Q1 = data_all[contineous_features].quantile(0.25)
Q3 = data_all[contineous_features].quantile(0.75)
IQR = Q3-Q1

lower_range = Q1 - 1.5*IQR
upper_range = Q3 + 1.5*IQR
iqr_range = pd.DataFrame(pd.concat([lower_range,upper_range], axis=1))
iqr_range.columns = ['IQR Lower','IQR Upper']

# Z-score range
means = data_all[contineous_features].mean()
stds = data_all[contineous_features].std()

lower_zscore = means - 3*stds
upper_zscore = means + 3*stds
z_range = pd.DataFrame(pd.concat([lower_zscore,upper_zscore], axis=1))
z_range.columns =['Z Lower','Z Upper']

# Both Into DataFrame
pd.concat([iqr_range,z_range], axis=1)

Since Incomes should never go negative, it seems that using log transformations will prove to be a better option since it does not affect the smaller values much, but reduces the larger values. So, we get a distribution similar to normal distribution.

### 2. Multivariate Analysis
### 2.1 Feature with Target

Heuristics: 
1. `Credit History` Status and `Property Area` should be highly related to `Loan Approval`.
2. People with Greater number of `Dependents` or are `Married` should have higher chances for `Loan Approval`.
3. People who are not `Self Employed` should get higher chances of `Loan Approval`.

In [None]:
# Looking for patterns between any two features
sns.pairplot(data_all)

Seems like Loan Amount and Applicant Income have a slight linear relationship

In [None]:
# Looking for individual features' relation with Loan Status
plt.figure(figsize=(20,10))
for index,col in enumerate(categorical_features, start=1):
    plt.subplot(2,4,index)
    plot_frequency_sns(data=data_all, feature_name=col, hue='Loan_Status', annotate=True, annotate_distance=5, annotate_rotation='horizontal', palette='Blues')
    plt.legend(['Loan Dis-approved','Loan Approved']);
plt.tight_layout()

In [None]:
# color palettes: BuGn_r,Blues,GnBu_d
for index,col in enumerate(categorical_features, start=1):
    data = pd.crosstab(data_all[col],data_all['Loan_Status'])
    data.div(data.sum(axis=1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(4,4))

1. `Gender`: Males have `slightly higher chance` of getting loan
2. `Married`: Married people have `higher chance` of getting loan
3. `Dependents`: <font color='red'>No pattern</font>
4. `Graduate`: Graduated people have `higher chance` of getting loan
5. `Self-Employed`: Not self employed people have `slightly higher chance` of getting loan
6. `Credit History`: People with credit history 1 has `highest chance` of getting loan
7. `Property Area`: People in Semi-urban area has the `highest chance` of getting loan and people get loans in following order: `Semi-urban > Urban > Rural`

Initial Heuristics Analysis: 

<font color='green'> 1. <b>Credit History</b> Status and <b>Property Area</b> should be <b>highly related</b> to Loan Approval.</font>

<font color='red'>2. People with Greater number of <b>Dependents</b></font> or <font color='green'> are <b>Married</b> should have higher chances for Loan Approval.</font>

<font color='green'>3. People who are <b>not Self Employed</b> should get higher chances of Loan Approval.</font>

### 2.1 Feature with Feature/+Target

Initial Heuristics:
1. People who are `Married` should be `Graduated`    
People who are `Married` should not be `Self Employed` mostly, they should take less risks      
If a person is `Married`, couple's `Male` will only register for `loan` else they have similar ratio.
2. `Self Employed` might be more `Under Graduates`       
`Self Employed` people appearing for loan might be greater in number than with jobs    
`Self Employed` people would not take much higher loan amounts compared to `Not self employed`

3. People with >0 `Dependents` should be `Married`
3. `Applicant Income` and `Loan amount` shoud be related
4. `Applicant Income` related to `Property Area`
5. Higher `Applicant Income` and greater `Dependents` should have Positive Credit history
6. `Education` and `Applicant Income` 
8. `Self Employed` people might have lower `Applicant income` 
9. Some kind of uneven distribution for `property area` according to number of `dependents`

In [None]:
# Looking for a pattern combining two two features with Loan Status
plt.figure(figsize=(20,10))
index=1
for cat_col,cat_hue in zip(['Married','Married','Married','Gender','Self_Employed','Self_Employed','Dependents','Property_Area']\
                           ,['Education','Self_Employed','Gender','Dependents','Education',None,'Married','Education']):
    plt.subplot(2,4,index)
    plot_frequency_sns(data=data_all, feature_name=cat_col, hue=cat_hue, annotate=True, annotate_distance=5, annotate_rotation='horizontal', palette='Blues')
    plt.legend(data_all[cat_hue].unique()) if cat_hue != None else _
    index+=1
plt.tight_layout()

In [None]:
print(f'Married-Graduate: {275/72, 485/146}')
print(f'Self_Employed-Graduate: {626/181, 94/25}')
print(f'Dependents-Married: {(((124+146+79)/3)/((124+146+79+36+14+12)/6))*100:.2f} %')
print(273/69, 216/74, 274/75)
626/181, 94/25

In [None]:
sns.scatterplot(data_all['Self_Employed'], data_all['LoanAmount'])
plt.axhline(y=300, c='green', alpha=0.3, linestyle='--');

<font color='red'>1. People who are `Married` should be `Graduated`</font>      
<font color='red'>People who are `Married` should not be `Self Employed`</font>, <font color='green'>Almost 2 times the number of the people seeking loans are married whether they are self employed or not</font>            
<font color='green'>If a person is Married, couple's Male will only register for loan else they have similar ratio.</font>        
<font color='red'>2. `Self Employed` might be more `Under Graduates`</font>     
<font color='red'>`Self Employed` people appearing for loan might be greater in number than with jobs</font>, <font color='green'>Infact it is opposite</font>    
<font color='green'>`Self Employed` people would not take much higher loan amounts compared to `Not self employed`</font>       
<font color='green'>3. People with `0 Dependents` are mostly `Married`</font>


Let's apply credit history filters

In [None]:
data_all[  (data_all['Dependents']==0) & (data_all['Married']=='No') & (data_all['Education']=='Not Graduate')]['Credit_History'].value_counts(normalize=True)*100

In [None]:
data_all[  (data_all['Dependents']==0) & (data_all['Married']=='No') & (data_all['Education']=='Graduate')]['Credit_History'].value_counts(normalize=True)

**We can see that with similar filters, graduate people have higher(18%) of negative credit history than that of not graduates(11%)**

In [None]:
fig,axs = plt.subplots(nrows=1,ncols=2,figsize=(20,5))

# Direct
data_all.groupby('Dependents')['Property_Area'].value_counts().plot(kind='bar', ax=axs[0]);

# More visual
dic = {}
for x in data_all.groupby('Property_Area')['Dependents']:
    dic[x[0]] = x[1].value_counts()
df = pd.DataFrame(dic)

df.plot(kind='bar', ax=axs[1])
plt.title('Number of dependents from different areas')
plt.xlabel('Count')
plt.ylabel('No. of people');

In [None]:
d = pd.crosstab(data_all['Credit_History'],data_all['Dependents']).T
dd = d.div(d.sum(axis=1), axis=0)*100
dd

**Most people have `0 dependents`**     
<font color='red'>9. Almost every area has the same number of dependents</font>      
**People with 3+ dependents have 24% of people with negative credit history**

In [None]:
plt.figure(figsize=(20,4))
plt.subplot(1,2,1)
sns.regplot(x='ApplicantIncome', y='LoanAmount', data=data_all);
plt.axvline(x=22_000, c='green', alpha=0.3, linestyle='--');

plt.subplot(1,2,2)
plt.plot(data_all['LoanAmount'], marker="*", linestyle='')
plt.plot(data_all['ApplicantIncome'], marker=".", linestyle='');

In [None]:
plt.figure(figsize=(20,6))
plt.plot(data_all['LoanAmount'][50:100].apply(lambda x: x/data_all['LoanAmount'][50:100].max()), marker="*", linestyle='-', label='LoanAmount')
plt.plot(data_all['ApplicantIncome'][50:100].apply(lambda x: x/data_all['ApplicantIncome'][50:100].max()), marker=".", linestyle='--', label='ApplicantIncome')
plt.xlabel('Data Points')
plt.ylabel('Amount (normalised)')
plt.title('LoanAmount vs ApplicantIncome')
plt.legend();

<font color='green'>4. <b>Loan amount</b> gets higher as <b>Applicant Income, might be because higher business people needs higher loans</b></font>

In [None]:
plt.figure(figsize=(20,8))
plt.title("Distribution of income according to applicant's property area ")
plt.axhline(y=22000, c='orange', alpha=0.7, linestyle='--')
sns.scatterplot(x=data_all['Property_Area'], y=data_all['ApplicantIncome']);

<font color='green'>5. <b>Applicant Income</b> of <b>Semi-urban</b> property area greater number of high earning people</font>

Higher loan approval rate in semi-urban area might also be because people living in semi-urban area has higher application income and hence can repay the loan amount feasibly.

In [None]:
plt.figure(figsize=(20,4))
plt.subplot(1,2,1)

sns.scatterplot(x=data_all['Gender'], y=data_all['ApplicantIncome'])
max_female_income = data_all[ data_all['Gender']=='Female' ]['ApplicantIncome'].max()
plt.annotate(s=max_female_income, xy=(1,-1), xytext=(0.9,max_female_income+3000))
plt.axhline(y=max_female_income, c='orange', alpha=0.7, linestyle='--');

plt.subplot(1,2,2)
sns.scatterplot(x=data_all['Gender'], y=data_all['CoapplicantIncome'])
max_female_co_income = data_all[ data_all['Gender']=='Female' ]['CoapplicantIncome'][284]
plt.annotate(s=max_female_co_income, xy=(1,-1), xytext=(0.9,max_female_co_income+3000))
plt.axhline(y=max_female_co_income, c='orange', alpha=0.7, linestyle='--');

<font color='green'>`Female's` `Incomes` are generally limited</font>

In [None]:
plt.figure(figsize=(20,4))
plt.subplot(1,2,1)
sns.scatterplot(x='ApplicantIncome', y='Dependents', data=data_all, hue='Credit_History');
plt.axvline(x=13_000, linestyle='--', c='g', alpha=0.3)
plt.axhline(y=2, linestyle='--', c='g', alpha=0.3)
plt.subplot(1,2,2)
plot_frequency_sns(data=data_all, feature_name="Dependents", hue="Credit_History", annotate=True, annotate_distance=3);

In [None]:
423/78, 125/20, 125/25, 63/20

<font color='green'> 6. People with <3 dependents and above ~13000 income have positive credit history </font>

In [None]:
data_all.groupby('Education')['ApplicantIncome'].max(), data_all.groupby('Self_Employed')['ApplicantIncome'].max()

In [None]:
plt.figure(figsize=(20,4))
plt.subplot(1,2,1)
sns.scatterplot(data_all['Education'], data_all['ApplicantIncome'])
plt.axhline(18165, linestyle="--", alpha=0.3)

plt.subplot(1,2,2)
sns.scatterplot(data_all['Self_Employed'], data_all['ApplicantIncome'])
plt.axhline(39147, linestyle="--", alpha=0.3);

In [None]:
data_all.boxplot(column='ApplicantIncome', by = 'Education');
plt.ylabel('Applicant Income');
plt.title("");

<font color='green'> 7. People who are <b>graduated</b> has large range of <b>income</b> whereas undergraduates' income are limited </font>       
<font color='green'> 8. People who are <b>not self employed</b> has large range of <b>income</b> whereas self employed's income are limited </font>     
**Mean of Applicant Incomes whether individual be Graduate or not is the same, statistically speaking by looking at the box plots**

## Feature Engineering

In [None]:
data_all.head(2)

In [None]:
# Turn Applicant Income into Categorical Feature
bins=[0,2500,4000,6000,81000]
group=['Low','Average','High', 'Very high']
data_all['Income_bin'] = pd.cut(data_all['ApplicantIncome'],bins,labels=group)

In [None]:
# Check it's correlation with Loan Status Percentage wise
data = pd.crosstab(data_all['Income_bin'], data_all['Loan_Status'])
data.div(data.sum(axis=1), axis=0)*100

In [None]:
data.div(data.sum(axis=1), axis=0).plot(kind='bar', stacked=True);
plt.ylabel('Percentage');

It can be inferred that Segmenting Applicant income into bins also does not affect the chances of loan approval which <font color='red'>contradicts our hypothesis in which we assumed that if the applicant income is high the chances of loan approval will also be high.

In [None]:
# Let's apply the same concept on Co-applicant Income and form categories
bins=[0,1000,3000,42000] 
group=['Low','Average','High'] 
data_all['Coapplicant_Income_bin']=pd.cut(data_all['CoapplicantIncome'],bins,labels=group)

Coapplicant_Income_bin=pd.crosstab(data_all['Coapplicant_Income_bin'],data_all['Loan_Status']) 
Coapplicant_Income_bin.div(Coapplicant_Income_bin.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True) 
plt.xlabel('CoapplicantIncome') 
plt.ylabel('Percentage');

It shows that if coapplicant’s income is less the chances of loan approval are high. But this does not look right. The possible reason behind this may be that most of the applicants don’t have any coapplicant so the coapplicant income for such applicants is 0 and hence the loan approval is not dependent on it. So we can make a new variable in which we will combine the applicant’s and coapplicant’s income to visualize the combined effect of income on loan approval.

In [None]:
# Engineer Total Income feature and again make it categorical
data_all['Total_Income']=data_all['ApplicantIncome']+data_all['CoapplicantIncome']

bins=[0,2500,4000,6000,81000]
group=['Low','Average','High', 'Very high'] 
data_all['Total_Income_bin']=pd.cut(data_all['Total_Income'],bins,labels=group)

Total_Income_bin=pd.crosstab(data_all['Total_Income_bin'],data_all['Loan_Status'])
Total_Income_bin.div(Total_Income_bin.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Total_Income') 
plt.ylabel('Percentage');

<font color='green'>Finally we can see that people with low total income has lesser chances of getting loans compared with the other 3 bins

In [None]:
# Let's feature engineer another column(Note: Loan amount is in thousands)
interest_rate=0.08
data_all['EMI'] = data_all.apply(lambda x: ((x['LoanAmount']*1000)/x['Loan_Amount_Term']) ,axis=1)
data_all['Residual_monthly_income'] = (data_all['Total_Income']/12)-(data_all['EMI'])

In [None]:
# See their distributions
plt.figure(figsize=(20,4))
for i,col in enumerate(['Total_Income','EMI','Residual_monthly_income'],start=1):
    plt.subplot(1,3,i)
    sns.distplot(data_all[col], rug=True);

As expected, their distributions had to be skewed      
**<font color='green'>Interesting, People also have negative residual Total Incomes**

In [None]:
# Allocate negative status to people with -ve residual income and analyse with Credit History feature
data_all['Redisual_Status'] = data_all['Residual_monthly_income'].apply(lambda x: 0 if x<0 else 1)
d = pd.crosstab(data_all['Redisual_Status'], data_all['Credit_History'])
d.div(d.sum(axis=1), axis=0)*100

<font color='green'> As expected, People whose residual montly income is negative, does have lower credit history

In [None]:
plot_frequency_sns(data=data_all, feature_name='Redisual_Status', annotate=True, annotate_distance=0, hue='Loan_Status')

**That's Interesting, 85 people having negative residual income still gets Loan**

In [None]:
data_all[ (data_all['Redisual_Status']==0) & (data_all['Loan_Status']=='Y')]['Credit_History'].value_counts()

<font color='green'> That makes sense, If they have Credit history 1, <font/><font color='red'>but why did other 3 people got loans??

In [None]:
data_all[ (data_all['Redisual_Status']==0) & (data_all['Loan_Status']=='Y') & (data_all['Credit_History']==0)]

Hmmm... Well they have total income bin as High and Very High and their residual incomes is comparatively very less than others plus they are not self employed

In [None]:
# Check for a pattern of newely generated features with the Loan Status
plt.figure(figsize=(20,4))
for i,col in enumerate(['Total_Income','EMI','Residual_monthly_income'],start=1):
    plt.subplot(1,3,i)
    sns.scatterplot(x=data_all['Loan_Status'], y=data_all[col])
    if i==3:
        plt.axhline(0, linestyle="--", alpha=0.3, c='g');

<font color='red'>No pattern found

In [None]:
data_all[ data_all['EMI']==data_all['EMI'].max() ]

In [None]:
data_all['Loan_Amount_Term'].value_counts()

In [None]:
sns.distplot(data_all['LoanAmount']);

In [None]:
# Add another feature
data_all['Remaining_family_income'] = data_all.apply(lambda x: x['Residual_monthly_income']/x['Dependents'] if x['Dependents']!=0 else x['Residual_monthly_income'],axis=1)
sns.scatterplot(x=data_all['Loan_Status'], y=data_all['Remaining_family_income'])
plt.axhline(0, linestyle="--", alpha=0.3, c='g');

<font color='red'> More people should have been to the left side with Loan Status No for people below Remaining Family Income of 0

In [None]:
# data_all['Loan_Status'] = data_all['Loan_Status'].map({'Y':1,'N':0})

In [None]:
sns.heatmap(data_all[['Dependents','Residual_monthly_income','Remaining_family_income','Credit_History','Loan_Status']].corr(), annot=True, cmap='YlGnBu');

We already analysed the relations shown in this heatmap with Credit History and Loan Status, As per Remaining Family income and Residual Monthly income, one is derived from the other so that had to be that

In [None]:
data_all['Safe_Applicant'] = data_all.apply(lambda x: 1 if (x['Education']=='Graduate' and x['Dependents']>0 and x['Married']=='Yes' and x['Self_Employed']=='No' and (x['Income_bin']=='High' or x['Income_bin']=='Very high')) else 0, axis=1)
plot_frequency_sns(data=data_all, feature_name='Safe_Applicant', annotate=True,annotate_distance=-4, hue='Loan_Status')

d = pd.crosstab(data_all['Safe_Applicant'], data_all['Loan_Status'])
dd = d.div(d.sum(axis=1), axis=0)*100
dd.plot.bar(stacked=True);

The safe applicant column does not have a major significance over Loan Status/Credit History