#### Srinivasaragavan Vijayaraghavan 



## Introduction
This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

 

## Business Understanding

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

 

When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

 

The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,

All other cases: All other cases when the payment is paid on time.

 

 

When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

Approved: The Company has approved loan Application

Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.

Refused: The company had rejected the loan (because the client does not meet their requirements etc.).

Unused offer:  Loan has been cancelled by the client but on different stages of the process.

In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

In [None]:
#Import the required Libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import date, timedelta
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 5000)


### Reading the Dataset 

In [None]:

#Read the data in a dataframe
inp1= pd.read_csv(r"../input/loan-defaulter/previous_application.csv")
inp2= pd.read_csv(r"../input/loan-defaulter/application_data.csv")
cols_data= pd.read_csv(r"../input/loan-defaulter/columns_description.csv",encoding= 'unicode_escape')

### Let's see the Columns description first.

In [None]:
cols_data.head()

In [None]:
cols_data.rename(columns = {'ï»¿':'Serial No'}, inplace = True) 
cols_data.set_index('Serial No')


#### Now we have the descriptions for the columns in each datasets that we have. Time for exploring the Data to infer insights

### Exploring the  Application Data Which is current data given 

In [None]:
# To know the Shape of the Dataset we are going to explore
inp2.shape

In [None]:
#Lets, now see the columns we have and their dataypes and stats 
inp2.info()

In [None]:
#View sample data to see how the data set look like 
inp2.head(10)

In [None]:
inp2.describe()

In [None]:
# Cleaning the data 
# Exlpore for null values 
nullcolumns=inp2.isnull().sum()
nullcolumns

In [None]:
# To find the percentage of null values in the above columns we have the null counts displayed 
##To find the columns having more than 50% null values 
nullcolumns=inp2.isnull().sum()
nullcolumns=nullcolumns[nullcolumns.values>(0.5*len(nullcolumns))]
nullcolumns

In [None]:
#Drop the Null values 

nullcolumns = list(nullcolumns[nullcolumns.values>=0.3].index)
inp2.drop(labels=nullcolumns,axis=1,inplace=True)
print(len(nullcolumns))

In [None]:
#Check for percantage of null values again to ensure we have no NaN's in data set 

print((100*(inp2.isnull().sum()/len(inp2))))

 We, could see that AMT_ANNUITY column has few null values,hence it will be imputed

In [None]:
#Box Plot check for Outliers 
sns.boxplot(inp2.AMT_ANNUITY)
plt.show()
plt.savefig('sample.jpg')

In [None]:
#Plot to see outliers in AMT_CREDIT 
sns.distplot(inp2.AMT_CREDIT)
plt.show()

In [None]:
#Plot to see outliers in AMT_INCOME_TOTAL 
plt.figure(figsize=[8,2])
sns.boxplot(inp2.AMT_INCOME_TOTAL)
plt.show()

# make boxplot with Seaborn
bplot=sns.boxplot(inp2.AMT_INCOME_TOTAL, 
                 width=0.5,
                 palette="colorblind")
 
# add stripplot to boxplot with Seaborn
bplot=sns.stripplot(inp2.AMT_INCOME_TOTAL,  
                   jitter=True, 
                   marker='o', 
                   alpha=0.5,
                   color='black')


#### from the box plot we could see that the Field/ Column has more outiers. So, Instead of imputing through mean, the field will be imputed with median . Since, we have outliers

In [None]:
#To find the median for the fiel AMT_ANNUITY
values=inp2['AMT_ANNUITY'].median()

values

In [None]:
# Fill the above value 24903 for all the missing values in AMT_ANNUITY
inp2.loc[inp2['AMT_ANNUITY'].isnull(),'AMT_ANNUITY']=values

In [None]:
#Check for percantage of null values again to ensure we have no NaN's in data set 

print((100*(inp2.isnull().sum()/len(inp2))))

In [None]:

# Removing rows having null values greater than or equal to 50%

nullrows=inp2.isnull().sum(axis=1)
nullrows=list(nullrows[nullrows.values>=0.5*len(inp2)].index)
inp2.drop(labels=nullrows,axis=0,inplace=True)
print(len(nullrows))

In [None]:
#To Check the dataype of all the columns 
inp2.head(10)

In [None]:
# We will remove unwanted columns from this dataset

unwanted=['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL','REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','FLAG_EMAIL','CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY','DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']

inp2.drop(labels=unwanted,axis=1,inplace=True)

In [None]:
inp2.info()

In [None]:
#view Sample data frame 
inp2.head(10)

From the analysis of the head above, we could see the DAYS colum have the value as negative and the ORganization Type colum have XNA value . So, the Next step is to check for XNA for the variables and the -ve values need to be +ve converted for days column

In [None]:
#To handle -ve values in the DAYS columns in the inp2 Dataframe
inp2 = inp2.apply(lambda x: x*-1 if x.name in ['DAYS_BIRTH', 'DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH'] else x)
inp2.head(10)

In [None]:
inpsample=inp2.head(10)

In [None]:
#inp2.drop([Current_Date],axis=1,inplace=True)
#inp2['date_only'] = inp2['Current_date'].dt.date
#inp2.head(10)
#inp2.drop(['Current_Date'], axis = 1,inplace=True) 

In [None]:
#Add CurrentDate value to Data frame inp2
inp2['Current_date'] = pd.to_datetime('today',utc=False)
inp2['Current_date'] = inp2['Current_date'].dt.date

In [None]:
inp2.head()

### Handling missing values(XNA)- Not Available  in the inp2 Dataframe based on the suitable techniques


In [None]:
#Categorical columns having these 'XNA' values
    
#CODE_GENDER 
inp2[inp2['CODE_GENDER']=='XNA'].shape

In [None]:
# Organization column

inp2[inp2['ORGANIZATION_TYPE']=='XNA'].shape

So, there are 4 rows from Gender column and 55374 rows from Organization type column

In [None]:
# Describing the Gender column to check the number of females and males

inp2['CODE_GENDER'].value_counts()

Since, Female is having the majority and only 4 rows are having XNA values, we can update those columns with Gender 'F'.

In [None]:
# Updating the column 'CODE_GENDER' with "F" for the dataset

inp2.loc[inp2['CODE_GENDER']=='XNA','CODE_GENDER']='F'
inp2['CODE_GENDER'].value_counts()

In [None]:
# Describing the organization type column

inp2['ORGANIZATION_TYPE'].describe()

we have total count of 307511 rows of which 55374 rows are having 'XNA' values. Which means 18% of the column is having this values. Hence if we drop the rows of total 55374, will not have any major impact on our dataset.

In [None]:
# Hence, dropping the rows of total 55374 have 'XNA' values in the organization type column

inp2=inp2.drop(inp2.loc[inp2['ORGANIZATION_TYPE']=='XNA'].index)
inp2[inp2['ORGANIZATION_TYPE']=='XNA'].shape

In [None]:

# Casting all variable into numeric in the dataset

numeric_columns=['TARGET','CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','REGION_POPULATION_RELATIVE','DAYS_BIRTH',
                'DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','HOUR_APPR_PROCESS_START','LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY']

inp2[numeric_columns]=inp2[numeric_columns].apply(pd.to_numeric)

inp2.head()

In [None]:
inp2.info()

### To Create Bins for AMT_INCOME_TOTAL ,AMT_CREDIT,AMT_ANNUITY


In [None]:
#These Bins are created to explore insights by cutting the amounts into specific class intervals 
#Creating bins for income amount

bins = [0,25000,50000,75000,100000,125000,150000,175000,200000,225000,250000,275000,300000,325000,350000,375000,400000,425000,450000,475000,500000,10000000000]
slot = ['0-25000', '25000-50000','50000-75000','75000,100000','100000-125000', '125000-150000', '150000-175000','175000-200000',
       '200000-225000','225000-250000','250000-275000','275000-300000','300000-325000','325000-350000','350000-375000',
       '375000-400000','400000-425000','425000-450000','450000-475000','475000-500000','500000 and above']

inp2['AMT_INCOME_RANGE']=pd.cut(inp2['AMT_INCOME_TOTAL'],bins,labels=slot)

In [None]:
# Creating bins for Credit amount

bins = [0,150000,200000,250000,300000,350000,400000,450000,500000,550000,600000,650000,700000,750000,800000,850000,900000,1000000000]
slots = ['0-150000', '150000-200000','200000-250000', '250000-300000', '300000-350000', '350000-400000','400000-450000',
        '450000-500000','500000-550000','550000-600000','600000-650000','650000-700000','700000-750000','750000-800000',
        '800000-850000','850000-900000','900000 and above']

inp2['AMT_CREDIT_RANGE']=pd.cut(inp2['AMT_CREDIT'],bins=bins,labels=slots)

In [None]:
# Dividing the dataset into two dataset of  target=1(client with payment difficulties) and target=0(all other)

target0=inp2.loc[inp2["TARGET"]==0]
target1=inp2.loc[inp2["TARGET"]==1]

In [None]:
target0.head(10)

In [None]:
target1.head(10)

In [None]:

# Calculating Imbalance percentage
    
# Since the majority is target0 and minority is target1

round(len(target0)/len(target1),2)

##### We have, now found out that the Imbalance Ratio is 10.055

## Univariate analysis for categories for Target - 0 ( Client with no payment difficulites )

In [None]:
# Count plotting in logarithmic scale

def uniplot(df,col,title,hue =None):
    
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 22
    plt.rcParams['axes.titlepad'] = 30
    
    
    temp = pd.Series(data = hue)
    fig, ax = plt.subplots()
    width = len(df[col].unique()) + 7 + 4*len(temp.unique())
    fig.set_size_inches(width , 8)
    plt.xticks(rotation=45)
    plt.yscale('log')
    plt.title(title)
    ax = sns.countplot(data = df, x= col, order=df[col].value_counts().index,hue = hue,palette='magma') 
        
    plt.show()

In [None]:
# PLotting for income range

uniplot(target0,col='AMT_INCOME_RANGE',title='Distribution of income range',hue='CODE_GENDER')

Inferences and insights from the above Plot :

Female counts are higher than male.

Income range from 100000 to 200000 is having more number of credits.

This graph show that females are more than male in having credits for that range.

Very less count for income range 400000 and above.

In [None]:
# Plotting for Income type

uniplot(target0,col='NAME_INCOME_TYPE',title='Distribution of Income type',hue='CODE_GENDER')

Inisghts Derived from the above plot : 


For income type ‘working’, ’commercial associate’, and ‘State Servant’ the number of credits are higher than others.
For this Females are having more number of credits than male.
Less number of credits for income type ‘student’ ,’pensioner’, ‘Businessman’ and ‘Maternity leave’.

In [None]:

# Plotting for Contract type

uniplot(target0,col='NAME_CONTRACT_TYPE',title='Distribution of contract type',hue='CODE_GENDER')

Inisghts Derived from the above plot :

For contract type ‘cash loans’ is having higher number of credits than ‘Revolving loans’ contract type.

Female is leading for applying credits.

In [None]:

# Plotting for Organization type in logarithmic scale

sns.set_style('whitegrid')
sns.set_context('talk')
plt.figure(figsize=(15,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30

plt.title("Distribution of Organization type for target - 0")

plt.xticks(rotation=90)
plt.xscale('log')

sns.countplot(data=target0,y='ORGANIZATION_TYPE',order=target0['ORGANIZATION_TYPE'].value_counts().index,palette='cool')

plt.show()


Insights inferred from the above plot

Clients which have applied for credits are from most of the organization type ‘Business entity Type 3’ , ‘Self employed’, ‘Other’ , ‘Medicine’ and ‘Government’.

Less clients are from Industry type 8,type 6, type 10, religion and trade type 5, type 4.

## Categoroical Univariate Analysis in logarithmic scale for target - 1 (Client with payment difficulties)

In [None]:
# PLotting for income range

uniplot(target1,col='AMT_INCOME_RANGE',title='Distribution of income range',hue='CODE_GENDER')

Points to be concluded from the above graph.

Male counts are higher than female.
Income range from 100000 to 200000 is having more number of credits.
This graph show that males are more than female in having credits for that range.
Very less count for income range 400000 and above.

In [None]:
# Plotting for Income type

uniplot(target1,col='NAME_INCOME_TYPE',title='Distribution of Income type',hue='CODE_GENDER')


Points to be concluded from the above graph.

For income type ‘working’, ’commercial associate’, and ‘State Servant’ the number of credits are higher than other i.e. ‘Maternity leave.
For this Females are having more number of credits than male.
Less number of credits for income type ‘Maternity leave’.
For type 1: There is no income type for ‘student’ , ’pensioner’ and ‘Businessman’ which means they don’t do any late payments.

In [None]:
# Plotting for Contract type

uniplot(target1,col='NAME_CONTRACT_TYPE',title='Distribution of contract type',hue='CODE_GENDER')

Points to be concluded from the above graph.

For contract type ‘cash loans’ is having higher number of credits than ‘Revolving loans’ contract type.
For this also Female is leading for applying credits.
For type 1 : there is only Female Revolving loans.

In [None]:
# Plotting for Organization type

sns.set_style('whitegrid')
sns.set_context('talk')
plt.figure(figsize=(15,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30

plt.title("Distribution of Organization type for target - 1")

plt.xticks(rotation=90)
plt.xscale('log')

sns.countplot(data=target1,y='ORGANIZATION_TYPE',order=target1['ORGANIZATION_TYPE'].value_counts().index,palette='cool')

plt.show()

Insights inferred from the above plot

Clients which have applied for credits are from most of the organization type ‘Business entity Type 3’ , ‘Self employed’, ‘Other’ 

Less clients are from Industry type 4,type 8, type 5, religion and trade type 10, type 6.

In [None]:
# Finding some correlation for numerical columns for both target 0 and 1 

target0_corr=target0.iloc[0:,2:]
target1_corr=target1.iloc[0:,2:]

target0cr=target0_corr.corr(method='spearman')
target1cr=target1_corr.corr(method='spearman')


In [None]:
# Correlation for target 0

target0cr

In [None]:
# Correlation for target 1

target1cr

In [None]:
# Now, plotting the above correlation with heat map as it is the best choice to visulaize

# figure size

def targets_corr(data,title):
    plt.figure(figsize=(15, 10))
    plt.rcParams['axes.titlesize'] = 25
    plt.rcParams['axes.titlepad'] = 70

# heatmap with a color map of choice


    sns.heatmap(data, cmap="RdYlGn",annot=False)

    plt.title(title)
    plt.yticks(rotation=0)
    plt.show()



In [None]:
# For Target 0

targets_corr(data=target0cr,title='Correlation for target 0')

As we can see from above correlation heatmap, There are number of observation we can point out

Credit amount is inversely proportional to the date of birth, which means Credit amount is higher for low age and vice-versa.

Credit amount is inversely proportional to the number of children client have, means Credit amount is higher for less children count client have and vice-versa.

Income amount is inversely proportional to the number of children client have, means more income for less children client have and vice-versa.

less children client have in densely populated area.

Credit amount is higher to densely populated area.

The income is also higher in densely populated area.

In [None]:
# For Target 1

targets_corr(data=target1cr,title='Correlation for target 1')

This heat map for Target 1 is also having quite a same observation just like Target 0. But for few points are different. They are listed below.

The client's permanent address does not match contact address are having less children and vice-versa

the client's permanent address does not match work address are having less children and vice-versa

### Univariate Analysis for both the targets to explore insights 

In [None]:


# Box plotting for univariate variables analysis in logarithmic scale

def univariate_numerical(data,col,title):
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 22
    plt.rcParams['axes.titlepad'] = 30
    
    plt.title(title)
    plt.yscale('log')
    sns.boxplot(data =target0, x=col,orient='v')
    plt.show()

### For Target-0  Univariate Analysis

In [None]:

# Distribution of income amount

univariate_numerical(data=target0,col='AMT_INCOME_TOTAL',title='Distribution of income amount')

Few points can be concluded from the graph above.

Some outliers are noticed in income amount.

The third quartiles is very slim for income amount.

In [None]:
# Disrtibution of credit amount

univariate_numerical(data=target0,col='AMT_CREDIT',title='Distribution of credit amount')

In [None]:
#Plot to see outliers in AMT_INCOME_TOTAL 
plt.figure(figsize=[8,2])
sns.boxplot(inp2.AMT_CREDIT)
plt.show()

# make boxplot with Seaborn
bplot=sns.boxplot(inp2.AMT_CREDIT, 
                 width=0.5,
                 palette="colorblind")
 
# add stripplot to boxplot with Seaborn
bplot=sns.stripplot(inp2.AMT_CREDIT,  
                   jitter=True, 
                   marker='o', 
                   alpha=0.5,
                   color='black')


Few points can be concluded from the graph above.

Some outliers are noticed in credit amount.
The first quartile is bigger than third quartile for credit amount which means most of the credits of clients are present in the first quartile.

In [None]:
# Distribution of anuuity amount

univariate_numerical(data=target0,col='AMT_ANNUITY',title='Distribution of Annuity amount')


Few points can be concluded from the graph above.


The first quartile is bigger than third quartile for annuity amount which means most of the annuity clients are from first quartile.

### For Target-1  Univariate Analysis

In [None]:
# Distribution of income amount

univariate_numerical(data=target1,col='AMT_INCOME_TOTAL',title='Distribution of income amount')

Few points can be concluded from the graph above.


Some outliers are noticed in income amount.

The third quartiles is very slim for income amount.

Most of the clients of income are present in first quartile.


In [None]:
# Distribution of credit amount

univariate_numerical(data=target1,col='AMT_CREDIT',title='Distribution of credit amount')

Few points can be concluded from the graph above.

Some outliers are noticed in credit amount.
The first quartile is bigger than third quartile for credit amount which means most of the credits of clients are present in the first quartile.

In [None]:
# Distribution of Annuity amount

univariate_numerical(data=target1,col='AMT_ANNUITY',title='Distribution of Annuity amount')


Few points can be concluded from the graph above.


Some outliers are noticed in annuity amount.
The first quartile is bigger than third quartile for annuity amount which means most of the annuity clients are from first quartile.

### Bivariate analysis for Target 0 

In [None]:
# Box plotting for Credit amount

plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
sns.boxplot(data =target0, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Credit amount vs Education Status')
plt.show()

From the above box plot we can conclude that Family status of 'civil marriage', 'marriage' and 'separated' of Academic degree education are having higher number of credits than others. Also, higher education of family status of 'marriage', 'single' and 'civil marriage' are having more outliers. Civil marriage for Academic degree is having most of the credits in the third quartile.

In [None]:
# Box plotting for Income amount in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
plt.yscale('log')
sns.boxplot(data =target0, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Income amount vs Education Status')
plt.show()

From above boxplot for Education type 'Higher education' the income amount is mostly equal with family status. It does contain many outliers. Less outlier are having for Academic degree but there income amount is little higher that Higher education. Lower secondary of civil marriage family status are have less income amount than others.

### Bivariate analysis for Target 1

In [None]:
# Box plotting for credit amount

plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
sns.boxplot(data =target1, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Credit Amount vs Education Status')
plt.show()

There is a difference here when compared to Target-0 with Target 1 . 

Only Married family status people in having an academic degree have higher credit than other categories.

We could see lot of outliers in other categories such as secondary,Incomplete Higher,Higher Education , Lower Secondary

People seperated with higher education background have high credits as their third quartile is bigger when compared with other categories and their counterparts

In [None]:
# Box plotting for Income amount in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
plt.yscale('log')
sns.boxplot(data =target1, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Income amount vs Education Status')
plt.show()

From the above plot ,

WE could see that Maried customer with Academic Degree are the once who have high income  than all other categories. 
We also could not spot any outliers in academic degree. 

Most customers of different family status with lower seconday have low income when compared to all others

Customer who are Widow and of incomplete higher Education has the least income among all others. 



# Analysis of Previous application Data 

In [None]:
#Reading the Data Dictionary for columns in previous application Data 
cols_data.rename(columns = {'Unnamed: 0':'Serial No'}, inplace = True) 
cols_data.set_index('Serial No')


In [None]:
#To check the shape of Previous application Dataframe 

inp1.shape

In [None]:
# Check for the columns in the inp1 Dataframe (Previous application)- henceforth called as inp1
inp1.columns

In [None]:
# Check the column stats
inp1.info()

In [None]:
# Describe to see the mean and min and max value in dataframe inp1
inp1.describe()

In [None]:
# Check for sample data from inp1

inp1.head(10)

From the Data we could see there are null values in the dataframe and also few missing values like XNA and XAP we will be handling this downstream in the code below

In [None]:
# Cleaning the missing data

# listing the null values columns having more than 30%

emptycol1=inp1.isnull().sum()
emptycol1=emptycol1[emptycol1.values>(0.3*len(emptycol1))]
len(emptycol1)

In [None]:
# Cleaning the data 
# Exlpore for null values 
nullcolumns=inp1.isnull().sum()
nullcolumns

In [None]:
#Check the percentage of null values in the columns of inp1 dataframe 
round(inp1.isnull().sum()/len(inp1)*100,2)

In [None]:
##To find the columns having more than 50% null values 
emptynullcol=inp1.isnull().sum()
emptynullcol=emptynullcol[emptynullcol.values>(0.5*len(emptynullcol))]
emptynullcol

In [None]:
#Drop the Null values 

emptynullcol = list(emptynullcol[emptynullcol.values>=0.5].index)
inp1.drop(labels=emptynullcol,axis=1,inplace=True)
print(len(emptynullcol))

In [None]:
#Check the percentage of null values in the columns of inp1 dataframe after dropping few columns greater than 50% null
round(inp1.isnull().sum()/len(inp1)*100,2)

Now we have 22 columns without null values , now we inspect for missing values and data correction and dataype correctin for few columns

In [None]:
#Checck for XNA and XAP in Column NAME_CASH_LOAN_PURPOSE
inp1.NAME_CASH_LOAN_PURPOSE.value_counts()

We found we have XAP-922661 and XNA-67791 We have to delte the data as it is more than 30%

In [None]:
# Removing the column values of 'XNA' and 'XAP'

inp1=inp1.drop(inp1[inp1['NAME_CASH_LOAN_PURPOSE']=='XNA'].index)
inp1=inp1.drop(inp1[inp1['NAME_CASH_LOAN_PURPOSE']=='XAP'].index)

inp1.NAME_CASH_LOAN_PURPOSE.value_counts()

In [None]:
# Check for shape after XNA and XAP handling 
inp1.shape

In [None]:
# Check for sample data in inp1
inp1.head(20)

In [None]:
#check Column info
inp1.info()

In [None]:
#Merging the Application dataset with previous appliaction dataset

Master=pd.merge(left=inp2,right=inp1,how='inner',on='SK_ID_CURR',suffixes='_x')

In [None]:
# Renaming the column names after merging

master1= Master.rename({'NAME_CONTRACT_TYPE_' : 'NAME_CONTRACT_TYPE','AMT_CREDIT_':'AMT_CREDIT','AMT_ANNUITY_':'AMT_ANNUITY',
                         'WEEKDAY_APPR_PROCESS_START_' : 'WEEKDAY_APPR_PROCESS_START',
                         'HOUR_APPR_PROCESS_START_':'HOUR_APPR_PROCESS_START','NAME_CONTRACT_TYPEx':'NAME_CONTRACT_TYPE_PREV',
                         'AMT_CREDITx':'AMT_CREDIT_PREV','AMT_ANNUITYx':'AMT_ANNUITY_PREV',
                         'WEEKDAY_APPR_PROCESS_STARTx':'WEEKDAY_APPR_PROCESS_START_PREV',
                         'HOUR_APPR_PROCESS_STARTx':'HOUR_APPR_PROCESS_START_PREV'}, axis=1)

In [None]:
# Check Sample data in master dataframe after merge of inp1 and inp1
Master.head()

In [None]:
#check for columns in master dataframe 
Master.columns

In [None]:
master1.head()

In [None]:
master1.columns

In [None]:
# Removing unwanted columns for analysis

master1.drop(['SK_ID_CURR','WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START','REG_REGION_NOT_LIVE_REGION', 
              'REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
              'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY','WEEKDAY_APPR_PROCESS_START_PREV',
              'HOUR_APPR_PROCESS_START_PREV', 'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_LAST_APPL_IN_DAY'],axis=1,inplace=True)

In [None]:
#view sample data after dropping unwanted columns 
master1.head()

In [None]:
# View columns 
master1.info()

In [None]:
master1.columns

## Performing univariate analysis

In [None]:
# Distribution of contract status in logarithmic scale
sns.set_style('whitegrid')
sns.set_context('talk')

plt.figure(figsize=(15,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30
plt.xticks(rotation=90)
plt.xscale('log')
plt.title('Distribution of contract status with purposes')
ax = sns.countplot(data =master1, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=master1['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue ='NAME_CONTRACT_STATUS',palette='magma')


Points to be concluded from above plot:

Most rejection of loans came from purpose 'repairs'.

For education purposes we have equal number of approves and rejection

Paying other loans and buying a new car is having significant higher rejection than approves.

In [None]:
# Distribution of contract status

sns.set_style('whitegrid')
sns.set_context('talk')

plt.figure(figsize=(15,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30
plt.xticks(rotation=90)
plt.xscale('log')
plt.title('Distribution of purposes with target ')
ax = sns.countplot(data = master1, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=master1['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue = 'TARGET',palette='magma')


Few points we can conclude from abpve plot:

Loan purposes with 'Repairs' are facing more difficulites in payment on time.

There are few places where loan payment is significant higher than facing difficulties. 

They are 'Buying a garage', 'Business developemt', 'Buying land','Buying a new car' and 'Education'

Hence we can focus on these purposes for which the client is having for minimal payment difficulties.

## Performing bivariate analysis

In [None]:
# Box plotting for Credit amount in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=90)
plt.yscale('log')
sns.barplot(data =master1, x='NAME_CASH_LOAN_PURPOSE',hue='NAME_INCOME_TYPE',y='AMT_CREDIT_PREV',orient='v')
plt.title('Prev Credit amount vs Loan Purpose')
plt.show()


From the above we can conclude some points-

The credit amount of Loan purposes like 'Buying a holiday home','Buying a land','Buying a new car' and'Building a house' is higher.

Income type of state servants have a significant amount of credit applied

Money for third person or a Hobby is having less credits applied for.

In [None]:
# Box plotting for Credit amount prev vs Housing type in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=90)
sns.barplot(data =master1, y='AMT_CREDIT_PREV',hue='TARGET',x='NAME_HOUSING_TYPE')
plt.title('Prev Credit amount vs Housing type')
plt.show()

Here for Housing type, office appartment is having higher credit of target 0 and co-op apartment is having higher credit of target 1. So, we can conclude that bank should avoid giving loans to the housing type of co-op apartment as they are having difficulties in payment. Bank can focus mostly on housing type with parents or House\appartment or miuncipal appartment for successful payments

## CONCLUSION

#### 1. Banks should focus more on contract type ‘Student’ ,’pensioner’ and ‘Businessman’ with housing ‘type other than ‘Co-op apartment’ for successful payments.

#### 2. Banks should focus less on income type ‘Working’ as they are having most number of unsuccessful payments.

#### 3. Also with loan purpose ‘Repair’ is having higher number of unsuccessful payments on time.

#### 4. Get as much as clients from housing type ‘With parents’ as they are having least number of unsuccessful payments.