# Business Understanding

#### The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

<img src=image.jfif>

 

#### When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

    - If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
    - If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead   to a financial loss for the company. 
#### The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

    - The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,
    - All other cases: All other cases when the payment is paid on time. 
#### When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

    - Approved : The Company has approved loan Application
    - Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.
    - Refused: The company had rejected the loan (because the client does not meet their requirements etc.).
    - Unused offer:  Loan has been cancelled by the client but on different stages of the process.
    - In this case study, we perform EDA to understand how consumer attributes and loan attributes influence the tendency of default.

## 1. Importing Required Libraries

In [None]:
#ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import statements of all the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import random

## 2.Reading the applications.csv data 


### 'application_data.csv'  contains all the information of the client at the time of application.The data is about whether a client has payment difficulties. Lets analyse this data and draw patterns in the data

In [None]:
#Reading data from application_data.csv 
data= pd.read_csv("application_data.csv")

# Displaying the first 5 rows of data
data.head()

# Data Description

In [None]:
#Displaying the number of rows and columns in the data
data.shape

In [None]:
#Displaying data types of all columns
data.info()

In [None]:
#Summary of the dataset
data.describe()

## 3. Inspecting Missing Values

In [None]:
# As shown above the column has been removed 
data.head()

In [None]:
# percentage of null values in each column
percentageOfNullValues=data.isnull().sum()/data.shape[0]
percentageOfNullValues

## 4.Handling Missing Values

In [None]:
#Verifying the count of columns which has more than 30 perent of null values in it 
percentageOfNullValues[percentageOfNullValues>0.30].count()

## From the above step we could see that 50 columns in appplication data have more than 30 percent null vlaues in it. Considering this may result in biasing.So Continuing our analysis ignoring/Dropping them

In [None]:
# Count of columns Containing the null value percentage less than 30
percentageOfNullValues[percentageOfNullValues<0.30].count()

# Lets consider only columns where count of nan values is less than 30 percent

In [None]:
data=data[percentageOfNullValues[percentageOfNullValues<0.30].index]
data.head()

In [None]:
# Verifying the shape of the dataframe after removing the columns where columns contains more than 30 percent of null values in it
data.shape

# As we can see now the data set is free of null values where percentage of null values in a cloumn is more than 30 percent

In [None]:
data.info()

# 5.Datatype Handling

In [None]:
#changing negative ages to positive ages for further anslysis
data['DAYS_BIRTH']=abs(data['DAYS_BIRTH'])
data['DAYS_BIRTH'].describe()

In [None]:
#changing negative values in days to positive days for further anlysis
data['DAYS_EMPLOYED']=abs(data['DAYS_EMPLOYED'])
data['DAYS_EMPLOYED'].describe()

In [None]:
#changing negative days to positive days for further analysis
data['DAYS_REGISTRATION']=abs(data['DAYS_REGISTRATION'])
data['DAYS_REGISTRATION'].describe()

In [None]:
#changing negative days to positive for further analysis
data['DAYS_ID_PUBLISH']=abs(data['DAYS_ID_PUBLISH'])
data['DAYS_ID_PUBLISH'].describe()

In [None]:
#converting the data type of given categorical column
data['REG_REGION_NOT_LIVE_REGION'] = data['REG_REGION_NOT_LIVE_REGION'].astype(object)
data.dtypes

In [None]:
#Changing region daatype
data['REG_REGION_NOT_WORK_REGION'] = data['REG_REGION_NOT_WORK_REGION'].astype(object)

In [None]:
#Changing region datatype
data['LIVE_REGION_NOT_WORK_REGION'] = data['LIVE_REGION_NOT_WORK_REGION'].astype(object)

In [None]:
#Changing city datatype
data['REG_CITY_NOT_LIVE_CITY'] = data['REG_CITY_NOT_LIVE_CITY'].astype(object)

In [None]:
#Changing city datatype
data['REG_CITY_NOT_WORK_CITY'] = data['REG_CITY_NOT_WORK_CITY'].astype(object)

In [None]:
#Changing city datatype
data['LIVE_CITY_NOT_WORK_CITY']=data['LIVE_CITY_NOT_WORK_CITY'].astype(object)

In [None]:
data.head()

# Target:

## Analysing the Target Column and verifying that target doesn't contain any nan values in it

- 1: Payment difficulties

- 0: Other case

- Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)

In [None]:
data['TARGET'].value_counts()

In [None]:
print("Percentage of people with other than Payment difficulties 0's",data['TARGET'].value_counts()[0]/data.shape[0])
print("Percentage of people with payment difficulties 1's: ",data['TARGET'].value_counts()[1]/data.shape[0])

In [None]:
data['TARGET'].value_counts(normalize=True).plot(kind='bar')

# Insights:

## Observations on column `TARGET` from applictaion_data is:

- Customers falling under Category: 1`(Defaulters/Having payment difficulties)` is about 8 percent

- Customers falling under Category : 0`(Non-Defaulters)` is about 92 percent

# [](http://)![](http://)# 6.Univariate Analysis

# Categorical Unordered univariate analysis

In [None]:
# df0 represents target==0 which means customers belonging to non-defaulters

# df1 represents target ==1 which means customers belonging to defaulters/having payment difficulties

df0=data[data['TARGET']==0]
df1=data[data['TARGET']==1]

# Since our output column(target) is having 0 and 1 we created df0 and df1 to represnt binary classification

In [None]:
#Creation of univariate analysis(univ_anal) Method which is reusable, And helps us in performing further analysis with less lines of code

def univ_anal(variable_analysis):
    plt.figure(figsize=(15,5))
    plt.subplot(1,2,1)
    plt.title('Distribution for Non-Defaulter on '+variable_analysis,fontdict={'fontsize':13})
    sns.countplot(df0[variable_analysis])   #seaborn for more advanced colors
    plt.xticks(rotation=90)
    plt.subplot(1,2,2)
    plt.title('Distribution for Defaulter on '+variable_analysis,fontdict={'fontsize':13})
    sns.countplot(df1[variable_analysis])
    plt.xticks(rotation=90)

In [None]:
#Distribution for Non-Defaulter and Defaulter on NAME_CONTRACT_TYPE
univ_anal('NAME_CONTRACT_TYPE')

In [None]:
#Distribution for Non-Defaulter and Defaulter on NAME_FAMILY_STATUS

univ_anal('NAME_FAMILY_STATUS')

# Insights:
 
-  Proportion of Married customers falling in default category is high when compared with all other categories,This can be due to higher rate of customers taking the loan from this category
-  Single/Not married category has higher proportion count falling under default when compared with non-default count

In [None]:
plt.figure(figsize=(8,5))
plt.title('Distribution of Male versus Female in the data',fontdict={'fontsize':17})
sns.countplot(data.CODE_GENDER) 
plt.show()

In [None]:
#Distribution for Non-Defaulter and Defaulter on CODE_GENDER
univ_anal('CODE_GENDER')

# Insights:

- From the above two plots we could observe that Female catgeory in the dataset is twice as male category.In results to that we have above graph mentioning that in defaulter and non-defaulter Female category has a high count.Further analysis is done under bivariate anlysis to infer results from *`GENDER`* category.

In [None]:
#Distribution for Non-Defaulter and Defaulter on NAME_HOUSING_TYPE

univ_anal('NAME_HOUSING_TYPE')

# Insights:

- `House/Apartments` category has a highest category of customers walking into bank for loans
- `Rental apartment` category has more defaulters than non-defaulters,its a clear insight in the real world that there monthly expenses are going with house rents and which may lead to fall in defaulters list,This needs to be considered while giving loans
- If we observe the scale of the plots even `with parents` category has higher chances in falling under default category
- `House/Apartments` category has lot of defaulters its almost 8-10% when compared with the non-defaulters count

In [None]:
#Distribution for Non-Defaulter and Defaulter on NAME_INCOME_TYPE

univ_anal('NAME_INCOME_TYPE')

# Insights:

- The data implies almost same behavior with all the categories with respect to defaulters and Non-defaulters
- On Concentrated observation we could see that percentage of Pensioner being in default is less when compared with other categories.So bank can concentrate in this category to generate profits by taking certain steps like reducing loan amount 

In [None]:
#Distribution for Non-Defaulter and Defaulter on ORGANISATION_TYPE

plt.figure(figsize=(30,5))
plt.subplot(1,2,1)
plt.title('Distribution for Non-Defaulter on ORGANISATION_TYPE',fontdict={'fontsize':13})
sns.countplot(df0['ORGANIZATION_TYPE'])   #seaborn for more advanced colors
plt.xticks(rotation=90)
plt.subplot(1,2,2)
plt.title('Distribution for Defaulter on ORGANISATION_TYPE',fontdict={'fontsize':13})
sns.countplot(df1['ORGANIZATION_TYPE'])
plt.xticks(rotation=90)
plt.show()

# Insights:

- We cold observe that most of the categories is having 10 percent defualters when compared with non-defaulters.
- `Business Entity Type3` and `self-employed` has higher percentage of defaulter count
- `Business Entity Type 1` has more defaulter count when compared with overall percentage of that category

In [None]:
#Distribution for Non-Defaulter and Defaulter on NAME_TYPE_SUITE
univ_anal('NAME_TYPE_SUITE')

# Insights:

- `NAME_TYPE_SUITE` tells about person acompnying while applying for the loan and we could observe there is almost similar relation in both defaulter and Non-defaulter category.
- We can use this varibale for further anlysis in bivariate to inccur if this variable is having any strong correlation with other variables in a dataset

# Categorical Ordered univariate analysis

In [None]:
univ_anal('WEEKDAY_APPR_PROCESS_START')

# Insights:

- Day on which loan has been processed doesn't really imply the outcome/Target

In [None]:
#Distribution for Non-Defaulter and Defaulter on NAME_EDUCATION_TYPE

univ_anal('NAME_EDUCATION_TYPE')

# Insights:

- Academic degree looks more profitable for banks since the Defaulter percentage is very less in this category when compared with other categories.
- As the education level increases the Defaulters count is decreasing, This is quite realistic that customers might have settled with certain jobs and able to Repay the loans.
- Higher eduaction clearly implies less Defaulters to the bank. So bank can concentrate on giving loans accordingly.

# 7.Bivariate Analyssis

## Numeric - Numeric analysis

- There are three ways to analyse the *`numeric- numeric`* data types simultaneously.
- **Scatter plot**: describes the pattern that how one variable is varying with other variable.
- **Correlation matrix**: to describe the linearity of two numeric variables.
- **Pair plot**: group of scatter plots of all numeric variables in the data frame.

In [None]:
def multi_anal(groupby_var,target):  #multi-varibale analysis resusable method
    plt.figure(figsize=(13,5))
    plt.subplot(1,2,1)
    plt.title('Distribution of Non-Defaulter Category\n'+' V/S '+groupby_var)
    sns.countplot(x=groupby_var,hue=target,data=df0)
    plt.xticks(rotation=90)
    plt.subplot(1,2,2)
    plt.title('Distribution of Defaulter Category\n'+' V/S '+groupby_var)
    sns.countplot(x=groupby_var,hue=target,data=df1)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
def multivaranal_scatter(x,y):
    plt.figure(figsize=(15,5))
    plt.subplot(1,2,1)
    plt.title("NO payment Difficulties")
    plt.xlabel(x)
    plt.ylabel(y)
    plt.scatter(x,y,data=df0)
    plt.subplot(1,2,2)
    plt.title("Defaulter/Payment Difficulties")
    plt.xlabel(x)
    plt.ylabel(y)
    plt.scatter(x,y,data=df1)
    plt.show()

In [None]:
multivaranal_scatter('AMT_CREDIT','AMT_ANNUITY')

# Insights:

- We could easily observe thatCredit amount of the loan is linearly related with Amt_annuity

In [None]:
multivaranal_scatter('OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE')

# Insgihts:

- we could observe that `OBS_30_CNT_SOCIAL_CIRCLE` and `OBS_60_CNT_SOCIAL_CIRCLE` are linearly related

In [None]:
multivaranal_scatter('AMT_ANNUITY','AMT_GOODS_PRICE')

# Insights:

- This is quite Obvious that sum of Amount_ANNUITY(Term repayments) is equal to the loan amount and our plot depicts the same
- when both plots scales are compared it says that the `AMT_ANNUITY` scale is less than 140000,when compared with No payment difficulties it says that the higher amount installments dont have much much defaulters which is one way profit to the bank with regular payments from customers

In [None]:
multivaranal_scatter('AMT_GOODS_PRICE','AMT_CREDIT')

# Insights:

- `AMT_GOODS_PRICE` AND `AMT_CREDIT` are almost linearly related
- As observed from defaulters list there are some customers under defaulter category at higher goodsprice. Its better to analyse other varibales and reduce the oan amount to such customers to reduce the loss to the bank

# Numeric - Categorical


## `In our datasheet Target is categorical, Lets Compare this with other numeric values in dataset`

In [None]:
sns.boxplot(x='TARGET',y='DAYS_BIRTH',data=data)

# Insights:

- As we can see that Younger age group has more Payment difficulties when compared with older age group
- We can also observe thatAge group greater than 47(17500/365) has very less payment difficulties

In [None]:
sns.boxplot(x='TARGET',y='AMT_ANNUITY',data=data)

# Insights:

- `AMT_ANNUITY` tells us about the series of payments made by the customer, We could observe that Higher installments are getting paid on time as we can see higher Annuity payments fall under non-defaulter's list


In [None]:
data.groupby('TARGET')['DAYS_EMPLOYED'].mean().plot(kind='bar')

# Insights:

- As observed from the Plot we could see that the Customer working in the company for longer duration of time has less chance of falling under defaulter,Which can be considered while giving loan

# EXPLANATION

- AMT_INCOME_TOTAL: tOTAL INCOME OF THE CLIENT
- AMT_CREDIT : CREDIT AMOUNT OF THE LOAN
- AMT_ANNUITY: SERIES OF EQUAL AMOUNTS PAID
- AMT_GOODSPRICE: For consumer loans it is the price of the goods for which the loan is given

In [None]:
sns.boxplot('TARGET','AMT_GOODS_PRICE',data=data)

# Insights:

- AMT_GOODS_PRICE  tells us about the price of goods for which amount of loan is given by bank, We could observe that Higher Goods price for which loan was given is under non-default category(0) which is profit for bank.
- Plot2 shows that there are huge set of customers who fall under Non-defaulter category, Where bank is giving loans for higher amount of Goods.This should be taken care to make profits.

In [None]:
sns.boxplot('TARGET','CNT_FAM_MEMBERS',data=data)

# Insights:

- From the above plot we could see that increase in count of family members doesn'r really impact the final target variable
- But further analysis can be made on this in mutivariate analysis.

In [None]:

sns.boxplot('TARGET','HOUR_APPR_PROCESS_START',data=data)

# Insisghts:

- As observed from the above plot `HOUR_APPR_PROCESS_START` doesn't make much difference in defaulters and Non-defaulters. 

In [None]:

sns.boxplot('TARGET','DAYS_LAST_PHONE_CHANGE',data=data)

# Insights:

- In defaulter list we could see that there is a change in the *`phoneNumber`* in the recent days, Even 75th percentile is close to the loan processing day.So we may consider this for further analysis to gain more insights out of it

In [None]:
sns.boxplot('TARGET','AMT_ANNUITY',data=data)

In [None]:
df1.AMT_ANNUITY.describe()

# Insights:

- `AMT_ANNUITY` is the series of amount paid at equal intervals.The plot depicts that defaulters are more when the AMT_ANNUITY is less than 149211.


# Categorical vs Categorical

In [None]:
def multi_anal(groupby_var,target):  #multi-varibale analysis resusable method
    plt.figure(figsize=(13,5))
    plt.subplot(1,2,1)
    plt.title('Distribution of Non-Defaulter Category\n'+' V/S '+groupby_var)
    sns.countplot(x=groupby_var,hue=target,data=df0)
    plt.xticks(rotation=90)
    plt.subplot(1,2,2)
    plt.title('Distribution of Defaulter Category\n'+' V/S '+groupby_var)
    sns.countplot(x=groupby_var,hue=target,data=df1)
    plt.xticks(rotation=90)

In [None]:

multi_anal('NAME_INCOME_TYPE','TARGET')

# Insights:

- We can understand from the above plot that pensioner people are better for Bank and we can provide loans to them in a safer manner when compared with other categories
- State Servent,Commercial Assocaite follows the next category of low risk followed by pensioner category

- Unemployed and Maternity leave Category stands high in becoming a defaulter

In [None]:
multi_anal('FLAG_OWN_CAR','TARGET')

# Insights:

- As observed percentage of customers owning a car and not owning a car is almost half, Which is quite realistic.

## Customer Owning a car:

- Percentage of customer owning a car and falling in defaulters is quite less but doesn't imply drastic difference.

In [None]:
multi_anal('NAME_EDUCATION_TYPE','TARGET')

# Insights:

- Graph Clearly replicates that Acamedic degree candidates are less chance to be in defaulter list,Which is profit for company
- Minimal education level clearly implies that there is a chance of loss incurring lossed from those category of people

In [None]:
plt.figure(figsize=(25,5))
data.groupby('ORGANIZATION_TYPE')['TARGET'].mean().plot(kind='bar')

# Insights:

- This clearly provides insights about which Organisation_type customers are more in loan defaulters list


# Correlation matrix

In [None]:
# considering index element form 61 since 60 numeric variables themselves(same variable has high correlation) equals to 1
data.corr().abs().unstack().sort_values(ascending= False)[61:82]

# Insights from Correlation table:

- DAYS_EMPLOYED                FLAG_EMP_PHONE                 0.999755
- FLAG_EMP_PHONE               DAYS_EMPLOYED                  0.999755
- OBS_60_CNT_SOCIAL_CIRCLE     OBS_30_CNT_SOCIAL_CIRCLE       0.998490
- OBS_30_CNT_SOCIAL_CIRCLE     OBS_60_CNT_SOCIAL_CIRCLE       0.998490
- AMT_GOODS_PRICE              AMT_CREDIT                     0.986968
- AMT_CREDIT                   AMT_GOODS_PRICE                0.986968
- REGION_RATING_CLIENT         REGION_RATING_CLIENT_W_CITY    0.950842
- REGION_RATING_CLIENT_W_CITY  REGION_RATING_CLIENT           0.950842
- CNT_FAM_MEMBERS              CNT_CHILDREN                   0.879161
- CNT_CHILDREN                 CNT_FAM_MEMBERS                0.879161
- LIVE_REGION_NOT_WORK_REGION  REG_REGION_NOT_WORK_REGION     0.860627
- REG_REGION_NOT_WORK_REGION   LIVE_REGION_NOT_WORK_REGION    0.860627
- DEF_30_CNT_SOCIAL_CIRCLE     DEF_60_CNT_SOCIAL_CIRCLE       0.860517
- DEF_60_CNT_SOCIAL_CIRCLE     DEF_30_CNT_SOCIAL_CIRCLE       0.860517
- LIVE_CITY_NOT_WORK_CITY      REG_CITY_NOT_WORK_CITY         0.825575
- REG_CITY_NOT_WORK_CITY       LIVE_CITY_NOT_WORK_CITY        0.825575
- AMT_ANNUITY                  AMT_GOODS_PRICE                0.775109
- AMT_GOODS_PRICE              AMT_ANNUITY                    0.775109
- AMT_CREDIT                   AMT_ANNUITY                    0.770138
- AMT_ANNUITY                  AMT_CREDIT                     0.770138
- FLAG_EMP_PHONE               DAYS_BIRTH                     0.619888
- DAYS_BIRTH                   FLAG_EMP_PHONE                 0.619888
- DAYS_EMPLOYED                DAYS_BIRTH                     0.615864
- DAYS_BIRTH                   DAYS_EMPLOYED                  0.615864
- FLAG_EMP_PHONE               FLAG_DOCUMENT_6                0.597732


# Visualisation for the high correlated varibales

In [None]:
# Top 1 Correlated variables

data.plot.scatter(x='OBS_30_CNT_SOCIAL_CIRCLE',y='OBS_60_CNT_SOCIAL_CIRCLE')
plt.show()

In [None]:
# Top 2 Correlated variables

data.plot.scatter(x='AMT_CREDIT',y='AMT_GOODS_PRICE')
plt.show()

# Insights:

- As we could observe that As the AMT_CREDIT and AMT_GOODS_PRICE is linearly related this is quite obvious that when customer resaches bank for loan amount, the bank will decide the amount to give based on the cutsomer history
-

In [None]:
# Top 3 Correlated variables

data.plot.scatter(x='REGION_RATING_CLIENT_W_CITY',y='REGION_RATING_CLIENT')
plt.show()

In [None]:
# Top 4 Correlated variables

data.plot.scatter(x='CNT_FAM_MEMBERS',y='CNT_CHILDREN')
plt.show()

In [None]:
# Top 5 Correlated variables
data.plot.scatter(x='LIVE_REGION_NOT_WORK_REGION',y='REG_REGION_NOT_WORK_REGION')
plt.show()

# Insights:

- `LIVE_REGION_NOT_WORK_REGION` and `REG_REGION_NOT_WORK_REGION` are highly correlated.
- If the value of one categorical column is 0 the other will be 1

In [None]:
# Top 6 Correlated variables
data.plot.scatter(x='DEF_60_CNT_SOCIAL_CIRCLE',y='DEF_30_CNT_SOCIAL_CIRCLE')
plt.show()

In [None]:
   # Top 7 Correlated variables

data.plot.scatter(x='REG_CITY_NOT_WORK_CITY',y='LIVE_CITY_NOT_WORK_CITY')
plt.show()

In [None]:
# Top 8 Correlated variables

data.plot.scatter(x='AMT_GOODS_PRICE',y='AMT_ANNUITY')
plt.show()

In [None]:
# Top 9 Correlated variables

data.plot.scatter(x='AMT_CREDIT',y='AMT_ANNUITY')
plt.show()

In [None]:
# Top 10 Correlated variables

data.plot.scatter(x='DAYS_BIRTH',y='FLAG_EMP_PHONE')
plt.show()

####   End of Correlation Analysyis

### Lets analyse 'previous_application.csv' contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.

#### Analysing this data helps in understanding the banks behavior in the previous loans

# Previous Application Dataset

In [None]:
import pandas as pd

In [None]:
#Reading the previous application dataset

prev_appl=pd.read_csv('previous_application.csv')
prev_appl

In [None]:
prev_appl.head()

# Data Description

In [None]:
#shape of the data
prev_appl.shape

In [None]:
prev_appl.info()

In [None]:
#statistical description of the data
prev_appl.describe()

# Identification and Treatment of Null values

In [None]:
percentageofnullvalues=((prev_appl.isnull().sum()/prev_appl.shape[0])*100).round(2)
percentageofnullvalues

# Lets consider only columns having less than 20 percent of null values in it

In [None]:
# Lets consider only columns having less than 20 percent of null values in it

prev_appl=prev_appl[percentageofnullvalues[percentageofnullvalues<20].index]
prev_appl

In [None]:
# Verifying the column  count after removal of columns in the previous step 
prev_appl.shape

In [None]:
# Previous application top 5 rows

prev_appl.head()

In [None]:
# combining the 'previous application' dataframe  with 'application'  dataframe based on the Current application id (SK_ID_CURR)

merged_df=pd.merge(data,prev_appl,how='left',on='SK_ID_CURR')
merged_df

In [None]:
# shape of the combined dataframe
merged_df.shape

# NAME_CONTRACT_STATUS Analysis

In [None]:
merged_df.NAME_CONTRACT_STATUS.value_counts()

In [None]:
# merged_df0 represents target==0 which means customers belonging to non-defaulters

# merged_df1 represents target ==1 which means customers belonging to defaulters

merged_df0=merged_df[merged_df['TARGET']==0]
merged_df1=merged_df[merged_df['TARGET']==1]

# univariate and Bivariate Analysis

In [None]:
# Analysyis on NAME_CONTRACT_STATUS versus for non-defaulter and defaulter

plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
plt.title('Distribution of Non-Defaulter Category\n'+' V/S '+'NAME_CONTRACT_STATUS',fontsize=20)
sns.countplot(x='TARGET',hue='NAME_CONTRACT_STATUS',data=merged_df0)
plt.subplot(1,2,2)
plt.title('Distribution of Defaulter Category\n'+' V/S '+'NAME_CONTRACT_STATUS',fontsize=20)
sns.countplot(x='TARGET',hue='NAME_CONTRACT_STATUS',data=merged_df1)
plt.show()

# Insights:

- we could see some good insights after combining both the dataframes that almost 65000(plot2) loans has been approved by the bank and which are falling under Default category, which is loss for the bank.
- Plot1 says about non-defaulter category where we could see the company has refused almost 200000 loans, but the customer is capable of repaying the loans,This should be avoided to inccur profits to the bank


In [None]:
#Analysis on NAME_EDUCATION_TYPE V/S NAME_CONTRACT_STATUS for non-defaulter and defaulter

plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.title('Distribution of Non-Defaulter Category on\n'+'NAME_EDUCATION_TYPE V/S NAME_CONTRACT_STATUS',fontsize=15)
sns.countplot(x='NAME_CONTRACT_STATUS',hue='NAME_EDUCATION_TYPE',data=merged_df0)
plt.xticks(rotation=90)
plt.subplot(1,2,2)
plt.title('Distribution of Default Category on\n'+'NAME_EDUCATION_TYPE V/S NAME_CONTRACT_STATUS',fontsize=15)
sns.countplot(x='NAME_CONTRACT_STATUS',hue='NAME_EDUCATION_TYPE',data=merged_df1)
plt.xticks(rotation=90)
plt.show()

# Insights:

- Bank should consider `Secondary/Secondary special` category since the customers are more from this category in both defaulter and non-default list.So other factors(varibales) should be considered before approving the loan


In [None]:
# Analysyis on NAME_FAMILY_STATUS V/S NAME_CONTRACT_STATUS for non-defaulter and defaulter

plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.title('Distribution of Non-Defaulter Category on\n'+'NAME_FAMILY_STATUS V/S NAME_CONTRACT_STATUS',fontsize=15)
sns.countplot(x='NAME_CONTRACT_STATUS',hue='NAME_FAMILY_STATUS',data=merged_df0)
plt.xticks(rotation=90)
plt.subplot(1,2,2)
plt.title('Distribution of Defaulter Category on\n'+'NAME_FAMILY_STATUS V/S NAME_CONTRACT_STATUS',fontsize=15)
sns.countplot(x='NAME_CONTRACT_STATUS',hue='NAME_FAMILY_STATUS',data=merged_df1)
plt.xticks(rotation=90)
plt.show()

# Insights:

- From plot1 we can understand that some of the loans has been refused across different categories, But those category of people are capable of repaying the loan
- From plot2 we can say tht `Married` category has high count in refused and cancelled loans when compared with other categories

In [None]:
# Analysyis on NAME_HOUSING_TYPE V/S NAME_CONTRACT_STATUS for non-defaulter and defaulter

plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.title('Distribution of Non-Defaulter Category on\n'+'NAME_HOUSING_TYPE V/S NAME_CONTRACT_STATUS',fontsize=15)
sns.countplot(x='NAME_CONTRACT_STATUS',hue='NAME_HOUSING_TYPE',data=merged_df0)
plt.xticks(rotation=90)
plt.subplot(1,2,2)
plt.title('Distribution of Defaulter Category on\n'+'NAME_HOUSING_TYPE V/S NAME_CONTRACT_STATUS',fontsize=15)
sns.countplot(x='NAME_CONTRACT_STATUS',hue='NAME_HOUSING_TYPE',data=merged_df1)
plt.xticks(rotation=90)
plt.show()

# Insights:

- The bank has refused some of the loans across different categories but actually they are capable of repaying. So bank may take another steps by reducing the loan amount and procing the loan to such customers to incur profits.
- Plot 2 shows high Loan approvals on `House/Apartment` category but the customers are falling under default category, Which results in loss for bank. So bank should consider that having`House/Apartments` doesn't really imply that the customer will be able to repay loan on time.

# Summary of Analysis:

## Bank should concentrate on following categories while providing loans:

- Banks can take risk in providing loans to >47 in generating profits. Since from our analysis we found that higher age group people are less in number under defaulters, When compared with other categories of age.
- Lesser the education of Client, High chances of falling in Defaulter's list. This can be considered to gain benefits.
- Employees staying in their current jobs for longer period of time, Chances of falling under defaulter is less, This should be considered while providing loan.
- By analysing the data we observed that clients belonging to Apartment/House are falling under defaulters which needs to be considered to incur profits.
- Bank should reduce their focus on ‘working’ category where we say huge percentage of people are falling under defaulters.
- Pensioner category are good for benefitting profits