
# CREDIT EDA case study






### Problem statement
The loan providing companies find it hard to give loans to the people due to their
insufficient or non-existent credit history. Because of that, some consumers use it as
their advantage by becoming a defaulter.
When the company receives a loan application, the company must decide for loan
approval based on the applicant’s profile. Two types of risks are associated with the
bank’s decision:
- If the applicant is likely to repay the loan, then not approving the loan results in a
loss of business to the company
- If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then
approving the loan may lead to a financial loss for the company.


Business objective is to identify patterns which indicate if a client has difficulty paying
their installments

## Table of contents:
* [Load Data](#load-data)
* [Understanding Data](#un-data)
    * [Data Imbalance](#data-imb)
* [Remove unnecessary columns](#remove-cols)
* [Set Index](#reset-ind)
* [Missing Values](#missing-val)
* [Classifying variables](#classfy-vars)
* [Cleaning and Handling Datatypes](#handling-dtypes)
* [Analysing for outliers](#outliers)
    * [Missing Values cntd...](#missing-val2)
* [Preparing Data](#prepare-data)
    * [Segmenting](#seg)
* [Analysis](#ana)
    * [univariate analysis](#uni-ana)
    * [Bivariate and Miltivariate analysis](#bi-ana)
    * [Correlation](#corr)
* [Previous application data](#prev_app)
* [Understanding data](#prev_und)
* [Dropping unnecessary columns](#prev_drop)
* [Missing values previous application data](#prev_missing)
* [Merging and Segmenting](#prev_merg)
* [Analysis of Previous and Current application data](#prev_ana)
* [Final Thoughts](#summary)

In [None]:
#importing necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Font for graph labels, this is used for all graphs in this notebook
label_font = {"fontsize": 12, 
              "color" : "darkred",
             'weight': 'normal'}
#Font for graph title, this is used for all graphs in this notebook
title_font = {"fontsize": 15, 
              "color" : "darkred",
             'weight': 'normal'}

## Load data <a class="anchor" id="load-data"></a>



In [None]:
#The data files are in the same folder as the ipynb file
app_df = pd.read_csv("../input/bank-loans-dataset/application_data.csv")
app_df.head()

## Understanding data <a class="anchor" id="un-data"></a>

To start understanding the dataset, we start by looking at the number of records and type of columns.

In [None]:
app_df.shape

In [None]:
app_df.info(verbose = True, null_counts=True)

In [None]:
#Types of loans
app_df["NAME_CONTRACT_TYPE"].value_counts()

- A cash loan is a term loan which is received by the borrower in money form and repaid over a term in fixed installments.
- Revolving loan is where a client withdraws within the credit limit and repays and withdraws again if necessary. Eg, credit cards.

In [None]:
#How many of them have defaulted for Revolving loans
app_df[(app_df["NAME_CONTRACT_TYPE"] == "Revolving loans") & (app_df["TARGET"] == 1)].shape[0]

In [None]:
#How many of them have defaulted for Cash loans
app_df[(app_df["NAME_CONTRACT_TYPE"] == "Cash loans") & (app_df["TARGET"] == 1)].shape[0]

### Data Imbalance <a class="anchor" id="data-imb"></a>

In [None]:
#Persentage of defaulter records in the given dataset
(app_df[(app_df["TARGET"] == 1)].shape[0]/app_df.shape[0])*100

With minority class data proportion being approximatly 8%, the data is moderatly imbalanced


## Remove unnecessary columns <a class="anchor" id="remove-cols"></a>

The mean median and mode values of living conditions of the customers seem to be NOT useful for answering the quetsion at hand, and all of these columns have 50% or more missing values. Hence dropping these columns.

 'APARTMENTS_AVG',
       'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG',
       'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
       'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG',
       'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE',
       'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE',
       'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE',
       'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE',
       'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI',
       'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
       'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI',
       'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI',
       'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
       'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE',
       'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE',

In [None]:
app_df.drop(app_df.iloc[:, 44:91], inplace = True, axis = 1)

In [None]:
app_df.info(verbose = True, null_counts=True)

Number of credit bureau enquiry columns can be reduced into one column by adding the HOUR, DAY, MONTH, QUARTER AND YEAR values and the following columns can be dropped.

AMT_REQ_CREDIT_BUREAU_HOUR   ,AMT_REQ_CREDIT_BUREAU_DAY  ,AMT_REQ_CREDIT_BUREAU_WEEK   ,AMT_REQ_CREDIT_BUREAU_MON ,AMT_REQ_CREDIT_BUREAU_QRT  ,AMT_REQ_CREDIT_BUREAU_YEAR

But before adding up, there seem to be approximately 14% of the values missing in all columns. We need to check why these values are missing and if we can do something about them.

In [None]:
app_df[app_df["AMT_REQ_CREDIT_BUREAU_YEAR"].isna()].head()

In [None]:
app_df[app_df["AMT_REQ_CREDIT_BUREAU_YEAR"].isna()].tail()

<a class="anchor" id="cbe_mnar"></a>Seems like if data for one credit enquiry column is missing then it is missing for all credit enquiry columns. There seems to be a pattern here, this could be potential human/machine error or the data is not available

#### This is a case of Missing not at random

In [None]:
#Are there any defaulters for whom the credit bureau enquiry data is missing?
app_df[app_df["AMT_REQ_CREDIT_BUREAU_YEAR"].isna()]["TARGET"].sum()

In [None]:
app_df[app_df["AMT_REQ_CREDIT_BUREAU_YEAR"].isna()]["TARGET"].sum()/app_df["TARGET"].sum()

#### 17% of the defaulters are missing the credit enquiry data

In [None]:
# create a new column adding up all credit enquiry values
app_df["CREDIT_BUREAU_ENQ_YEAR"] = app_df["AMT_REQ_CREDIT_BUREAU_HOUR"]+app_df["AMT_REQ_CREDIT_BUREAU_DAY"]+app_df["AMT_REQ_CREDIT_BUREAU_MON"]+app_df["AMT_REQ_CREDIT_BUREAU_QRT"]+app_df["AMT_REQ_CREDIT_BUREAU_YEAR"]

In [None]:
app_df.head()

In [None]:
app_df.drop(app_df.iloc[:, 69:75], inplace = True, axis = 1)

In [None]:
app_df.info()

OBS_30_CNT_SOCIAL_CIRCLE        float64
DEF_30_CNT_SOCIAL_CIRCLE        float64
OBS_60_CNT_SOCIAL_CIRCLE        float64
DEF_60_CNT_SOCIAL_CIRCLE        float64
Above columns can be aggregated into one column

In [None]:
app_df["DEF_IN_SOCIAL_CIRCLE"] = app_df["DEF_30_CNT_SOCIAL_CIRCLE"]+app_df["DEF_60_CNT_SOCIAL_CIRCLE"]
app_df[["DEF_30_CNT_SOCIAL_CIRCLE","DEF_60_CNT_SOCIAL_CIRCLE","DEF_IN_SOCIAL_CIRCLE"]].head()

In [None]:
app_df.drop(["OBS_30_CNT_SOCIAL_CIRCLE", "DEF_30_CNT_SOCIAL_CIRCLE", "OBS_60_CNT_SOCIAL_CIRCLE",
             "DEF_60_CNT_SOCIAL_CIRCLE"], inplace = True, axis = 1)
app_df.head()

OWN_CAR_AGE, FLAG_MOBIL, FLAG_EMP_PHONE , FLAG_WORK_PHONE  , FLAG_CONT_MOBILE ,FLAG_PHONE, FLAG_EMAIL Columns are not useful for answering the question.   

In [None]:
app_df.drop(["OWN_CAR_AGE", "FLAG_MOBIL", "FLAG_EMP_PHONE", "FLAG_WORK_PHONE", "FLAG_CONT_MOBILE", 
             "FLAG_CONT_MOBILE", "FLAG_PHONE", "FLAG_EMAIL"], inplace = True, axis = 1)

In [None]:
app_df.head(2)

The next set of columns providing information about document submission also seem not useful for our analysis, hence below mentioned columns will be removed

'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'

In [None]:
app_df.drop(['FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'], inplace = True, axis = 1)

In [None]:
app_df.info()

<a class="anchor" id="ext_mar"></a>For the normalized external score columns EXT_SOURCE_1 and 3 ~56% and ~19% data is missing. Ror the EXT_SOURCE_2 we have more than 90% of the data available.

In [None]:
app_df.drop(['EXT_SOURCE_1','EXT_SOURCE_3'], inplace = True, axis = 1)

In [None]:
app_df.info()

Below mentioned columns does not seem to be useful

NAME_TYPE_SUITE,
WEEKDAY_APPR_PROCESS_START,
HOUR_APPR_PROCESS_START,
REG_REGION_NOT_LIVE_REGION,
REG_REGION_NOT_WORK_REGION,
LIVE_REGION_NOT_WORK_REGION,
REG_CITY_NOT_LIVE_CITY,
REG_CITY_NOT_WORK_CITY,
LIVE_CITY_NOT_WORK_CITY            

In [None]:
app_df.drop(["NAME_TYPE_SUITE",
"WEEKDAY_APPR_PROCESS_START",
"HOUR_APPR_PROCESS_START",
"REG_REGION_NOT_LIVE_REGION",
"REG_REGION_NOT_WORK_REGION",
"LIVE_REGION_NOT_WORK_REGION",
"REG_CITY_NOT_LIVE_CITY",
"REG_CITY_NOT_WORK_CITY",
"LIVE_CITY_NOT_WORK_CITY"], inplace = True, axis = 1)

In [None]:
app_df.info()

## Set Index <a class="anchor" id="reset-ind"></a>

Setting the customer application ID as the Index for ease of access and ease of merging with previous application dataset.

In [None]:
app_df.set_index("SK_ID_CURR", inplace = True)

In [None]:
app_df.head(2)

## Missing values <a class="anchor" id="missing-val"></a>

Missing values in a dataset can be of three types 
- MCAR: It stands for Missing completely at random. The reason behind the missing value is not dependent on any other features.
- MAR: It stands for Missing at random. The reason behind the missing value may be associated with some other features.
- MNAR: It stands for Missing not at random. There is a specific reason behind the missing value.

We saw an example of MNAR [CREDIT_BUREAU enquiry columns](#cbe_mnar), we aggregated these columns into one column for the ease of analysis.

We also saw the [External Source Scoring columns](#ext_mar) had as much as  56%, 19% missing values. 
We handled this situation by dropping the columns.

There are multiple ways of handling the missing values 
- Set values as missing values
- Drop rows and columns
- Impute values

If missing rows/columns are deleted, we may loose valuable information. Imputing may intorduce data exaggeration and imbalance.

In [None]:
#Percentage of missing values after removing the columns that are NOT useful for analysis
round(100*(app_df.isnull().sum()/app_df.shape[0]),4)

We see the OCCUPATION_TYPE column at 31% having highest percentage of missing values followed by CREDIT_BUREAU_ENQ_YEAR at 13.5%. All the other columns with the missing values are less than 1% and hence negligible as these missing values will not impact the analysis greatly.

<b>Handling OCCUPATION_TYPE</b>




In [None]:
#Setting option to display all columns
pd.set_option('display.max_columns', 31)
#display customers with missing occupation type
app_df[app_df["OCCUPATION_TYPE"].isna()].head()

In [None]:
#display customers with missing occupation type
app_df[app_df["OCCUPATION_TYPE"].isna()].tail()

There <s>does not seem</s> **seems** to be a pattern with the missing values here. This is a case of Missing At Random.

OCCUPATION_TYPE is a categorical variable, with ~31% of missing values.
<s>One of the possible ways of handling this variable is by imputing the missing values with the most occuring category (Mode).</s>

We came across a patter for the missing values of OCCUPATION_TYPE column. Which is further analysed <a href="#missing-val2" class="anchor" id="back_to_missing">here</a>

<b>Handling CREDIT_BUREAU_ENQ_YEAR</b>


In [None]:
app_df["CREDIT_BUREAU_ENQ_YEAR"].describe()

In [None]:
app_df[app_df["CREDIT_BUREAU_ENQ_YEAR"].isna()].head()

In [None]:
app_df[app_df["CREDIT_BUREAU_ENQ_YEAR"].isna()].tail()

Again, there does not seem to be an obvious pattern for the missing values of credit bureau enquiry data. 

As this is a numeric/continuous variable, we can impute the values with mean or median. However, as we can see the max value(256) is far away from the 75th percentile, median would be a good choice to impute the values.


More safer way to handle these missing values will be to bin them as the missing category.

## Classifying variables <a class="anchor" id="classfy-vars"></a>

In [None]:
app_df.info()

In [None]:
app_df['REGION_RATING_CLIENT'].value_counts()

In [None]:
app_df['REGION_RATING_CLIENT_W_CITY'].value_counts()

In [None]:
app_df['DAYS_REGISTRATION'].value_counts()

At the first look there seems to be 10 categorical variables and 20 continuous variables. But at a closer look REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY are more of a categoriacal variables than continuous. We can change the datatypes of these two columns to object(string).

And the columns, DAYS_REGISTRATION and CNT_FAM_MEMBERS seem to have been wrongly identified as floats. We can change the types to integer.

With this understanding we have 12 categorical variables and 18 continuous variables for analysis.

## Cleaning and Handling Datatypes <a class="anchor" id="handling-dtypes"></a>

In [None]:
#Converting REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY into object
app_df['REGION_RATING_CLIENT'] = app_df['REGION_RATING_CLIENT'].astype(object)
app_df['REGION_RATING_CLIENT_W_CITY'] = app_df['REGION_RATING_CLIENT_W_CITY'].astype(object)
app_df.dtypes

In [None]:
# converting DAYS_REGISTRATION and CNT_FAM_MEMBERS to integers
'''app_df['DAYS_REGISTRATION'] = app_df['DAYS_REGISTRATION'].astype(int)
app_df['CNT_FAM_MEMBERS'] = app_df['CNT_FAM_MEMBERS'].astype(int)
app_df.dtypes'''

we encountered a ValueError when trying to convert the above variables to integer

Cannot convert non-finite values (NA or inf) to integer


So analysing the root cause, CNT_FAM_MEMBERS seem to have some missing values.

In [None]:
app_df[app_df["CNT_FAM_MEMBERS"].isna()]

In [None]:
#There are only two rows with missing values, hence dropping the CNT_FAM_MEMBERS missing value rows

app_df.drop(app_df[app_df["CNT_FAM_MEMBERS"].isna()].index.tolist(), inplace = True, axis = 0)


In [None]:
#Again trying to convert the datatypes
app_df['DAYS_REGISTRATION'] = app_df['DAYS_REGISTRATION'].astype(int)
app_df['CNT_FAM_MEMBERS'] = app_df['CNT_FAM_MEMBERS'].astype(int)
app_df.dtypes

Some of the numeric variables have negative values. It makes more sense to convert these into positive numbers.

- DAYS_BIRTH
- DAYS_EMPLOYED
- DAYS_REGISTRATION
- DAYS_ID_PUBLISH

In [None]:
#Converting negative values to positive and converting the scale from days to years.
app_df['DAYS_BIRTH']=abs(app_df['DAYS_BIRTH'])/365
app_df['DAYS_BIRTH'].describe()

In [None]:
#Converting negative values to positive and converting the scale from days to years.
app_df['DAYS_EMPLOYED']=abs(app_df['DAYS_EMPLOYED'])/365
app_df['DAYS_EMPLOYED'].describe()

The maximum value is 1000 years and there is a huge difference between mean and median, there seem to be some invalid data and outliers in DAYS_EMPLOYED variable. 

In [None]:
#Converting negative values to positive and converting the scale from days to years.
app_df['DAYS_REGISTRATION']=abs(app_df['DAYS_REGISTRATION'])/365
app_df['DAYS_REGISTRATION'].describe()

In [None]:
#Converting negative values to positive and converting the scale from days to years. 
app_df['DAYS_ID_PUBLISH']=abs(app_df['DAYS_ID_PUBLISH'])/365
app_df['DAYS_ID_PUBLISH'].describe()

#### Checking DAYS_EMPLOYED variable

In [None]:
app_df["DAYS_EMPLOYED"].quantile([.25,.5,.6,.7,.75,.77,.8,.85,.9,.99])

There is a big jump from 80th percentage to 85th percentage. A person employed for more than 85 years seems illogical

In [None]:
app_df[app_df["DAYS_EMPLOYED"] > 85].shape[0]/app_df.shape[0]

18% percent of values in column DAYS_EMPLOYED are invalid

In [None]:
# We have changed the scale of few columns from days to years.
# Renaming the columns with meaningful names after changing the scale.
app_df.rename(columns = {"DAYS_BIRTH":"AGE_IN_YEARS",
"DAYS_EMPLOYED":"YEARS_EMPLOYED",
"DAYS_REGISTRATION":"YEARS_REGISTRATION",
"DAYS_ID_PUBLISH":"YEARS_ID_PUBLISHED"}, inplace = True) 
app_df.columns.values

## Analysing for outliers<a class="anchor" id="outliers"></a>


According to the definetion :
Outliers are values that are much beyond or far from the next nearest data points.

Major approaches in handling outliers are 
- Imputation
- Deletion of outliers
- Binning of values
- Capping the outliers

#### Analysing AMT_INCOME_TOTAL

In [None]:
# Analysing AMT_INCOME_TOTAL
app_df.AMT_INCOME_TOTAL.describe()

In [None]:
plt.figure(figsize=(10,2))
sns.boxplot(app_df.AMT_INCOME_TOTAL)
plt.title('Distribution of Income')
plt.xscale('symlog')
plt.show()

While the box plot is definitely indicating we have outliers, its not clear how many and by how much.
Looking at the quantiles might help

In [None]:
app_df["AMT_INCOME_TOTAL"].quantile([0.5, 0.7,.8, 0.9,0.95,0.99, 1])

Looking at the quantiles the top 0.1% seems to be the outliers at approximate value 1.17x10^8.
Large total income values are definitely valid, however, they can mislead the analysis. We can resolve this by caping to a reasonable value.

In [None]:
app_df[app_df["AMT_INCOME_TOTAL"] > app_df["AMT_INCOME_TOTAL"].quantile(0.99)]["AMT_INCOME_TOTAL"].value_counts()

In [None]:
# Resolving outliers by capping the values at 99th percentile
cap_val = app_df["AMT_INCOME_TOTAL"].quantile(0.99)
app_df["AMT_INCOME_TOTAL"] = app_df["AMT_INCOME_TOTAL"].apply(lambda x: cap_val if x > cap_val else x)

In [None]:
app_df["AMT_INCOME_TOTAL"].quantile([0.5, 0.7,.8, 0.9,0.95,0.99, 1])

#### Analysing AMT_CREDIT

In [None]:
app_df.AMT_CREDIT.describe()

There is little difference between mean and median

In [None]:
app_df["AMT_CREDIT"].quantile([0.5, 0.7, 0.9,0.95,0.99, 1])

In [None]:
plt.figure(figsize=(10,2))
sns.boxplot(app_df["AMT_CREDIT"])
plt.title('Distribution of Credit amount')
plt.show()

The outliers lie beyond 3 million. We can cap the values at 99th percentile or drop the values

#### Analysing AMT_ANNUITY 

In [None]:
#describe AMT_ANNUITY 
app_df.AMT_ANNUITY.describe()

In [None]:
plt.figure(figsize=(9,2))
sns.boxplot(app_df.AMT_ANNUITY)
plt.title('Distribution of Annuity Amount')
plt.show()

Ploating AMT_ANNUITY column shows there are outliers above 200000.
As the mean and median are closer, we can impute the outliers by replacing them by Median value or by simply dropping them.

#### Analysing YEARS_EMPLOYED

As we previously analysed 18% percent of values of variable YEARS_EMPLOYED are invalid

In [None]:
app_df["YEARS_EMPLOYED"].describe()

In [None]:
plt.figure(figsize=(10,3))
sns.boxplot(app_df["YEARS_EMPLOYED"])
plt.title('Distribution of years employed')
plt.show()

The outliers here are invalid values it is illogical for someone to be employed for 1000 years. Hence, these need to be treated as missing values and need to be analysed further.

In [None]:
app_df[app_df["YEARS_EMPLOYED"]>85].head(10)

In [None]:
app_df[app_df["YEARS_EMPLOYED"]>85][["NAME_INCOME_TYPE", "OCCUPATION_TYPE","ORGANIZATION_TYPE"]]

### Missing Values cntd...<a class="anchor" id="missing-val2"></a>

The data indicates that value 1000 for employment years has been used to indicate retired customers.
This analysis also brings us to observe missing values of "OCCUPATION_TYPE" and "ORGANIZATION_TYPE" variables.

We can handle missing values by imputing the values with more meaningful term.

In [None]:
app_df["OCCUPATION_TYPE"] = app_df.apply(lambda x: "Retired" if x["NAME_INCOME_TYPE"] == "Pensioner" 
                                         else x["OCCUPATION_TYPE"], axis = 1)


In [None]:
app_df["ORGANIZATION_TYPE"] = app_df.apply(lambda x: "None" if x["NAME_INCOME_TYPE"] == "Pensioner" 
                                         else x["ORGANIZATION_TYPE"], axis = 1)

In [None]:
app_df[app_df["YEARS_EMPLOYED"]>85][["NAME_INCOME_TYPE", "OCCUPATION_TYPE","ORGANIZATION_TYPE"]]

In [None]:
#Recalculating missing values
round(100*(app_df.isnull().sum()/app_df.shape[0]),4)

If you navigated here from missing value section <a href="#back_to_missing" class="anchor">continue to missing values</a>

## Preparing Data <a class="anchor" id="prepare-data"></a>

Histogram is a very helpful visual to understand the distribution of variables. Before we start analysing the different variables influence on the target variable, we need to translate our numeric data into categorical data, as that is the best way of using it with histograms. This process is known as **Binning**.

We can consider the below mentioned variables for binning.
- AMT_INCOME_TOTAL                
- AMT_CREDIT                      
- AMT_ANNUITY
- AGE_IN_YEARS

In [None]:
#Check for the value range in AMT_INCOME_TOTAL
app_df['AMT_INCOME_TOTAL'].describe()

In [None]:
# Creating bins for Income Amount
bins = [0, .25, .75, 1.]
labels = ['Low','Medium','High']
app_df['INCOME_AMT_RANGE']=pd.qcut(app_df['AMT_INCOME_TOTAL'],bins,labels=labels)

In [None]:
app_df[['AMT_INCOME_TOTAL', 'INCOME_AMT_RANGE']].head()

In [None]:
#Check for the value range in AMT_CREDIT
app_df['AMT_CREDIT'].describe()

In [None]:
#Binning Credit Amount
bins = [0, .25, .75, 1.]
labels = ['Low','Medium','High']
app_df['CREDIT_AMT_RANGE']=pd.qcut(app_df['AMT_CREDIT'],q=bins,labels=labels)

In [None]:
app_df[['AMT_CREDIT', 'CREDIT_AMT_RANGE']].head()

In [None]:
#Check for the value range in AMT_ANNUITY
app_df['AMT_ANNUITY'].describe()

In [None]:
#Binning Annuity Amount
bins = [0, .25, .85, 1.]
labels = ['Low','Medium','High']
app_df['ANNUITY_AMT_RANGE']=pd.qcut(app_df['AMT_ANNUITY'],q=bins,labels=labels)

In [None]:
app_df[["ANNUITY_AMT_RANGE","AMT_ANNUITY"]].head(10)

In [None]:
#Binning Age
app_df['AGE_IN_YEARS'].describe()

In [None]:
np.arange(10,90,10,int)

In [None]:
#Binning Annuity Amount
bins = [10, 20, 30, 40, 50, 60, 70, 80]
labels = ['10-20','20-30','30-40','40-50','50-60','60-70','70-80']
app_df['AGE_RANGE']=pd.cut(app_df['AGE_IN_YEARS'],bins=bins,labels=labels)

In [None]:
app_df[['AGE_IN_YEARS', 'AGE_RANGE']].head()

In [None]:
#Checking Dataframe.
app_df.head()

### Segmenting <a class="anchor" id="seg"></a>

Segmenting data is one of the important steps to understand the data by visualization. Here we will segment the data based on the target variable. A dataframe with defaulter data and a dataframe with all other cases data.

As we saw when we checked for data imbalance there are around 8% of the customers who have defaulted on payment.

In [None]:
# All the customers who have defaulted
defaulter_df = app_df[app_df['TARGET']==1]
# All other customers
other_df = app_df[app_df['TARGET']==0]

In [None]:
#Checking defaulter dataframe
defaulter_df.head()

In [None]:
#Checking all other cases dataframe
other_df.head()

## Analysis<a class="anchor" id="ana"></a>

In [None]:
#Checking the datatypes
app_df.dtypes

### Univariate Analysis<a class="anchor" id="uni-ana"></a>

Univariate Analysis is where we analyse one variable at a time. We can analyse both numerical and categorical variables.
Some of the categorical variables to analyse could be

- ORGANIZATION_TYPE                
- NAME_CONTRACT_TYPE    
- CODE_GENDER      
- FLAG_OWN_REALTY  
- OCCUPATION_TYPE 
- NAME_INCOME_TYPE                 
- NAME_EDUCATION_TYPE              
- NAME_FAMILY_STATUS               
- NAME_HOUSING_TYPE                
- INCOME_AMT_RANGE               
- CREDIT_AMT_RANGE               
- ANNUITY_AMT_RANGE              
- AGE_RANGE                      

Some of the numerical variables to analyse could be
 - CNT_CHILDREN
 - YEARS_EMPLOYED             
 - YEARS_ID_PUBLISHED
 - CNT_FAM_MEMBERS
 - CREDIT_BUREAU_ENQ_YEAR
 - AMT_GOODS_PRICE

In [None]:
# Comparing the occupation type between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Influence of occupation", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
defaulter_df['OCCUPATION_TYPE'].value_counts().plot.bar()
plt.xlim([-1,10])
plt.xlabel("Occupation Type", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.xticks(rotation='vertical')

# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
other_df['OCCUPATION_TYPE'].value_counts().plot.bar()
plt.xlim([-1,10])
plt.xlabel("Occupation Type", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.xticks(rotation='vertical')
plt.show()

**Inference :** Laborers are the major category to default, the second and third category to default is sales staff and retired group of people. But the 2nd and 3rd categories are approximately half of the first category.
However, when we consider both the segments, there is a high percentage of Laborers who have repaid the loan on time. Hence, from this data we cannot conclusively say Laborers have a higher probability to default, they are equally likely to make the payments on time. 

Retired/Pensioners group of customers are more likely to make timely payments than default.

In [None]:
#Ploting gender type
sns.countplot(x = "CODE_GENDER", hue = 'TARGET', data = app_df)
plt.title('Influence of Gender type', title_font)
plt.xlabel("Gender", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.show()

**Inference :** With the data we have female group are more likely to pay on time than their male counterparts, also female group are slightly more probable to default compared to male group. In both the segments the female group seems to lead!. 

In [None]:
# Comparing the income type between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Influence of Income type", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
defaulter_df['NAME_INCOME_TYPE'].value_counts().plot.bar()
plt.xlabel("Income Type", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.xticks(rotation='vertical')

# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
other_df['NAME_INCOME_TYPE'].value_counts().plot.bar()
plt.xlabel("Income Type", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.xticks(rotation='vertical')
plt.show()

**Inference :** Again, no strong pattern with income type. Working class are equally probable to pay on time or default.  Students and Businessman category have never defaulted. 

In [None]:
# Comparing the education level between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Influence of Education level", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
defaulter_df['NAME_EDUCATION_TYPE'].value_counts().plot.bar()
plt.xlabel("Education level", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.xticks(rotation='vertical')

# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
other_df['NAME_EDUCATION_TYPE'].value_counts().plot.bar()
plt.xlabel("Education level", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.xticks(rotation='vertical')
plt.show()

**Inference :** Customers with secondary eductaion are equally probable to pay on time or default. 
The higher education level customer are more probable to pay than default. All other education levels are equally probable to pay or default on payment.

In [None]:
# Comparing the income range  between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Influence of income range", fontsize=20, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
sns.countplot(defaulter_df['INCOME_AMT_RANGE'],palette='muted')
plt.xlabel("Income Range", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label

# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
sns.countplot(other_df['INCOME_AMT_RANGE'], palette='muted')
plt.xlabel("Income Range", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.show()

**Inference :** Clients with medium income range are more probable to default when compared to other two income ranges(low, high) and high income category of clients are least probable to default on payments. 

When we compare the income ranges between two segments, low and medium income ranges are almost equally probable to pay on time or default on payment, and high income range is slightly more probable to pay on time than default.

In [None]:
# Comparing the credit range  between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Influence of credit range", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
sns.countplot(defaulter_df['CREDIT_AMT_RANGE'],palette='muted')
plt.xlabel("Credit Range", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label

# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
sns.countplot(other_df['CREDIT_AMT_RANGE'], palette='muted')
plt.xlabel("Credit Range", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.show()

**Inference :** Clients with medium credit range are more probable to default when compared to other two income ranges(low, high) and high high category of clients are least probable to default on payments.

When we compare the income ranges between two segments, medium credit range customers are almost equally probable to pay on time of default on payment, low and  high income ranges are slightly more probable to pay on time than default.

In [None]:
# Comparing the number of children between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Credit Amount analysis", fontsize=20, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
plt.xlabel("Credit Amount", label_font) #Set x axis label
#plt.ylabel("Credit Amount", label_font) #Set y axis label
plt.hist(defaulter_df.AMT_CREDIT, bins = 15, rwidth=.8)
plt.xticks(rotation = 90)


# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
plt.xlabel("Credit Amount", label_font) #Set x axis label
#plt.ylabel("Total Count", label_font) #Set y axis label
plt.hist(other_df.AMT_CREDIT, bins = 15, rwidth=.8)
plt.xticks(rotation = 90)
plt.show()

**Inference:** More number of customers who have a credit amount of around 500k have payment difficulties, and relatively more customers below the range of 500k credit are able to make timely payments.

In [None]:
# Comparing the age between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Influence of age", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
sns.countplot(defaulter_df['AGE_RANGE'],palette='muted')
plt.xlabel("Age Range", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label

# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
sns.countplot(other_df['AGE_RANGE'], palette='muted')
plt.xlabel("Age Range", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.show()

**Inference :** Customers between the age group of 30 and 40 years are more likely to default on payment when compared to all other age groups.
- Customers in 20-30 years group are slightly more likely to default than pay on time.
- Customers in 30-40 years are equally likely to default or pay on time.
- All other age groups are more likely to pay on time

In [None]:
# Comparing the number of children between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("No. of family members", fontsize=20, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
sns.countplot(defaulter_df['CNT_FAM_MEMBERS'],palette='muted')
plt.xlim([-1,10])
plt.xlabel("No. of family members", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label

# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
sns.countplot(other_df['CNT_FAM_MEMBERS'], palette='muted')
plt.xlim([-1,10])
plt.xlabel("No. of family members", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
plt.show()

**Inference :** Looking at the graph, the number of family members of the customer does not seem to affect the payment capability, because we see both the segments have a similar pattern.

In [None]:
# Comparing the number of children between the two segments
plt.figure(figsize = (15, 10))
plt.suptitle("Number of children", fontsize=20, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
plt.xlabel("No. of children", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
sns.countplot(defaulter_df['CNT_CHILDREN'],palette='muted')
plt.xlim([-1,6])


# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
plt.xlabel("No. of children", label_font) #Set x axis label
plt.ylabel("Total Count", label_font) #Set y axis label
sns.countplot(other_df['CNT_CHILDREN'], palette='muted')
plt.xlim([-1,6])
plt.show()


**Inference :** Looking at the graph number of children of the customer does not seem to affect the payment capability. That is because we see both the segments have a similar pattern, that is, customers who have no childern are of majority in both segments, and similar pattern is followed by customers that have 1,2,3 or more children

In [None]:
#Analysing years employed
'''As we know the column years employed has an invalid value 1000 years assigned to all pensioners.
To remove the influence of the invalid value on our analysis we can omit the records with the 
invalid value'''

emp_default = defaulter_df[defaulter_df["YEARS_EMPLOYED"]<100]["YEARS_EMPLOYED"]
emp_other = other_df[other_df["YEARS_EMPLOYED"]<100]["YEARS_EMPLOYED"]

In [None]:
#Excluding Reteried customers
print("-----Defaulters---")
print(emp_default.describe())
print("-----Others---")
print(emp_other.describe())

**Inference :** The mean and median do not fall very far apart for the defaulters. However, the percentile and maximum value indicate that the customers who **DIDNOT default** are **employed for slightly more number of years** than the customers who **DID** defalut.

In [None]:
#Analysing the number of years since the customer changed the identity document
plt.figure(figsize = (15, 5))
sns.distplot(defaulter_df["YEARS_ID_PUBLISHED"], hist=False,label='Default')
sns.distplot(other_df["YEARS_ID_PUBLISHED"],hist=False,label='Others')
plt.title('Analysis of years since ID change', title_font)
plt.xlabel("No. of years since ID change", label_font) #Set x axis label
plt.show()

**Inference :** We see a similar trend for both segments. This may simply mean that in the data we have, more number of customers fall between the range of 10 and 15 years since they changed the ID with which they applied for this loan.

<ins>**Analysing the number of credit bureau enqueries of the customers in the past year**</ins>

In [None]:
print("Median for customers with payment issues : ",defaulter_df['CREDIT_BUREAU_ENQ_YEAR'].describe()["50%"])
print("Median for customers with all other cases : ",other_df['CREDIT_BUREAU_ENQ_YEAR'].describe()["50%"])

In [None]:
print("--Value counts for customers with payment issues--")
print(defaulter_df['CREDIT_BUREAU_ENQ_YEAR'].value_counts())
print("--Value counts  for customers wiht all other cases--")
print(other_df['CREDIT_BUREAU_ENQ_YEAR'].value_counts())

**Inference:**  Number of credit bureau enqueries in the year befor the loan application is similar to both the segemnts. Looking the median and the values of both the segments we can say number of credit bureau enqueries does not indicate if a customer may default on payment or not.

In [None]:
variables = ['NAME_FAMILY_STATUS','NAME_HOUSING_TYPE', "FLAG_OWN_REALTY", "NAME_CONTRACT_TYPE"]
plt.figure(figsize = (15, 40))

plt.subplots_adjust(hspace=0.8)
for i in enumerate(variables):
    plt.subplot(5, 2, i[0]+1)
    sns.countplot(x = i[1], hue = 'TARGET', data = app_df)
    plt.xticks(rotation = 45)

**Inference:**

- More number of married customers seem to make timely payments followed by customers with single status. If we observe the customer group that has a high percentage for payment difficulties, it is againg the married group.
- Customers who live in house/appartment seem to make timely payments.Same group seem to default more than the customers with all other housing types. All other customers are more likely to pay on time.
- More number of customers who own the reality seem to make timely payments.Same group seem to default more than the customers who do not own a reality.
- More number of customers defaulted on cash loans.

### Bivariate and Multivariate Analysis<a class="anchor" id="bi-ana"></a>

Bivariate analysis is where we analyse two variables together to understand the data patterns.
We can analyse variables in below said combinations
- Numerical - Numerical
- Numerical - Categorical
- Categorical - Categorical

Multivariate analysis is where we analyse more than two variables together to understand the data patterns.

In [None]:
app_df.columns

<ins>**Numerical - Numerical analysis**</ins>

Scatter plots are generally used to analyse two numeric variables

In [None]:
# Analysing 'AMT_INCOME_TOTAL' and 'AMT_CREDIT'
plt.figure(figsize = (20, 12))
sns.scatterplot(x = "AMT_INCOME_TOTAL", y = "AMT_CREDIT", hue = "TARGET",
                data = app_df)
plt.title('Total income vs Credit', title_font)
plt.xlabel("Total Income", label_font) #Set x axis label
plt.ylabel("Credit amount", label_font) #Set y axis label
plt.show()

**Inference:** Here as the data is dense towards the begining of the plot, we can say more loans has been lent to lower income and lower medium income range. We may hypothesis here as generally people with high income apply for less loans. 

Customers with payment difficulties seem to fall under all segments, like
- Low income low credit amount
- Low income moderate credit amount
- High income and low credit amount
- High income and medium credit amount.


In [None]:
# Analysing 'AMT_INCOME_TOTAL' and 'AMT_ANNUITY'
plt.figure(figsize = (20, 12))
sns.scatterplot(x = "AMT_INCOME_TOTAL", y = "AMT_ANNUITY", hue = "TARGET",
                data = app_df)
plt.title('Total income vs Annuity', title_font)
plt.xlabel("Total Income", label_font) #Set x axis label
plt.ylabel("Annuity amount", label_font) #Set y axis label
plt.show()

**Inference:** The annuity amount does not seem to be influenced by the total income for either segment of the customers.

In [None]:
# Analysing 'AMT_CREDIT' and 'AMT_GOODS_PRICE'
plt.figure(figsize = (10,5))
sns.scatterplot(x = "AMT_CREDIT", y = "AMT_GOODS_PRICE", hue = "TARGET",
                data = app_df)
plt.title('Credit vs Goods price', title_font)
plt.xlabel("Credit amount", label_font) #Set x axis label
plt.ylabel("Goods price", label_font) #Set y axis label
plt.show()

**Inference:** There is a clear corelation between the credit amount and goods price. As the goods price increase credit amount increases too. Both the segments follow the same pattern

In [None]:
# Credit vs age for customers with payment difficulties
plt.figure(figsize = (8, 6))
sns.scatterplot(y = "AMT_CREDIT", x = "AGE_IN_YEARS",
                data = defaulter_df)
plt.title('Credit amount vs Age', title_font)
plt.xlabel("Age in years", label_font) #Set x axis label
plt.ylabel("Credit amount", label_font) #Set y axis label
plt.show()

**Inference:** There is no strong association between age of the customer and the credit amount lent to the customer for the customers with payment difficulties.

In [None]:
# Credit vs Annuity for all customers

plt.figure(figsize = (15, 10))
plt.suptitle("Credit vs Annuity", fontsize=20, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
plt.title('Defaulters', title_font)
sns.scatterplot(y = "AMT_CREDIT", x = "AMT_ANNUITY",data = defaulter_df)
plt.xlabel("Annuity Amount", label_font) #Set x axis label
plt.ylabel("Credit Amount", label_font) #Set y axis label


# subplot 2
plt.subplot(2, 2, 2)
plt.title('All other cases', title_font)
sns.scatterplot(y = "AMT_CREDIT", x = "AMT_ANNUITY",data = other_df, palette='muted')
plt.xlabel("Annuity Amount", label_font) #Set x axis label
plt.ylabel("Credit Amount", label_font) #Set y axis label
plt.show()

**Inference:** Annuity increases as the credit increases. However, neither of these variables seem to influence the payment capabilities of the customer.

<ins>**Numerical - Categorical analysis**</ins>

In [None]:
# Income vs payment information
sns.boxplot(data=app_df, x = "TARGET", y = "AMT_INCOME_TOTAL")
plt.title('Income vs payment information', title_font)
plt.xlabel("Payment", label_font) #Set x axis label
plt.ylabel("Total income", label_font) #Set y axis label
plt.xticks(np.array([0,1]),  ['All other cases','Defaulters'])
plt.show()

**Inference:** The median total income for the customers with payment difficulties is slightly lesser than the customers with no payment difficulties. Also, more number of defaulting customers fall above the median income range of that segment

In [None]:
# Income vs payment information
sns.barplot(data=app_df, x = "TARGET", y = "AMT_INCOME_TOTAL")
plt.title('Income vs payment information', title_font)
plt.xlabel("Payment", label_font) #Set x axis label
plt.ylabel("Total income", label_font) #Set y axis label
plt.xticks(np.array([0,1]),  ['All other cases','Defaulters'])
plt.show()

**Inference:** Customers with higher income are more likely to make payments on time

In [None]:
app_df.info()

In [None]:
# Annuity vs payment information
sns.boxplot(data=app_df, x = "TARGET", y = "AMT_ANNUITY")
plt.title('Annuity vs payment information', title_font)
plt.xlabel("Payment", label_font) #Set x axis label
plt.ylabel("Annuity amount", label_font) #Set y axis label
plt.xticks(np.array([0,1]),  ['All other cases','Defaulters'])
plt.show()

In [None]:
app_df.groupby("TARGET")["AMT_ANNUITY"].aggregate(["mean","median"])

**Inference:** Both the segments have similar median. There are more number of customers with more annuity amount who made timely payments.

In [None]:
#Region rating vs payment information


plt.figure(figsize = (15, 10))
plt.subplot(2, 2, 1)
app_df.groupby(["REGION_RATING_CLIENT"])["TARGET"].sum().plot.barh()
plt.title('Region rating vs Defaulters', title_font)
plt.xlabel("Number of defaulters", label_font) #Set x axis label
plt.ylabel("Region rating", label_font) #Set y axis label

# subplot 2
plt.subplot(2, 2, 2)
sns.barplot(x = "AMT_INCOME_TOTAL", y = "REGION_RATING_CLIENT", data = app_df, orient="h")
plt.title('Region rating vs total income', title_font)
plt.xlabel("Total Income", label_font) #Set x axis label
plt.ylabel("Region rating", label_font) #Set y axis label
plt.show()

**Inference:** There are more customers from region with rating 2 who defaulted on payment. And customers from the regions with rating 1 have higher income.

In [None]:
#Eduction vs Credit

plt.figure(figsize=(15,7))
plt.subplots_adjust(wspace=0.3)

plt.suptitle("Eduction vs Credit", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
sns.barplot(data =defaulter_df, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT')
plt.title('Defaulters', title_font)
plt.xlabel("Education", label_font) #Set x axis label
plt.ylabel("Credit amount", label_font) #Set y axis label
plt.xticks(rotation=45)

plt.subplot(2, 2, 2)
sns.barplot(data =other_df, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT')
plt.title('All other cases', title_font)
plt.xlabel("Education", label_font) #Set x axis label
plt.ylabel("Credit amount", label_font) #Set y axis label
plt.xticks(rotation=45)
plt.show()

In [None]:
app_df[app_df["NAME_EDUCATION_TYPE"] == "Academic degree"]["NAME_CONTRACT_TYPE"].value_counts()

**Inference:** Clientele with an academic degree have higher credit and have a higher probability to default on payments compared to other groups. Customers with an academic degree have applied for more number of cash loans then revolving loans.

In [None]:
#Region rating vs payment information
plt.figure(figsize = (15, 10))
plt.subplot(2, 2, 1)
app_df.groupby(["NAME_CONTRACT_TYPE"])["TARGET"].sum().plot.barh()
plt.title('Contract type vs Defaulters', title_font)
plt.xlabel("Number of defaulters", label_font) #Set x axis label
plt.ylabel("Contract type", label_font) #Set y axis label
plt.yticks(rotation=90)

# subplot 2
plt.subplot(2, 2, 2)
sns.barplot(x = "AMT_INCOME_TOTAL", y = "NAME_CONTRACT_TYPE", data = app_df, orient="h")
plt.title('Contract type vs total income', title_font)
plt.xlabel("Total Income", label_font) #Set x axis label
plt.ylabel("Contract type", label_font) #Set y axis label
plt.yticks(rotation=90)
plt.show()

**Inference:** Clientele who applied for revolving loan are very less likely to default on payment when compared to clientele who applied for cash loans. Also, customers who applied for cash loans have slightly more income compared to other customers.

In [None]:
app_df.info()

<ins>**Categorical - Categorical analysis**</ins>

In [None]:
ed_fm = pd.pivot_table(data=app_df, index="NAME_EDUCATION_TYPE", 
                       columns="NAME_FAMILY_STATUS", values="TARGET")
plt.figure(figsize = (10, 7))
sns.heatmap(ed_fm, annot=True, cmap="RdYlGn")
plt.title('Education vs Family status', title_font)
plt.show()

**Inference:**
- Widowed customers seem to default less on payments irrespective of their education level
- Customers with an accademic degree seem to default less on payments irrespective of their family status
- Customers with lower secondary education seem to default more across different family statuses, closely followed by secondary education and incomplete higher

### Correlation of multiple variables <a class="anchor" id="corr"></a>

In [None]:
# Mapping the group of payment related variables 
tg_amt = app_df[['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE','EXT_SOURCE_2',
       'DEF_IN_SOCIAL_CIRCLE',]].corr()

plt.figure(figsize = (10, 7))
sns.heatmap(tg_amt, annot=True, cmap="RdYlGn")
plt.title('Correlation of payment related variables', title_font)
plt.show()

**Inference:**
- Number of defaulters in the social circle of a customer have a weak positive correlation with the target, meaning more number of defaulters in the client's social circle more probability of the client to default.
- As the income, credit, annuity and goods price increse number of defaulters in the customers social circle decrease.  Number of defaulters in the social circle variable has a negetive correlation with all these columns
- Credit amount, annuity and goods price are positively correlated with each other, which means one increases as the other do.
- External score is negetively correlated with target, which means less number of defaulters as the score increases

##  Previous application data <a class="anchor" id="prev_app"></a>

In [None]:
# loading previous_application file 
prev_app_df = pd.read_csv("../input/bank-loans-dataset/previous_application.csv")
prev_app_df.head()

#### Set index<a class="anchor" id="prev_set"></a>

In [None]:
#Setting SK_ID_PREV as index column
prev_app_df.set_index("SK_ID_PREV", inplace = True)

In [None]:

prev_app_df.head()

## Understanding data <a class="anchor" id="prev_und"></a>

In [None]:
prev_app_df.shape

In [None]:
prev_app_df.info()

In [None]:
len(prev_app_df["SK_ID_CURR"].unique())

<i>We started with 307511 current application data. We see there are a few more current application IDs here.</i>

In [None]:
#Types of loans
prev_app_df["NAME_CONTRACT_TYPE"].value_counts()

**Consumer loans**  is a new category that was not found in current application data

In [None]:
#Why are some applications are not the last application?
prev_app_df["FLAG_LAST_APPL_PER_CONTRACT"].value_counts()

In [None]:
#Checking for pattern
prev_app_df[prev_app_df["FLAG_LAST_APPL_PER_CONTRACT"]=="N"].head()

In [None]:
#Understanding with an example application id 185661
prev_app_df[prev_app_df["SK_ID_CURR"].isin([185661])][["NAME_CONTRACT_TYPE", "AMT_ANNUITY","FLAG_LAST_APPL_PER_CONTRACT","NAME_CONTRACT_STATUS"]]

In [None]:
#One more example application id 148658
prev_app_df[prev_app_df["SK_ID_CURR"].isin([148658])][["NAME_CONTRACT_TYPE", "AMT_ANNUITY","FLAG_LAST_APPL_PER_CONTRACT","NAME_CONTRACT_STATUS"]]

In [None]:
#What happened to the application status that were not the last application.
prev_app_df[prev_app_df["FLAG_LAST_APPL_PER_CONTRACT"]=="N"]["NAME_CONTRACT_STATUS"].value_counts()

When an application is "Refused" or "Canceled", FLAG_LAST_APPL_PER_CONTRACT is set to "N". Meaning these refused or cancelled loans are reapplied.

In [None]:
prev_app_df["NFLAG_LAST_APPL_IN_DAY"].value_counts()

In [None]:
prev_app_df[prev_app_df["NFLAG_LAST_APPL_IN_DAY"]==0]["NAME_CONTRACT_STATUS"].value_counts()

In [None]:
prev_app_df[prev_app_df["NFLAG_LAST_APPL_IN_DAY"]==0]["FLAG_LAST_APPL_PER_CONTRACT"].value_counts()

We have 786 records that were not the last application of the customer in the given day, 
but we cannot be sure if it is an application for a new contract type or a duplication of the record that happened by mistake. Hence, not removing these records.

**<i>With the understanding we have of the previous application data we aim to answer below mentioned questions and analyse to find the variables influencing the TARGET</i>**
- What is the percentage of returning and new customers for defaulting and others segments?(NAME_CLIENT_TYPE)
- Compare the application status of previous loan applications for defaulting and others segments?(NAME_CONTRACT_STATUS)
- What are the different types of loans previously applied for defaulting and others segments?(NAME_CONTRACT_TYPE)
- Compare medians of expected termination days of previous loans for defaulting and others segments?(DAYS_TERMINATION)
- Compare if the client requested insurance during the previous application for defaulting and others segments?(NFLAG_INSURED_ON_APPROVAL)
- Compare through which channels the client was acquired the previous application for defaulting and others segments?(CHANNEL_TYPE)
- Compare the credit amount and annuity amount for previously approved applications for defaulting and others segments?(AMT_CREDIT, AMT_ANNUITY)
- Compare the portfolios for previous applications for defaulting and others segments?(NAME_PORTFOLIO)
- Compare previous credit term for previously approved applications for defaulting and others segments?(CNT_PAYMENT)
- Compare Grouped interest rate for previous applications for defaulting and others segments?(NAME_YIELD_GROUP)
- Compare x-sell vs walk-in for both segments (NAME_PRODUCT_TYPE). This could be same as analysing new and returning customers.
- Compare purpose of the cash loan for both segments.(NAME_CASH_LOAN_PURPOSE).
- Compare the medians of application amount and credit amount for both segments.(AMT_APPLICATION, AMT_CREDIT)

## Dropping unnecessary columns <a class="anchor" id="prev_drop"></a>

In [None]:
#Not useful for answering the question at hand
prev_app_df.drop(["WEEKDAY_APPR_PROCESS_START", "HOUR_APPR_PROCESS_START",
                   "FLAG_LAST_APPL_PER_CONTRACT","NFLAG_LAST_APPL_IN_DAY",'AMT_DOWN_PAYMENT',
                   'AMT_GOODS_PRICE', 'RATE_DOWN_PAYMENT',  'DAYS_DECISION', 'NAME_TYPE_SUITE',
                   'NAME_GOODS_CATEGORY','SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
                   'PRODUCT_COMBINATION','DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
                   'DAYS_LAST_DUE_1ST_VERSION'],axis=1,inplace=True)

prev_app_df.head()

## Missing values previous application data <a class="anchor" id="prev_missing"></a>

In [None]:
round(100*(prev_app_df.isnull().sum()/prev_app_df.shape[0]),4)

We see exactly same number of AMT_ANNUITY and CNT_PAYMENT are missing and so is DAYS_TERMINATION and NFLAG_INSURED_ON_APPROVAL         

#### Checking for patterns

In [None]:
prev_app_df[prev_app_df["AMT_ANNUITY"].isna()].head(10)

In [None]:
prev_app_df[(prev_app_df["AMT_ANNUITY"].isna()) | (prev_app_df["CNT_PAYMENT"].isna())]["NAME_CONTRACT_STATUS"].value_counts()

AMT_ANNUITY and CNT_PAYMENT are majorly missing for
- Canceled        305805
- Refused          40898
- Unused offer     25524

In [None]:
prev_app_df[prev_app_df["DAYS_TERMINATION"].isna()].head(10)

In [None]:
prev_app_df[prev_app_df["DAYS_TERMINATION"].isna()]["NAME_CONTRACT_STATUS"].value_counts()

In [None]:
prev_app_df[prev_app_df["NFLAG_INSURED_ON_APPROVAL"].isna()]["NAME_CONTRACT_STATUS"].value_counts()

In [None]:
prev_app_df[prev_app_df["DAYS_TERMINATION"].isna()]["NAME_CLIENT_TYPE"].value_counts()

In [None]:
prev_app_df[prev_app_df["DAYS_TERMINATION"].isna()]["NAME_CONTRACT_TYPE"].value_counts()

In [None]:
print(prev_app_df["DAYS_TERMINATION"].isna().count())
print(prev_app_df["DAYS_LAST_DUE"].isna().count())

It seems like all the applications with DAYS_TERMINATION data missing are **on going loans** which are not terminated yet.

**Dropping "RATE_INTEREST_PRIMARY" and "RATE_INTEREST_PRIVILEGED" as 99% of the data is missing**

In [None]:
prev_app_df.drop(["RATE_INTEREST_PRIMARY", "RATE_INTEREST_PRIVILEGED"],axis=1,inplace=True)

prev_app_df.head()

In [None]:
prev_app_df.info()

## Merging and Segmenting <a id="prev_merg"></a>

In [None]:
#Merge the previous application with the current application data file
master_df= app_df.merge(prev_app_df, how='left', on='SK_ID_CURR',suffixes=('_CURR', '_PREV'))
master_df.head()

In [None]:
master_df.info()

In [None]:
master_df["SK_ID_CURR"].unique().size

In [None]:
master_default_df = master_df[(master_df["TARGET"] == 1)]
master_default_df.shape

In [None]:
master_default_df.head()

In [None]:
master_other_df = master_df[(master_df["TARGET"] != 1)]
master_other_df.shape

In [None]:
master_df["TARGET"].value_counts()

In [None]:
master_other_df.head()

## Analysis of Previous and Current application data<a id="prev_ana"></a>

In [None]:
# Application amount vs target (master_df["AMT_APPLICATION"] < 1000000)
sns.boxplot(y = "AMT_APPLICATION", x= "TARGET",
            data = master_df[(master_df["AMT_APPLICATION"] != 0) & 
                   (master_df["NAME_CONTRACT_STATUS"] == "Approved") &
                            (master_df["AMT_APPLICATION"] < 600000)])
plt.title('Application amount vs Target', title_font)
plt.xlabel("Payment", label_font) #Set x axis label
plt.ylabel("Application amount", label_font) #Set y axis label
plt.xticks(np.array([0,1]),  ['All other cases','Defaulters'])
plt.show() 

**Inference** The medians of the application amount are almost similar for both segments. Application amount does not seem to be influencing the payment capabilities.

In [None]:
# Credit amount vs target (master_df["AMT_APPLICATION"] < 1000000)
sns.boxplot(y = "AMT_CREDIT_PREV", x= "TARGET",
            data = master_df[(master_df["AMT_CREDIT_PREV"] != 0) & 
                   (master_df["NAME_CONTRACT_STATUS"] == "Approved") &
                            (master_df["AMT_CREDIT_PREV"] < 1000000)])
plt.title('Credit amount vs Target', title_font)
plt.xlabel("Payment", label_font) #Set x axis label
plt.ylabel("Credit amount", label_font) #Set y axis label
plt.xticks(np.array([0,1]),  ['All other cases','Defaulters'])
plt.show() 

**Inference** The medians of the credit amount are almost similar for both segments. Credit amount does not seem to be influencing the payment capabilities.

<u>What is the percentage of returning and new customers for defaulting and others segments?(NAME_CLIENT_TYPE)</u>

If a customer has previous application data it means that customer was a returning customer when they applied for current loan. In this light there are very few customers with no previous application data.

In [None]:
#Percentage of customers with not previous applications
app_df[~app_df.index.isin(prev_app_df["SK_ID_CURR"])].shape[0]/app_df.shape[0]

Around 5% customers do not have any previous application data.

In [None]:
app_df[~app_df.index.isin(prev_app_df["SK_ID_CURR"])]["TARGET"].value_counts().plot.barh()
plt.title("New Customers vs Payment Capability", title_font)
plt.xlabel("Count", label_font)
#plt.ylabel("Target", label_font)
plt.yticks([0,1],["On time payment", "Default"])
plt.show()

**Inference:** A large percentage of first time customers have made timely payments

In [None]:
#New customers(when they applied for loans previously) who defaulted on payments
len(master_default_df[master_default_df["NAME_CLIENT_TYPE"] == "New"]["SK_ID_CURR"].unique())

In [None]:
#New customers(when they applied for loans previously) who made timely payments
len(master_other_df[master_other_df["NAME_CLIENT_TYPE"] == "New"]["SK_ID_CURR"].unique())

In [None]:
len(master_default_df[master_default_df["NAME_CLIENT_TYPE"] == "New"]["SK_ID_CURR"].unique())/len(master_other_df[master_other_df["NAME_CLIENT_TYPE"] == "New"]["SK_ID_CURR"].unique())

**Inference:** There are approximately 9.6% of the customers who were new to the system when they applied for previous loans that defaulted on payments for the current loan.

<u>Compare the application status of previous loan applications for defaulting and others segments</u>

In [None]:
master_df["NAME_CONTRACT_STATUS"].value_counts()

In [None]:
master_other_df["NAME_CONTRACT_STATUS"].isna().sum()

In [None]:
#Eduction vs Credit

plt.figure(figsize=(15,7))
plt.subplots_adjust(wspace=0.3)

plt.suptitle("Application Status", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
sns.countplot(data =master_default_df, x='NAME_CONTRACT_STATUS')
plt.title('Defaulters', title_font)
plt.xlabel("Count", label_font) #Set x axis label
plt.ylabel("Application Status", label_font) #Set y axis label
plt.xticks(rotation=45)

plt.subplot(2, 2, 2)
sns.countplot(data =master_other_df, x='NAME_CONTRACT_STATUS')
plt.title('All other cases', title_font)
plt.xlabel("Count", label_font) #Set x axis label
plt.ylabel("Application status", label_font) #Set y axis label
plt.xticks(rotation=45)
plt.show()

**Inference:** The percentage of canceled and refused applications are slightly higher for the defaulted customers than the customers who made timely payments.

<u>What are the different types of loan applications previously applied for defaulting and others segments?(NAME_CONTRACT_TYPE)</u>

In [None]:

plt.figure(figsize=(15,7))
plt.subplots_adjust(wspace=0.3)

plt.suptitle("Types of loan applications", fontsize=18, color="darkred", weight='normal')
plt.subplot(2, 2, 1)
sns.countplot(data =master_default_df, x='NAME_CONTRACT_TYPE_PREV')
plt.title('Defaulters', title_font)
plt.xlabel("Count", label_font) #Set x axis label
plt.ylabel("Application type", label_font) #Set y axis label
plt.xticks(rotation=45)

plt.subplot(2, 2, 2)
sns.countplot(data =master_other_df, x='NAME_CONTRACT_TYPE_PREV')
plt.title('All other cases', title_font)
plt.xlabel("Count", label_font) #Set x axis label
plt.ylabel("Application type", label_font) #Set y axis label
plt.xticks(rotation=45)
plt.show()

**Inference:** Customers defaulted on cash loans the most and then consumer loans and revolving loans in that order.

<u>Compare medians of expected termination days of previous loans for defaulting and others segments?(DAYS_TERMINATION)</u>

In [None]:
master_df["DAYS_TERMINATION"].describe()

In [None]:
master_df["DAYS_TERMINATION"] = abs(master_df["DAYS_TERMINATION"])/365

In [None]:
master_df["DAYS_TERMINATION"].describe()

In [None]:
master_df[master_df["DAYS_TERMINATION"] > 8]["DAYS_TERMINATION"].value_counts()

In [None]:
# Income vs payment information
sns.boxplot(data = master_df[master_df["DAYS_TERMINATION"] < 8], x = "TARGET", y = "DAYS_TERMINATION")
plt.title('Years termination vs payment information', title_font)
plt.xlabel("Payment", label_font) #Set x axis label
plt.ylabel("Years", label_font) #Set y axis label
plt.xticks(np.array([0,1]),  ['All other cases','Defaulters'])
plt.show() 

**Inference:** The median of previous loans termination time is slightly lesser for the defaulter group. This gives an idea that the clients who had their loans terminated more recently(It is not clear if they have paid back all the money) are slightly more likely to default.

However, there are a lot of records with a value of 1000 years for the previous loan termination time, there seems to be a pattern but we were not able to figure it out. We will need more information about the DAYS_

<u>Compare if the client requested insurance during the previous application for defaulting and others segments?(NFLAG_INSURED_ON_APPROVAL)</u>

In [None]:
master_df[["NFLAG_INSURED_ON_APPROVAL"]].info()

In [None]:
master_df["NFLAG_INSURED_ON_APPROVAL"].value_counts()

In [None]:
master_df["NFLAG_INSURED_ON_APPROVAL"].isna().sum()/master_df.shape[0]

40% values of this variable are missing.

In [None]:
sns.countplot(x = "NFLAG_INSURED_ON_APPROVAL", hue="TARGET", data = master_df)
plt.title("Insurance request", title_font)
plt.xlabel("Requested insurance", label_font)
plt.ylabel("Count", label_font)
plt.xticks([0.0, 1.0],["No", "Yes"])
plt.show()

**Inference:** There are more number of applications from defaulters who DIDNOT request for insurence than who did request. The same is true for all other cases too.

<u>Compare through which channels the client was acquired the previous application for defaulting and others segments?(CHANNEL_TYPE)</u>

In [None]:
master_df["CHANNEL_TYPE"].value_counts()

In [None]:
master_df["CHANNEL_TYPE"].isna().sum()

In [None]:
plt.figure(figsize = (8,5))
sns.countplot(x = "CHANNEL_TYPE", hue="TARGET", data = master_df)
plt.title("Channel Type", title_font)
plt.xlabel("Channel Type", label_font)
plt.ylabel("Count", label_font)
plt.xticks(rotation = 90)
plt.show()

**Inference:** For both the segments highest number of customers are accured from credit and cash offices. Followed by, country wide and stone channels.

<u>- Compare the portfolios for previous applications for defaulting and others segments?(NAME_PORTFOLIO)</u>

In [None]:
plt.figure(figsize = (8,5))
sns.countplot(x = "NAME_PORTFOLIO", hue="TARGET", 
              data = master_df[master_df["NAME_PORTFOLIO"] != "XNA"])
plt.title("Portfolio Type", title_font)
plt.xlabel("Portfolio", label_font)
plt.ylabel("Count", label_font)
plt.show()

**Inference:** Both the segments follow similar pattern again! 
Major share of defaulting and non defaulting customers are from Point of sale followed by cash and cards

<u>- Compare previous credit term for previously approved applications for defaulting and others segments?(CNT_PAYMENT)</u>

In [None]:
# Income vs payment information
sns.boxplot(data = master_df, x = "TARGET", y = "CNT_PAYMENT")
plt.title('Loan term vs payment information', title_font)
plt.xlabel("Payment", label_font) #Set x axis label
plt.ylabel("Months", label_font) #Set y axis label
plt.xticks(np.array([0,1]),  ['All other cases','Defaulters'])
plt.show() 

**Inference:** Again a lot of similarity in median and quartiles for both segments. Meaning, The defaulting behaviour of the clients do not depend on the term of the loan.

<u>Compare Grouped interest rate for previous applications for defaulting and others segments?(NAME_YIELD_GROUP)</u>

In [None]:
plt.figure(figsize = (8,5))
sns.countplot(x = "NAME_YIELD_GROUP", hue="TARGET", 
              data = master_df[master_df["NAME_YIELD_GROUP"] != "XNA"])
plt.title("Interest Rate", title_font)
plt.xlabel("Interest Rate", label_font)
plt.ylabel("Count", label_font)
plt.show()

**Inference:** Clients with middle and high interest seem to be defaulting more than the other interest rate groups.

<u>Compare x-sell vs walk-in for both segments (NAME_PRODUCT_TYPE). This could be same as analysing new and returning customers.</u>

In [None]:
plt.figure(figsize = (8,5))
sns.countplot(x = "NAME_PRODUCT_TYPE", hue="TARGET", 
              data = master_df[master_df["NAME_PRODUCT_TYPE"] != "XNA"])
plt.title("Product Type", title_font)
plt.xlabel("Product Type", label_font)
plt.ylabel("Count", label_font)
plt.show()

**Inference:** Customers with cross selled products have defaulted more than the walked-in customers.

<u>Compare purpose of the cash loan for both segments.(NAME_CASH_LOAN_PURPOSE).</u>

In [None]:
plt.figure(figsize = (8,5))
sns.countplot(x = "NAME_CASH_LOAN_PURPOSE",
              data = master_default_df[~master_default_df["NAME_CASH_LOAN_PURPOSE"].isin(["XNA","XAP"])])
plt.title("Loan Purpose of defaulting customers", title_font)
plt.xlabel("Purpose", label_font)
plt.ylabel("Count", label_font)
plt.xticks(rotation = 90)
plt.show()

**Inference:** Customers with repair purpose loans are more likely to default than other purposes

<u>Education vs cash loan purpose</u>

In [None]:
plt.figure(figsize = (10, 7))
sns.scatterplot(data=master_default_df, x="AMT_APPLICATION", y="AMT_CREDIT_PREV")
plt.title("Application amount vs Credit amount", title_font)
plt.ylabel("Credit amount", label_font)
plt.xlabel("Application amount", label_font)
plt.show()

**Inference:** As the application amount increases credit amount increases too.

In [None]:
ed_lp = pd.pivot_table(data=master_df, index="NAME_EDUCATION_TYPE", 
                       columns="NAME_CASH_LOAN_PURPOSE", values="TARGET")
plt.figure(figsize = (20, 10))
sns.heatmap(ed_lp, annot=True, cmap="RdYlGn_r")
plt.title('Education vs cash loan purpose', title_font)
plt.show()

**Inference:** 
- Clients with lower secondary education applying for furniture loans are the riskiest group.
- Customer with incomplete higer education applying for hobby loan and gasification/water supply are risky too
- Customers with an academic degree seem to be safest for all types of cash loans

In [None]:
ed_lp = pd.pivot_table(data=master_df[master_df["NAME_CONTRACT_TYPE_PREV"] != "XNA"],
                       index="NAME_EDUCATION_TYPE", 
                       columns="NAME_CONTRACT_TYPE_PREV", values="TARGET")
plt.figure(figsize = (20, 7))
sns.heatmap(ed_lp, annot=True, cmap="RdYlGn_r")
plt.title('Education vs Loan Type', title_font)
plt.show()

**Inference:**
- Customer with a lower secondary education with a revolving loan is more probable to default
- Customer with an academic degree with a revolving loan is less probable to default

### **Final Thoughts:**<a id="summary"></a>


We have credit application data with the customer payment behaviour. We are also given previous applications details of the customers who have currently applied for a loan.

Aim is to find out what kind of a customer is more probable to default on payments, so that the banks can take necessary measures to reduce losses incurred.

We have analysed the given data in multiple steps:

We started with understanding the variables given and removed the unnecessery variables which are not useful to answer the question at hand.

Then we looked at the missing values for each variable and tried to understand if there lies any pattern/ reason behind the missing values. We did find some patterns like occupation type and organization type were missing for the customers with income type Pensioner. We reported all the missing values and suggested appropriate action to mitigate the risk of missing values affecting the analysis.

After analysing the missing values we classified the variables into numerical and categorical and changed the wrongly identified datatypes.

We followed this step with outliers analysis steps. When analysing for outliers we did notice some interesting patterns; one such pattern is, in the data extremely high values(1000 years) are used for duration related columns to indicate that variable in not applicable for that application data. We carefully excluded such records when analysing such columns.

Then we prepared the data for analysis by converting some numerical variables into categorical variables by binning and we also converted some variables into more analysis friendly scale; like converting days from birth into years of age.

We further segmented the data into customers having payment difficulties and customers with no payment difficulties. We performed univariet, bivariet and multivariet analysis of the cleaned and prepared data. We ploted some graphs to gain deeper understanding of the variables.

Then we performed similar cleaning and preparing steps with previous application data and merged with the current application data to continue the analysis.

We largely found no single variable affecting the capabilities of the customer payment. However, when we combined the datasets and started analysing multiple variables together some weak patterns emerged.

Like, clients with lower secondary education applying for furniture loans are more probable to default on loans.
Our analysis lead us to believe that, below mentioned variables of a customer can be used for careful consideration before approving a loan:-

- Education level of a customer : Academic degree holders are less probable to default
- Loan type and reason : Cash loans for repairs and revolving loans seem to be risky
- Interest rate : Higher and medium range of interest rate more probable the customer to default
- Channel Type : Clients acquired through Credit and Sales offices are more probable to default than clients acquired elsewhere.
- External score : Higher the score less likely to default on payment
- Defaulters in social circle :  lesser the number of defaulters social circle the better.
- Family status : Widowed customers seem to default less on payments irrespective of their education level
- Region rating : Clients living in the regions rated 2 are more probable to default than others
- Occupation type : Laborer occupation type seem to default more than others
- Client Type :  More number of new customers seem to make timely payments

