## 1. INTRODUCTION

This case study aims to give an idea of applying EDA in a real business scenario. In this case study, we will not only apply the EDA techniques but also will develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

### 1.1 BUSINESS UNDERSTANDING

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose we work for a consumer finance company which specialises in lending various types of loans to urban customers. We have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.
 
When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:


- If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company.

- If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

1.	The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample
2.	All other cases: All other cases when the payment is paid on time.

When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

1.	Approved: The Company has approved loan Application.
2.	Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.

3.	Refused: The Company had rejected the loan (because the client does not meet their requirements etc.).

4.	Unused offer:  Loan has been cancelled by the client but on different stages of the process.

In this case study, we will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

### 1.2 BUSINESS OBJECTIVES

This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment.

### 1.3 DATA UNDERSTANDING

Our dataset has 3 files as explained below: 

1.	'application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.
2.	'previous_application.csv' contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.
3.	'columns_description.csv' is data dictionary which describes the meaning of the variables.


## 2. LOAN APPLICATION DATA

### 2.1 Read the data file

We will first read the application csv file which contains the loan applications of customers at the time of application. After performing certaing analysis on this current data we will look into the previous application details for further study at a later point of time.

In [None]:
# Import the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import itertools
%matplotlib inline

In [None]:
# Filter out the warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Setting maximum rows and columns display size to 200 for better visibility of data 

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

In [None]:
# Read the application data file

application_df = pd.read_csv('../input/bank-loan-default/application_data.csv')
application_df.head()

### 2.2 Inspect the data frame

In this section we will perform the below activities -

1.	Inspect the application dataframe to understand the size of this dataset
2.	Look at the column info such as data type
3.	Summarise the numeric columns 


In [None]:
# Check the number of rows and columns in the dataframe
application_df.shape

In [None]:
# Check the column-wise info of the dataframe
application_df.info(verbose=True)


In [None]:
# Check the summary for the numeric columns
application_df.describe()

### 2.3 Data Cleaning & Imputation - Suggestions

In this section we will perform data quality check by identifying missing values, incorrect data types etc. and by suggesting the best possible way to treat such data.

1. Check for any missing values and best possible metric to handle those missing values
2. Verify whether any column(s) has incorrect data type
3. For numerical columns, we will check for outliers
4. We will also perform binning of continuous variables


In [None]:
# Check for missing values in percentage 

round(100 * application_df.isnull().mean(),2)

In [None]:
# Extract the column names with more than 50% data missing and their respective missing value percentage

missing50 = list(filter(lambda x: x[1] > 50 , round(100 * application_df.isnull().sum() / len(application_df.index),2).iteritems()))

# Extract the column names from the above list

cols_to_drop = [i[0] for i in missing50]
cols_to_drop

- Since the above columns have more than 50% data missing, it would be wise to drop these columns because if we keep them, they would impact our overal analysis.

In [None]:
# Remove the columns with more than 50% missing values

application_df.drop(cols_to_drop, axis = 1, inplace = True)

# Check the shape 

application_df.shape

In [None]:

# Check for % missing values for remaining columns

round(100 * application_df.isnull().sum() / len(application_df.index),2)

#### Missing value imputation  -


1. There are few columns with missing value percentage very close to 50%. Which are -

1. YEARS_BEGINEXPLUATATION_AVG     
2. FLOORSMAX_AVG                   
3. YEARS_BEGINEXPLUATATION_MODE    
4. FLOORSMAX_MODE                  
5. YEARS_BEGINEXPLUATATION_MEDI    
6. FLOORSMAX_MEDI                  
7. TOTALAREA_MODE                 
8. EMERGENCYSTATE_MODE             

These columns can also be dropped as they have close to 50% data missing and if we impute them in some manner our dataset would be heavily biased and we will not be able to draw an appropriate insight.

In [None]:
# Drop the above columns 

cols_to_drop = ['YEARS_BEGINEXPLUATATION_AVG','FLOORSMAX_AVG','YEARS_BEGINEXPLUATATION_MODE','FLOORSMAX_MODE','YEARS_BEGINEXPLUATATION_MEDI','FLOORSMAX_MEDI','TOTALAREA_MODE','EMERGENCYSTATE_MODE']
application_df.drop(cols_to_drop, axis = 1, inplace = True)

# Check the shape 

application_df.shape

2. There is this column OCCUPATION_TYPE  with 31.35% data missing. 

As OCCUPATION_TYPE is a categorical variable and is of object type and since the missing value percentage is high (31.35%) we could NOT take its mode value to fill the missing ones because that will simply make the data biased. So, it would be safe to rather create a new type 'Unknown' to fill the missing values.

In [None]:
application_df['OCCUPATION_TYPE'].value_counts()

In [None]:
application_df['OCCUPATION_TYPE'] = application_df['OCCUPATION_TYPE'].fillna('Unknown')

In [None]:
application_df['OCCUPATION_TYPE'].value_counts()

3. Columns with around 13% missing values. Now we are going to consider the columns with approximately 13% missing values and suggest the possible imputation strategies for each of them. From the above list the below columns are found to have around 13% values missing. Let's inspect them. Columns to be considered -



1. AMT_REQ_CREDIT_BUREAU_HOUR
2. AMT_REQ_CREDIT_BUREAU_DAY
3. AMT_REQ_CREDIT_BUREAU_WEEK
4. AMT_REQ_CREDIT_BUREAU_MON
5. AMT_REQ_CREDIT_BUREAU_QRT
6. AMT_REQ_CREDIT_BUREAU_YEAR

In [None]:
application_df[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']].describe()

Here we can replace the missing values with the respective median value for all these columns. The reason for choosing median over mean is that these columns represent 'Number of enquiries...' which can't be a floating value. It must be a whole number and that's why we are going to use median to fill missing values. 

In [None]:
cols_list = ['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']

for col in cols_list:
    application_df[col] = application_df[col].fillna(application_df[col].median())

In [None]:
# We can drop the EXT_SOURCE_2 and EXT_SOURCE_3 columns as they are not required for this case study

application_df.drop(['EXT_SOURCE_2','EXT_SOURCE_3'], axis = 1, inplace = True)
application_df.columns

In [None]:

# Check for % missing values for remaining columns

round(100 * application_df.isnull().sum() / len(application_df.index),2)

Since rest of the missing values are very low as compared to the total number of records at hand, we can simply drop those rows to get rid of the missing data.

In [None]:
application_df.shape

In [None]:
application_df = application_df.dropna(axis=0, how='any')

In [None]:
application_df.shape

#### Data Type Correction -

There are certain columns in the data set which have incorrect data types. We can change them to appropriate data type.

Note: We cannot perform the data type change for some columns until we actually impute the missing data. In that case only suggestion is provided.

1. The below columns represent the number of enquires to Credit Bureau about the client. However, the data present are in float which is not correct. Here we need to change the data type to int.

- AMT_REQ_CREDIT_BUREAU_HOUR
- AMT_REQ_CREDIT_BUREAU_DAY
- AMT_REQ_CREDIT_BUREAU_WEEK
- AMT_REQ_CREDIT_BUREAU_MON
- AMT_REQ_CREDIT_BUREAU_QRT
- AMT_REQ_CREDIT_BUREAU_YEAR


In [None]:
cols_list = ['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']

for col in cols_list:
    application_df[col] = application_df[col].astype(int)

2. Similarly for DAYS_REGISTRATION column, we need to change it into int as it shows the number of days.

In [None]:
# Changing DAYS_REGISTRATION column data type to int

application_df['DAYS_REGISTRATION'] = application_df['DAYS_REGISTRATION'].astype(int)

In [None]:
# Changing CNT_FAM_MEMBERS column data type to int

application_df['CNT_FAM_MEMBERS'] = application_df['CNT_FAM_MEMBERS'].astype(int)

4. We can change the columns which represent Yes or No values as 0 and 1 into Category data type for better plotting and thus reading. 

In [None]:
# We can convert these DAYS columns into int data type as it is anyway going to be a whole number.

col_list = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']

for i in col_list:
    application_df[i] = application_df[i].astype(int)

In [None]:
# Verify the changes

application_df.info()

#### Data standardization -

1. There are some columns which represent number of days but have some negative values. We need to fix that by replacing those values with their respective absolute values. The columns are -

- DAYS_BIRTH
- DAYS_EMPLOYED
- DAYS_REGISTRATION
- DAYS_ID_PUBLISH
- DAYS_LAST_PHONE_CHANGE

In [None]:
# Inspect the negative values in the DAYS columns

application_df[['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']].describe()

In [None]:
# Make a list of all DAYS columns
col_list = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']

# Replace the values with their respective absolute values
for i in col_list:
    application_df[i] = abs(application_df[i])

# Verify the changes
application_df[['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']].describe()

Note: We can create a new column based on DAYS_BIRTH to show the age of the applicant for better readability and then we can drop the DAYS_BIRTH column. Similarly we can convert the other DAYS columns to represent the value in years.


In [None]:
application_df['AGE'] = application_df['DAYS_BIRTH'] // 365
application_df.drop('DAYS_BIRTH', axis = 1, inplace = True)

In [None]:
application_df['YEAR_IN_SERVICE'] = application_df['DAYS_EMPLOYED'] // 365
application_df.drop('DAYS_EMPLOYED', axis = 1, inplace = True)

In [None]:
application_df['BANK_MEMBERSHIP_DURATION'] = application_df['DAYS_REGISTRATION'] // 365
application_df.drop('DAYS_REGISTRATION', axis = 1, inplace = True)

In [None]:
application_df.head()

2. We can also verify some of the categorical variables.

- CODE_GENDER

In [None]:
application_df['CODE_GENDER'].value_counts()

In [None]:
# Get rid of improper value XNA by replacing it with NaN - not using mode as that would be imputation

application_df['CODE_GENDER'] = application_df['CODE_GENDER'].replace('XNA',np.nan)

In [None]:
# Verify again

application_df['CODE_GENDER'].value_counts()

In [None]:
# Let's covert all amount related columns into lakhs

col_list = ['AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE']

for col in col_list:
    application_df[col] = application_df[col]/100000

#### Outlier Analysis -

As we know, there are possibilities of having exceptionally low or high values in our data termed as outliers. It is of very much importance to identify such data points and get the data treated to avoid wrong interpretation. We are going to consider the below columns for outlier analysis.

- AMT_INCOME_TOTAL
- AMT_CREDIT
- AMT_ANNUITY
- AMT_GOODS_PRICE
- DAYS_BIRTH
- DAYS_EMPLOYED
- DAYS_REGISTRATION

In [None]:
# Defining a function to plot outliers 

def outlier_plot(var,title,label):
    
    plt.figure(figsize = [8,5])
    plt.title(title, fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Brown'})
    sns.boxplot(y = var)
    plt.ylabel(label, fontdict={'fontsize': 10, 'fontweight' : 5, 'color' : 'Grey'})
    plt.show()


In [None]:
# Ploting boxplot on AMT_INCOME_TOTAL for outlier analysis

var = application_df['AMT_INCOME_TOTAL']
title = "Client's income"
label = 'Income in Lakhs'

outlier_plot(var,title,label)

- AMT_INCOME_TOTAL(Income of the client) shows that some of the applicants have very high income as compared to others.

In [None]:
# Describe to check the summary

application_df['AMT_INCOME_TOTAL'].describe()

- There is definitely a huge difference between 75% and the maximum value. Let's print the quantile to check the difference between 0.95 or 0.99 quantile and the maximum value.

In [None]:
# print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of AMT_INCOME_TOTAL

application_df['AMT_INCOME_TOTAL'].quantile([0.5, 0.7, 0.9, 0.95, 0.99])

- AMT_INCOME_TOTAL - As we see there is a huge difference in 0.99 quantile and the maximum values. So, there are definitely outliers. As we know the income may vary from person to person, it would be good to decide on a cap value here and get rid of very high incomes. 

In [None]:
# Ploting boxplot on AMT_CREDIT for outlier analysis

var = application_df['AMT_CREDIT']
title = "Credit amount of the loan"
label = "Amount in Lakhs"

outlier_plot(var,title,label)

- AMT_CREDIT(Credit amount of loan) has some outliers. Since the amount credits can vary from person to person based on the loan applied, their eligibility and other factors, it is considerable.
- Also we have more applications with credit amount in the lower range below 5 lakhs.


In [None]:
# Describe to check the summary

application_df['AMT_CREDIT'].describe()

- We can see an increase in value after 75% but it is not very high. Let's check the quantiles.

In [None]:
# print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of AMT_CREDIT

application_df['AMT_CREDIT'].quantile([0.5, 0.7, 0.9, 0.95, 0.99])

- AMT_CREDIT - In this case we have some high value after 99% bt they are not significantly high. We can replace then with median.

In [None]:
# Ploting boxplot on AMT_ANNUITY for outlier analysis

var = application_df['AMT_ANNUITY']
title = "Loan annuity"
label = "Loan Annuity in Lakhs"

outlier_plot(var,title,label)

- AMT_ANNUITY(Loan annuity) also has some outliers but it is kind of continuous. There is no sudden significant rise in the value.

In [None]:
# Describe to check the summary

application_df['AMT_ANNUITY'].describe()

- In this case as there is not a huge difference between 75% and the maximum value also the mean and median values are not much different , we can impute the outliers with median value.

In [None]:
# Ploting boxplot on AMT_GOODS_PRICE for outlier analysis

var = application_df['AMT_GOODS_PRICE']
title = "Goods Price"
label = "Amount in Lakhs"

outlier_plot(var,title,label)

In [None]:
# Describe to check the summary

application_df['AMT_GOODS_PRICE'].describe()

In [None]:
# Describe to check the summary

application_df['AMT_GOODS_PRICE'].quantile([0.5, 0.7, 0.9, 0.95, 0.99])

- Mean and Median are not very different. Also, form the quantiles, the 0.99 and the maximum values are not very far apart. So, we can impute with median.

In [None]:
# Ploting boxplot on AGE for outlier analysis

var = application_df['AGE']
title = "Client's age"
label = "Age in years"

outlier_plot(var,title,label)

- AGE - Client's age seems to have no outliers at all. No imputation or treatment required.

In [None]:
# Ploting boxplot on YEAR_IN_SERVICE for outlier analysis

var = application_df['YEAR_IN_SERVICE']
title = "Employement duration"
label = "Years in Service"

outlier_plot(var,title,label)

- YEAR_IN_SERVICE(Employment duration) data surely has huge outliers which is clearly visible from the boxplot.Some data points are showing close to 1000 years in service which is impossible.

In [None]:
# Describe to check the summary

application_df['YEAR_IN_SERVICE'].describe()

- There is surely a huge difference in 75% and the maximum value. That explains the difference between mean and median value as well. Let's check the quantile.

In [None]:
# print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of YEAR_IN_SERVICE

application_df['YEAR_IN_SERVICE'].quantile([0.5, 0.7, 0.9, 0.95, 0.99])

In [None]:
application_df['YEAR_IN_SERVICE'].quantile([0.5, 0.7, 0.8,0.85, 0.9])

- In case of YEAR_IN_SERVICE, we can see no difference between  0.90 quantile and the maximum value. However, there is a huge difference between 0.80 and 0.90 quantiles. Here, we can say that close to 20% data is not correct and hence is not reliable. We can cap the value at 80% in this case.

In [None]:
# Ploting boxplot on BANK_MEMBERSHIP_DURATION for outlier analysis

var = application_df['BANK_MEMBERSHIP_DURATION']
title = "Bank membership duration"
label = "Registered for in years"

outlier_plot(var,title,label)

- In case of BANK_MEMBERSHIP_DURATION - Bank membership duration, we can see that we have some applicants with the bank for very long time , close to 70 years, which is rare but not impossible. There are people who tie up with a bank and stay loyal to the same one for life time.

In [None]:
# Describe to check the summary

application_df['BANK_MEMBERSHIP_DURATION'].describe()

- We dont see much difference between mean and median. So, we can replace the outliers with median value. 

#### Binning

1. We may want to bin the applicants ages into certain categories to be able to draw some insights such as - whether the loan defaulters majorly fall into any certain age groups or which age groups are much likely to repay on time etc.

In [None]:
# Check the Age Summary - Since DAYS_BIRTH is in days, we have divided it by 365 to get it in years

application_df['AGE'].describe()

In [None]:
# Binning AGE based on above summary

bins = [0,20,30,40,50,60,100]
labels = ['Below 20','20-30','30-40','40-50','50-60','Above 60']
application_df['AGE_GROUP'] = pd.cut(application_df['AGE'], bins = bins, labels = labels )


In [None]:
# Checking the values

application_df['AGE_GROUP'].value_counts().plot(kind='bar')
plt.title("No. of Loan Applicants Vs Age Group\n", fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel('No. of applicants', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xlabel('Age Group', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xticks(rotation=30)
plt.show()

- Maximum no. of loan applications are from age group 30-40, almost no applications below 20 age group (which is understandable as this group has very less chance of having an income).

2. Let's also bin the AMT_INCOME_TOTAL to categorize the total income of the applicants. 

Note: We can make the AMT_INCOME_TOTAL data more readable by changing the unit to lakhs.

In [None]:
# Check the Total income summary - we can divide it by 100,000 for better readability

application_df['AMT_INCOME_TOTAL'].describe()

In [None]:
# Binning AMT_INCOME_TOTAL based on above summary

bins = [0,1,2,5,10,20,50,1000]
labels = ['Upto 1L','1-2L','2-5L','5-10L','10-20L','20-50L','50L above']
application_df['INCOME_GROUP'] = pd.cut(application_df['AMT_INCOME_TOTAL'], bins = bins, labels = labels )


In [None]:
# Checking the values

application_df['INCOME_GROUP'].value_counts().plot(kind='bar')
plt.title("Number of Applications Vs Income Group\n", fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel('No. of applicants', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xlabel('Income Group', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xticks(rotation=30)
plt.show()

- Maximum loan applicants are from lower income group i.e upto 5 lakhs. Bank should focus on this group.Also, we can cap the value at 20L.

3. We will also categorize the credit amount of the loan (AMT_CREDIT) column

In [None]:
# Check the credit amount of the loan - we can divide it by 100,000 for better readability

application_df['AMT_CREDIT'].describe()

In [None]:
# Binning AMT_ANNUITY based on above summary

bins = [0,1,5,10,20,30,40,50,100]
labels = ['Upto 1L','1-5L','5-10L','10-20L','20-30L','30-40L','40-50L','50L above']
application_df['CREDIT_GROUP'] = pd.cut(application_df['AMT_CREDIT'], bins = bins, labels = labels )

In [None]:
# Checking the values
application_df['CREDIT_GROUP'].value_counts().plot(kind='bar')
plt.title("Number of Applications Vs Credit Group\n", fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel('No. of applicants', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xlabel('Credit Group', fontdict={'fontsize': 12, 'fontweight' : 5, 'color' : 'Grey'})
plt.xticks(rotation=30)
plt.show()

- The number of applicants with credit amount range 1-20L is very high. Almost none above 30L.

### 2.4 Data Analysis

#### 2.4.1	Check the Imbalance percentage

What is Imbalance Percentage?

In our data set, there is a target variable/column named 'TARGET'. It represents whether the client is a defaulter or not.
If we segregate our dataset based on this column, and if the distribution turns out to be 50-50 i.e. 50% of the applicants are defaluters and the rest 50% are NOT, then our data set would be BALANCED. In any other case, it would be considered as IMBALANCED.

In [None]:
# Checking imbalance percentage

application_df['TARGET'].value_counts(normalize = True)*100

In [None]:
# Plotting imbalance percentage

#Extracting the imbalance percentage
Repayment_Status = application_df['TARGET'].value_counts(normalize=True)*100

# Defining the x values
x= ['Others','Defaulters']

# Defining the y ticks
axes= plt.axes()
axes.set_ylim([0,100])
axes.set_yticks([10,20,30,40,50,60,70,80,90,100])

# Plotting barplot
sns.barplot(x, Repayment_Status)

# Adding plot title, and x & y labels
plt.title('Imbalance Percentage\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel("Borrower Category")
plt.ylabel("Percentage")

# Displaying the plot
plt.show()

- As per the above data, we can say that our data set is imbalanced with almost 8% defaulters. Rest all 92% were able to repay the loans.

#### 2.4.2	Segregate data based on TARGET column

Now, let's create 2 data sets to segregate our original data based on the TARGET column values to have defaulters in one dataframe and others in another.

In [None]:
# Creating data frame of Others

application_df0 = application_df[application_df['TARGET']==0]
application_df0.head()

In [None]:
# Creating data frame of Defaulters

application_df1 = application_df[application_df['TARGET']==1]
application_df1.head()

#### 2.4.3	Univariate Analysis

##### Categorical Variable Analysis

We will plot graphs of the below categorical variables to draw inferences-

- NAME_CONTRACT_TYPE
- CODE_GENDER
- OCCUPATION_TYPE
- NAME_INCOME_TYPE
- NAME_EDUCATION_TYPE
- NAME_FAMILY_STATUS
- NAME_HOUSING_TYPE
- INCOME_GROUP
- AGE_GROUP

In [None]:
# Defining a function to plot univariate categorical variables

def univariate_categorical_plot(category1, category2, xlabel):
    
    plt.figure(figsize = [15,7])
    plt.subplot(1,2,1)
    sns.countplot(category1)
    plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    plt.xlabel(xlabel)
    plt.xticks(rotation=45, ha='right')
    
    plt.subplot(1,2,2)
    sns.countplot(category2)
    plt.title('Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    plt.xlabel(xlabel)
    plt.xticks(rotation=45, ha='right')
    
    plt.show()


In [None]:
# Defining a function to plot defaulter percentage against univariate categorical variable

def perc_defaulter(col1, col2, title, xlabel):
    
    tempdf = application_df[[col1,col2]].groupby([col2], as_index=False).mean()

    tempdf[col1] = tempdf[col1]*100
    tempdf.sort_values(by=col1, ascending=False, inplace=True)

    sns.barplot(x=col2, y = col1, data = tempdf)
    plt.title(title, fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    plt.xlabel(xlabel)
    plt.ylabel('Defaulter %')
    plt.xticks(rotation=45, ha='right')
    plt.show()


##### NAME_CONTRACT_TYPE

In [None]:
# Analyzing w.r.t Contract Type column

category1 = application_df1['NAME_CONTRACT_TYPE']
category2 = application_df0['NAME_CONTRACT_TYPE']
xlabel = 'Contract Type'

univariate_categorical_plot(category1, category2, xlabel)

- Cash loan type contracts are high in number in both cases.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'NAME_CONTRACT_TYPE'
title = 'Contract Type Vs Defalut percentage\n'
xlabel = 'Contract Type'

perc_defaulter(col1, col2, title, xlabel)

- Cash loan contract type are more likely to fail repayment

##### CODE_GENDER

In [None]:
# Analyzing w.r.t Gender column

category1 = application_df1['CODE_GENDER']
category2 = application_df0['CODE_GENDER']
xlabel = 'Gender'

univariate_categorical_plot(category1, category2, xlabel)

- In either case, we have more female clients as compared to males. But we actually need to see the percentage of defaulters in these 2 gender categories to actually say which gender is more likely to become a defaulter.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'CODE_GENDER'
title = 'Gender Vs Defalut percentage\n'
xlabel = 'Gender'

perc_defaulter(col1, col2, title, xlabel)


- As seen from the above diagram, males have comparatively higher percentage of being defaulters than females.

##### OCCUPATION_TYPE

In [None]:
# Analyzing w.r.t Occupation column

category1 = application_df1['OCCUPATION_TYPE']
category2 = application_df0['OCCUPATION_TYPE']
xlabel = 'Occupation'

univariate_categorical_plot(category1, category2, xlabel)


- Here we can see that in both defaluters and others category, the laborers are the maximum in number. Let's check the percentage of defaulters in each occupation category.

In [None]:

# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'OCCUPATION_TYPE'
title = 'Occupation Vs Defalut percentage\n'
xlabel = 'Occupation'

perc_defaulter(col1, col2, title, xlabel)

- As we see, low-skill laborers are most likely to default.

##### NAME_INCOME_TYPE

In [None]:
# Analyzing w.r.t Income Type column

category1 = application_df1['NAME_INCOME_TYPE']
category2 = application_df0['NAME_INCOME_TYPE']
xlabel = 'Income Source'

univariate_categorical_plot(category1, category2, xlabel)


- Here the numbers are high for Working people in both cases. Let's see the % age of defaulters for these categories.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'NAME_INCOME_TYPE'
title = 'Income type Vs Defalut percentage\n'
xlabel = 'Income type'

perc_defaulter(col1, col2, title, xlabel)


- The people on Maternity leave and unemployed categories are more likely to fail to repay.
- Businessman and Students have the lowest chances of defaulting.

##### NAME_EDUCATION_TYPE

In [None]:
# Analyzing w.r.t Education Type column

category1 = application_df1['NAME_EDUCATION_TYPE']
category2 = application_df0['NAME_EDUCATION_TYPE']
xlabel = 'Education'

univariate_categorical_plot(category1, category2, xlabel)


- People with secondary education level have maximum count in either cases. Let's plot the % age graph.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'NAME_EDUCATION_TYPE'
title = 'Education Vs Defalut percentage\n'
xlabel = 'Education'

perc_defaulter(col1, col2, title, xlabel)


- People with lower secondary education level are more likely to fail repayment.
- People with academic degree or higher education, however, are mostly able to repay on time.

##### NAME_FAMILY_STATUS

In [None]:
# Analyzing w.r.t Family Status Type column

category1 = application_df1['NAME_FAMILY_STATUS']
category2 = application_df0['NAME_FAMILY_STATUS']
xlabel = 'Family Status'

univariate_categorical_plot(category1, category2, xlabel)

- No. of married applicants are more in both cases.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'NAME_FAMILY_STATUS'
title = 'Family Status Vs Defalut percentage\n'
xlabel = 'Family Status'

perc_defaulter(col1, col2, title, xlabel)


- People in a civil marriage or those who are singles, are more likely to default.

##### NAME_HOUSING_TYPE

In [None]:
# Analyzing w.r.t Housing Type Type column


category1 = application_df1['NAME_HOUSING_TYPE']
category2 = application_df0['NAME_HOUSING_TYPE']
xlabel = 'Housing Type'

univariate_categorical_plot(category1, category2, xlabel)

- A major count of applicants stay in house/apartment. Very less people stay in office or co-op apartments.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'NAME_HOUSING_TYPE'
title = 'House Type Vs Defalut percentage\n'
xlabel = 'House Type'

perc_defaulter(col1, col2, title, xlabel)

- Most of the applicants who are likely to default are either staying in a rented apartment or with parents compared to other housing types.

In [None]:
# Analyzing w.r.t Income Group column

category1 = application_df1['INCOME_GROUP']
category2 = application_df0['INCOME_GROUP']
xlabel = 'Income Group'

univariate_categorical_plot(category1, category2, xlabel)


- Most of the loan applicants with an income range 1-2 lakhs are most likely to default.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'INCOME_GROUP'
title = 'Income Group Vs Defalut percentage\n'
xlabel = 'Income Group'

perc_defaulter(col1, col2, title, xlabel)

- Lower the income group higher than chance of defaulting.
- Maximum defaulting income group is 1-2 Lakhs

In [None]:
# Analyzing w.r.t Income Group column

category1 = application_df1['AGE_GROUP']
category2 = application_df0['AGE_GROUP']
xlabel = 'Age Group'

univariate_categorical_plot(category1, category2, xlabel)

- From these plots it seems people in the age range 30-40 are more likely to default.

In [None]:
# Plot the percentage of defaulters in each category

col1 = 'TARGET'
col2 = 'AGE_GROUP'
title = 'Age Group Vs Defalut percentage\n'
xlabel = 'Age Group'

perc_defaulter(col1, col2, title, xlabel)

- However, this plot says, the percentage of loan default is highest in the age group 20-30. 
- The loan default percentage decreases with increase in age.

##### Numeric variable Analysis

Here we are going to consider the below numeric columns and draw are conclusion on them.

- AMT_INCOME_TOTAL
- AMT_CREDIT 
- AMT_ANNUITY 
- AMT_GOODS_PRICE
- CNT_CHILDREN 
- DAYS_BIRTH

In [None]:
# Defining a function to plot univariate numerical columns

def univariate_numerical_plots(col1, col2, title, xlabel):
    sns.distplot(col1 , hist=False, label='Defaulters')
    sns.distplot(col2 , hist=False, label='Others')
    plt.title(title, fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    plt.xlabel(xlabel)
    plt.legend()
    plt.show()

In [None]:
# Plotting AMT_INCOME_TOTAL 

col1 = application_df1['AMT_INCOME_TOTAL']
col2 = application_df0['AMT_INCOME_TOTAL']
title = 'Total Income of the client\n'
xlabel = 'Total income in lakhs'

univariate_numerical_plots(col1, col2, title, xlabel)


- Most of the applicants are in low income range.

In [None]:
# Plotting AMT_CREDIT 

col1 = application_df1['AMT_CREDIT']
col2 = application_df0['AMT_CREDIT']
title = 'Credit amount\n'
xlabel = 'Credit amount in lakhs'

univariate_numerical_plots(col1, col2, title, xlabel)


- Most of the loans are given with credit amount less than 10 lakhs.

In [None]:
# Plotting AMT_ANNUITY 

col1 = application_df1['AMT_ANNUITY']
col2 = application_df0['AMT_ANNUITY']
title = 'Annuity\n'
xlabel = 'Annuity in lakhs'

univariate_numerical_plots(col1, col2, title, xlabel)


- Most of the loan annuity is less than 50, 000.

In [None]:
# Plotting AMT_GOODS_PRICE 

col1 = application_df1['AMT_GOODS_PRICE']
col2 = application_df0['AMT_GOODS_PRICE']
title = 'Goods Price\n'
xlabel = 'Goods price in lakhs'

univariate_numerical_plots(col1, col2, title, xlabel)


- Goods amount price is mostly less than 15 lakhs.

In [None]:
# Plotting CNT_CHILDREN 

sns.distplot(application_df1['CNT_CHILDREN'],hist=False, label='Defaulters')
sns.distplot(application_df0['CNT_CHILDREN'],hist=False, label='Others')
xlabel = 'Children'
ticks = [0,1,2,3,4,5,6,7,8,9,10,11]
plt.xticks(ticks)
plt.legend()
plt.title('Count of children\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.show()

- Majority of the applicants are having no children. Mostly they have 3 or less children.

In [None]:
# Plotting DAYS_BIRTH 

col1 = application_df1['AGE']
col2 = application_df0['AGE']
title = 'Age\n'
xlabel = 'Age in years'

univariate_numerical_plots(col1, col2, title, xlabel)


- Defaulters are more in 25-40 age group. Above 40, the number of defaulters tends to decrease.

#### 2.4.4	Correlation

In [None]:
corr_df1 = application_df1[['AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE','AGE','YEAR_IN_SERVICE','CNT_CHILDREN']].corr()
corr_df1

In [None]:
corr_df0 = application_df0[['AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE','AGE','YEAR_IN_SERVICE','CNT_CHILDREN']].corr()
corr_df0

In [None]:
# Plot correlation heatmap for numerical variables

plt.figure(figsize=[20,10])

plt.subplot(1,2,1)
sns.heatmap(corr_df1, cmap="YlGnBu", annot = True)
plt.title('Correlation - Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xticks(rotation=45)

plt.subplot(1,2,2)
sns.heatmap(corr_df0, cmap="YlGnBu", annot = True)
plt.title('Correlation - Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xticks(rotation=45)

plt.show()

- AMT_CREDIT is strongly correlated to AMT_ANNUITY and AMT_GOODS_PRICE in both cases

#### 2.4.5	Bivariate Analysis

We will perform 3 types of bivariate analysis to understand the data better and draw some important insights.

- Categorical - Categorical Analysis
- Categorical - Continuous Analysis
- Continuous - Continuous Analysis

##### Categorical - Categorical Analysis

Columns considered -

- NAME_CONTRACT_TYPE - CODE_GENDER
- NAME_INCOME_TYPE - NAME_CONTRACT_TYPE
- INCOME_GROUP - CODE_GENDER
- CODE_GENDER - FLAG_OWN_REALTY
- NAME_HOUSING_TYPE - FLAG_OWN_REALTY
- NAME_HOUSING_TYPE - NAME_FAMILY_STATUS

In [None]:
# Defining function for categorical - categorical variable plotting

def cat_cat_plot(var1, var2, label, legend):
    
    plt.figure(figsize=[20,5])
    
    plt.subplot(1,2,1)
    plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    sns.countplot(application_df1[var1], hue=application_df1[var2])
    plt.xlabel(label)
    plt.xticks(rotation = 45)
    plt.legend(title=legend, loc='upper right')

    plt.subplot(1,2,2)
    plt.title('Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    sns.countplot(application_df0[var1], hue=application_df0[var2])
    plt.xlabel(label)
    plt.xticks(rotation = 45)
    plt.legend(title=legend, loc='upper right')
    
    plt.show()

In [None]:
# NAME_CONTRACT_TYPE - CODE_GENDER

var1 = 'NAME_CONTRACT_TYPE'
var2 = 'CODE_GENDER'
label = 'Contract Type'
legend = 'Gender'

cat_cat_plot(var1, var2, label, legend)

- Most of the applications have applied for cash loans.
- Female applicants are more than males.


In [None]:
# NAME_INCOME_TYPE - NAME_CONTRACT_TYPE

var1 = 'NAME_INCOME_TYPE'
var2 = 'NAME_CONTRACT_TYPE'
label = 'Income Type'
legend = 'Contract type'

cat_cat_plot(var1, var2, label, legend)

- Across income types, cash loan seems to be the popular contract type.
- Most of the people who have taken loans are working class and they have taken cash loans mostly compared to revolving loans.
- People who have taken cash loans are likely to default as well

In [None]:
# INCOME_GROUP - CODE_GENDER

var1 = 'INCOME_GROUP'
var2 = 'CODE_GENDER'
label = 'Income Group'
legend = 'Gender'

cat_cat_plot(var1, var2, label, legend)

- Females have done timely repayment than that of males.

In [None]:
# CODE_GENDER - FLAG_OWN_REALTY

var1 = 'CODE_GENDER'
var2 = 'FLAG_OWN_REALTY'
label = 'Gender'
legend = 'Own house?'

cat_cat_plot(var1, var2, label, legend)

- Female borrowers are more likely to own flat/house.
- Since the female count is higher in both cases , we cannot be sure that they are likely to default.

In [None]:
# NAME_HOUSING_TYPE - FLAG_OWN_REALTY

var1 = 'NAME_HOUSING_TYPE'
var2 = 'FLAG_OWN_REALTY'
label = 'Housing Type'
legend = 'Own house?'

cat_cat_plot(var1, var2, label, legend)

- People who own a house/flat and are staying in own property are likely to make repayments.


In [None]:
# NAME_HOUSING_TYPE - NAME_FAMILY_STATUS

var1 = 'NAME_HOUSING_TYPE'
var2 = 'NAME_FAMILY_STATUS'
label = 'Housing Type'
legend = 'Family Status'

cat_cat_plot(var1, var2, label, legend)

- Married loan applicants are mostly staying in house/apartment.
- Married people staying in house/apartments are the group with maximum number of loan applications.
- Single and civil marriage applicants are more likely to default.


##### Categorical - Continuous Analysis

Columns considered -

- NAME_CONTRACT_TYPE - AMT_CREDIT
- NAME_INCOME_TYPE - AMT_CREDIT
- NAME_EDUCATION_TYPE - AMT_ANNUITY
- NAME_HOUSING_TYPE - AMT_CREDIT
- OCCUPATION_TYPE - AMT_CREDIT

In [None]:
# Defining function for categorical - Continuous variable plotting

def cat_cont_plot(var1, var2, xlabel, ylabel):
    
    plt.figure(figsize=(20,5))
    plt.subplot(1,2,1)
    plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    sns.boxplot(x=var1,y=var2, data=application_df1)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=45)
    
    plt.subplot(1,2,2)
    plt.title('Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    sns.boxplot(x=var1,y=var2, data=application_df0)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=45)
    
    plt.show()

In [None]:
# NAME_CONTRACT_TYPE - AMT_CREDIT

var1 = 'NAME_CONTRACT_TYPE'
var2 = 'AMT_CREDIT'

xlabel = 'Contract Type'
ylabel = 'Credit Amount'

cat_cont_plot(var1, var2, xlabel, ylabel)

- Loan credit amount for cash loan is higher than that of revolving loans.
- Cash loan is favourite among all genders.

In [None]:
# NAME_INCOME_TYPE - AMT_CREDIT

var1 = 'NAME_INCOME_TYPE'
var2 = 'AMT_CREDIT'

xlabel = 'Income Type'
ylabel = 'Credit Amount'

cat_cont_plot(var1, var2, xlabel, ylabel)

- Loan amount taken by businessman is higher compared to the other income types.
- The median value of credit amount is some what similar for working, commercial associate, state servant and pensioner income type.
- People with maternity leave income type tend to default with higher credit amount.

In [None]:
# NAME_EDUCATION_TYPE - AMT_ANNUITY

var1 = 'NAME_EDUCATION_TYPE'
var2 = 'AMT_ANNUITY'
xlabel = 'Education Type'
ylabel = 'Annuity'

cat_cont_plot(var1, var2, xlabel, ylabel)

- People having academic degree and higher education have more loan annuity amount compared to the other groups in both the default and non-default section.

In [None]:
# NAME_HOUSING_TYPE - AMT_CREDIT

var1 = 'NAME_HOUSING_TYPE'
var2 = 'AMT_CREDIT'
xlabel = 'Housing Type'
ylabel = 'Credit Amount'

cat_cont_plot(var1, var2, xlabel, ylabel)

- Loan credit amount is comparatively higher for people living in houses/appartments, municipal and office apartments.

In [None]:
# OCCUPATION_TYPE - AMT_CREDIT

var1 = 'OCCUPATION_TYPE'
var2 = 'AMT_CREDIT'
xlabel = 'Occupation Type'
ylabel = 'Credit Amount'

cat_cont_plot(var1, var2, xlabel, ylabel)

- Managers and Accountants have comparatively higher credit amount.

##### Continuous - Continuous Analysis

Here we have considered the below continuous value columns for plotting the graphs -

- AMT_CREDIT
- AMT_ANNUITY
- AMT_GOODS_PRICE
- AMT_INCOME_TOTAL

In [None]:
# Defining function for Continuous - continuous plot

def cont_cont_plot(col1, col2, xlabel, ylabel):
    
    plt.figure(figsize=[20,5])
    plt.subplot(1,2,1)
    plt.title('Defaulters\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    sns.scatterplot(x = col1, y = col2, data = application_df1)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=45)

    plt.subplot(1,2,2)
    plt.title('Others\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
    sns.scatterplot(x = col1, y = col2, data = application_df0)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=45)
    
    plt.show()

In [None]:
# AMT_CREDIT-AMT_ANNUITY

col1 = 'AMT_CREDIT'
col2 = 'AMT_ANNUITY'
xlabel = 'Credit Amount'
ylabel = 'Annuity'

cont_cont_plot(col1, col2, xlabel, ylabel)

- AMT_CREDIT are AMT_ANNUITY seems to be correlated.

In [None]:
# AMT_CREDIT-AMT_GOODS_PRICE

col1 = 'AMT_CREDIT'
col2 = 'AMT_GOODS_PRICE'
xlabel = 'Credit Amount'
ylabel = 'Goods Price'

cont_cont_plot(col1, col2, xlabel, ylabel)

- AMT_CREDIT are AMT_GOODS_PRICE seems to be correlated.

In [None]:
# AMT_CREDIT-AMT_INCOME_TOTAL

col1 = 'AMT_CREDIT'
col2 = 'AMT_INCOME_TOTAL'
xlabel = 'Credit Amount'
ylabel = 'Total income'

cont_cont_plot(col1, col2, xlabel, ylabel)

- AMT_CREDIT and AMT_INCOME_TOTAL does not seem to be correlated.

## 3. PREVIOUS LOAN APPLICATION DATA

We also have the previous application histories of the applicants. Let's explore that and see if we could find any trend.

### 3.1 Read and Inspect the Previous Data

In [None]:
# Read the previous data file

previous_df = pd.read_csv('../input/bank-loan-default/previous_application.csv')
previous_df.head()

In [None]:
# Check the number of rows and columns in the dataframe
previous_df.shape

In [None]:
# Check the column-wise info of the dataframe
previous_df.info(verbose=True)

In [None]:
# Check the summary for the numeric columns
previous_df.describe()

### 3.2 Data Cleaning & Imputation - Suggestions

In [None]:
# Check for missing values in percentage 

round(100 * previous_df.isnull().mean(),2)

In [None]:
# Extract the column names with more than 40% data missing and their respective missing value percentage

missing40 = list(filter(lambda x: x[1] > 40 , round(100 * previous_df.isnull().sum() / len(previous_df.index),2).iteritems()))

# Extract the column names from the above list

cols_to_drop = [i[0] for i in missing40]
cols_to_drop

In [None]:
# Remove the columns with more than 50% missing values

previous_df.drop(cols_to_drop, axis = 1, inplace = True)

# Check the shape 

previous_df.shape

In [None]:
# Check the missing values for remaining
round(100 * previous_df.isnull().mean(),2)

- AMT_ANNUITY,AMT_GOODS_PRICE and CNT_PAYMENT missing values can be replaced with mean or median. 

In [None]:
previous_df[['AMT_ANNUITY','AMT_GOODS_PRICE','CNT_PAYMENT']].describe()

In [None]:
col_list = ['AMT_ANNUITY','AMT_GOODS_PRICE','CNT_PAYMENT']

for col in col_list:
    previous_df[col] = previous_df[col].fillna(previous_df[col].median())

In [None]:
# For less than 1% missing values, we can delete the rows
previous_df.dropna(axis=0, how='any',inplace = True)

In [None]:
# Check the missing values for remaining
round(100 * previous_df.isnull().mean(),2)

### 3.3 Data Standardisation

In [None]:
# Covert all amount related columns in lakhs

col_list = ['AMT_ANNUITY','AMT_APPLICATION','AMT_CREDIT','AMT_GOODS_PRICE']

for col in col_list:
    previous_df[col] = previous_df[col]/100000

### 3.4 Outlier Analysis

- For the columns AMT_ANNUITY, AMT_GOODS_PRICE and CNT_PAYMENT , let's plot outliers for better understanding.

In [None]:
# Box plot AMT_ANNUITY

var = previous_df['AMT_ANNUITY']
title = 'Annuity Amount\n'
label = 'Amount in lakhs'

outlier_plot(var,title,label)

- Annuity seems to have some higher data points.

In [None]:
# Check Summary

previous_df['AMT_ANNUITY'].describe()

In [None]:
# Check the quantiles

previous_df['AMT_ANNUITY'].quantile([0.5,0.7,0.90,0.95,0.99])

- The outliers can be capped at 0.99 

In [None]:
# Box plot AMT_GOODS_PRICE

var = previous_df['AMT_GOODS_PRICE']
title = 'Goods Price\n'
label = 'Amount in lakhs'

outlier_plot(var,title,label)

In [None]:
# Check summary

previous_df['AMT_GOODS_PRICE'].describe()

In [None]:
# Check the quantiles

previous_df['AMT_GOODS_PRICE'].quantile([0.5,0.7,0.90,0.95,0.99])

- There are certain highly priced goods after 0.95 quantile. Here we can set a cap value to ignore very high goods price.

In [None]:
# Box plot CNT_PAYMENT

var = previous_df['CNT_PAYMENT']
title = 'Term of previous credit\n'
label = 'Term'

outlier_plot(var,title,label)

In [None]:
# Check summary

previous_df['CNT_PAYMENT'].describe()

## 4 MERGING DATA SETS

Merge the application data frame and previous application data frame

### 4.1 Merging the data sets

In [None]:
# Merge both application_df and previous_df

finaldf = pd.merge(application_df, previous_df, on='SK_ID_CURR', how = 'inner')

# verify

finaldf.head()

In [None]:
# Check the column info

finaldf.info(verbose=True)

In [None]:
# Rename the duplicated columns

finaldf = finaldf.rename({'NAME_CONTRACT_TYPE_y':'NAME_CONTRACT_TYPE_PREV',
                         'AMT_ANNUITY_y':'AMT_ANNUITY_PREV',
                        'AMT_CREDIT_y':'AMT_CREDIT_PREV',
                         'AMT_GOODS_PRICE_y':'AMT_GOODS_PRICE_PREV',
                         'NAME_TYPE_SUITE_y':'NAME_TYPE_SUITE_PREV',
                         'NAME_TYPE_SUITE_x':'NAME_TYPE_SUITE_CURR',
                         'AMT_GOODS_PRICE_x':'AMT_GOODS_PRICE_CURR',
                         'AMT_ANNUITY_x':'AMT_ANNUITY_CURR',
                         'AMT_CREDIT_x':'AMT_CREDIT_CURR',
                         'NAME_CONTRACT_TYPE_x':'NAME_CONTRACT_TYPE_CURR'}, axis=1)

In [None]:
#Verify

finaldf.info(verbose=True)

In [None]:
# Remove unwanted columns

finaldf.drop(['REGION_POPULATION_RELATIVE','REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 
              'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
              'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY',
              'LIVE_CITY_NOT_WORK_CITY','OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE',
              'DEF_60_CNT_SOCIAL_CIRCLE','DAYS_LAST_PHONE_CHANGE','AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY',
              'AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR',
             'FLAG_DOCUMENT_2','FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4','FLAG_DOCUMENT_5', 
              'FLAG_DOCUMENT_6','FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
              'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10','FLAG_DOCUMENT_11',
              'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13','FLAG_DOCUMENT_14',
              'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16','FLAG_DOCUMENT_17','FLAG_DOCUMENT_18','FLAG_DOCUMENT_19',
              'FLAG_DOCUMENT_20','FLAG_DOCUMENT_21','FLAG_MOBIL','FLAG_EMP_PHONE', 'FLAG_WORK_PHONE','FLAG_CONT_MOBILE', 
              'FLAG_PHONE','FLAG_EMAIL', 'FLAG_LAST_APPL_PER_CONTRACT',
              'NFLAG_LAST_APPL_IN_DAY', 'SELLERPLACE_AREA','WEEKDAY_APPR_PROCESS_START_x',
              'WEEKDAY_APPR_PROCESS_START_y','HOUR_APPR_PROCESS_START_x','HOUR_APPR_PROCESS_START_y'],axis=1,inplace=True)

In [None]:
#Verify

finaldf.info(verbose=True)

### 4.2 Imbalance Percentage

In [None]:
# Plotting imbalance percentage

#Extracting the imbalance percentage
Repayment_Status = finaldf['TARGET'].value_counts(normalize=True)*100

# Defining the x values
x= ['Others','Defaulters']

# Defining the y ticks
axes= plt.axes()
axes.set_ylim([0,100])
axes.set_yticks([10,20,30,40,50,60,70,80,90,100])

# Plotting barplot
sns.barplot(x, Repayment_Status)

# Adding plot title, and x & y labels
plt.title('Imbalance Percentage\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel("Borrower Category")
plt.ylabel("Percentage")

# Displaying the plot
plt.show()

The previous data also seems imbalanced.

### 4.3 Univariate Analysis

Categorical Analysis

In [None]:
#NAME_CONTRACT_STATUS

tempdf = finaldf[['TARGET','NAME_CONTRACT_STATUS']].groupby(['NAME_CONTRACT_STATUS'], as_index=False).sum()
tempdf.sort_values(by='TARGET', ascending=False, inplace=True)

sns.barplot(x='NAME_CONTRACT_STATUS', y = 'TARGET', data = tempdf)
plt.title('Previous Contract Status\n',fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'} )
plt.xlabel('Loan status')
plt.ylabel('Number of defaults')
plt.xticks(rotation=45, ha='right')

plt.show()

- High number of defaulters have their loans approved in the past.
- Number of defaulters who have not used the offer is the minimum.

In [None]:
# NAME_CONTRACT_TYPE_PREV

tempdf = finaldf[['TARGET','NAME_CONTRACT_TYPE_PREV']].groupby(['NAME_CONTRACT_TYPE_PREV'], as_index=False).sum()
tempdf.sort_values(by='TARGET', ascending=False, inplace=True)

sns.barplot(x='NAME_CONTRACT_TYPE_PREV', y = 'TARGET', data = tempdf)
plt.title('Previous Contract Type\n',fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Contract type')
plt.ylabel('Number of defaulters')
plt.xticks(rotation=45, ha='right')

plt.show()

- High number of defaults in case of cash loans followed by consumer loans in previous applications data.

In [None]:
# NAME_CASH_LOAN_PURPOSE

tempdf = finaldf[finaldf['NAME_CASH_LOAN_PURPOSE'] != 'XAP']
tempdf = tempdf[tempdf['NAME_CASH_LOAN_PURPOSE'] != 'XNA']
tempdf = tempdf[['TARGET','NAME_CASH_LOAN_PURPOSE']].groupby(['NAME_CASH_LOAN_PURPOSE'], as_index=False).sum()

plt.figure(figsize=[20,10])
sns.barplot(x='NAME_CASH_LOAN_PURPOSE', y = 'TARGET', data = tempdf)
plt.title('Cash Loan Purpose\n',fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Purpose')
plt.ylabel('Number of defaulters')
plt.xticks(rotation=90, ha='right')

plt.show()

- Loan applications with purpose such as Repairs, Urgent needs and Others etc. are more likely to default.
- Purpose such as buying a garage, home and hobby etc, have no difficulty in repayment.

Numerical Analysis

In [None]:
# Numerical data analysis

sns.distplot(finaldf['AMT_CREDIT_PREV'] , hist=False)
plt.title('Credit Amount of Previous Applications\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Credit Amount')
plt.show()

- In past, most of the loans had credit amount in the lower range i.e. below 1 lakh.

In [None]:
# Numerical data analysis

sns.distplot(finaldf['AMT_ANNUITY_PREV'] , hist=False)
plt.title('Annuity of Previous Applications\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
plt.xlabel('Annuity')
plt.show()

- Previous applications annuity was also mostly below 1 lakh.

### 4.3 Correlation

In [None]:
# Check the correlation

corrdf = finaldf[['AMT_ANNUITY_PREV','AMT_APPLICATION','AMT_CREDIT_PREV','AMT_GOODS_PRICE_PREV']].corr()
corrdf

In [None]:
# Plot correlation heatmap for numerical variables

plt.figure(figsize=[10,5])

sns.heatmap(corrdf, cmap="YlGnBu", annot = True)
plt.title('Correlation - Previous Applications\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})

plt.show()

- From the above plot we can see that AMT_CREDIT_PREV is highly correlated to AMT_APPLICATION and AMT_GOODS_PRICE_PREV

### 4.2 Bivariate Analysis

##### Categorical - Categorical

In [None]:
# NAME_CASH_LOAN_PURPOSE - OCCUPATION_TYPE

tempdf = finaldf[finaldf['NAME_CASH_LOAN_PURPOSE'] != 'XAP']
tempdf = tempdf[tempdf['NAME_CASH_LOAN_PURPOSE'] != 'XNA']

plt.figure(figsize=[20,10])
plt.title('Cash Loan Purpose\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.countplot(tempdf['NAME_CASH_LOAN_PURPOSE'],hue=tempdf['TARGET'])
plt.xlabel('Cash loan purpose')
plt.xticks(rotation = 90)
plt.show()

- Loans taken for Repairs purpose are higher compared to others.
- Loan default chances is also higher for loans taken for repair purpose.

In [None]:
# NAME_CONTRACT_STATUS - CODE_REJECT_REASON

tempdf = finaldf[finaldf['NAME_CONTRACT_STATUS'] == 'Refused']

plt.figure(figsize=[20,10])
plt.title('Rejection reason\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.countplot(tempdf['CODE_REJECT_REASON'],hue=tempdf['TARGET'])
plt.xlabel('Rejection reason')
plt.xticks(rotation = 90)
plt.show()

- Most of the applications were rejected for rejection code - HC. It also has the higher number of defaulters.
- Rejection by system is very less.

##### Categorical - Continuous

In [None]:
# NAME_CASH_LOAN_PURPOSE -  AMT_CREDIT_PREV


plt.figure(figsize=(20,10))
plt.title('Cash Loan Purpose Vs Credit Amount\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.barplot(x=finaldf['NAME_CASH_LOAN_PURPOSE'], y=finaldf['AMT_CREDIT_PREV'], hue=finaldf['TARGET'] )
plt.xlabel('Cash Loan Purpose')
plt.ylabel('Credit amount')
plt.xticks(rotation=90)
plt.legend(title='Is defaulter?', loc= 'upper right')
plt.show()

- People taking cash loans with high credit amount but have refused to name the purpose are more likely to default.
- Less chances of defaulting in case of home loans.

In [None]:
# NAME_CONTRACT_STATUS - AMT_INCOME_TOTAL

plt.figure(figsize=(10,8))
plt.title('Contract Status Vs Total Income\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.barplot(x=finaldf['NAME_CONTRACT_STATUS'], y=finaldf['AMT_INCOME_TOTAL'], hue=finaldf['TARGET'] )
plt.xlabel('Contract Status')
plt.ylabel('Total Income')
plt.legend(title='Is defaulter?', loc= 'upper right')
plt.show()

- This graph shows the people who have unused offers are more likely to default even though they have comparatively high total income.

##### Continuous - Continuous

In [None]:
# AMT_CREDIT_PREV - AMT_APPLICATION

plt.figure(figsize=[20,5])
plt.title('Previous Credit Amount Vs Amount Applied\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.scatterplot(x = finaldf['AMT_CREDIT_PREV'], y = finaldf['AMT_APPLICATION'])
plt.xlabel('Credit Amount')
plt.ylabel('Amount Applied')
plt.show()

In [None]:
# AMT_CREDIT_PREV - AMT_GOODS_PRICE_PREV

plt.figure(figsize=[20,5])
plt.title('Previous Credit Amount Vs Goods Price\n', fontdict={'fontsize': 20, 'fontweight' : 5, 'color' : 'Brown'})
sns.scatterplot(x = finaldf['AMT_CREDIT_PREV'], y = finaldf['AMT_GOODS_PRICE_PREV'])
plt.xlabel('Credit Amount')
plt.ylabel('Goods Price')
plt.show()

- As we can see from the above 2 plots that Credit Amount is highly correlated to Goods Price and the Amount applied by the client on previous loan applcations.
- With an increase in credit amount, applied amount and goods price, the tendency to default decreases. 
- High chances of defaulting for lower credit amount, applied amount and goods price.


## 5. Conclusion

From the above analysis, we have gathered the below insights -

1. People who are more likely to default-

- Age : Young people – 25 to 35 age group
- Income : Lower income group with a total income of less than 5 lakhs
- Occupation : Low-skill labourers, drivers, waiters/barmen staff
- Education : Lower / secondary education
- Gender : Males
- Income type : On maternity leave and unemployed
- Family status : Civil marriage, single/unmarried
- Housing type : Rented apartment or with parents
- Contract type: Cash loan 
- Cash loan purpose : Repairs and urgent needs
- Previous loan status : Approved 


2. People who will repay on time-

- Age : Older people – above 50
- Income : Higher income group 
- Occupation : Managers, High-skilled tech staff, Accountants
- Education : Higher education and academic degree
- Gender : Females
- Income type : Working class, businessmen and students
- Family status : Married
- Housing type : Own house/apartment
- Contract type: Revolving loan 
- Cash loan purpose : Buying garage, home etc.
- Previous loan status : Unused offer

Hence, we can safely conclude that -

- Young males with lower secondary education and of lower income group and staying with parents or in a rented house, applying for low-range cash contract, should be denied.

- Females are likely to repay but not if they are on maternity leave. Hence, bank can reduce the loan amount for female applicants who are on maternity leave.

- Since people taking cash loans for repairs and urgent needs are more likely to default, bank can refuse them.

- Since the people who have unused offers are more likely to default even though they have comparatively high total income, they can be offered loan at a higher interest rate.

- Banks can target businessmen, students and working class people with academic degree/ higher education as they have no difficulty in repayment.

- Bank can also approve loans taken on purpose for buying home or garage as there less chances of defaulting. 
