In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <font color = brown> Bank Loan Services Assignment </font>

#### Table of contents
> 1. [Introduction](#intro)
> 2. [Import Required Libraries](#iml)
> 3. [Reading and understanding the dataset](#rud)
>> a. [Importing datasets](#imds)<br>
>> b. [Inspecting Dataframes](#insdf)<br>
> 4. [Data Cleaning and Imputation](#dci)
>> a. [Null value Calculation](#nvc)<br>
>> b. [Analyzing and Dropping Irrelevant Variables in application data](#adivad)<br>
>> c. [Analyze and Dropping Irrelevant Variables in previous_data](#adivpd)<br>
>> d. [Standardize Values](#sv)<br>
>> e. [Data Type Conversion](#dtc)<br>
>> f. [Outlier Detection](#od)<br>
>> g. [Null Value Data Imputation](#nvdi)<br>
> 5. [Data Analysis](#da)
>> a. [Imbalance Analysis](#ia)<br>
>> b. [Plotting Functions](#pf)<br>
>> c. [Categorical Variables Analysis](#cva)<br>
>> d. [Numerical Variables Analysis](#nva)<br>
> 6. [Merged Dataframes Analysis](#mda)<br>
> 7. [Conclusions](#con)

### 1. INTRODUCTION <a id='intro'></a>
   - This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

### 1.1 Business UnderStanding
- The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants are capable of repaying the loan are not rejected.

When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

- If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

- If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

- The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,

- All other cases: All other cases when the payment is paid on time.

When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

#### Approved:
The Company has approved loan Application

#### Cancelled:
The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.

#### Refused:
The company had rejected the loan (because the client does not meet their requirements etc.).

#### Unused offer:
Loan has been cancelled by the client but on different stages of the process.



### 1.2 Business Objective

The case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.

### 1.3 Data Understanding

1. 'application_data.csv'
   It contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.
   
2. 'previous_application.csv'
It contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.

3. 'columns_description.csv'
It is data dictionary which describes the meaning of the variables.

## 2. Import required libraries <a id= 'iml'></a>

In [None]:
# Filtering out the warnings

import warnings

warnings.filterwarnings('ignore')

In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
import matplotlib.style as style
from mpl_toolkits.mplot3d import Axes3D
import os # accessing directory structure
import plotly
import plotly.express as px
# graphs to be inline
%matplotlib inline
# setting up plot style 
style.use('seaborn-poster')
style.use('Solarize_Light2')

In [None]:
#setting display format

pd.options.display.float_format='{:.2f}'.format
pd.set_option('display.max_rows',50)
pd.set_option('display.max_columns',122)

## <font color= green> 3. Reading and Understanding the dataset </font> <a id ='rud'></a>

#### <font color= blue> 3.a. Importing datasets </font> <a id ='imds'></a>

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# reading csv file and read first five rows
application_data=pd.read_csv('/kaggle/input/loan-defaulter/application_data.csv',na_values='XNA')
print(application_data.head())


In [None]:
# reading csv file and read first five rows
previous_data=pd.read_csv('/kaggle/input/loan-defaulter/previous_application.csv',na_values='XNA')
print(previous_data.head())


#### <font color= blue> 3.b. Inspecting dataframes </font> <a id ='insdf'></a>

In [None]:
# Database dimension
print("Dimensions of application_data     :",application_data.shape)
print("Dimensions of previous_data        :",previous_data.shape)

#Database size
print("Size of application_data          :",application_data.size)
print("Size of previous_data             :",previous_data.size)

In [None]:
# Understanding the datatypes and info
print(application_data.info(verbose=True))

 ###### Shape and size of application data
 -There are total of 122 attributes out of which 
- 65 columns are of datatype float
- 41 columns are of datatype int 
- 16 columns are of datatype object

In [None]:
# Understanding the datatypes and info of previous data
previous_data.info(verbose=True)

 ###### Shape and size of  previous data
 -There are total of 37 variables out of which 
- 15 columns are of datatype float
- 06 columns are of datatype int 
- 16 columns are of datatype object

In [None]:
# Checking the columns with float datatype
application_data.select_dtypes('float').info()

In [None]:
# Checking the columns with int datatype
application_data.select_dtypes(np.number,exclude='float').info()

In [None]:
# Checking the columns with object datatype
application_data.select_dtypes(exclude=np.number).info()

In [None]:
# checking datatypes for the columns in previous_data
previous_data.info()

In [None]:
#checking for duplicate rows using SK_ID_CURR which is the unique key in application data
application_data.SK_ID_CURR.duplicated().sum()

- **There are no duplicate records in application_data**
- **For each loan application new ID is generated.**

In [None]:
#checking for duplicate rows using SK_ID_PREV which is the unique key in previous data
previous_data.SK_ID_PREV.duplicated().sum()

- **There are no duplicate records in application and previous data.**

In [None]:
#Describing application data
application_data.describe()

After primarily inspecting application_data without examining null values 
> 1. DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH columns(attributes) are with negative sign
> 2. Maximum value for DAYS_EMPLOYED is 365243 which when converted to years gives 1000 .It is an outlier as no person would serve so long.

In [None]:
#Describing previous data
previous_data.describe()

- **CNT_PAYMENT describes the term of payment in months. Hence, should be in integer datatype**

### <font color= green> 4. Data Cleaning & Imputation <a id='dci'></a>

#### <font color = blue> 4.a. Null value Calculation</font> <a id= 'nvc'></a>

- **4.a.i. application_data dataframe**

In [None]:
# Percentage null value calculation for each column
round(application_data.isnull().sum() / application_data.shape[0] * 100.00,2)

In [None]:
# finding  null values
application_data.isnull().sum()

In [None]:
# nullvalue calcultion for integer datatype columns  
application_data.select_dtypes(np.number,exclude='float').isnull().sum()

- **There are no null values in the integer datatype columns those are 41 columns** 

In [None]:
application_data.select_dtypes(np.number,exclude='int64').isnull().sum()

In [None]:
# nullvalue calcultion for float datatype columns  
round(application_data.select_dtypes(np.number,exclude='int64').isnull().sum()/ application_data.shape[0] * 100.00,2)

In [None]:
# plotting pointplot for Percentage of Missing values with float datatype in application data
null_application_data = pd.DataFrame((application_data.select_dtypes(np.number,exclude='int64').isnull().sum())*100/application_data.shape[0]).reset_index()
null_application_data.columns = ['Column Name', 'Null Values Percentage']
fig = plt.figure(figsize=(18,6))
ax = sns.pointplot(x="Column Name",y="Null Values Percentage",data=null_application_data,orient='v',color='green')
plt.xticks(rotation =90,fontsize =7)
ax.axhline(50, ls='--',color='red')
plt.title("Percentage of Missing values with float datatype in application data")
plt.ylabel("Null Values PERCENTAGE")
plt.xlabel("COLUMNS")
plt.show()

In [None]:
# more than or equal to 50% null value columns with float datatype
nullcol_application_data = null_application_data[null_application_data["Null Values Percentage"]>=50]
nullcol_application_data.reset_index()

**38 columns with 50%  and more  null values with float datatype**

In [None]:
# percentage null value calculation for each categorical column in application data
round(application_data.select_dtypes(exclude=np.number).isnull().sum()/ application_data.shape[0] * 100.00,2)

**Variables with 50% and above missing values with float datatype can be dropped.**
- FONDKAPREMONT_MODE -----          68.39 % <br>
- HOUSETYPE_MODE          -----     50.18 % <br>
- WALLSMATERIAL_MODE           -----50.84 % 


In [None]:
# plotting pointplot for Percentage of Missing values with object datatype in application data
null_application_data = pd.DataFrame((application_data.select_dtypes(exclude=np.number).isnull().sum())*100/application_data.shape[0]).reset_index()
null_application_data.columns = ['Column Name', 'Null Values Percentage']
fig = plt.figure(figsize=(18,6))
ax = sns.pointplot(x="Column Name",y="Null Values Percentage",data=null_application_data,orient='v',color='green')
plt.xticks(rotation =90,fontsize =7)
ax.axhline(50, ls='--',color='red')
ax.axhline(15, ls='--',color='red')
plt.title("Percentage of Missing values with object datatype in application data")
plt.ylabel("Null Values PERCENTAGE")
plt.xlabel("COLUMNS")
plt.show()

In [None]:
# percentage null value calculation for each categorical column in application data
round(application_data.select_dtypes(exclude=np.number).isnull().sum()/ application_data.shape[0] * 100.00,2)

**attributes with 50% and above missing values with object datatype can be dropped.so these variables will be added to the list of unwanted_application**
- FONDKAPREMONT_MODE -----          68.39 % <br>
- HOUSETYPE_MODE          -----     50.18 % <br>
- WALLSMATERIAL_MODE           -----50.84 % <br>


In [None]:
# checking null values for categorical variables in application data
print(application_data.select_dtypes(exclude=np.number).isnull().sum())

**Number of null values below 50% with object datatype.These values can be imputed**
1. For the columns with Null values below 15% can be imputed with its mode 
- CODE_GENDER  ----           4
- NAME_TYPE_SUITE  ----                1292<br>
2. For the columns with Null values greater than equal to 15% can be imputed with new category
- OCCUPATION_TYPE  ----               96391
- ORGANIZATION_TYPE    ----           55374
- EMERGENCYSTATE_MODE  -----         145755

- **4.a.ii. previous_data dataframe**

In [None]:
# Percentage null value calculation for each column in previous data
round(previous_data.isnull().sum() / previous_data.shape[0] * 100.00,2)

In [None]:
null_previous_data =pd.DataFrame((previous_data.isnull().sum())*100/previous_data.shape[0]).reset_index()
null_previous_data.columns = ['Column Name', 'Null Values Percentage']
fig = plt.figure(figsize=(18,6))
ax = sns.pointplot(x="Column Name",y="Null Values Percentage",data=null_previous_data,orient='v',color='green')
plt.xticks(rotation =90,fontsize =7)
ax.axhline(50, ls='--',color='red')
ax.axhline(15, ls='--',color='red')
plt.title("Percentage of Missing values previous data")
plt.ylabel("Null Values PERCENTAGE")
plt.xlabel("COLUMNS")
plt.show()


From the plot we can see the columns in which percentage of null values more than 50% are marked above the second red line and the columns which have less than 50 % null values but greater then 15% are marked with the first red line above X-axis.<br>
**There are 7 columns in previous_data dataframe where missing value is more than 50%.**
- Number of null values below 50% can be imputed :
1. For the columns with Null values below 15% can be imputed with its mode<br>
NAME_CONTRACT_TYPE-----0.02<br>
CODE_REJECT_REASON-----0.31<br>
NAME_CLIENT_TYPE-------0.12<br>
PRODUCT_COMBINATION----0.02
2. For the columns with Null values greater than equal to 15% can be imputed with new category
NAME_CASH_LOAN_PURPOSE-40.59<br>
NAME_PAYMENT_TYPE------37.56<br>
NAME_TYPE_SUITE--------49.12<br>
NAME_PORTFOLIO---------22.29<br>
NAME_YIELD_GROUP-------30.97



In [None]:
# more than or equal to 50% empty columns
nullcol_previous = null_previous_data[null_previous_data["Null Values Percentage"]>=50]
print(nullcol_previous.reset_index())
print('length :',len(nullcol_previous))

#### <font color= blue>4.b. Analyzing and Droping Irrelevant Variables in application data</font><a id='adivad'></a>

**There are 38 columns with float datatype with 50% and more missing values.**
**These are mostly related to the locality of the applicant.so these columns can be dropped.**<br>

In [None]:
len(nullcol_application_data)

In [None]:
#creating list of unnecessary columns in application data
unwanted_application = nullcol_application_data["Column Name"].tolist()
unwanted_application
#len(unwanted_application)

In [None]:
#updating unwanted application list with the object type variables possessing 40% and more missing values
unwanted_application= unwanted_application+ ['FONDKAPREMONT_MODE','HOUSETYPE_MODE','WALLSMATERIAL_MODE']

##### Flag Document

In [None]:

col_doc = [ 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6','FLAG_DOCUMENT_7', 
           'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12','FLAG_DOCUMENT_13',
           'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
           'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
df_flag = application_data[col_doc+["TARGET"]]

length = len(col_doc)

df_flag["TARGET"] = df_flag["TARGET"].replace({1:"Defaulter",0:"Repayer"})

fig = plt.figure(figsize=(21,24))

for i,j in itertools.zip_longest(col_doc,range(length)):
    plt.subplot(5,4,j+1)
    ax = sns.countplot(df_flag[i],hue=df_flag["TARGET"],palette=["y","b"])
    plt.yticks(fontsize=8)
    plt.xlabel("")
    plt.ylabel("")
    plt.title(i)

**The above graph shows that in most of the loan application cases, clients who applied for loans has not submitted FLAG_DOCUMENT_X except FLAG_DOCUMENT_3. Thus, Except for FLAG_DOCUMENT_3, we can delete rest of the columns. Data shows if borrower has submitted FLAG_DOCUMENT_3 then there is a less chance of defaulting the loan**

In [None]:
#updating unnecessary columns in application data
col_doc.remove('FLAG_DOCUMENT_3')
unwanted_application= unwanted_application+ col_doc
len(unwanted_application)

**contact Parameters**

In [None]:

# checking is there is any correlation between mobile phone, work phone etc, email, Family members and Region rating
contact_col = ['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL','TARGET']
Contact_corr = application_data[contact_col].corr()
fig = plt.figure(figsize=(8,8))
ax = sns.heatmap(Contact_corr,
            xticklabels=Contact_corr.columns,
            yticklabels=Contact_corr.columns,
            annot = True,
            cmap ="RdYlGn",
            linewidth=1)

- There is no linear correlation between flags of mobile phone, email etc. with loan repayment; thus these columns can be dropped

In [None]:
#updating unwanted application list with six contact variables
contact_col.remove('TARGET') 
unwanted_application = unwanted_application + contact_col
len(unwanted_application)

- <font color ='purple'>**Checkpoint** <br>
    38 float datatype with 50% and above missing values columns+ <br>
    19 FLAG_DOCUMENT_X columns + <br>
    3 columns with the object type variables possessing 50% and more missing values+<br>
    6 contact columns =<br>    66 unwanted_application columns</font> 

In [None]:
# Dropping the irrelevant columns from applicationDF
application_data.drop(labels=unwanted_application,axis=1,inplace=True)

In [None]:
# Inspecting the dataframe after removal of unnecessary columns
application_data.shape

#### <font color=blue>4.c. Analyzing and Dropping Irrelevant variables in previous_data </font> <a id = 'adivpd'></a>

In [None]:
# Getting the 13 columns which has more than 40% unknown
Unwanted_previous = nullcol_previous["Column Name"].tolist()
Unwanted_previous

In [None]:
# Listing down columns which are not needed
Unnecessary_previous = ['WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START',
                        'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_LAST_APPL_IN_DAY','DAYS_FIRST_DRAWING',
                        'DAYS_FIRST_DUE','DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE','DAYS_TERMINATION','NFLAG_INSURED_ON_APPROVAL']

In [None]:
Unwanted_previous = Unwanted_previous + Unnecessary_previous
len(Unwanted_previous)

In [None]:
# Dropping the Irrelevant columns from previous data
previous_data.drop(labels=Unwanted_previous,axis=1,inplace=True)
# Inspecting the dataframe after removal of Irrelevant columns
previous_data.shape

**Checkpoint :**
 - 37 total columns - 17 irrelevant columns= 20 columns 

#### <font color=blue>4.d. Standardize Values </font> <a id = 'sv'></a>


**Strategy for application data:**<br>
Convert DAYS_DECISION,DAYS_EMPLOYED, DAYS_REGISTRATION,DAYS_ID_PUBLISH from negative to positive as days cannot be negative.<br>
Convert DAYS_BIRTH from negative to positive values and calculate age and create categorical bins columns<br>
Categorize the amount variables into bins<br>
Convert region rating column and few other columns to categorical

In [None]:
# Converting Negative days to positive days
date_col = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH']

for col in date_col:
    application_data[col] = abs(application_data[col])

In [None]:
# Binning Numerical Columns to create a categorical column

# Creating bins for income amount
application_data['AMT_INCOME_TOTAL']=application_data['AMT_INCOME_TOTAL']/100000

bins = [0,1,2,3,4,5,6,7,8,9,10,11]
slot = ['0-100K','100K-200K', '200k-300k','300k-400k','400k-500k','500k-600k','600k-700k','700k-800k','800k-900k','900k-1M', '1M Above']

application_data['AMT_INCOME_RANGE']=pd.cut(application_data['AMT_INCOME_TOTAL'],bins,labels=slot)

In [None]:
application_data['AMT_INCOME_RANGE'].value_counts(normalize=True)*100

**More than 50% loan applicants have income amount in the range of 100K-200K. Almost 92% loan applicants have income less than 300K**

In [None]:
# Creating bins for Credit amount
application_data['AMT_CREDIT']=application_data['AMT_CREDIT']/100000

bins = [0,1,2,3,4,5,6,7,8,9,10,100]
slots = ['0-100K','100K-200K', '200k-300k','300k-400k','400k-500k','500k-600k','600k-700k','700k-800k',
       '800k-900k','900k-1M', '1M Above']

application_data['AMT_CREDIT_RANGE']=pd.cut(application_data['AMT_CREDIT'],bins=bins,labels=slots)

In [None]:
#checking the binning of data and % of data in each category
application_data['AMT_CREDIT_RANGE'].value_counts(normalize=True)*100

**More Than 16% loan applicants have taken loan which amounts to more than 1M.**

In [None]:
# Creating bins for Age
application_data['AGE'] = application_data['DAYS_BIRTH'] // 365
bins = [0,20,30,40,50,100]
slots = ['0-20','20-30','30-40','40-50','50 above']

application_data['AGE_GROUP']=pd.cut(application_data['AGE'],bins=bins,labels=slots)

In [None]:
#checking the binning of data and % of data in each category
application_data['AGE_GROUP'].value_counts(normalize=True)*100

**31% loan applicants have age above 50 years. More than 70% of loan applicants have age over 40 years.**

In [None]:
# Creating bins for Employement Time
application_data['YEARS_EMPLOYED'] = application_data['DAYS_EMPLOYED'] // 365
bins = [0,5,10,20,30,40,50,60,150]
slots = ['0-5','5-10','10-20','20-30','30-40','40-50','50-60','60 above']

application_data['EMPLOYMENT_YEAR']=pd.cut(application_data['YEARS_EMPLOYED'],bins=bins,labels=slots,include_lowest=True)

In [None]:
application_data[application_data.EMPLOYMENT_YEAR.isnull()].head()

In [None]:
#checking the binning of data and % of data in each category
application_data['EMPLOYMENT_YEAR'].value_counts(normalize=True)*100

**55% of the loan applicants have work experience within 0-5 years and almost 80% of them have less than 10 years of work experience**

In [None]:
#Checking the number of unique values each column possess to identify categorical columns in application Data
application_data.nunique().sort_values()

**Strategy for previous_data:**
- Convert DAYS_DECISION from negative to positive values and create categorical bins columns.
- Convert TERM_PAYMENT column as integer as it is in number of months.But it has 22.29% missing values.Need to impute first and then convert datatype
- Convert loan purpose and few other columns to categorical.


In [None]:
#Checking the number of unique values each column possess to identify categorical columns
previous_data.nunique().sort_values() 

In [None]:
#previous_data.CNT_PAYMENT=previous_data.CNT_PAYMENT.astype(int).value_counts()
previous_data.CNT_PAYMENT.unique()

In [None]:
#Converting negative days to positive days 
previous_data['DAYS_DECISION'] = abs(previous_data['DAYS_DECISION'])

In [None]:
#age group calculation e.g. 388 will be grouped as 300-400
previous_data['DAYS_DECISION_GROUP'] = (previous_data['DAYS_DECISION']-(previous_data['DAYS_DECISION'] % 400)).astype(str)+'-'+ ((previous_data['DAYS_DECISION'] - (previous_data['DAYS_DECISION'] % 400)) + (previous_data['DAYS_DECISION'] % 400) + (400 - (previous_data['DAYS_DECISION'] % 400))).astype(str)

In [None]:
previous_data['DAYS_DECISION_GROUP'].value_counts(normalize=True)*100

**Almost 37% loan applicants have applied for a new loan within 0-400 days of previous loan decision**

In [None]:
previous_data.info()

#### <font color=blue>4.e.Data Type Conversion </font> <a id = 'dtc'></a>

In [None]:
#Conversion of Object and Numerical columns to Categorical Columns
categorical_columns = ['NAME_CONTRACT_TYPE','CODE_GENDER','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE',
                       'NAME_FAMILY_STATUS','NAME_HOUSING_TYPE','OCCUPATION_TYPE','WEEKDAY_APPR_PROCESS_START',
                       'ORGANIZATION_TYPE','FLAG_OWN_CAR','FLAG_OWN_REALTY','LIVE_CITY_NOT_WORK_CITY',
                       'REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY','REG_REGION_NOT_WORK_REGION',
                       'LIVE_REGION_NOT_WORK_REGION','REGION_RATING_CLIENT','WEEKDAY_APPR_PROCESS_START',
                       'REGION_RATING_CLIENT_W_CITY'
                      ]
for col in categorical_columns:
    application_data[col] =pd.Categorical(application_data[col])

In [None]:
# inspecting the column types if the above conversion is reflected
application_data.info()

In [None]:
#Converting Categorical columns from Object to categorical 
Catgorical_col_p = ['NAME_CONTRACT_STATUS','NAME_PAYMENT_TYPE',
                    'CODE_REJECT_REASON','NAME_CLIENT_TYPE','NAME_PORTFOLIO'
                ,'CHANNEL_TYPE','NAME_YIELD_GROUP','PRODUCT_COMBINATION',
                    'NAME_CONTRACT_TYPE','DAYS_DECISION_GROUP']

for col in Catgorical_col_p:
    previous_data[col] =pd.Categorical(previous_data[col])

#### <font color=blue>4.f. Outlier Detection</font><a id ='od'></a>

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
plt.figure(figsize=(22,10))

fig = make_subplots(
    rows=3, cols=3,
    subplot_titles=("Plot 1", "Plot 2", "Plot 3", "Plot 4", "Plot 5", "Plot 6",'', "Plot 7"))                    

fig.add_trace(go.Box(y=application_data.AMT_ANNUITY,name='AMT_ANNUITY', boxmean=True),
              row=1, col=1)

fig.add_trace(go.Box(y=application_data.AMT_INCOME_TOTAL,name='AMT_INCOME_TOTAL',boxmean=True),
              row=1, col=2)

fig.add_trace(go.Box(y=application_data.AMT_CREDIT,name='AMT_CREDIT',boxmean=True),
              row=1, col=3)

fig.add_trace(go.Box(y=application_data.AMT_GOODS_PRICE,name='AMT_GOODS_PRICE',boxmean=True),
              row=2, col=1)

fig.add_trace(go.Box(y=application_data.DAYS_EMPLOYED,name='DAYS_EMPLOYED',boxmean=True),
              row=2, col=2)

fig.add_trace(go.Box(y=application_data.CNT_CHILDREN,name='CNT_CHILDREN',boxmean=True),
              row=2, col=3)

fig.add_trace(go.Box(y=application_data.DAYS_BIRTH,name='DAYS_BIRTH',boxmean=True),
              row=3, col=2)

fig.update_layout(height=700, width=900,
                  title_text="Outlier Detection",showlegend=False)

fig.show()

#### <font color=blue>4.g. Null Value Data Imputation</font><a id ='nvdi'></a>

**Strategy for application_data:**
1. To impute null values in categorical variables which has lower null percentage, mode() is used to impute the most frequent items.
2. To impute null values in categorical variables which has higher null percentage, a new category is created.
3. To impute null values in numerical variables which has lower null percentage, median() is used as
4. There are no outliers in the columns
- Mean returned decimal values and median returned whole numbers and the columns were number of requests

In [None]:
# checking the null value % of each column in applicationDF dataframe
round(application_data.isnull().sum() / application_data.shape[0] * 100.00,2)

**Impute categorical variable 'NAME_TYPE_SUITE' which has lower null percentage(0.42%) with the most frequent category using mode()[0]:**

In [None]:
application_data['NAME_TYPE_SUITE'].describe()

In [None]:
application_data['NAME_TYPE_SUITE'].fillna((application_data['NAME_TYPE_SUITE'].mode()[0]),inplace = True)

**Impute categorical variable 'OCCUPATION_TYPE' which has higher null percentage(31.35%) with a new category as assigning to any existing category might influence the analysis:**

In [None]:
application_data['OCCUPATION_TYPE'] = application_data['OCCUPATION_TYPE'].cat.add_categories('Unknown')
application_data['OCCUPATION_TYPE'].fillna('Unknown', inplace =True) 

**Impute numerical variables with the median as there are no outliers that can be seen from results of describe() and mean() returns decimal values and these columns represent number of enquiries made which cannot be decimal:**

In [None]:
application_data[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY',
               'AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON',
               'AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']].describe()

In [None]:
# imputing with median value
amount = ['AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON',
         'AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']

for col in amount:
    application_data[col].fillna(application_data[col].median(),inplace = True)

##### Imputations 
- EXT_SOURCE_2 and EXT_SOURCE_3 null values can be imputed with the mean of the respective columns, as the normalised credit scores lie between 0 and 1

In [None]:
# imputing EXT_SOURCE_2 with mean (0.50)
application_data.EXT_SOURCE_2.fillna(application_data.EXT_SOURCE_2.mean(),inplace=True)

In [None]:
application_data.EXT_SOURCE_2.isnull().sum()

In [None]:
# imputing EXT_SOURCE_3 with mean (0.50)
application_data.EXT_SOURCE_3.fillna(application_data.EXT_SOURCE_3.mean(),inplace=True)

In [None]:
application_data.EXT_SOURCE_3.isnull().sum()

In [None]:
application_data.EXT_SOURCE_3.head()

In [None]:
application_data['CREDIT_SCORE']=application_data.apply(lambda x: (x.EXT_SOURCE_2 + x.EXT_SOURCE_3)/3,axis=1)

In [None]:
application_data['CREDIT_SCORE'].head(10)

In [None]:
# checking the null value % of each column in application data dataframe
round(application_data.isnull().sum() / application_data.shape[0] * 100.00,2)

- Records with null values of EMPLOYMENT_YEAR are to be dropped as for those applicant years employed is 1000 

In [None]:
application_data.dropna(subset=['EMPLOYMENT_YEAR'],inplace=True)

In [None]:
# checking the null value % of each column in application data dataframe
round(application_data.isnull().sum() / application_data.shape[0] * 100.00,2)

**Strategy for Previous _data:**
1. To impute null values and negative values in AMT_DOWN_PAYMENT 
2. To impute null values in numerical column, we analysed the loan status and assigned values.
3. To impute null values in continuous variables, we plotted the distribution of the columns and used
- median if the distribution is skewed
- mode if the distribution pattern is preserved.`
4. To impute null values in categorical columns mode is used

In [None]:
# checking the null value % of each column in previousDF dataframe
round(previous_data.isnull().sum() / previous_data.shape[0] * 100.00,2)

**Impute AMT_ANNUITY with median as the distribution is greatly skewed**

In [None]:
plt.figure(figsize=(6,6))
sns.kdeplot(previous_data['AMT_ANNUITY'])
plt.show()

In [None]:
previous_data['AMT_ANNUITY'].fillna(previous_data['AMT_ANNUITY'].median(),inplace = True)

In [None]:
plt.figure(figsize=(6,6))
sns.kdeplot(previous_data['AMT_GOODS_PRICE'][pd.notnull(previous_data['AMT_GOODS_PRICE'])])
plt.show()

- AMT_GOODS_PRICE has a positive linear correlation with AMT_CREDIT so impute with the corresponding values

In [None]:
#plot the correlation matrix of AMT_CREDIT,RATE_DOWN_PAYMENT in previous_data dataframe.
previous_data[['AMT_CREDIT','AMT_GOODS_PRICE']].corr()

In [None]:
sns.heatmap(previous_data[['AMT_CREDIT','AMT_GOODS_PRICE']].corr(),annot=True,cmap='Greens')

In [None]:
sns.scatterplot(previous_data.AMT_CREDIT,previous_data.AMT_GOODS_PRICE)

In [None]:
# imputing AMT_GOODS_PRICE with AMT_CREDIT
previous_data['AMT_GOODS_PRICE'].fillna(previous_data['AMT_CREDIT'],inplace = True)

**Impute CNT_PAYMENT with 0 as the NAME_CONTRACT_STATUS for these indicate that most of these loans were not started**

In [None]:
previous_data.loc[previous_data['CNT_PAYMENT'].isnull(),'NAME_CONTRACT_STATUS'].value_counts()

In [None]:
previous_data['CNT_PAYMENT'].fillna(0,inplace = True)

In [None]:
previous_data.NAME_PAYMENT_TYPE.value_counts()

In [None]:
#imputing with the mode
previous_data['NAME_PAYMENT_TYPE'].fillna((previous_data['NAME_PAYMENT_TYPE'].mode()[0]),inplace = True)

In [None]:
previous_data.NAME_PORTFOLIO.value_counts()

In [None]:
#imputing with the mode
previous_data['NAME_PORTFOLIO'].fillna((previous_data['NAME_PORTFOLIO'].mode()[0]),inplace = True)

In [None]:
previous_data.NAME_YIELD_GROUP.value_counts()

In [None]:
#imputing with the mode
previous_data['NAME_YIELD_GROUP'].fillna((previous_data['NAME_YIELD_GROUP'].mode()[0]),inplace = True)

In [None]:
previous_data.NAME_CLIENT_TYPE.value_counts()

In [None]:
#imputing with the mode
previous_data['NAME_CLIENT_TYPE'].fillna((previous_data['NAME_CLIENT_TYPE'].mode()[0]),inplace = True)

In [None]:
previous_data.CODE_REJECT_REASON.value_counts()

In [None]:
#imputing with the mode
previous_data['CODE_REJECT_REASON'].fillna((previous_data['CODE_REJECT_REASON'].mode()[0]),inplace = True)

In [None]:
#imputing with the mode
previous_data['NAME_CASH_LOAN_PURPOSE'].fillna((previous_data['NAME_CASH_LOAN_PURPOSE'].mode()[0]),inplace = True)

In [None]:
# checking the null value % of each column in previousDF dataframe
round(previous_data.isnull().sum() / previous_data.shape[0] * 100.00,2)

### 5.<font color=green> Data Analysis</font> <a id = 'da'></a>

**Strategy:**
- The data analysis flow has been planned in following way :

1. Imbalance in Data
2. Categorical Data Analysis
>Categorical segmented Univariate Analysis<br>
>Categorical Bi/Multivariate analysis
3. Numeric Data Analysis
>Bi-furcation of databased based on TARGET data<br>
>Correlation Matrix<br>
>Numerical segmented Univariate Analysis<br>
>Numerical Bi/Multivariate analysis

#### <font color=blue>5.a.Imbalance Analysis</font> <a id = 'ia'></a>

In [None]:
Imbalance = application_data["TARGET"].value_counts().reset_index()

plt.figure(figsize=(10,4))
x= ['Repayer','Defaulter']
sns.barplot(x,"TARGET",data = Imbalance,palette= ['g','r'])
plt.xlabel("Loan Repayment Status")
plt.ylabel("Count of Repayers & Defaulters")
plt.title("Imbalance Plotting")
plt.show()

#### <font color=blue>5.b.Plotting Functions</font> <a id = 'pf'></a>

- Following are the common functions customized to perform uniform anaysis that is called for all plots

In [None]:
#function for plotting number of defaulters and repayers categorically(univariate) 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
def univariate_hist(dataframe,feature,vertical=True):
    
    defaulters=dataframe[feature].where(dataframe.TARGET==1)
    repayers=dataframe[feature].where(dataframe.TARGET==0)
    
    fig = go.Figure()
    if(vertical):
        
        fig.add_trace(go.Histogram(x=defaulters,histnorm='',name='Defaulters', marker_color='crimson', opacity=0.75))
        fig.add_trace(go.Histogram(x=repayers,histnorm='',name='Repayers',marker_color='green', opacity=0.75))
        fig.update_layout(title_text='Univariate categoriacal Analysis: '+feature, # title of plot
                  xaxis_title_text=feature, # xaxis label
                  yaxis_title_text='COUNT', # yaxis label
                  bargap=0.2, # gap between bars of adjacent location coordinates
                  bargroupgap=0.1 # gap between bars of the same location coordinates
                 )
    else :
        
        fig.add_trace(go.Histogram(y=defaulters,histnorm='',name='Defaulters', marker_color='crimson', opacity=0.75))
        fig.add_trace(go.Histogram(y=repayers,histnorm='',name='Repayers',marker_color='green', opacity=0.75))
# The two histograms are drawn on top of another
        fig.update_layout(title_text='Univariate categoriacal Analysis : '+feature, # title of plot
                  xaxis_title_text='COUNT', # xaxis label
                  yaxis_title_text= feature, # yaxis label
                  bargap=0.2, # gap between bars of adjacent location coordinates
                  bargroupgap=0.1 # gap between bars of the same location coordinates
                 )
    

    fig.show()

In [None]:
# function for Percentage of Defaulters for categorical variables
def perc_defaulter(dataframe,feature,vertical=True):
    
    cat_perc = dataframe[[feature, 'TARGET']].groupby([feature],as_index=False).mean()
    cat_perc["TARGET"] = cat_perc["TARGET"]*100
    cat_perc.sort_values(by='TARGET', ascending=False, inplace=True)
    if (vertical): 
        
        fig = px.bar(cat_perc, x=feature ,y="TARGET",height=400,text="TARGET")
        fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6,texttemplate='%{text:.2s}'+'%', textposition='outside')
        fig.update_layout(title_text='Percentage of Defaulters')
    else :
        
        fig = px.bar(cat_perc, x="TARGET" ,y=feature,height=400,text="TARGET")
        fig.update_traces(marker_color='rgb(150,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6,texttemplate='%{text:.2s}'+'%', textposition='outside')
        fig.update_layout(title_text='Percentage of Defaulters')
    fig.show()

In [None]:

def bivariate_bar(x,y,df,hue,figsize):
    
    plt.figure(figsize=figsize)
    sns.barplot(x=x,
                  y=y,
                  data=df, 
                  hue=hue, 
                  palette =['g','r'])     
        
    # Defining aesthetics of Labels and Title of the plot using style dictionaries
    plt.xlabel(x,fontdict={'fontsize' : 10, 'fontweight' : 3, 'color' : 'Blue'})    
    plt.ylabel(y,fontdict={'fontsize' : 10, 'fontweight' : 3, 'color' : 'Blue'})    
    plt.title(col, fontdict={'fontsize' : 15, 'fontweight' : 5, 'color' : 'Blue'}) 
    plt.xticks(rotation=90, ha='right')
    plt.legend(labels = ['Repayer','Defaulter'])
    plt.show()

In [None]:
# function for plotting repetitive rel plots in bivariate numerical analysis on application_data

def bivariate_rel(x,y,data, hue, kind, palette, legend,figsize):
    
    plt.figure(figsize=figsize)
    sns.relplot(x=x, 
                y=y, 
                data=application_data, 
                hue="TARGET",
                kind=kind,
                palette = ['g','r'],
                legend = False)
    plt.legend(['Repayer','Defaulter'])
    plt.xticks(rotation=90, ha='right')
    plt.show()

In [None]:


def univariate_merged(col,df,hue,palette,ylog,figsize):
    plt.figure(figsize=figsize)
    ax=sns.countplot(x=col, 
                  data=df,
                  hue= hue,
                  palette= palette,
                  order=df[col].value_counts().index)
    

    if ylog:
        plt.yscale('log')
        plt.ylabel("Count (log)",fontdict={'fontsize' : 10, 'fontweight' : 3, 'color' : 'Blue'})     
    else:
        plt.ylabel("Count",fontdict={'fontsize' : 10, 'fontweight' : 3, 'color' : 'Blue'})       

    plt.title(col , fontdict={'fontsize' : 15, 'fontweight' : 5, 'color' : 'Blue'}) 
    plt.legend(loc = "upper right")
    plt.xticks(rotation=90, ha='right')
    
    plt.show()

#### <font color=blue>5.c.Categorical Variables Analysis </font> <a id = 'cva'></a>

In [None]:
fig = px.pie(application_data, values=application_data.NAME_CONTRACT_TYPE.value_counts(),names=['CASH LOANS','REVOLVING LOANS'], title='percentage of types of loans')
fig.show()

In [None]:
fig=px.pie(application_data,values=application_data.TARGET.value_counts(),names=['REPAYERS','DEFAULTERS'],title='Percentage of client with payment difficulties')
fig.show()

In [None]:
# Checking the contract type based on loan repayment status
univariate_hist(application_data,'NAME_CONTRACT_TYPE')

In [None]:
# Percentage of defaulters in Cash and revolving loans
perc_defaulter(application_data,'NAME_CONTRACT_TYPE')

- Inferences:
1. Contract type: Revolving loans are just a small fraction (10.3%) from the total number of loans; in the same time,5% defaulters are for Revolving loans, comparing with their frequency.
2. Cash loans have 9% defaulting rate which is more than for revolving loans comparetively

In [None]:
# Checking the gender based count on loan repayment status
univariate_hist(application_data,'CODE_GENDER')

In [None]:
# Percentage of defaulters in Cash and revolving loans
perc_defaulter(application_data,'CODE_GENDER')

###### - Inferences:
- The number of female clients is almost double the number of male clients. Based on the percentage of defaulted credits, males have a higher chance of not returning their loans(10%), comparing with women (7.6%)

In [None]:
# Checking the gender based count on loan repayment status
univariate_hist(application_data,'FLAG_OWN_CAR')

In [None]:
# Percentage of defaulters based on car owned by applicants
perc_defaulter(application_data,'FLAG_OWN_CAR')

- Clients who own a car are half in number of the clients who dont own a car. But based on the percentage of deault, there is no correlation between owning a car and loan repayment as in both cases the default percentage has just a difference of 2%.

In [None]:
#plotting sunburst for defaulters'count in Occupation type
fig = px.sunburst(application_data, path=['OCCUPATION_TYPE','NAME_CONTRACT_TYPE'], values=application_data.TARGET==1)
fig.show()

In [None]:
# Percentage of defaulters based on car owned by applicants
perc_defaulter(application_data,'OCCUPATION_TYPE',False)

- inferences:
1. Most of the loans are taken by Laborers, followed by Sales staff. IT staff take the lowest amount of loans.
2. The category with highest percent of not repaid loans are Low-skill Laborers (above 17%), followed by Drivers and Waiters/barmen staff, Security staff, Laborers and Cooking staff.

In [None]:
# Checking the HOUSING based count on loan repayment status
univariate_hist(application_data,'NAME_HOUSING_TYPE',False)

In [None]:
# Percentage of defaulters based on housing of applicants
perc_defaulter(application_data,'NAME_HOUSING_TYPE',False)

**Inferences:**
1. Majority of people live in House/apartment
2. People living in office apartments have lowest default rate
3. People living with parents (12%) and living in rented apartments(13%) have higher probability of defaulting

In [None]:
# Checking the REGION based count on loan repayment status
univariate_hist(application_data,'REGION_RATING_CLIENT')

In [None]:
# Percentage of defaulters based on REGION of applicants
perc_defaulter(application_data,'REGION_RATING_CLIENT')

**Inferences**:
1. Most of the applicants are living in Region_Rating 2 place.
2. Region Rating 3 has the highest default rate (12%)
3. Applicant living in Region_Rating 1 has the lowest probability of defaulting, thus safer for approving loans

In [None]:
# Checking the DOCUMENT SUBMISSION based count on loan repayment status
univariate_hist(application_data,'FLAG_DOCUMENT_3')

In [None]:
# Percentage of defaulters based on DOCUMENT SUBMISSION of applicants
perc_defaulter(application_data,'FLAG_DOCUMENT_3')

**Inferences:**
1. There is no significant correlation between repayers and defaulters in terms of submitting document 3 as we see even if applicants have submitted the document, they have defaulted a slightly more (9.3%) than who have not submitted the document (6.6%)

In [None]:
# Checking the FAMILY STATUS based count on loan repayment status
univariate_hist(application_data,'NAME_FAMILY_STATUS',False)

In [None]:
# Percentage of defaulters based on FAMILY STATUS of applicants
perc_defaulter(application_data,'NAME_FAMILY_STATUS',False)

**Inferences:**
1. Most of the people who have taken loan are married, followed by Single/not married and civil marriage
2. In terms of percentage of defaulters, Civil marriage and single/not married has the highest percent of non repayment (10%), with Widow the lowest (exception being Unknown).

In [None]:
# Checking the INCOME TYPE based count on loan repayment status
univariate_hist(application_data,'NAME_INCOME_TYPE',False)

In [None]:
# Percentage of defaulters based on INCOME TYPE of applicants
perc_defaulter(application_data,'NAME_INCOME_TYPE',False)

**Inferences:**
1. Most of applicants for loans have income type as Working, followed by Commercial associate, Pensioner and State servant.
2. The applicants with the type of income Maternity leave have almost 40% ratio of not returning loans, followed by working (9.6%).
3. Student and Businessmen,Pensioners though less in numbers do not have any default record. Thus these two category are safest for providing loan.

In [None]:
# Checking the EDUCATION TYPE based count on loan repayment status
univariate_hist(application_data,'NAME_EDUCATION_TYPE',False)

In [None]:
# Percentage of defaulters based on EDUCATION TYPE of applicants
perc_defaulter(application_data,'NAME_EDUCATION_TYPE',False)

**Inferences:**
1. Majority of the clients have Secondary / secondary special education, followed by clients with Higher education. Only a very small number having an academic degree
2. The Lower secondary category, although rare, have the largest rate of not repaying the loan (14%). The people with Academic degree have 2.2% defaulting rate.

In [None]:
# Checking the ORGANIZATION_TYPE based count on loan repayment status
univariate_hist(application_data,'ORGANIZATION_TYPE',False)

In [None]:
# Percentage of defaulters based on ORGANIZATION_TYPE of applicants
perc_defaulter(application_data,'ORGANIZATION_TYPE',False)

**Inferences:**
1. Organizations with highest percent of loans not repaid are Transport: type 3 (15.75%), Industry: type 13 (13.5%), Industry: type 8 (12.5%) and Restaurant (less than 12%). Self employed people have relative high defaulting rate(10.17%), and thus should be avoided to be approved for loan or provide loan with higher interest rate to mitigate the risk of defaulting.
2. Most of the people application for loan are from Business Entity Type 3
3. Though business entity type 2 has more loans disbursed, they have comparatively lees default rate(8.5%) 
4. It can be seen that following category of organization type has lesser defaulters thus safer for providing loans:
- Trade Type 4 
- Securities Ministries

In [None]:
# Checking AGE_GROUP based count on loan repayment status
univariate_hist(application_data,'AGE_GROUP',False)

In [None]:
# Percentage of defaulters based on  AGE GROUP of applicants
perc_defaulter(application_data,'AGE_GROUP',False)

**Inferences:**
1. People in the age group range 20-40 have higher probability of defaulting
2. People above age of 50 have low probability of defaulting
3. loan given to above age 40 is safer as defaulting rate is comparatively less.

In [None]:
# Checking employed years count on loan repayment status
univariate_hist(application_data,'EMPLOYMENT_YEAR')

In [None]:
# Percentage of defaulters based on  year employed of applicants
perc_defaulter(application_data,'EMPLOYMENT_YEAR')

**Inferences:**
1. Majority of the applicants have been employeed in between 0-5 years. The defaulting rating of this group is also the highest which is 10%
2. With increase of employment year, defaulting rate is gradually decreasing with people having 40+ year experience having less than 1% default rate

In [None]:
# Checking amount credited count on loan repayment status
univariate_hist(application_data,'AMT_CREDIT_RANGE',False)

In [None]:
# Percentage of defaulters based on  amount credited of applicants
perc_defaulter(application_data,'AMT_CREDIT_RANGE',False)

**Inferences:**
1. More than 80% of the loan provided are for amount less than 900,000
2. People who get loan for 300-600k tend to default more than others.

In [None]:
# Checking income range count on loan repayment status
univariate_hist(application_data,'AMT_INCOME_RANGE',False)

In [None]:
# Percentage of defaulters based on  income range of applicants
perc_defaulter(application_data,'AMT_INCOME_RANGE',False)

**Inferences:**
1. 90% of the applications have Income total less than 300,000
2. Application with Income less than 300,000 has high probability of defaulting
3. Applicant with Income more than 700,000 are less likely to default

In [None]:
# Checking  childrens' count on loan repayment status
univariate_hist(application_data,'CNT_CHILDREN')

In [None]:
# Percentage of defaulters based on chidrens' count of applicants
perc_defaulter(application_data,'CNT_CHILDREN')

**Inferences:**
1. Most of the applicants do not have children
2. Very few clients have more than 3 children.
3. Client who have more than 4 children has a very high default rate with child count 9 and 11 showing 100% default rate

In [None]:
# Checking  family members' count on loan repayment status
univariate_hist(application_data,'CNT_FAM_MEMBERS')

In [None]:
# Percentage of defaulters based on family members' count of applicants
perc_defaulter(application_data,'CNT_FAM_MEMBERS')

**Inferences:**
1. Family member follows the same trend as children where having more family members increases the risk of defaulting

### <font color=blue>5.d Numerical Variables Analysis </font> <a id = 'nva'></a>

In [None]:
# Income type vs Income TOTAL
bivariate_bar("NAME_INCOME_TYPE","AMT_INCOME_TOTAL",application_data,"TARGET",(18,10))


**Inferences:**
- It can be seen that business man's income is the highest and the estimated range with default 95% confidence level seem to indicate that the income of a business man could be in the range of slightly close to 4 lakhs and slightly above 10 lakhs**

In [None]:
# Bifurcating the application_data dataframe based on Target value 0 and 1 for correlation and other analysis
cols_for_correlation = ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY','CREDIT_SCORE',
                        'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 
                        'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
                        'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 
                        'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
                        'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
                        'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 
                        'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE',
                        'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_3', 
                        'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
                        'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']


Repayer_df = application_data.loc[application_data['TARGET']==0, cols_for_correlation] # Repayers
Defaulter_df = application_data.loc[application_data['TARGET']==1, cols_for_correlation] # Defaulters

In [None]:
fig = plt.figure(figsize=(12,12))
ax = sns.heatmap(Repayer_df.corr(), cmap="RdYlGn",annot=False,linewidth =1)


In [None]:
fig = plt.figure(figsize=(12,12))
ax = sns.heatmap(Defaulter_df.corr(), cmap="RdYlGn",annot=False,linewidth =1)


In [None]:
! pip install tabulate

In [None]:
# Getting the top 10 correlation for the Repayers data

corr_repayer = Repayer_df.corr()
corr_repayer = corr_repayer.where(np.triu(np.ones(corr_repayer.shape),k=1).astype(np.bool))
corr_df_repayer = corr_repayer.unstack().reset_index()
corr_df_repayer.columns =['VAR1','VAR2','Correlation']
corr_df_repayer.dropna(subset = ["Correlation"], inplace = True)
corr_df_repayer["Correlation"]=corr_df_repayer["Correlation"].abs() 
corr_df_repayer.sort_values(by='Correlation', ascending=False, inplace=True) 
print(corr_df_repayer.head(10).to_markdown())

In [None]:
# Getting the top 10 correlation for the Defaulter data

corr_Defaulter = Defaulter_df.corr()
corr_Defaulter = corr_Defaulter.where(np.triu(np.ones(corr_Defaulter.shape),k=1).astype(np.bool))
corr_df_Defaulter = corr_Defaulter.unstack().reset_index()
corr_df_Defaulter.columns =['VAR1','VAR2','Correlation']
corr_df_Defaulter.dropna(subset = ["Correlation"], inplace = True)
corr_df_Defaulter["Correlation"]=corr_df_Defaulter["Correlation"].abs()
corr_df_Defaulter.sort_values(by='Correlation', ascending=False, inplace=True)
print(corr_df_Defaulter.head(10).to_markdown())

**Inferences:**
1. Credit amount is highly correlated with amount of goods price.
2. But the loan annuity correlation with credit amount has slightly reduced in defaulters(0.75) when compared to repayers(0.77)
3. We can also see that repayers have high correlation in number of days employed(0.62) when compared to defaulters(0.58).
4. There is a severe drop in the correlation between total income of the client and the credit amount amongst defaulters whereas it is 0.33 among repayers.
5. Days_birth and credit score correlation has reduced to 0.26 in defaulters when compared to 0.30 in repayers.
6. There is a decrease in correlation between age and  days registration : defaulters(0.24),repayers(0.30) 

- Numerical univariate Analysis

In [None]:
plt.figure(figsize=[200,100])
fig = px.scatter(application_data, x="CREDIT_SCORE", y="AMT_CREDIT",color='TARGET',opacity=0.5 ,marginal_y="violin",
           marginal_x="histogram")
fig.show()

**Inferences**<br> 
- Credit score doesn't seems to have impact on defaulters,
- But density of repayers with more than 0.5 normalised credit score is high irrespective of loan amount

In [None]:
fig = px.scatter_matrix(application_data, dimensions=['AMT_INCOME_TOTAL','AMT_CREDIT',
                         'AMT_ANNUITY', 'AMT_GOODS_PRICE'], color="TARGET",opacity=0.5)
fig.show()

**Inferences:**
1. When amt_annuity >15000 amt_goods_price> 3M, there is a lesser chance of defaulters
2. AMT_CREDIT and AMT_GOODS_PRICE are highly correlated as based on the scatterplot where most of the data are consolidated in form of a line
3. There are very less defaulters for AMT_CREDIT >3M

### 6.<font color=green> Merged Dataframes Analysis</font> <a id = 'mda'></a>

In [None]:
#merge both the dataframe on SK_ID_CURR with Inner Joins
loan_process_df = pd.merge(application_data, previous_data, how='inner', on='SK_ID_CURR')
loan_process_df.head()

In [None]:
print('Shape of merged dataframe : ',loan_process_df.shape)
print('Size of merged dataframe : ',loan_process_df.size)

In [None]:
loan_process_df.info()

In [None]:
loan_process_df.describe()

In [None]:


L0 = loan_process_df[loan_process_df['TARGET']==0] # Repayers
L1 = loan_process_df[loan_process_df['TARGET']==1] # Defaulters

In [None]:
univariate_merged("NAME_CASH_LOAN_PURPOSE",L0,"NAME_CONTRACT_STATUS",["#548235","#FF0000","#0070C0","#FFFF00"],True,(18,7))

univariate_merged("NAME_CASH_LOAN_PURPOSE",L1,"NAME_CONTRACT_STATUS",["#548235","#FF0000","#0070C0","#FFFF00"],True,(18,7))

**Inferences:**
1. Loan purpose has high number of unknown values (XAP)
2. Loan taken for the purpose of Repairs seems to have highest default rate
3. A very high number application have been rejected by bank or refused by client which has purpose as repair or other. This shows that purpose repair is taken as high risk by bank and either they are rejected or bank offers very high loan interest rate which is not feasible by the clients, thus they refuse the loan.

In [None]:
# Percentage of defaulters based on previous contract status of merged dataframe
perc_defaulter(loan_process_df,'NAME_CONTRACT_STATUS',False)

In [None]:
# Percentage of defaulters based on previous client type of merged dataframe
perc_defaulter(loan_process_df,'NAME_CLIENT_TYPE',False)

In [None]:
#plotting sunburst for defaulters'count in type of client with type of loans and its status 
fig = px.sunburst(loan_process_df, path=['NAME_CLIENT_TYPE','NAME_CONTRACT_TYPE_x','NAME_CONTRACT_STATUS'], values=loan_process_df.TARGET==1)
fig.show()

**Inferences**
1. Repeaters with large quantity of cash loans given and approved are defaulters.At the same time there is nearly equal amount   of repeaters are refused and canceled
2. Large number of new applicants approved with revolving loans are defaulters
3. 13% clients were previously refused are defaulters

In [None]:
# Checking the Contract Status based on loan repayment status and whether there is any business loss or financial loss
univariate_merged("NAME_CONTRACT_STATUS",loan_process_df,"TARGET",['g','r'],False,(12,8))
g = loan_process_df.groupby("NAME_CONTRACT_STATUS")["TARGET"]
df1 = pd.concat([g.value_counts(),round(g.value_counts(normalize=True).mul(100),2)],axis=1, keys=('Counts','Percentage'))
df1['Percentage'] = df1['Percentage'].astype(str) +"%" # adding percentage symbol in the results for understanding
print(df1.to_markdown())

**Inferences:**
1. 90% of the previously cancelled client have actually repayed the loan. Revisiting the interest rates would increase business opoortunity for these clients
2. 88% of the clients who have been previously refused a loan has payed back the loan in current case.
3. Refusal reason should be recorded for further analysis as these clients would turn into potential repaying customer.

### 7.<font color=green> Conclusions</font> <a id = 'con'></a>

- After analysing the datasets, there are few attributes of a client with which the bank would be able to identify if they will repay the loan or not. The analysis is consised as below with the contributing factors and categorization:

**Decisive Factor whether an applicant will be Repayer:**
1. NAME_EDUCATION_TYPE: Academic degree has less defaults.
2. NAME_INCOME_TYPE: Student and Businessmen have no defaults.
3. REGION_RATING_CLIENT: RATING 1 is safer.
4. ORGANIZATION_TYPE: Clients with Trade Type 4 and 5 and Industry type 8 have defaulted less than 3%
5. DAYS_BIRTH: People above age of 50 have low probability of defaulting
6. DAYS_EMPLOYED: Clients with 40+ year experience having less than 1% default rate
7. AMT_INCOME_TOTAL:Applicant with Income more than 700,000 are less likely to default
8. CNT_CHILDREN: People with zero to two children tend to repay the loans.
9. CREDIT_SCORE : People with average to high credit score are less likely to default 

**Decisive Factor whether an applicant will be Defaulter:**
1. CODE_GENDER: Men are at relatively higher default rate
2. NAME_FAMILY_STATUS : People who have civil marriage or who are single default a lot.
3. NAME_EDUCATION_TYPE: People with Lower Secondary & Secondary education
4. NAME_INCOME_TYPE: Clients who are either at Maternity leave OR Unemployed default a lot.
5. REGION_RATING_CLIENT: People who live in Rating 3 has highest defaults.
6. OCCUPATION_TYPE: Avoid Low-skill Laborers, Drivers and Waiters/barmen staff, Security staff, Laborers and Cooking staff as the default rate is huge.
7. ORGANIZATION_TYPE:Organizations with highest percent of loans not repaid are Transport: type 3 (15.75%), Industry: type 13 (13.5%), Industry: type 8 (12.5%) and Restaurant (less than 12%). Self employed people have relative high defaulting rate (10.17%), and thus should be avoided to be approved for loan or provide loan with higher interest rate to mitigate the risk of defaulting.
8. DAYS_BIRTH: Avoid young people who are in age group of 20-40 as they have higher probability of defaulting
9. DAYS_EMPLOYED: People who have less than 5 years of employment have high default rate.
10. CNT_CHILDREN & CNT_FAM_MEMBERS: Client who have children equal to or more than 9 default 100% and hence their applications are to be rejected.
11. AMT_GOODS_PRICE: When the credit amount goes beyond 3M, there is an increase in defaulters.