## Introduction:
This case study aims to give us an idea of applying EDA in a real business scenario. In this case study, we develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.


## Business Objectives:
This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.The company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing Libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
pd.set_option('display.max_columns',150)
pd.set_option('display.max_info_columns', 150)
pd.set_option('display.max_rows',150)

## 1. Importing Datasets

In [None]:
# Loading pplication_data  datasets
app_data = pd.read_csv("../input/loan-defaulter/application_data.csv")

## 2. Understanding Data

In [None]:
# Display top 5 rows of app_data dataframe
app_data.head()

In [None]:
# Printing shape of application_data dataset
print(f'Shape of app_data : {app_data.shape}')

In [None]:
app_data.info()

> app_data.info() only showing data types of columns. No info about null_values, lets use describe fn. to get some more insight

In [None]:
app_data.describe()

>Some columns of applition_dataset also have null values as count values are different

## 3. Data Preprocessing (Cleaning and Fixing Data)

###### For app_data dataframe

In [None]:
# Removing Unwanted Col.
unwanted_col=['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL','REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','FLAG_EMAIL','CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY','DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21','OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE',
        'OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE','EXT_SOURCE_2','EXT_SOURCE_3']
app_data.drop(unwanted_col, inplace=True, axis=1)

In [None]:
# Column wise Null percentage in app_data
null_percentage_app = round(app_data.isnull().sum()/app_data.shape[0]*100, 2)
print(null_percentage_app)

In [None]:
# Getting list of columns which have more than or equal to 45% missing values
app_colToDrop = list(null_percentage_app[null_percentage_app >= 45].index)

In [None]:
print(f'No. of col. to be drop: {len(app_colToDrop)}')
app_colToDrop

In [None]:
# Dropping Columns having more than 45% of missing values
app_data.drop(app_colToDrop, axis= 1, inplace=True)

In [None]:
# Rechecking column wise Null percentage in app_data
null_percentage_app = round(app_data.isnull().sum()/app_data.shape[0]*100, 2)
print(null_percentage_app)
print(f'New shape of app_data: {app_data.shape}')

In [None]:
# Getting columns with missing values between 0% and 45%
missing_value_col_app = list((null_percentage_app[null_percentage_app > 0].index))
missing_value_col_app

In [None]:
print(app_data.AMT_GOODS_PRICE.describe())
sns.distplot(app_data.AMT_GOODS_PRICE, hist=False)

In [None]:
sns.boxplot(y = app_data.AMT_GOODS_PRICE)

> AMT_GOODS_PRICE represents price of goods for which loans has been taken, which can be an important metric for identifying Defaulters. This col. shows high standard deviation and has multi-modal distribution curve, is left-skewed and contain outliers that can be seen in the boxplot . 

> For this col. **median is ideally** suited but it is only 0.09% so, I would recommend **removing rows with missing values**.

In [None]:
print(app_data.OCCUPATION_TYPE.value_counts())
app_data.OCCUPATION_TYPE.value_counts().plot(kind ='bar', figsize = (10,8))

> OCCUPATION_TYPE is a important col. and it is a categoriacal value with around 31.35 % missing value. So, ideally we replace it with **mode or most frequent** category but in this analysis the best step will be to **remove the rows having missing values** in occupation_type. As Occupation plays major role in defaulting or not.

In [None]:
# print(app_data.EXT_SOURCE_3.describe())
# sns.boxplot(y = app_data.EXT_SOURCE_3)

In [None]:
# sns.distplot(app_data.EXT_SOURCE_3, hist= False)

> Recommendation to use **Mean** to impute missing values of EXT_SOURCE_3 as there are not 

In [None]:
app_data.info()

In [None]:

Amt_req_credit = list(enumerate(missing_value_col_app[-6:]))
for i in Amt_req_credit:
    print('\n'+i[1])
    print(f'No. of unique values: {app_data[i[1]].nunique()}')
    print(app_data[i[1]].value_counts())
    plt.figure(figsize=(5,10))
    plt.subplot(len(Amt_req_credit), 1, i[0]+1 )
    plt.title(i[1])
    app_data[i[1]].value_counts().plot(kind= 'bar')

> Use **Mode** to impute missing value in these col. 
('AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR')
 as they are categorical columns

### Checking Data and DataTypes of the Columns

In [None]:
app_data

###### > Fixing Days_Employed Col

In [None]:
app_data.DAYS_EMPLOYED.value_counts()

> First value i.e 365243 DAYS ~= 1000.years which is not possible. So, we replace it with 0

In [None]:
app_data.DAYS_EMPLOYED.replace(365243, 0, inplace= True)

In [None]:
app_data.DAYS_EMPLOYED = -1*app_data.DAYS_EMPLOYED

In [None]:
app_data.DAYS_EMPLOYED.value_counts()

In [None]:
# Checking ORGANIZATION_TYPE col
app_data.ORGANIZATION_TYPE.value_counts(normalize = True)*100

> We can see that **'XNA'** value which to our knowledge represents **'Not Available'** can be also treated as invalid or missing value. This is 18% of total rows which we can **remove from our analysis**.

In [None]:
# Removing XNA rows from the CODE_GENDER.
app_data.drop(app_data[app_data.ORGANIZATION_TYPE == 'XNA'].index, axis=0, inplace=True)
app_data.ORGANIZATION_TYPE.value_counts()

In [None]:
# Checking ORGANIZATION_TYPE col
app_data.CODE_GENDER.value_counts(normalize = True)*100

> We can see that **'XNA'** value which to our knowledge represents **'Not Available'** can be also treated as invalid or missing value. This is 0.0013% of total rows which we will **remove from our analysis**.

In [None]:
# Removing XNA rows from the CODE_GENDER.
app_data.drop(app_data[app_data.CODE_GENDER == 'XNA'].index, axis=0, inplace=True)
app_data.CODE_GENDER.value_counts()

In [None]:
# Defining fun. to fix other col.
Other_colToFix = ['DAYS_BIRTH','DAYS_ID_PUBLISH','DAYS_REGISTRATION']
def fixCol(arr):
    for i in arr:
        app_data[i] = -1 * app_data[i]
        print('\n'+i)
        print(app_data[i].value_counts())

In [None]:
fixCol(Other_colToFix)

In [None]:
# Casting all other columns data type to numeric data type

num_col=['TARGET',
          'CNT_CHILDREN',
          'AMT_INCOME_TOTAL',
          'AMT_CREDIT','AMT_ANNUITY',
          'REGION_POPULATION_RELATIVE','DAYS_BIRTH',
          'DAYS_EMPLOYED',
          'DAYS_REGISTRATION',
          'DAYS_ID_PUBLISH',
          'HOUR_APPR_PROCESS_START',
          'LIVE_REGION_NOT_WORK_REGION',
          'REG_CITY_NOT_LIVE_CITY',
          'REG_CITY_NOT_WORK_CITY',
          'LIVE_CITY_NOT_WORK_CITY']

app_data[num_col]=app_data[num_col].apply(pd.to_numeric)
app_data.head(5)


#### Finding Outliers for Numerical Col

In [None]:
For_Outliers = list(enumerate(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE']))

In [None]:
for i in For_Outliers:
    print('\n'+i[1])
    print('-'*30)
    print(app_data[i[1]].describe())

> 1. For AMT_INCOME_TOTAL: There is lot of variation from 75% to max. So, this column is highly probale of having outliers, which can be confirmed using boxplot later.

> 2. For AMT_CREDIT: There is considerable amt. of variation in different quartiles of data, but describe fn. doesn't give clear repersentation about outliers in this col.. We need to use boxplot for that.

> 3. For AMT_ANNUITY : This col. has large amt. of variation in the last quartile which shows that this col. is suffering from outliers, which can be confirm by box plot later.

> 4. For AMT_GOODS_PRICE : In this col. describe fn. does not give any strong information about outliers in the col.

In [None]:
for i in For_Outliers:
    plt.figure(figsize=(10,15))
    plt.subplot(len(For_Outliers), 1, i[0]+1)
    sns.boxplot(app_data[i[1]])

> 1. For AMT_INCOME_TOTAL: This col. has outlier as we can clearly see a single point on extreme right in its respective boxplot. We can remove these.

In [None]:
app_data.AMT_CREDIT.quantile([0.5, 0.8, 0.85, 0.9, 0.95, 0.97, 0.99,1])

In [None]:
app_data[app_data.AMT_CREDIT > 1200000].AMT_CREDIT.describe()

> 2. For AMT_CREDIT: This col. has some outliers but they are significant for our analysis. There is large variation in 99% and max value. Here we can **Cap the outliers** to some value.

In [None]:
app_data.AMT_ANNUITY.quantile([0.5, 0.8, 0.85, 0.9, 0.95, 0.97, 0.99, 1])

In [None]:
app_data[app_data.AMT_ANNUITY > 60000].AMT_ANNUITY.describe()

> 3. For AMT_ANNUITY : For this col. majority distribution is smooth but there are certain outliers in the last 1%. The **best is to remove the extreme values and cap rest**.

In [None]:
app_data.AMT_GOODS_PRICE.quantile([0.5, 0.8, 0.85, 0.9, 0.95, 0.97, 0.99, 1])

In [None]:
app_data[app_data.AMT_GOODS_PRICE > 1800000].AMT_GOODS_PRICE.dropna().describe()

> 4. For AMT_GOODS_PRICE : For this col. majority distribution is smooth but there are certain outliers in the last 1%. The best is **to remove the extreme values and cap rest**.

#### Binning

In [None]:
# Creating bins for AMT_INCOME_TOTAL
bins = [0,100000,200000,300000,400000,500000,600000,700000,800000, 900000, 1000000, 100000000]
slot = ['0-100000','100000-200000','200000-300000','300000-400000','400000-500000','500000-600000',
        '600000-700000','700000-800000','800000-900000','900000-1000000', '1000000 +']

app_data['AMT_INCOME_RANGE']=pd.cut(app_data['AMT_INCOME_TOTAL'], bins, labels=slot)

In [None]:
app_data['AMT_INCOME_RANGE'].value_counts

In [None]:
# Creating bins for AMT_CREDIT

bins = [0,150000,200000,250000,300000,350000,400000,450000,500000,550000,600000,650000,700000,750000,800000,850000,900000,1000000000]
slots = ['0-150000', '150000-200000','200000-250000', '250000-300000', '300000-350000', '350000-400000','400000-450000',
        '450000-500000','500000-550000','550000-600000','600000-650000','650000-700000','700000-750000','750000-800000',
        '800000-850000','850000-900000','900000 and above']

app_data['AMT_CREDIT_RANGE']=pd.cut(app_data['AMT_CREDIT'], bins=bins, labels=slots)

In [None]:
app_data.head()

## 4. Analysis 

In [None]:
# Checking for Imbalance dataset w.r.t. TARGET col
value_Count_Target = app_data.TARGET.value_counts(normalize = True)*100
value_Count_Target.plot(kind= 'bar')
print(value_Count_Target)

> This is highly Imbalanced Dataset as approx. 92% data belong to target value of 0 and only approx. 8% belong to target value of 1

In [None]:
# Splitting dataset w.r.t Traget == 0 & Target == 1
Target_0 = app_data[app_data.TARGET == 0]
Target_1 = app_data[app_data.TARGET == 1]

In [None]:
Target_0.head()

In [None]:
Target_1.head()

### Univariate Analysis for Categorical Variable

In [None]:
# Defining a function to plot the countplot for different categories
def UniVarCatPlot(title, hue = None, rotation=None, col_y = None, col_x = None):
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 30
    plt.rcParams['axes.titlepad'] = 30
    
    if col_x:
        col_name = col_x
        plt.figure(figsize=(30,25))
    else:
        col_name = col_y
        plt.figure(figsize=(15,38))

    #   1st subplot for Target_1    
    plt.subplots_adjust(hspace=0.5)
    plt.subplot(2,1,1)
    
    title1 = title + ' for Target_0 (Client with NO Payment Difficulty)'
    plt.title(title1)

    #   Adjusting scale for horizonatl plot    
    if col_x:
        plt.yscale('log')
        plt.xticks(rotation = rotation)
    else:
        plt.xscale('log')
        plt.yticks(rotation = rotation)
        
    sns.countplot(data = Target_0, x = col_x, y = col_y, order=Target_0[col_name].value_counts().index, hue=hue, palette='dark')
    

    #   2nd subplot for Target_1
    plt.subplot(2,1,2)
    plt.xticks(rotation = rotation)
    title2 = title + ' for Target_1 (Client with Payment Difficulty)'
    plt.title(title2)

    #   Adjusting scale for horizonatl plot
    if col_x:
        plt.yscale('log')
        plt.xticks(rotation = rotation)
    else:
        plt.xscale('log')
        plt.yticks(rotation = rotation)
        
    sns.countplot(data = Target_1, x = col_x, y= col_y, order=Target_1[col_name].value_counts().index, hue=hue, palette='dark')
    plt.legend(loc = 'upper right', fontsize = 'large')
    plt.show()

In [None]:
# Count plot for income range with wrt gender
UniVarCatPlot(col_x = 'AMT_INCOME_RANGE', title= 'Count Plot for Income Range', hue='CODE_GENDER')

> Observation from the count plot
    1. More female application for credit
    2. Majority of income range lies between 0 and 4,00,000
    3. Less count in 9,00,000 - 10,00,000. But sudden increase in count for 10,00,000+ interval.

In [None]:
# Count Plot for contract type wrt gender
UniVarCatPlot(col_x = 'NAME_CONTRACT_TYPE', title= 'Count Plot for Contract Type', hue='CODE_GENDER')

For Target_0
> Count  for  contract type Cash loans is singnificantly higher than Revolving loans
> Females has higher count in this category.

For Target_1
> Count  for  contract type Cash loans is singnificantly higher than Revolving loans
> Females has higher count in this category, with no male in Revolving loans.


In [None]:
# Count Plot for type of education wrt gender
UniVarCatPlot(col_x = 'NAME_EDUCATION_TYPE', title= 'Count Plot for Education Type', hue='CODE_GENDER')

>For both Target_0 and Target_1 the count plot gives similar pattern with credit for Secondary Education type counts max.

In [None]:
# Count Plot for cdifferent housing type
UniVarCatPlot(col_x = 'NAME_HOUSING_TYPE', title= 'Count Plot for Different Housing Type', hue='CODE_GENDER')

>For both Target_0 and Target_1 the count plot gives similar pattern with credit for House/apartment housing type counts max.

In [None]:
# Count Plot for contract type wrt gender
UniVarCatPlot(col_x = 'OCCUPATION_TYPE', title= 'Count Plot for Different Occupation Type', rotation= 45)

For Target_0
>1. Laborers have highest count.

For Target_1
>1. Similar to Target_0, LABORERS highest count.

Suggesting Bank to be careful in giveing loans to top 5 occupation type i.e Laborers, Sales_Staff, Drivers, Core Staff, Managers.

In [None]:
UniVarCatPlot(col_y = 'ORGANIZATION_TYPE', title= 'Count Plot for Different Organization Type')

For Target_0
>1. Clients which have applied for credits are from most of the organization type ‘Business entity Type 3’ , ‘Self employed’, ‘Other’ , ‘Medicine’ and ‘Government’.
2. Less clients are from Industry type 8,type 6, type 10, religion and trade type 5, type 4.

For Target_1
>1. Clients which have applied for credits are from most of the organization type ‘Business entity Type 3’ , ‘Self employed’ , ‘Other’ , ‘Medicine’ and ‘Government’.
2. Less clients are from Industry type 8,type 6, type 10, religion and trade type 5, type 4.
3. Same as type 0 in distribution of organization type.

In [None]:
# Finding Correlation between variables for Target_0
corr_0 = Target_0.corr()
# sns.heatmap(corr_0)
corr_0_df = corr_0.where(np.triu(np.ones(corr_0.shape), k=1).astype(np.bool))
corr_0_df = corr_0_df.unstack().reset_index()
corr_0_df.columns = ['Variable_1', 'Variable_2', 'Correlation']
corr_0_df.dropna(subset = ['Correlation'], inplace = True)
corr_0_df['Correlation'] = round(corr_0_df['Correlation'],2)
corr_0_df['Correlation'] = abs(corr_0_df['Correlation'])
corr_0_df.sort_values(by = 'Correlation', ascending = False).head(10)

In [None]:
# Finding Correlation between variables for Target_1
corr_1 = Target_1.corr()
corr_1_df = corr_1.where(np.triu(np.ones(corr_1.shape), k=1).astype(np.bool))
corr_1_df = corr_1_df.unstack().reset_index()
corr_1_df.columns = ['Variable_1', 'Variable_2', 'Correlation']
corr_1_df.dropna(subset = ['Correlation'], inplace = True)
corr_1_df['Correlation'] = round(corr_1_df['Correlation'],2)
corr_1_df['Correlation'] = abs(corr_1_df['Correlation'])
corr_1_df.sort_values(by = 'Correlation', ascending = False).head(10)

>The highest correlation is almost same in both Target_0 and Target_1 dataframe and between same variables.

In [None]:
# Correlation Matrix for Target_0
plt.figure(figsize=(10, 8))
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['axes.titlepad'] = 30
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12
plt.title('Correlattion of Target 0')
sns.heatmap(corr_0.iloc[2:-6,2:-6], cmap ='RdYlGn')

In [None]:
# Correlation Matrix for Target_1
plt.figure(figsize=(10, 8))
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['axes.titlepad'] = 30
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12
plt.title('Correlattion of Target 1')
sns.heatmap(corr_1.iloc[2:-6,2:-6], cmap ='RdYlGn')

> We can see that both Correlation matrix for Target_0 and Target_1 are almost similar.

### Univariate Analysis for Numerical Variable

In [None]:
# Defining a function to plot the boxplot for different numerical col.
def UniVarNumPlot(title, col):
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.figure(figsize=(15,6))
    plt.rcParams["axes.labelsize"] = 12
    plt.rcParams['xtick.labelsize']=12
    plt.rcParams['ytick.labelsize']=12
    plt.rcParams['axes.titlesize'] = 14
    plt.rcParams['axes.titleweight'] = 12
    plt.rcParams['axes.titlepad'] = 30
    

    #   1st subplot for Target_1    
    plt.subplots_adjust(wspace=1.5)
    plt.subplot(1,2,1)
    title1 = title + ' for Target_0'
    plt.title(title1)
    plt.yscale('log')    
    sns.boxplot(data = Target_0, x = col, orient = 'v')
    

    #   2nd subplot for Target_1
    plt.subplot(1,2,2)
    title2 = title + ' for Target_1'
    plt.title(title2)
    plt.yscale('log')
    sns.boxplot(data = Target_1, x = col, orient = 'v')
    plt.show()

In [None]:
# Boxplot plot for Total Income between Target_0 and Target_1
UniVarNumPlot(col ='AMT_INCOME_TOTAL', title = 'Distribution of Client Income' )

For Target_0 Client Income

>Some outliers are noticed in income amount.
The third quartiles is very slim for income amount.


For Target_1 Client Income

>Some outliers are noticed in income amount.
The third quartiles is very slim for income amount.
Most of the clients of income are present in first quartile.

In [None]:
# Boxplot plot for Credit Amt between Target_0 and Target_1
UniVarNumPlot(col ='AMT_CREDIT', title = 'Distribution of Credit Amount' )

> Both Target_0 and Target_1 has similar type of boxplot.

> Some outliers are noticed in credit amount.
The first quartile is bigger than third quartile for credit amount which means most of the credits of clients are present in the first quartile.


In [None]:
# Boxplot plot for Credit Amt between Target_0 and Target_1
UniVarNumPlot(col ='AMT_ANNUITY', title = 'Distribution of Annuity Amount' )

> For Target_0
Some outliers are noticed in credit amount.
The first quartile is bigger than third quartile for credit amount which means most of the credits of clients are present in the first quartile.

> For Target_1
Some outliers are noticed in annuity amount.
The first quartile is bigger than third quartile for annuity amount which means most of the annuity clients are from first quartile.
 
> Both Target_0 and Target_1 has similar type of boxplot for Annuity amount.

#### Bivariate analysis for numerical variables

In [None]:
# Function for Bi- variate Boxplot analysis
def BiVarPlot(data, col_x, col_y, hue, title, scale=None):
    plt.rcParams["axes.labelsize"] = 12
    plt.rcParams["axes.labelpad"] = 12
    plt.rcParams['xtick.labelsize']=12
    plt.rcParams['ytick.labelsize']=12
    plt.rcParams['axes.titlesize'] = 14
    plt.rcParams['axes.titleweight'] = 12
    plt.rcParams['axes.titlepad'] = 30
    plt.figure(figsize=(16,12))
    plt.xticks(rotation=0)
    if scale:
        plt.yscale(scale)
    sns.boxplot(data = data, x= col_x, y= col_y, hue= hue, orient='v')
    plt.title(title)
    plt.show()

In [None]:
# Box plotting for Credit amount for Target_0
BiVarPlot(data = Target_0,
          col_x='NAME_EDUCATION_TYPE',
          col_y='AMT_CREDIT',
          hue='NAME_FAMILY_STATUS',
          title='Credit amount vs Education Status')

> From the above box plot we can say that Family status of 'civil marriage', 'marriage' and 'separated' of Academic degree education are having higher number of credits than others.

> Most of the outliers are from Education type 'Higher education' and 'Secondary'. 

> Civil marriage for Academic degree is having most of the credits in the third quartile.

>These all above mentioned categories of people have no difficulty in paying back the loan.

In [None]:
# Box plotting for Credit amount for Target_1
BiVarPlot(data = Target_1,
          col_x='NAME_EDUCATION_TYPE',
          col_y='AMT_CREDIT',
          hue='NAME_FAMILY_STATUS',
          title='Credit amount vs Education Status')

> Married people with Academic degree has highest Credit amount than all other education type. And other family status has neligible credit. This shows that Married people with Academic degree are having more difficulties in paying back the loan.

> While  most of the outliers are Secondary/Secodary special.

In [None]:
# Box plotting for Income amount in logarithmic scale for Target_0
BiVarPlot(data = Target_0,
          col_x='NAME_EDUCATION_TYPE',
          col_y='AMT_INCOME_TOTAL',
          hue='NAME_FAMILY_STATUS',
          title='Income amount vs Education Status',
          scale='log')

>From above boxplot for Education type 'Higher education' the income amount is mostly equal with family status same goes for 'Secondary Education'.

>Less outlier are having for Academic degree but there income amount is little higher that Higher education. Lower secondary are have less income amount than others.



In [None]:
# Box plotting for Income amount in logarithmic scale for Target_1
BiVarPlot(data = Target_1,
          col_x='NAME_EDUCATION_TYPE',
          col_y='AMT_INCOME_TOTAL',
          hue='NAME_FAMILY_STATUS',
          title='Income amount vs Education Status',
          scale='log')

## 5. Read previous_application data

In [None]:
# Loading previous_application and application_data  datasets
prev_application = pd.read_csv('../input/loan-defaulter/previous_application.csv')

In [None]:
# display top 5 rows of prev_application dataframe
prev_application.head()

In [None]:
# Printing shape of both dataset
print(f'Shape of prev_application : {prev_application.shape}')

In [None]:
prev_application.info()

In [None]:
prev_application.describe()

In [None]:
# Column wise Null percentage in prev_application
null_percentage_prev = round(prev_application.isnull().sum()/prev_application.shape[0]*100, 2)
print(null_percentage_prev)

>Looks like there are some missing values present in some columns of the previous_application dataset

In [None]:
# Getting the list of columns with more than 45% of null values 
prev_colToDrop = list(null_percentage_prev[null_percentage_prev >= 35].index)

In [None]:
print(f'No. of col. to be drop: {len(prev_colToDrop)}')
prev_colToDrop

In [None]:
# Dropping the columns from prev_application dataframe
prev_application.drop(prev_colToDrop, axis= 1, inplace = True)

In [None]:
# Rechecking column wise Null percentage in prev_application
null_percentage_prev = round(prev_application.isnull().sum()/prev_application.shape[0]*100, 2)
print(null_percentage_prev)
print(f'New shape of prev_application: {prev_application.shape}')

In [None]:
# Getting columns with missing values between 0% and 35%
missing_value_col_prev = list(null_percentage_prev[null_percentage_prev > 0].index)
missing_value_col_prev

In [None]:
round(prev_application.NAME_CASH_LOAN_PURPOSE.value_counts()/prev_application.shape[0]*100,2)

> there are majority of value as 'XAN' and 'XAP' combined account to 96% of rows

In [None]:
# Dropping rows conating 'XNA' and 'XAP'
prev_application=prev_application.drop(prev_application[prev_application['NAME_CASH_LOAN_PURPOSE']=='XNA'].index)
prev_application=prev_application.drop(prev_application[prev_application['NAME_CASH_LOAN_PURPOSE']=='XAP'].index)

In [None]:
prev_application.shape

In [None]:
# Merging prev_application and app_data on SK_ID_CURR
merged_df = pd.merge(left = app_data, right=prev_application, how='inner', on='SK_ID_CURR', suffixes='_o')

In [None]:
merged_df.head()

In [None]:
# Renaming the col
merged_df = merged_df.rename({'NAME_CONTRACT_TYPE_' : 'NAME_CONTRACT_TYPE',
                         'AMT_CREDIT_':'AMT_CREDIT',
                         'AMT_ANNUITY_':'AMT_ANNUITY',
                         'WEEKDAY_APPR_PROCESS_START_' : 'WEEKDAY_APPR_PROCESS_START',
                         'HOUR_APPR_PROCESS_START_':'HOUR_APPR_PROCESS_START',
                         'NAME_CONTRACT_TYPEo':'NAME_CONTRACT_TYPE_PREV',
                         'AMT_CREDITo':'AMT_CREDIT_PREV',
                         'AMT_ANNUITYo':'AMT_ANNUITY_PREV',
                         'WEEKDAY_APPR_PROCESS_STARTo':'WEEKDAY_APPR_PROCESS_START_PREV',
                         'HOUR_APPR_PROCESS_STARTo':'HOUR_APPR_PROCESS_START_PREV'}, axis=1)

In [None]:
merged_df

In [None]:
# Removing Unwanted columns
merged_df.drop(['WEEKDAY_APPR_PROCESS_START',
              'HOUR_APPR_PROCESS_START',
              'REG_REGION_NOT_LIVE_REGION', 
              'REG_REGION_NOT_WORK_REGION',
              'LIVE_REGION_NOT_WORK_REGION',
              'REG_CITY_NOT_LIVE_CITY',
              'REG_CITY_NOT_WORK_CITY', 
              'LIVE_CITY_NOT_WORK_CITY',
              'WEEKDAY_APPR_PROCESS_START_PREV',
              'HOUR_APPR_PROCESS_START_PREV', 
              'FLAG_LAST_APPL_PER_CONTRACT',
              'NFLAG_LAST_APPL_IN_DAY',
              'NAME_GOODS_CATEGORY',
              'SELLERPLACE_AREA',
              'NAME_SELLER_INDUSTRY'],axis=1,inplace=True)

In [None]:
merged_df.info()

In [None]:
# Function for Univariate Analysis on merged_df using count plot
def UniVariatePlot(dataframe, col_y, hue, title):
    sns.set_style('whitegrid')
    sns.set_context('talk')

    plt.figure(figsize=(15,25))
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 22
    plt.rcParams['axes.titlepad'] = 30
    
    plt.xscale('log')
    plt.title(title)
    sns.countplot(data = merged_df,
                  y= col_y,
                  order=merged_df[col_y].value_counts().index,
                  hue = hue,
                  palette='dark')

In [None]:
# Distribution of contract status in logarithmic scale
UniVariatePlot(dataframe=merged_df,
               col_y = 'NAME_CASH_LOAN_PURPOSE', 
               hue='NAME_CONTRACT_STATUS', 
               title= 'Distribution of contract status with purposes')

> Most rejection of loans came from purpose 'repairs'.

> For education purposes we have equal number of approves and rejection.

> Payign other loans and buying a new car is having significant higher rejection than approves.

In [None]:
# Distribution of Target
UniVariatePlot(dataframe=merged_df, 
               col_y='NAME_CASH_LOAN_PURPOSE', 
               hue='TARGET', 
               title='Distribution occupation type with Target.')

> Loan purposes with 'Repairs' are facing more difficulites in payment on time.

> There are few places where client with no loan payment difficulty is significant higher than client facing difficulties in payment. They are 'Buying a garage', 'Business developemt', 'Buying land','Buying a new car' and 'Education' Hence we can focus on these purposes for which the client is having for minimal payment difficulties.

In [None]:
UniVariatePlot(dataframe=merged_df, 
               col_y='OCCUPATION_TYPE', 
               hue='NAME_CONTRACT_STATUS', 
               title='Distribution occupation type with Target.')

> Refused contract status are more than Approved in every occupation type.

> Labours have highest count while IT staff has least count for credit.

In [None]:
UniVariatePlot(dataframe=merged_df, 
               col_y='OCCUPATION_TYPE', 
               hue='TARGET', 
               title='Distribution occupation type with Target.')

> All occupation type has higher count for target_0 which is no difficultly in payment of loan.

> IT Staff has significant +ve difference between target_0 and target_1, which represent IT Staff are Safer occupation type to give loan.

#### Performing Bivariate Analysis

In [None]:
# Function for Bivariate Analysis on merged_df using boxplot
def BiVariatePlot(dataframe, col_x, col_y, hue, title):
    sns.set_style('whitegrid')
    sns.set_context('talk')

    plt.figure(figsize=(16,12))
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 22
    plt.rcParams['axes.titlepad'] = 30
    
    plt.yscale('log')
    plt.title(title)
    plt.xticks(rotation=90)
    sns.boxplot(data = merged_df,
                x = col_x,
                y= col_y,
                hue = hue,
                orient='v')
    plt.show()

In [None]:
# Box plotting for Credit amount in logarithmic scale
BiVariatePlot(dataframe=merged_df,
              col_x='NAME_CASH_LOAN_PURPOSE',
              col_y='AMT_CREDIT_PREV',
              hue='NAME_INCOME_TYPE',
              title='Prev Credit amount vs Loan Purpose')

> Loan for 'Buying a new car, Buying a holiday Home/Land, Buying a Houe or an annex,  Buying a Home, Buying a Garage' is igher compared to other loan purpose. This implies that people are taking more credit for buying new things and assets.

> Student and Pensioner income type people has negligible credit.

> Commercial Associates and State Servant has applied for significant amount of credit.

In [None]:
# Box plotting for Credit amount prev vs Housing type in logarithmic scale
plt.figure(figsize=(16,12))
plt.xticks(rotation=90)
plt.yscale('log')
sns.barplot(data =merged_df, y='AMT_CREDIT_PREV',hue='TARGET',x='NAME_HOUSING_TYPE')
plt.title('Prev Credit amount vs Housing type')
plt.show()

> Here Office apartment and House/apartment is having higher credit of target type 0 i.e. Client with No payment difficulties.
While Co-op apartment have high credit of target type 1 i.e. Client with payment difficulties.

> So, Bank should avoid giving loans to housing of Co-op apartments and focus more on Office apartment and House/apartment types of housing.

**RECOMMENDATION TO BANK**
1. Banks should focus more on contract type ‘Student’ ,’pensioner’ and ‘Businessman’ with housing ‘type an d should avoid ‘Co-op apartment’ housing type for successful payments.

2. Banks should focus less on income type ‘Working’ as they are having most number of unsuccessful payments.

3. Also with loan purpose ‘Repair’ is having higher number of unsuccessful payments on time.

4. Get as much as clients from housing type ‘With parents’ as they are having least number of unsuccessful payments.