           

# BUSINESS UNDERSTANDING 
    
The loan providing companies find it hard to give loans to the people due to their insufficient or
non-existent credit history. Some consumers use this to their advantage by defaulting on loans. Suppose you work for a consumer finance company which specializes in lending various types of loans to urban customers. You have to use EDA to analyze the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.
When the company receives a loan application, the company has to decide for loan approval based
on the applicant’s profile. Two types of risks are associated with the bank’s decision:
1. If the applicant is likely to repay the loan, then not approving the loan results in a
loss of business to the company
2. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then
approving the loan may lead to a financial loss for the company.

To avoid risks associated with lending loan, we can do a detailed Exploratory Data Analysis on the given data sets. This can give us information on defaulters pattern and draw insights from it.
    



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import itertools
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype


    
# Loading Datasets for Analysis
There are 3 datasets given:
1. columns_description.csv:- Gives information on the columns in both datasets
2. application_data.csv:- This dataset gives information on loan, difficulty in repayment of loans 
3. previous_data.csv:- This dataset gives information on previous loan applications


In [None]:
df_columns = pd.read_csv('../input/loan-defaulter/columns_description.csv')
df_application = pd.read_csv('../input/loan-defaulter/application_data.csv')
df_prev_app = pd.read_csv('../input/loan-defaulter/previous_application.csv')

# Getting Information on the data

In [None]:
# Displays all the rows
pd.set_option("display.max_rows", None)
df_columns

In [None]:
# Gives general information on application dataset no of columns no of rows 
df_application.info()

In [None]:
# Gives general information on previous application dataset no of columns no of rows 
df_prev_app.info()


    
# Data Cleaning
* Data cleaning involves either finding and removing columns or rows with null values or replacing the null value with any suitable value. This is done to make a reliable analysis.
* Their deletion depends on whether a given column adds value to our analysis or not.
* But at any given point removing columns with more than 30% data seems logical though there is no said rule for it. 
    


In [None]:
# GRAPHICAL REPRESENTATION OF NULL VALUES IN A COLUMN IN ANY GIVEN DATASET
# NULL VALUES IN APPLICATION DATASET
df_app_null = pd.DataFrame(100*df_application.isnull().sum()/len(df_application)).reset_index(0)
df_app_null.columns = ['COLUMNS', 'PERCENTAGE OF NULL VALUES']
fig,ax = plt.subplots()
fig.set_size_inches(20,5)
fig.tight_layout(pad=3)

plt.yticks(np.arange(0, max(df_app_null['PERCENTAGE OF NULL VALUES']+20), 3))
plt.xticks(rotation=90,fontsize=8)
plt.title('PERCENTAGE OF NULL VALUES IN GIVEN COLUMN')
apn = sns.pointplot( x="COLUMNS",y="PERCENTAGE OF NULL VALUES",data=df_app_null,color='indigo')

In [None]:
# GRAPHICAL REPRESENTATION OF NULL VALUES IN A COLUMN IN ANY GIVEN DATASET
# NULL VALUES IN PREVIOUS APPLICATION DATASET
df_prev_null = pd.DataFrame(100*df_prev_app.isnull().sum()/len(df_prev_app)).reset_index(0)
df_prev_null.columns = ['COLUMNS', 'PERCENTAGE OF NULL VALUES']
fig,ax = plt.subplots()
fig.set_size_inches(20,5)
fig.tight_layout(pad=3)
plt.yticks(np.arange(0, max(df_prev_null['PERCENTAGE OF NULL VALUES']+20), 5))
plt.xticks(rotation=90,fontsize=8)
plt.title('PERCENTAGE OF NULL VALUES IN GIVEN COLUMN')
apn = sns.pointplot( x="COLUMNS",y="PERCENTAGE OF NULL VALUES",data=df_prev_null,color='indigo')

# Finding and dropping columns having more than 40% or 30% null values.

In [None]:
# FINDING NULL VALUES IN APPLICATION DATASET WITH MORE THAN 40%.
df_null = df_application.isnull().sum()/len(df_application)*100
df_null = df_null[df_null.values>40.0]
df_nu = pd.DataFrame(df_null)

drop_c = []
for col in df_nu.T:
    for c in df_application:
         if col == c:
            drop_c.append(c)
            
# Dropping EXT_SOURCE since it is normalised and dont know the idea           
drop_c.append('EXT_SOURCE_2')    
drop_c.append('EXT_SOURCE_3')
#NO OF COLUMNS THAT WOULD BE DELETED
print(len(drop_c))    


In [None]:
# DROPING COLUMNS 
df_application.drop(columns=drop_c,inplace=True)

In [None]:
#CHECKING FINAL VERSION OF DATA
df_application.info()

In [None]:
# FINDING NULL VALUES IN PREVIOUS DATASET WITH MORE THAN 30%.
prev_null = df_prev_app.isnull().sum()/len(df_prev_app)*100
prev_null = prev_null[prev_null.values>30.0]
prev_nu = pd.DataFrame(prev_null)

drop_pc = []
for col in prev_nu.T:
    for c in df_prev_app:
         if col == c:
            drop_pc.append(c)
            
print(len(drop_pc)) 
df_prev_app.drop(columns=drop_pc,inplace=True)

In [None]:
#CHECKING FINAL VERSION OF DATA
df_prev_app.info()


    
# DATA PREPROCESSING
* There are columns showing no of days instead of years, also columns showing amount the data is continuous.
* For ease in analysis and visualization we are going to convert these columns in years and create bins for such columns.
* We will also merge both the data to create new data so as to find insights if any.
* Also separate dataframes for defaulters and non-defaulters will be made for analysis.    
    
   

In [None]:
# '''Convert days to years'''
df_application['DAYS_ID_PUBLISH']= df_application['DAYS_ID_PUBLISH']//-365
df_application['DAYS_EMPLOYED']= df_application['DAYS_EMPLOYED']//-365
df_application['DAYS_LAST_PHONE_CHANGE']= df_application['DAYS_LAST_PHONE_CHANGE']//-365
df_application['DAYS_BIRTH']= df_application['DAYS_BIRTH']//-365

In [None]:
# CREATING BINS AND SOME COLUMNS WITH NUMERICAL DATA IS DIVIDED BASED ON BINS
bins = [25000,   50000,   75000,  100000,  125000,  150000,
        175000,  200000,  225000,  250000,  275000,  300000,  325000, 
        350000,  375000,  400000,  425000,  450000,  475000,  500000, 10000000000]
bins
df_application['AMT_INCOME'] = pd.cut(df_application['AMT_INCOME_TOTAL'], bins)
df_application['AMT_CRED'] = pd.cut(df_application['AMT_CREDIT'], bins)
df_application['AMT_GOODS'] = pd.cut(df_application['AMT_GOODS_PRICE'], bins)
df_application['AMT_ANN'] = pd.cut(df_application['AMT_ANNUITY'], bins)

In [None]:
# HERE I HAVE CONVERTED TARGET INTO STRING TYPE AS EASY APPROACH FOR VISUALIZING
df_application['TARGET'] = df_application.TARGET.astype(str)
df_application['AMT_INCOME'] = df_application.AMT_INCOME.astype(str)
df_application['AMT_CRED'] = df_application.AMT_CRED.astype(str)
df_application['AMT_GOODS'] = df_application.AMT_GOODS.astype(str)
df_application['AMT_ANN'] = df_application.AMT_ANN.astype(str)

In [None]:
# CREATING BINS FOR AGE
age_bin = [15,20,25,30,35,40,45,50,55,60,65,70,75,80,85]     
age_bin
df_application['AGE_YEARS'] = pd.cut(df_application['DAYS_BIRTH'], age_bin)
df_application['AGE_YEARS'] = df_application['AGE_YEARS'].astype(str)

In [None]:
# CREATING NEW DATAFRAME BY MERGING APPLICATION AND PREVIOUS DATA TO FIND INSIGHTS IF ANY
df_merge = pd.merge(df_prev_app, df_application, how='inner', on = 'SK_ID_CURR')

In [None]:
# MAKING SEPARATE DATAFRAME FOR DEFAULTER
df_default = df_application[df_application['TARGET']== '1']
df_default.info()

In [None]:
# MAKING SEPARATE DATAFRAME FOR NONDEFAULTER
df_nondefault = df_application[df_application['TARGET']== '0']
df_nondefault.info()


    
# Univariate Analysis
* Univariate is a simple analysis showing pattern of data in each columns in terms of histogram or bar graph.
* Here ill be running a loop to get an overview of all the columns.
* Later the analysis will be elaborate.
* In the application data 'Target' variable tells us whether the given ID has defaulted or not. So considering it as most important feature we will draw our conclusions based on it.    
    
  

In [None]:
# PLOTTING DISTRIBUTION OF TARGET VARIABLE
fig,ax= plt.subplots(2,1)
fig.set_size_inches(8, 10)
fig.tight_layout(pad=6)

target_df = df_application['TARGET'].value_counts(normalize=True).rename().mul(100).reset_index()
bar_t = target_df.plot(ax=ax[0],kind='bar',colormap='tab10',title='PERCENTAGE OF TARGETS')
bar_t.set_ylabel('PERCENT')
bar_t.set_xlabel('TARGET')
bar_t.get_legend().remove()
for p in bar_t.patches:
    bar_t.annotate(f'{p.get_height():.2f}%',
                   (p.get_x() + p.get_width() / 2,
                    p.get_height()), ha='center', va='center',
                   size=10, xytext=(0, 5),
                   textcoords='offset points')

bar_t2 =  df_application['TARGET'].value_counts().plot(ax=ax[1], kind='bar',colormap='tab10',title='PERCENTAGE OF TARGETS')    
bar_t2.set_title('TOTAL NUMBER OF TARGETS')
bar_t2.set_ylabel('COUNT')
bar_t2.set_xlabel('TARGET')
for p in bar_t2.patches:
    bar_t2.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2,
                    p.get_height()), ha='center', va='center',
                   size=10, xytext=(0, 5),
                   textcoords='offset points')



#### KEYPOINTS: 


* Most of the clients i.e. 91.93% are non defaulters while only 8% of them have defaulted in repayment.

* This seems a highly imbalanced data.

* We will have to investigate further, study other features in relation to find out the pattern of defaulters

  
   

In [None]:
# BIRDS EYE VIEW OF DATA, LOOPING THROUGH COLUMNS AND CREATING GRAPHS
for column in df_application:
    plt.xticks(rotation=90)
    plt.figure(column)
    plt.title(column)
    if is_numeric_dtype(df_application[column]):
        sns.displot(data=df_application, x=column,hue='TARGET', kind='kde',palette="viridis")
    elif is_string_dtype(df_application[column]):
        plt.yscale('log')
        sns.countplot(data=df_application, x=column, hue='TARGET',palette='viridis')


    
# Elaborating few graphs with insights drawn from them    
   

In [None]:
# PLOTTING CONTRACT_TYPE
fig,ax = plt.subplots()
fig.set_size_inches(15,6)
sns.countplot(data=df_application,x='NAME_CONTRACT_TYPE',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['NAME_CONTRACT_TYPE'].value_counts().index)
plt.title('COMPARING CONTRACT TYPE IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['NAME_CONTRACT_TYPE']).TARGET.value_counts(normalize=True).mul(100))
df_application.NAME_CONTRACT_TYPE.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 90.48% of clients took cash loans out of which 8.3% were defaulters. 
* 9.52% took revolving loans out of which 5.47% were defaulters
  
  

In [None]:
# PLOTTING CODE_GENDER
fig,ax = plt.subplots()
fig.set_size_inches(15,6)
sns.countplot(data=df_application,x='CODE_GENDER',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['CODE_GENDER'].value_counts().index)
plt.title('COMPARING GENDER IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['CODE_GENDER']).TARGET.value_counts(normalize=True).mul(100))
df_application.CODE_GENDER.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 65.83% were females out of which 6.99% tend to default.
* 10.14% out of 34.16% males had defaulted.

  
    
  

In [None]:
# PLOTTING CNT_CHILDREN
fig,ax = plt.subplots()
fig.set_size_inches(15,6)
sns.countplot(data=df_application,x='CNT_CHILDREN',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['CNT_CHILDREN'].value_counts().index)
plt.title('COMPARING COUNT OF CHILDREN IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['CNT_CHILDREN']).TARGET.value_counts(normalize=True).mul(100))
df_application.CNT_CHILDREN.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 70.03% had no children, 7.71% of them were defaulters.
* 28.57% with 6 children had defaulted.
  

In [None]:
# PLOTTING FAMILY_STATUS
fig,ax = plt.subplots()
fig.set_size_inches(12,6)
sns.countplot(data=df_application,x='NAME_FAMILY_STATUS',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['NAME_FAMILY_STATUS'].value_counts().index)
plt.title('COMPARING FAMILY STATUS IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['NAME_FAMILY_STATUS']).TARGET.value_counts(normalize=True).mul(100))
df_application.NAME_FAMILY_STATUS.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 63.88% were married of which 7.56% defaulted
* 9.94% of Civil marriage and 9.81% of single/not married had defaulted
* 5.8%of widows tend to default
 

In [None]:
# PLOTTING SUITE_TYPE
fig,ax = plt.subplots()
fig.set_size_inches(12,6)
sns.countplot(data=df_application,x='NAME_TYPE_SUITE',hue='TARGET',palette='winter',edgecolor='black')
plt.title('COMPARING SUITE TYPE IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['NAME_TYPE_SUITE']).TARGET.value_counts(normalize=True).mul(100))
df_application.NAME_TYPE_SUITE.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 81.16% were unaccompanied while applying for loan of which 8.18% had defaulted.
* 9.83% of Other_B had also defaulted

  

In [None]:
# PLOTTING INCOME_TYPE
fig,ax = plt.subplots()
fig.set_size_inches(13,6)
sns.countplot(data=df_application,x='NAME_INCOME_TYPE',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['NAME_INCOME_TYPE'].value_counts().index)
plt.title('COMPARING INCOME TYPE IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['NAME_INCOME_TYPE']).TARGET.value_counts(normalize=True).mul(100))
df_application.NAME_INCOME_TYPE.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 51.63% of clients have working income type of which 9.59% have defaulted. 
* 40 % of Maternity leave and 36.36% of unemployed had defaulted.
* Student and business type do not seem default.

   

In [None]:
# PLOTTING EDUCATION_TYPE
fig,ax = plt.subplots()
fig.set_size_inches(12,6)
sns.countplot(data=df_application,x='NAME_EDUCATION_TYPE',hue='TARGET',palette='winter',edgecolor='black')
plt.title('COMPARING EDUCATION TYPE IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['NAME_EDUCATION_TYPE']).TARGET.value_counts(normalize=True).mul(100))
df_application.NAME_EDUCATION_TYPE.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 71.01% have Secondary/secondary special education and 8.94% of them have defaulted along with 10.92% of people with Lower secondary education.
* 5.35% of  Higher education and 1.83% of Academic degree had defaulted.    
   

In [None]:
# PLOTTING HOUSING_TYPE
fig,ax = plt.subplots()
fig.set_size_inches(12,6)
sns.countplot(data=df_application,x='NAME_HOUSING_TYPE',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['NAME_HOUSING_TYPE'].value_counts().index)
plt.title('COMPARING HOUSING TYPE IN TARGET VARIABLE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['NAME_HOUSING_TYPE']).TARGET.value_counts(normalize=True).mul(100))
df_application.NAME_HOUSING_TYPE.value_counts(normalize=True).mul(100)


#### KEYPOINTS: 
* Of 88.73% who had house/apartment, 7.8% defaulted. 
* Approximately 12% of those who lived in rented apartment or live with parents have defaulted.
* 12.31% of those with Rented apartment and 11.7% of those with parents had defaulted.    

  

In [None]:
# PLOTTING OCCUPATION_TYPE
fig,ax = plt.subplots()
fig.set_size_inches(20,6)
sns.countplot(data=df_application,x='OCCUPATION_TYPE',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['OCCUPATION_TYPE'].value_counts().index)
plt.title('COMPARING OCCUPATION TYPE IN TARGET VARIABLE')
plt.xticks(ha='center', rotation=77,fontsize=15)
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['OCCUPATION_TYPE']).TARGET.value_counts(normalize=True).mul(100))
df_application.OCCUPATION_TYPE.value_counts(normalize=True)



#### KEYPOINTS: 
* Labourers constitute 26.14% of which 10.58% and 17.15% of Low-skill labourers seem to have defaulted.
  

In [None]:
# PLOTTING ORGANIZATION_TYPE
fig,ax = plt.subplots()
fig.set_size_inches(16,16)
sns.countplot(data=df_application,y='ORGANIZATION_TYPE',hue='TARGET',palette='winter',edgecolor='black',
             order=df_application['ORGANIZATION_TYPE'].value_counts().index)
plt.title('COMPARING ORGANIZATION TYPE IN TARGET VARIABLE')
plt.yticks(fontsize=11)
plt.xlabel('COUNT')
plt.xscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['ORGANIZATION_TYPE']).TARGET.value_counts(normalize=True).mul(100))
df_application.ORGANIZATION_TYPE.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* Business Type 3 constitute 22.11% of clients. Of which 9.29% had defaulted.
* 15.75% of Transport Type 3, 13.43% of Industry Type 13, 12.5% of Industry Type 8 were among top defaulters.
  

In [None]:
# PLOTTING PROCESS_START HOUR
fig,ax = plt.subplots()
fig.set_size_inches(12,6)
sns.countplot(data=df_application,x='HOUR_APPR_PROCESS_START',hue='TARGET',palette='winter')
plt.title('HOUR PROCESS START TIME')
plt.yscale('log')
plt.ylabel('COUNT')

In [None]:
# STATISTICAL ANALYSIS
print(df_application.groupby(['HOUR_APPR_PROCESS_START']).TARGET.value_counts(normalize=True).mul(100))
df_application.HOUR_APPR_PROCESS_START.value_counts(normalize=True).mul(100)


#### KEYPOINTS: 
* Most process starts at 10-12.
* Most defaulted at 0, 5, 6, 7 and 23 hrs



    
# Bivariate Analysis
* Bivariate analysis shows relation between two variables which can be categorical, numerical or both
* We will analyze few important features in this. 
* We will mainly study defaulters for insights.    
   

# GENDER ANALYSIS

In [None]:
# DISTRIBUTION OF GENDER IN CONTRACT TYPE
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
bi_chart = df_default['CODE_GENDER'].groupby(df_default['NAME_CONTRACT_TYPE']).value_counts(normalize=True).mul(100).rename('PERCENT').reset_index()
bar_chart = sns.barplot(x='NAME_CONTRACT_TYPE', y='PERCENT', hue='CODE_GENDER', palette='plasma', data=bi_chart)


In [None]:
# STATISTICAL ANALYSIS
print(df_default.groupby(['CODE_GENDER']).NAME_CONTRACT_TYPE.value_counts(normalize=True).mul(100))
df_default.groupby(['NAME_CONTRACT_TYPE']).CODE_GENDER.value_counts(normalize=True).mul(100)

In [None]:
# DISTRIBUTION OF GENDER IN CONTRACT TYPE
# fig, ax = plt.subplots()
ax = sns.displot(data=df_default, x='DAYS_BIRTH', hue='CODE_GENDER', kind='kde',height=5,aspect=2,palette='rocket',fill=True);
plt.xticks(np.arange(0, 82, 2));
plt.title('DISTRIBUTION OF GENDER ACROSS AGE')


In [None]:
# STATISTICAL ANALYSIS
print(df_default.groupby(['CODE_GENDER']).AGE_YEARS.value_counts(normalize=True).mul(100))
df_default.groupby(['AGE_YEARS']).CODE_GENDER.value_counts(normalize=True).mul(100)

In [None]:
# DISTRIBUTION OF GENDER BY AMT_CRED
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
bar_chart = sns.countplot(x='AMT_CRED', hue='CODE_GENDER', data=df_default, palette='plasma',
                         order = df_default['AMT_CRED'].value_counts().index);
plt.yscale('log')
plt.ylabel('COUNT')
plt.title('DISTRIBUTION OF GENDER ACROSS CREDIT AMOUNT')
plt.xticks(rotation=90);

In [None]:
# STATISTICAL ANALYSIS
print(df_default.groupby(['CODE_GENDER']).AMT_CRED.value_counts(normalize=True).mul(100))
df_default.groupby(['AMT_CRED']).CODE_GENDER.value_counts(normalize=True).mul(100)



#### KEYPOINTS: 
* 56.53 % Cash loans and 65.02% of revolving loans were taken by females. 
* Credit amount taken by females were more in any given range. 
* Most females belonged to 30-40 age group and most males belonged to 25-35 age group. Female count was more even in higher age groups compared to that of males



# INCOME TYPE ANALYSIS

In [None]:
# DISTRIBUTION OF INCOME TYPE IN CONTRACT TYPE

fig, axes = plt.subplots()
fig.set_size_inches(10,5)
bar_chart = sns.countplot(x='NAME_INCOME_TYPE', hue='NAME_CONTRACT_TYPE', data=df_default, palette='plasma',
                         order = df_default['NAME_INCOME_TYPE'].value_counts().index)
bar_chart.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_CONTRACT_TYPE')
plt.yscale('log')
plt.title('DISTRIBUTION OF INCOME TYPE ACROSS CONTRACT TYPE')
plt.ylabel('COUNT')
plt.xticks(rotation=90);

In [None]:
# STATISTICAL ANALYSIS
print(df_default.groupby(['NAME_INCOME_TYPE']).NAME_CONTRACT_TYPE.value_counts(normalize=True).mul(100))
df_default.groupby(['NAME_CONTRACT_TYPE']).NAME_INCOME_TYPE.value_counts(normalize=True).mul(100)

In [None]:
# DISTRIBUTION OF INCOME TYPE IN CREDIT AMOUNT
plt.figure(figsize=(18,6))
plt.xticks(rotation=90)
plt.title('COMPARING INCOME TYPE WRT CREDIT AMOUNT')
sns.boxplot(data=df_application,y= 'AMT_CREDIT', x='NAME_INCOME_TYPE', hue='TARGET', palette='husl',notch=True,
            order=df_application['NAME_INCOME_TYPE'].value_counts().index);

In [None]:
# DISTRIBUTION OF INCOME TYPE IN DIFFERENT AGE GROUPS
plt.figure(figsize=(18,6))
plt.xticks(rotation=90)
plt.title('DISTRIBUTION OF INCOME TYPE IN AGE GROUPS')
sns.boxplot(data=df_application,y= 'DAYS_BIRTH', x='NAME_INCOME_TYPE', hue='TARGET', palette='husl',notch=True,
           order=df_application['NAME_INCOME_TYPE'].value_counts().index);

In [None]:
# DISTRIBUTION OF INCOME TYPE WRT DAYS OF REGISTRATION
plt.figure(figsize=(18,6))
plt.xticks(rotation=90)
plt.title('DISTRIBUTION OF INCOME TYPE WRT DAYS OF REGISTRATION')

sns.boxplot(data=df_application,y= 'DAYS_REGISTRATION', x='NAME_INCOME_TYPE', hue='TARGET', palette='husl',notch=True,
           order=df_application['NAME_INCOME_TYPE'].value_counts().index);


    
#### KEYPOINTS:

* Most of them took cash loans but the ones with maternity leave who took cash loans and belonged to age group 35-40 definitely defaulted. 

* Unemployed who took cash loans defaulted more  and no defaulters are observed in revolving loans.

* The defaulters with higher credit amount with maternity leave as income type were more.

* The clients with less time between the application and change of registration defaulted more.


# EDUCATION TYPE ANALYSIS

In [None]:
# DISTRIBUTION OF OCCUPATION TYPE IN EDUCATION TYPE
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_EDUCATION_TYPE', hue='OCCUPATION_TYPE', data=df_default, palette='plasma',
                         order = df_default['NAME_EDUCATION_TYPE'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'OCCUPATION_TYPE')
plt.ylabel('COUNT')
plt.title('DISTRIBUTION OF OCCUPATION TYPE ACROSS EDUCATION TYPE')
plt.yscale('log')
plt.xticks(rotation=45,ha='center');

In [None]:
# STATISTICAL ANALYSIS
print(df_default.groupby(['NAME_EDUCATION_TYPE']).OCCUPATION_TYPE.value_counts(normalize=True).mul(100))
df_default.groupby(['OCCUPATION_TYPE']).NAME_EDUCATION_TYPE.value_counts(normalize=True).mul(100)

In [None]:
# DISTRIBUTION OF EDUCATION TYPE IN DIFFERENT AGE GROUPS
plt.figure(figsize=(18,6))
plt.xticks(rotation=90)
plt.title('DISTRIBUTION OF EDUCATION TYPE WRT AGE')

sns.boxplot(data=df_application,y= 'DAYS_BIRTH', x='NAME_EDUCATION_TYPE', hue='TARGET', palette='husl',notch=True,
           order=df_application['NAME_EDUCATION_TYPE'].value_counts().index);

In [None]:
# STATISTICAL ANALYSIS
print(df_default.groupby(['NAME_EDUCATION_TYPE']).AGE_YEARS.value_counts(normalize=True).mul(100))
df_default.groupby(['AGE_YEARS']).NAME_EDUCATION_TYPE.value_counts(normalize=True).mul(100)


    
#### KEYPOINTS:

* The highest defaulters belonged to secondary/secondary special education type and lowest belonged to ones with academic degree.

* In lower secondary education the IT staff, Private service staff, Realty agents and secretaries did not default.

* The only defaulters from academic degree belonged to accountants and sales staff. Their age group can also be narrowed down to approximately 34-35 age.



# SUITE TYPE

In [None]:
# DISTRIBUTION OF OCCUPATION TYPE IN SUITE TYPE
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_TYPE_SUITE', hue='OCCUPATION_TYPE', data=df_default, palette='plasma',
                         order = df_default['NAME_TYPE_SUITE'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'OCCUPATION_TYPE')
plt.ylabel('COUNT')
plt.title('DISTRIBUTION OF OCCUPATION TYPE ACROSS SUITE TYPE')
plt.yscale('log')
plt.xticks(rotation=45,ha='center');

In [None]:
# STATISTICAL ANALYSIS
print(df_default.groupby(['NAME_TYPE_SUITE']).OCCUPATION_TYPE.value_counts(normalize=True).mul(100))
df_default.groupby(['OCCUPATION_TYPE']).NAME_TYPE_SUITE.value_counts(normalize=True).mul(100)

In [None]:
# DISTRIBUTION OF EDUCATION TYPE IN DIFFERENT AGE GROUPS
plt.figure(figsize=(18,6))
plt.xticks(rotation=90)
plt.title('DISTRIBUTION OF EDUCATION TYPE WRT LAST PHONE CHANGE')

sns.boxplot(data=df_application,y= 'DAYS_LAST_PHONE_CHANGE', x='NAME_TYPE_SUITE', hue='TARGET', palette='husl',notch=True,
           order=df_application['NAME_TYPE_SUITE'].value_counts().index);

In [None]:
# DISTRIBUTION OF EDUCATION TYPE WRT PROCESS START HOUR
plt.figure(figsize=(18,6))
plt.xticks(rotation=90)
plt.title('DISTRIBUTION OF EDUCATION TYPE WRT PROCESS START HOUR')

sns.boxplot(data=df_application,y= 'HOUR_APPR_PROCESS_START', x='NAME_TYPE_SUITE', hue='TARGET', palette='husl',notch=True,
           order=df_application['NAME_TYPE_SUITE'].value_counts().index);


    
#### KEYPOINTS:

* In defaulters people with suite type as spouse/partner, never had IT staff as occupation.
* Suite type Children never had occupation type as Private service staff, Realty agents, secretaries, Waiters/barmen staff, IT staff or HR-staff.
* With suite type Other_B , no defaulters  were from Accountants, Realty agents, Secretaries, IT staff occupation.
* Realty agents with suite type Group of people always defaulted. Suite type Group of people were never  from occupation IT staff, High skill tech staff, Accountants, Waiters/barmen staff, IT staff or HR-staff.
* It is also observed that defaulters with suite type Other_B, Other_A and Group of People had their last phone changed near to application date.
* Most of the defaulters with suite type group of people had applied for loan at 12 hrs.


# COMPARING THE AMOUNT OF GOODS AND CREDIT TAKEN IN THE AMOUNT RANGE

In [None]:
# 
fig, ax = plt.subplots(3,1,sharex=True, sharey=True)

fig.set_size_inches(20, 15)
fig.tight_layout(pad=4)
fig.suptitle('COMPARING INCOME RANGE, CREDIT AMOUNT AND AMOUNT OF GOODS', fontsize=20)
plt.xticks(rotation=77, ha='center')
plt.ylabel('COUNT')

ax[0]=sns.countplot(data = df_application, x= 'AMT_INCOME', order=df_application['AMT_CRED'].value_counts().index,hue = 'TARGET',palette='hot',ax=ax[0])
ax[1]=sns.countplot(data = df_application, x= 'AMT_CRED', order=df_application['AMT_CRED'].value_counts().index,
              hue = 'TARGET',palette='rocket', ax=ax[1])
ax[2]=sns.countplot(data = df_application, x= 'AMT_GOODS', order=df_application['AMT_CRED'].value_counts().index,
              hue = 'TARGET',palette='winter', ax=ax[2])
ax[0].set(ylabel="COUNT")
ax[1].set(ylabel="COUNT")
ax[2].set(ylabel="COUNT")
plt.yscale('log')

In [None]:
print(df_default.groupby(['TARGET'])['AMT_INCOME'].value_counts(normalize=True).mul(100))
print(df_default.groupby(['TARGET'])['AMT_CRED'].value_counts(normalize=True).mul(100))
print(df_default.groupby(['TARGET'])['AMT_GOODS'].value_counts(normalize=True).mul(100))


    
#### KEYPOINTS:

* We can see that almost 33.58% of defaulters had applied for loan with goods amount in range of (500000, 10000000000) , but 49.22% had taken credit amount in same range.

* That means credit amount was more than goods price in 15.64% of cases.

* This parity can be observed in other price range as well



# ANALYZING PREVIOUS DATASET

# CONTRACT TYPE ANALYSIS

In [None]:
# DESCRIBING CONTRACT TYPE UNDER CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
c_prev =sns.countplot(x='NAME_CONTRACT_TYPE', hue='NAME_CONTRACT_STATUS', data=df_prev_app, palette='hot',
                  order = df_prev_app['NAME_CONTRACT_TYPE'].value_counts().index)
c_prev.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_CONTRACT_STATUS')
plt.title('COMPARING CONTRACT STATUS IN CONTRACT TYPE')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
print(df_prev_app.groupby(['NAME_CONTRACT_TYPE']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_CONTRACT_STATUS']).NAME_CONTRACT_TYPE.value_counts(normalize=True).mul(100)

In [None]:
# STATISTICAL ANALYSIS
df_merge.groupby(['NAME_CONTRACT_TYPE_x']).TARGET.value_counts(normalize=True).mul(100)


    
#### KEYPOINTS:

* In contract type 85.92% of consumer loans were approved and XNA were never approved.



# FLAG LAST APPL PER CONTRACT ANALYSIS

In [None]:
# DESCRIBING FLAG LAST APPL PER CONTRACT UNDER CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='FLAG_LAST_APPL_PER_CONTRACT', data=df_prev_app, palette='hot',
                         order = df_prev_app['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'FLAG_LAST_APPL_PER_CONTRACT')
plt.title('COMPARING FLAG LAST APPLICATION PER CONTRACT IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['FLAG_LAST_APPL_PER_CONTRACT']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_CONTRACT_STATUS']).FLAG_LAST_APPL_PER_CONTRACT.value_counts(normalize=True).mul(100)

In [None]:
df_merge.groupby(['FLAG_LAST_APPL_PER_CONTRACT']).TARGET.value_counts(normalize=True).mul(100)


    
#### KEYPOINTS:

* All approved status with Flag last application contract were yes. The no were mostly refused.



# PAYMENT TYPE ANALYSIS

In [None]:
# DESCRIBING PAYMENT TYPE IN CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='NAME_PAYMENT_TYPE', data=df_prev_app, palette='hot',
                         order = df_prev_app['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_PAYMENT_TYPE')
plt.title('COMPARING PAYMENT TYPE IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['NAME_CONTRACT_STATUS']).NAME_PAYMENT_TYPE.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_PAYMENT_TYPE']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100)

In [None]:
# STATISTICAL ANALYSIS 
df_merge.groupby(['NAME_PAYMENT_TYPE']).TARGET.value_counts(normalize=True).mul(100)


    
#### KEYPOINTS:

* 78.82% of approved had payment type Cash through bank as most. 98.97% with cancelled status belonged to XNA payment type was less approved mostly cancelled or refused.


# GOODS CATEGORY ANALYSIS

In [None]:
# DESCRIBING GOODS CATEGORY IN CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='NAME_GOODS_CATEGORY', data=df_prev_app, palette='hot',
                         order = df_prev_app['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_GOODS_CATEGORY')
plt.title('COMPARING GOODS CATEGORY IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['NAME_CONTRACT_STATUS']).NAME_GOODS_CATEGORY.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_GOODS_CATEGORY']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100)

In [None]:
# STATISTICAL ANALYSIS
df_merge.groupby(['NAME_GOODS_CATEGORY']).TARGET.value_counts(normalize=True).mul(100)

   
#### KEYPOINTS:

* Goods category Fitness, Other  and Medicine were among top approval rates. Only 81% in Insurance were approved. XNA category covered 99% in cancelled status and only 43% were approved.


# PORTFOLIO ANALYSIS

In [None]:
# DESCRIBING PORTFOLIO IN CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='NAME_PORTFOLIO', data=df_prev_app, palette='hot',
                         order = df_prev_app['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_PORTFOLIO')
plt.title('COMPARING PORTFOLIO IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['NAME_CONTRACT_STATUS']).NAME_PORTFOLIO.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_PORTFOLIO']).NAME_CONTRACT_STATUS.value_counts()

In [None]:
# STATISTICAL ANALYSIS
df_merge.groupby(['NAME_PORTFOLIO']).TARGET.value_counts(normalize=True).mul(100)


    
#### KEYPOINTS:

* Portfolio POS was most approved and least approved were XNA. Most refused were cash portfolio.



# YIELD GROUP ANALYSIS

In [None]:
# DESCRIBING YIELD GROUP IN CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='NAME_YIELD_GROUP', data=df_prev_app, palette='hot',
                         order = df_prev_app['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_YIELD_GROUP')
plt.title('COMPARING YIELD GROUP IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['NAME_CONTRACT_STATUS']).NAME_YIELD_GROUP.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_YIELD_GROUP']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100)

In [None]:
# STATISTICAL ANALYSIS
df_merge.groupby(['NAME_YIELD_GROUP']).TARGET.value_counts(normalize=True).mul(100)

    
#### KEYPOINTS:

* 59.22% Yield type XNA had cancelled. 31.16% of approved were from middle yield followed by 28.84% from high yield.


# PRODUCT COMBINATION ANALYSIS

In [None]:
# DESCRIBING PRODUCT COMBINATION IN CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='PRODUCT_COMBINATION', data=df_prev_app, palette='hot',
                         order = df_prev_app['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'PRODUCT_COMBINATION')
plt.title('COMPARING PRODUCT COMBINATION IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['NAME_CONTRACT_STATUS']).PRODUCT_COMBINATION.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['PRODUCT_COMBINATION']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100)

In [None]:
# STATISTICAL ANALYSIS
df_merge.groupby(['PRODUCT_COMBINATION']).TARGET.value_counts(normalize=True).mul(100)

    
#### KEYPOINTS:

* Product Combination POS household with interest  constituted 22.50 % of approved status. Cash dominated cancelled status with 81.89%. Approval constituted only 34.49% in Card Street. 



# SELLER INDUSTRY ANALYSIS

In [None]:
# DESCRIBING SELLER INDUSTRY IN CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='NAME_SELLER_INDUSTRY', data=df_merge, palette='hot',
                         order = df_merge['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_SELLER_INDUSTRY')
plt.title('COMPARING SELLER CATEGORY IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')

In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['NAME_CONTRACT_STATUS']).NAME_SELLER_INDUSTRY.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_SELLER_INDUSTRY']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100)

In [None]:
# STATISTICAL ANALYSIS
df_merge.groupby(['NAME_SELLER_INDUSTRY']).TARGET.value_counts(normalize=True).mul(100)


    
#### KEYPOINTS:

* Seller industry type XNA were only 40.90% approved and 36.68% were cancelled. Most unused offers belonged to connectivity. MLM partners were only 65.60% approved. 


# CASH LOAN PURPOSE ANALYSIS

In [None]:
# DESCRIBING CASH LOAN PURPOSE IN CONTRACT STATUS
fig, axes = plt.subplots()
fig.set_size_inches(10,5)
b_e=sns.countplot(x='NAME_CONTRACT_STATUS', hue='NAME_CASH_LOAN_PURPOSE', data=df_prev_app, palette='hot',
                         order = df_prev_app['NAME_CONTRACT_STATUS'].value_counts().index)
b_e.legend(bbox_to_anchor=(1, 1), loc='upper left',title = 'NAME_CASH_LOAN_PURPOSE')
plt.title('COMPARING CASH LOAN PURPOSE IN CONTRACT STATUS')
plt.ylabel('COUNT')
plt.yscale('log')


In [None]:
# STATISTICAL ANALYSIS
print(df_prev_app.groupby(['NAME_CONTRACT_STATUS']).NAME_CASH_LOAN_PURPOSE.value_counts(normalize=True).mul(100))
df_prev_app.groupby(['NAME_CASH_LOAN_PURPOSE']).NAME_CONTRACT_STATUS.value_counts(normalize=True).mul(100)

In [None]:
# STATISTICAL ANALYSIS
df_merge.groupby(['NAME_CASH_LOAN_PURPOSE']).TARGET.value_counts(normalize=True).mul(100)


    
#### KEYPOINTS:

* In cash loan purpose 78.5% of XAP and 42.13% of XNA was approved whereas 80.42% of Payment on other loans and 73.33% of Refusal to name the goal  were refused .


<div style="color:black;
           display:fill;
           border-radius:0.5px;
           background-color:lightsteelblue;
           font-size:20px;
           font-weight:600;
           font-family:Garamond;
           letter-spacing:0.05px">

<p style="padding:px;
              color:black;">
    

<h1>CONCLUDING THOUGHTS</h1><br>
> Overall there were more female clients.<br> 
> The IT staff, Private service staff, Realty agents and secretaries with lower secondary education, can be trusted.<br>
  Students, business man never defaulted.<br>
> Banks should pay attention to:<br>
&nbsp;&nbsp;    > The age group of 25-40. <br>
&nbsp;&nbsp;    > People with high credit amount for low goods price and low income.<br>
&nbsp;&nbsp;    > Female defaulters who were associated with high credit amount.<br>
&nbsp;&nbsp;    > People with occupation type as Laborers, Low skill laborers.<br>
&nbsp;&nbsp;    > People who belong to Business type entity 3, XNA, self employed.<br>
&nbsp;&nbsp;    > Realty agents with suite type group of people.<br>
&nbsp;&nbsp;    > Maternity income type with cash loans.<br>
&nbsp;&nbsp;    > People with Academic degree who took higher credits and with occupation type accountants and sales.<br>

> From Previous application:<br>
&nbsp;&nbsp;    > XNA contract type, XNA payment type defaulted more.<br>
&nbsp;&nbsp;    > Lower default rates in product combination as POS industry without interest, seller industry as clothing and               tourism, portfolio as cars.<br>
&nbsp;&nbsp;    > Higher default rates were found in Refusal to name the goal, Hobby, Car repairs, Gasification/water supply, Money         for a third person and Payments on other loans in that order.<br>
</p>
</div>