# Risk Analytics Case Study - EDA
No Machine Learning

## Business Objectives

This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment.

### Loading Libraries

In [None]:
#import the warnings.

import warnings
warnings.filterwarnings('ignore')

In [None]:
##import the useful libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Importing DATA and Observing Datasets

In [None]:
# Read application_data dataset and check the first five rows 
# Contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.

ad = pd.read_csv('../input/credit-eda-case-study/application_data.csv')
ad.head()

In [None]:
# Read previous_application dataset and check the first five rows
# Contains information about the client’s previous loan data. 
# Contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.

pa = pd.read_csv('../input/credit-eda-case-study/previous_application.csv')
pa.head()

## **Inspect the Dataframe**
Inspect the dataframe for dimensions, null-values, and summary of different numeric columns.



### Observing Data Set and Cleaning of application_data Dataset

In [None]:
# Check the number of rows and columns in application

print("Numbers of rows in application data set are:", len(ad))
print("Numbers of column in application data set are:", len(ad.columns))

In [None]:
# Widen output display to see more columns

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
# Check the column-wise info of the application_data dataframe

ad.info()

In Application dataframe we have 65 float attributes, 41 integer attributes and 16 Object attributes.

In [None]:
#percentage of missing values in application_data

ad.isnull().sum()* 100 / len(pa)

In [None]:
#drop columns with very high Null values and other irrelevant columns in application_data

ad.drop(['FLAG_DOCUMENT_2','FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 
        'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 
        'FLAG_DOCUMENT_16','FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21','OWN_CAR_AGE',
        'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG','YEARS_BUILD_AVG',
        'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG','FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG','LIVINGAPARTMENTS_AVG', 
        'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG','NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE','YEARS_BEGINEXPLUATATION_MODE',
        'YEARS_BUILD_MODE', 'COMMONAREA_MODE','ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE','LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE',
        'LIVINGAREA_MODE','NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI','BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI',
        'YEARS_BUILD_MEDI','COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI','FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
        'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI','FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE','WALLSMATERIAL_MODE',
        'EMERGENCYSTATE_MODE'
        ], axis =1, inplace=True)

In [None]:
#Checking numbers of rows after dropping very high Null values and other irrelevant columns in application_data

len(ad.columns)

In [None]:
# Check info of application data after dropping irrelevant columns

ad.info()

In [None]:
# Drop rows from columns with very negligible Null values in application

ad.dropna(subset=['AMT_ANNUITY','AMT_GOODS_PRICE','AMT_GOODS_PRICE','NAME_TYPE_SUITE'], inplace=True)

In [None]:
# Impute negligible Null values of AMT_INCOME_TOTAL with mean

ad.AMT_INCOME_TOTAL.fillna(ad.AMT_INCOME_TOTAL.mean(), inplace = True)
ad.AMT_INCOME_TOTAL.isnull().sum()

In [None]:
# Impute OCCUPATION_TYPE

ad.OCCUPATION_TYPE.value_counts()

In [None]:
# Check gender values

print("Data before rectifying undefined value: \n\n",ad.CODE_GENDER.value_counts())

# Replace negligible undefined values in gender with mode(highest value)

ad.CODE_GENDER.replace(to_replace='XNA', value='F', inplace=True)
ad.CODE_GENDER.value_counts()

In [None]:
# Convert client age from days to years

ad.DAYS_BIRTH = round(abs(ad.DAYS_BIRTH/364))  #abs function used because Days_Birth is in negative
ad.DAYS_BIRTH.head()

In [None]:
# Final shape of application data after cleaning
ad.shape

### Observing Data Set and Cleaning of Previous Application Dataset

In [None]:
# Describe previous application data

print("Numbers of rows in previous application data set are:", len(pa))
print("Numbers of column in previous application data set are:", len(pa.columns))

In [None]:
# Check the column-wise info of the previous application dataframe
pa.info()

In Previous Application dataframe we have 15 float attributes, 6 integer attributes and 16 Object attributes.

In [None]:
# Check the summary for the numeric columns of previous application
Numeric_column_pa = pa.select_dtypes(include=np.number)


In [None]:
# Percentage of Null values in previous application
pa.isnull().sum()* 100 / len(pa)

In [None]:
# Drop columns with very high Null values (more than 40% of null value)and other irrelevant columns in previous application
pa.drop(['AMT_DOWN_PAYMENT','RATE_DOWN_PAYMENT','RATE_INTEREST_PRIMARY','RATE_INTEREST_PRIVILEGED', 'SK_ID_PREV', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 
         'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL', 
         'NAME_TYPE_SUITE'], axis =1, inplace=True)

In [None]:
# Checking numbers of columns after dropping very high Null values and other irrelevant columns in previous application
len(pa.columns)

In [None]:
# Drop rows from columns with very negligible Null values in previous application
pa.dropna(subset=['PRODUCT_COMBINATION','AMT_CREDIT'], inplace=True)
pa.PRODUCT_COMBINATION.isnull().sum()

In [None]:
# Describe AMT_ANNUITY of previous application
print(pa.AMT_ANNUITY.describe())

# Percentage of Null values in AMT_ANNUITY
print('\nPercentage of Null values in AMT_ANNUITY:',round(pa.AMT_ANNUITY.isnull().sum()* 100 / len(pa.AMT_ANNUITY),2))

In [None]:
# Describe AMT_ANNUITY
pa.AMT_ANNUITY.describe()

In [None]:
# Check NAME_SELLER_INDUSTRY. More than half its values are undefined
pa.NAME_SELLER_INDUSTRY.value_counts(normalize=True)

### Identifying outliers in the dataset and adding inferences of them. 

### Checking outliers and perfoming univariate analysis of Application_data dataframe 

In [None]:
# Extracting the numeric features from application data

numeric_features_ad = []
for col in ad.columns:
    if ad[col].dtype == float or ad[col].dtype == int:
        numeric_features_ad.append(col)
        
print(numeric_features_ad)

In [None]:
# check outliers in total income amount of application_data

plt.figure(figsize=[15,3])
sns.boxplot(ad.AMT_INCOME_TOTAL)
plt.title('Income Distribution Of The Clients', fontsize=15)
plt.xlabel('Income Amount', fontsize=12)
plt.xscale('log')
plt.show()

**Inference:**
* The income distribution of clients has some extreme outlier values beyond 99th percentile.

In [None]:
# Check quantiles of AMT_INCOME_TOTAL

ad.AMT_INCOME_TOTAL.quantile([0.5,0.6,0.7,0.8,0.9,0.95,0.99,0.995])

In [None]:
# Description of columns where AMT_INCOME_TOTAL is above 99th percentile

ad[ad.AMT_INCOME_TOTAL>472500].describe()

In [None]:
# Description of columns where AMT_INCOME_TOTAL is above 0.995 percentile

ad[ad.AMT_INCOME_TOTAL>630000].describe()

**Inference:** 
* Out of total 306207 applicants 0.99 applicants have more salary than the 99 percentile salary which is at 472500. 
* Only 0.46% applicants are earning more than 99.5 percentile which is at 630000.
* This value is far from our mean value 147600. Also, in the boxplot it is showing that there are some values even after 99 percentile. As per our analysis we should consider salary amount beyound 630000 as outlier. 

In [None]:
# plot the boxplot of Amount Credit

plt.figure(figsize=[15,3])
sns.boxplot(ad.AMT_CREDIT)
plt.title('Credit Amount Of The Loan', fontsize=15)
plt.xlabel('Amount', fontsize=12)
plt.show()

**Inference:**
* The median almost divides the IQR equally, but the max whisker is much longer than the min whisker.
* There are many outliers in credit amount as per boxplot, most significant ones being above 2.5.

In [None]:
# Check quantiles of AMT_CREDIT

ad.AMT_CREDIT.quantile([0.5,0.6,0.7,0.8,0.9,0.95,0.99,0.995])

In [None]:
# Description of columns where AMT_CREDIT is above 99th percentile

ad[ad.AMT_CREDIT>1842768].describe()

**Inference:** 
* Out of total 306207 applicants 0.99 applicantions have loan amount more than the 99 percentile loan amount which is at 1842768.
* This value is far from our mean value: 513531.
* So as per our analysis we should consider loan amount beyound 1842768 as outlier.

In [None]:
# plot the boxplot of Amount Annuity

plt.figure(figsize=[15,3])
sns.boxplot(ad.AMT_ANNUITY)
plt.title('Loan Annuity', fontsize=15)
plt.xlabel('Amount', fontsize=12)
plt.show()

**Inference:** 
* The median almost divides the IQR equally, but the max whisker is noticably higher than the min whisker.
* There are many outliers in loan annuity as per boxplot, most significant one being above 250000.

In [None]:
# Check quantiles of AMT_ANNUITY

ad.AMT_ANNUITY.quantile([0.5,0.6,0.7,0.8,0.9,0.95,0.99,0.995])

In [None]:
# Check mean of AMT_ANNUITY

round(ad.AMT_ANNUITY.mean(),2)

In [None]:
# Description of columns where AMT_ANNUITY is above 99th percentile

ad[ad.AMT_ANNUITY>70006.5].describe()

**Inference:** 
* Out of total 306207, 0.99% applicants have loan annuity more than the 99 percentile loan annuity which is at 70006.5. 
* This value is far from our mean value: 27122. 
* So as per our analysis, we should consider Loan annuity beyond 70006.5 as outlier.

In [None]:
#check value counts of REGION_RATING_CLIENT
ad.REGION_RATING_CLIENT.value_counts()

In [None]:
#check value counts of REG_REGION_NOT_LIVE_REGION
print(ad.REG_REGION_NOT_LIVE_REGION.value_counts())

In [None]:
#check value counts of HOUR_APPR_PROCESS_START
print(ad.HOUR_APPR_PROCESS_START.value_counts())

In [None]:
#check value counts of REG_REGION_NOT_WORK_REGION
ad.REG_REGION_NOT_WORK_REGION.value_counts()

In [None]:
#check value counts of REGION_RATING_CLIENT_W_CITY
ad.REGION_RATING_CLIENT_W_CITY.value_counts()

In [None]:
#check value counts of LIVE_REGION_NOT_WORK_REGION
ad.LIVE_REGION_NOT_WORK_REGION.value_counts()

In [None]:
#check value counts of REG_CITY_NOT_LIVE_CITY
ad.REG_CITY_NOT_LIVE_CITY.value_counts()

In [None]:
#check value counts of REG_CITY_NOT_WORK_CITY
ad.REG_CITY_NOT_WORK_CITY.value_counts()

In [None]:
#check value counts of LIVE_CITY_NOT_WORK_CITY
ad.LIVE_CITY_NOT_WORK_CITY.value_counts()

In [None]:
#Dropping columns LIVE_CITY_NOT_WORK_CITY, REG_CITY_NOT_WORK_CITY, REG_CITY_NOT_LIVE_CITY, LIVE_REGION_NOT_WORK_REGION,REG_REGION_NOT_WORK_REGION, REG_REGION_NOT_LIVE_REGION 
# because they have only 0 and 1 values and providing no insights in finding defaulter. 
ad.drop(['LIVE_CITY_NOT_WORK_CITY','REG_CITY_NOT_WORK_CITY','REG_CITY_NOT_LIVE_CITY','LIVE_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_LIVE_REGION',
         'NAME_TYPE_SUITE'], axis =1, inplace=True)

In [None]:
#check shape of application data
ad.shape

In [None]:
# Checking the Maritial Status of the Applicants_data
print("Checking Data count of NAME_FAMILY_STATUS ")
print(ad.NAME_FAMILY_STATUS.value_counts())

# Replacing Civil marriage with married 
ad["NAME_FAMILY_STATUS"].replace("Civil marriage","Married",inplace=True)

# Check the count again
print("\n\nChecking Data count of NAME_FAMILY_STATUS after replacing ")
print(ad.NAME_FAMILY_STATUS.value_counts())

# Plot marital status of applicants
plt.figure(figsize=[15,7])
plt.title('Marital Status of the Applicants', fontsize=20)
sns.countplot(ad["NAME_FAMILY_STATUS"])
plt.xlabel('Marital Status', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.show()

**Inference:** 
* Maximum loan application come from married people.
* Least loan application come from widows.

In [None]:
# Checking the number of children of the Applicants
print("Checking Data count of NAME_FAMILY_STATUS ")
print(ad.CNT_CHILDREN.value_counts())
ad.CNT_CHILDREN.describe()

# Boxplot of children count of applicants
plt.figure(figsize=[15,3])
sns.boxplot(ad.CNT_CHILDREN)
plt.title('Children Count of Applicants', fontsize=15)
plt.xlabel('No. of Children', fontsize=12)
plt.show()

# Check quantiles of CNT_CHILDREN
print("Check quantiles of CNT_CHILDREN")
print(ad.CNT_CHILDREN.quantile([0.5,0.6,0.7,0.8,0.9,0.95,0.99,0.9995]))

In [None]:
#Checking the data of applicant having kids more than 6 kids
ad[ad.CNT_CHILDREN>3].describe()

#Plotting distribution of children
#sns.countplot(ad['CNT_CHILDREN'])

**Inference:** Out of 306207 only 553 applicants has more than 3 kids which we can consider as our outlier of number of kids.  

In [None]:
# Checking the Income type of the Applicants
print("Checking Data count of NAME_FAMILY_STATUS ")
print(ad.NAME_INCOME_TYPE.value_counts())
ad.NAME_INCOME_TYPE.describe()

#
plt.figure(figsize=[15,7])
plt.title('Income type of the Applicants', fontsize=20)
g = sns.countplot(ad["NAME_INCOME_TYPE"])
g.set_xticklabels(g.get_xticklabels(), rotation=45)
plt.xlabel('Income Type', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.yscale('log')
plt.show()

**Inference:** Most application we got from 3 income type Working, State Servant and Commercial associate. 

In [None]:
fig=plt.subplots(figsize=(20, 20))


plt.subplot(3, 3, 1, ylim=(0, 210000))
plt.subplots_adjust(hspace = 1.0)
sns.countplot(ad.CNT_CHILDREN, hue=ad["TARGET"])
plt.xticks(rotation=45)
plt.tight_layout()

plt.subplot(3, 3, 2, ylim=(0, 210000))
plt.subplots_adjust(hspace = 1.0)
sns.countplot(ad.CNT_FAM_MEMBERS, hue=ad["TARGET"])
plt.xticks(rotation=45)
plt.tight_layout()

plt.subplot(3, 3, 3, ylim=(0, 210000))
plt.subplots_adjust(hspace = 1.0)
sns.countplot(ad.NAME_FAMILY_STATUS, hue=ad["TARGET"])
plt.xticks(rotation=45)
plt.tight_layout()
    

plt.show()

In [None]:
ad.groupby(["NAME_FAMILY_STATUS","CNT_CHILDREN","CNT_FAM_MEMBERS"])["TARGET"].sum()/len(ad)*100

**Inference:** Most of the defaulters are from Married and Single category

### Checking outliers and perfoming univariate analysis of Previous application dataframe 

In [None]:
# plot the boxplot of Amount Annuity

plt.figure(figsize=[15,3])
sns.boxplot(pa.AMT_ANNUITY)
plt.title('Annuity Of Previous Application', fontsize=15)
plt.xlabel('Amount', fontsize=12)
plt.show()

In [None]:
# Check quantiles of Amount Annuity
print("Check quantiles of AMT_ANNUITY")
print(pa.AMT_ANNUITY.quantile([0.5,0.6,0.7,0.8,0.9,0.95,0.99]))

In [None]:
#Checking the data of Amount Annuity having more than 69685 amount
pa[pa.AMT_ANNUITY>69685].describe()


**Inference:** There are 12981 values more than 99 percentile and we have notice there are many outliers in AMT_ANNUITY as per boxplot but most significant one being above 300000.


In [None]:
# plot the boxplot of Amount Credit

plt.figure(figsize=[15,3])
sns.boxplot(pa.AMT_CREDIT)
plt.title('Final Credit Amount On The Previous Application', fontsize=15)
plt.xlabel('Amount', fontsize=12)
plt.show()

In [None]:
# Check quantiles of Amount Credit
print("Check quantiles of AMT_CREDIT")
print(pa.AMT_CREDIT.quantile([0.5,0.6,0.7,0.8,0.9,0.95,0.99]))

In [None]:
#Checking the data of Amount Credit for having more than 1515415.50
pa[pa.AMT_CREDIT>1515415].describe()

**Inference:** There are 16703 outliers in Amount Credit as per boxplot, most significant ones being above 4000000.

In [None]:
# plot the boxplot of Amount Application of previous applicants

plt.figure(figsize=[15,3])
sns.boxplot(pa.AMT_APPLICATION)
plt.title('Credit Asked By Client On The Previous Application', fontsize=15)
plt.xlabel('Amount', fontsize=12)
plt.show()

In [None]:
# Check quantiles of Amount Application of previous applicants
print("Check quantiles of AMT_APPLICATION")
print(pa.AMT_APPLICATION.quantile([0.5,0.6,0.7,0.8,0.9,0.95,0.99]))

In [None]:
#Checking the data of applicant having kids more than 6 kids
pa[pa.AMT_APPLICATION>1350000.00].describe()

**Inference:** There are 15952 outliers in Amount Application as per boxplot but most significant ones being above 4000000.

In [None]:
# #Correlation between Income, Loan Amount and Amount Annuity
# pd.pivot_table(data=ad, index ="AMT_INCOME_TOTAL" , columns ="NAME_FAMILY_STATUS", values = "AMT_CREDIT")

## Creating bins

In [None]:
# Create income bins from AMT_INCOME_TOTAL

ad['AMT_INCOME_BINS'] = pd.cut(ad.AMT_INCOME_TOTAL, [0,50000,100000,150000,200000,250000,300000,350000,400000,450000,500000],
                                labels=['under 50000', '50000 to 100000','100000 to 150000','150000 to 200000','200000 to 250000','250000 to 300000',
                                        '300000 to 350000', '350000 to 400000', '400000 to 450000', '450000 and above'])

# Value counts of income bins

ad['AMT_INCOME_BINS'].value_counts()

In [None]:
# Create Age bins from DAYS_BIRTH
labels=['20 to 30','30 to 35', '35 to 40', '40 to 45','45 to 50','50 to 55', '55 to 60','55 to 60','Above 65']
ad['DAYS_BIRTH_BINS'] = pd.cut(ad.DAYS_BIRTH, [20,25,30,35,40,45,50,55,60,65], labels=labels,ordered=False)

# Value counts of Age bins
ad['DAYS_BIRTH_BINS'].value_counts()

## Identify if there is data imbalance in the data. Find the ratio of data imbalance.

Hint: How will you analyse the data in case of data imbalance? You can plot more than one type of plot to analyse the different aspects due to data imbalance. For example, you can choose your own scale for the graphs, i.e. one can plot in terms of percentage or absolute value. Do this analysis for the ‘Target variable’ in the dataset ( clients with payment difficulties and all other cases). Use a mix of univariate and bivariate analysis etc.

Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms.

Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights.

### Bivariant and Multivariant Analysis on application_data

In [None]:
# Check distribution of Target variable. 1 - client with payment difficulties, 0 - all other cases
target_pct = round(ad.TARGET.value_counts()/ad.TARGET.value_counts().sum()*100,2)
target_pct

In [None]:
# Plot Target variable
plt.figure(figsize=[6,6])
explode = (0,0.2)
mylabels = ['Other cases','Payment difficulties']
plt.pie(ad.TARGET.value_counts(), explode =explode, labels=target_pct)
plt.title('Percentage Share Of Clients With And Without Payment Difficulties', fontsize=15)
plt.legend(mylabels)
plt.show()

In [None]:
# Creating dataframes for both target values
ad_target_0 = ad.loc[ad.TARGET==0]
ad_target_1 = ad.loc[ad.TARGET==1]

print('Shape of target 0 (No Payment difficulties):',ad_target_0.shape)
print('Shape of target 1 (Payment difficulties):',ad_target_1.shape)

In [None]:
# Finding the ratio of data imbalance of Target
print('Ratio of data imbalance between target 0 and target 1 is:', str(round(len(ad_target_0)/len(ad_target_1)))+':1')

In [None]:
# Finding percentage of applicant genders

gender_pct = round(ad.CODE_GENDER.value_counts()/ad.CODE_GENDER.value_counts().sum()*100,2)
gender_pct

In [None]:
# Plot gender variable

plt.figure(figsize=[6,6])
mylabels = ['Female','Male']
plt.pie(ad.CODE_GENDER.value_counts(), labels=gender_pct)
plt.title('Gender Percentage Share Of Clients', fontsize=15)
plt.legend(mylabels)
plt.show()

In [None]:
# Finding the ratio of data imbalance of gender

print('Ratio of data imbalance between female and male is:', str(round(len(ad.loc[ad.CODE_GENDER=='F'])/len(ad.loc[ad.CODE_GENDER=='M'])))+':1')

In [None]:
# Finding percentage of car ownership

car_pct = round(ad.FLAG_OWN_CAR.value_counts()/ad.FLAG_OWN_CAR.value_counts().sum()*100,2)
car_pct

In [None]:
# Plot car ownership variable

plt.figure(figsize=[6,6])
mylabels = ['No','Yes']
plt.pie(ad.FLAG_OWN_CAR.value_counts(), labels=car_pct)
plt.title('Car Ownership Percentage Share Of Clients', fontsize=15)
plt.legend(mylabels)
plt.show()

In [None]:
# Finding the ratio of data imbalance of car ownership

print('Ratio of data imbalance between Yes and No of car ownership is: 1:'+ str(round(len(ad.loc[ad.FLAG_OWN_CAR=='N'])/len(ad.loc[ad.FLAG_OWN_CAR=='Y']))))

In [None]:
# Finding percentage of realty ownership

realty_pct = round(ad.FLAG_OWN_REALTY.value_counts()/ad.FLAG_OWN_REALTY.value_counts().sum()*100,2)
realty_pct

In [None]:
# Plot realty ownership variable

plt.figure(figsize=[6,6])
mylabels = ['Yes','No']
plt.pie(ad.FLAG_OWN_REALTY.value_counts(), labels=realty_pct)
plt.title('Realty Ownership Share Of Clients', fontsize=15)
plt.legend(mylabels)
plt.show()

In [None]:
# Finding the ratio of data imbalance of realty ownership

print('Ratio of data imbalance of Yes and No of realty ownership:', str(round(len(ad.loc[ad.FLAG_OWN_REALTY=='Y'])/len(ad.loc[ad.FLAG_OWN_REALTY=='N'])))+':1')

In [None]:
# Plot Education Type Of Defaulters

plt.figure(figsize=[15,7])
sns.countplot(ad_target_1.NAME_EDUCATION_TYPE)
plt.title('Education Type Of Defaulters\n', fontsize=20)
plt.xlabel('Education Type', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.yscale('log')
plt.show()

**Inference:**
* Most defaulters came from Secondary and Higher education background.
* Least defaulter came from Academic degree background

In [None]:
# Check contract type values

print('Target 0\n',ad_target_0.NAME_CONTRACT_TYPE.value_counts())
print('\nTarget 1\n',ad_target_1.NAME_CONTRACT_TYPE.value_counts())

# 

fig, (ax1, ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
sns.countplot(data=ad_target_0, x='NAME_CONTRACT_TYPE', hue='CODE_GENDER', palette='dark:salmon', ax=ax1).set(title='Without Payment Difficulties', xlabel='Contract Type', ylabel='Count')
sns.countplot(data=ad_target_1, x='NAME_CONTRACT_TYPE', hue='CODE_GENDER', palette='dark:salmon_r', ax=ax2).set(title='With Payment Difficulties', xlabel='Contract Type', ylabel='Count')
plt.suptitle('Gender-wise Distribution Of Loan Contract Type v. Default\n', fontsize=20)
plt.show()

**Inference:**
* Demand for cash loans is significantly higher than revolving loans.
* Demand for both types of loan contracts are almost twice that of males. However, the default rate is almost equal.

In [None]:
# Plot gender-wise Income Distribution Of Defaulters

plt.figure(figsize=[15,7])
sns.countplot(data=ad_target_1, x='AMT_INCOME_BINS', hue='CODE_GENDER', palette='dark:salmon')
plt.title('Gender-wise Income Distribution Of Defaulters\n', fontsize=20)
plt.xlabel('Income Range', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.xticks(rotation=90)
plt.show()

**Inference:**
* Most defaulters for both Male and Female come from the 100000 to 150000 income range.
* There are more female defaulters in income range 250000 and below.
* There are more male defaulters in income range 250000 and above.

In [None]:
# Plot distribution Of Occupation Type v. Default

fig, (ax1, ax2) = plt.subplots(1,2, figsize=[15,15], sharey=True)
sns.countplot(data=ad_target_0, y='OCCUPATION_TYPE', palette='crest', ax=ax1).set(title='Without Payment Difficulties', ylabel='Occupation Type', xlabel='Count')
sns.countplot(data=ad_target_1, y='OCCUPATION_TYPE', palette='magma', ax=ax2).set(title='With Payment Difficulties', ylabel='Occupation Type', xlabel='Count')
plt.suptitle('Distribution Of Occupation Type v. Default', fontsize=20)
plt.show()

**Inference:**
Among different occupations,
* Cooking staff had the least count of clients with difficulty in payment of loan.
* Labourers had the highest count of clients with difficulty in payment of loan.

## **Numerical - Categorical Analysis**

## Find the top 10 correlation for the Client with payment difficulties and all other cases (Target variable).

Note that you have to find the top correlation by segmenting the data frame w.r.t to the target variable and then find the top correlation for each of the segmented data and find if any insight is there.  Say, there are 5+1(target) variables in a dataset: Var1, Var2, Var3, Var4, Var5, Target. And if you have to find top 3 correlation, it can be: Var1 & Var2, Var2 & Var3, Var1 & Var3. Target variable will not feature in this correlation as it is a categorical variable and not a continuous variable which is increasing or decreasing.

In [None]:
# Plot income distribution of non-defaulters for different family status

plt.figure(figsize=[15,7])
sns.boxplot(data=ad_target_0, x='NAME_EDUCATION_TYPE', y='AMT_INCOME_TOTAL',hue='NAME_FAMILY_STATUS')
plt.title('Income Distribution Of Non-Defaulters For Different Family Status\n', fontsize=20)
plt.xlabel('Education Type', fontsize=15)
plt.ylabel('Income', fontsize=15)
plt.yscale('log')
plt.show()

**Inference:**
* Non-married clients with academic degrees have a much higher minimum whisker than all other categories.
* Married clients with higher education or secondary/secondary special education have significant outliers on the higher side.
* There are no lower outliers for any category.

In [None]:
# Plot income distribution of defaulters for different family status

plt.figure(figsize=[15,7])
sns.boxplot(data=ad_target_1, x='NAME_EDUCATION_TYPE', y='AMT_INCOME_TOTAL',hue='NAME_FAMILY_STATUS')
plt.title('Income Distribution Of Defaulters For Different Family Status', fontsize=20)
plt.xlabel('Education Type', fontsize=15)
plt.ylabel('Income', fontsize=15)
plt.yscale('log')
plt.show()

**Inference:**
* For majority of defaulting clients, across all education types, the income is comparatively on the lower side compared to non-defaulters.
* Exceptions to this are outliers in married clients with higher education or secondary/secondary, who are defaulting despite higher income.
* Widows and seperated clients with academic degree, appear to be facing the least payment difficulties.

In [None]:
# Plot Correlation Distribution Of Family Size v. Loan Amount

fig, (ax1, ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
sns.scatterplot(data=ad_target_0,x='AMT_CREDIT', y='CNT_FAM_MEMBERS', ax=ax1).set(title='Without Payment Difficulties', xlabel='Credit Amount Of The Loan', ylabel='Family Size Of Client')
sns.scatterplot(data=ad_target_1,x='AMT_CREDIT', y='CNT_FAM_MEMBERS', ax=ax2).set(title='With Payment Difficulties', xlabel='Credit Amount Of The Loan', ylabel='Family Size Of Client')
plt.suptitle('Correlation Distribution Of Family Size v. Loan Amount', fontsize=20)
plt.show()

**Inference:**
* There is no correlation between family size of client and the credit amount, for both defaulters and non-defaulters.
* Infact, clients with extremely large families haven't had payment difficulties.

In [None]:
# Plot Correlation Distribution Of Goods Price v. Loan Amount

fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
sns.scatterplot(data=ad_target_0,x='AMT_GOODS_PRICE', y='AMT_CREDIT', ax=ax1).set(title='Without Payment Difficulties', ylabel='Credit Amount Of The Loan', xlabel='Price Of The Goods for Which The Loan Is Given')
sns.scatterplot(data=ad_target_1,x='AMT_GOODS_PRICE', y='AMT_CREDIT', ax=ax2).set(title='With Payment Difficulties', ylabel='Credit Amount Of The Loan', xlabel='Price Of The Goods for Which The Loan Is Given')
plt.suptitle('Correlation Distribution Of Goods Price v. Loan Amount', fontsize=20)
plt.show()

**Inference:**
* There is a linear correlation between credit amount of loan and price of goods for which the loan is given. 
* This showns that when the price the goods increases, the credit amount of loan also increases.

In [None]:
# Plot Correlation Distribution Of clients' region population v. Loan Amount

fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
sns.scatterplot(data=ad_target_0,y='AMT_CREDIT', x='REGION_POPULATION_RELATIVE', ax=ax1).set(title='Without Payment Difficulties', ylabel='Credit Amount', xlabel='Normalized Region Population')
sns.scatterplot(data=ad_target_1,y='AMT_CREDIT', x='REGION_POPULATION_RELATIVE', ax=ax2).set(title='With Payment Difficulties', ylabel='Credit Amount', xlabel='Normalized Region Population')
plt.suptitle('Correlation Distribution Of Region Population v. Loan Amount', fontsize=20)
plt.show()

**Inference:**
* For clients without payment difficulties, there is no visible correlation between client's region population and credit amount.
* For clients with payment difficulties, clients with higher credit amount and very low region population have noticable correlation outliers, compared to clients in regions with higher population.

In [None]:
# Numeric features from application data
numeric_features_ad

In [None]:
# Correlation HeatMap of Applicant Data
ad_heatmap_data = [ad.AMT_INCOME_TOTAL,ad.AMT_CREDIT,ad.TARGET,ad.CODE_GENDER,ad.CNT_CHILDREN,ad.AMT_ANNUITY,ad.DAYS_EMPLOYED,
              ad.REGION_RATING_CLIENT,ad.REGION_POPULATION_RELATIVE,ad.FLAG_MOBIL,ad.NAME_HOUSING_TYPE,ad.DAYS_BIRTH]
ad_heatmap_headers = ["AMT_INCOME_TOTAL","AMT_CREDIT","TARGET","CODE_GENDER","CNT_CHILDREN","AMT_ANNUITY","DAYS_EMPLOYED",
              "REGION_RATING_CLIENT","REGION_POPULATION_RELATIVE","FLAG_MOBIL","NAME_HOUSING_TYPE","DAYS_BIRTH"]

ad_heatmap = pd. concat(ad_heatmap_data, axis=1, keys=ad_heatmap_headers)

corr = ad_heatmap.corr()
plt.figure(figsize=(15,10))
plt.title('Correlation Heatmap Of Application Data', fontsize=20)
sns.heatmap(corr, annot = True, cmap="Oranges")
plt.show()

**Inference:** 
1. High income people take larger credit amount loan along and pay larger loan annuity as well.
2. High population density area pay higher loan annuities.
3. The chances of having payment difficulty is very low with high income people. 
4. If target lives in high population area and has high number of kids, there are higher chances of facing payment difficulty.

In [None]:
# Analysing DAYS_BIRTH values (converted to years)

ad.DAYS_BIRTH.describe()

In [None]:
# Plotting Distribution Of Ages based on Payment Difficulties

fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
chart1 = sns.countplot(ad_target_0['DAYS_BIRTH_BINS'],ax=ax1,palette="rocket").set(title='Not facing Payment Difficulties',ylabel='Count', xlabel='Age Groups')
chart2 = sns.countplot(ad_target_1['DAYS_BIRTH_BINS'],ax=ax2,palette="rocket").set(title='Facing Payment Difficulties',ylabel='Count', xlabel='Age Groups')
plt.suptitle('Distribution Of Ages based on Payment Difficulties', fontsize=20)
plt.show()

**Inference:**
* While applicants of age group 55 to 60 face most difficulty in payments, they are also the group with least difficulty in payment.
* 20 to 30 group and above 65 group face least difficulty in payment.

In [None]:
# Extracting the numeric features from previous application data

numeric_features_pa = []
for col in pa.columns:
    if pa[col].dtype == float or pa[col].dtype == int:
        numeric_features_pa.append(col)
        
print(numeric_features_pa)

In [None]:
# Checking top 5 rows of previous application data

pa.head(5)

In [None]:
#HeatMap of Previous Applicant Data
pa_heatmap_data = [pa.AMT_ANNUITY,pa.AMT_APPLICATION,pa.AMT_CREDIT,pa.AMT_GOODS_PRICE,pa.NFLAG_LAST_APPL_IN_DAY,pa.DAYS_DECISION,pa.SELLERPLACE_AREA,pa.CNT_PAYMENT]
pa_heatmap_headers = [ 'AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'NFLAG_LAST_APPL_IN_DAY', 'DAYS_DECISION', 'SELLERPLACE_AREA', 'CNT_PAYMENT']

pa_heatmap = pd.concat(pa_heatmap_data, axis=1, keys=pa_heatmap_headers)

corr = pa_heatmap.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot = True, cmap="YlOrBr")
plt.title('Correlation Heatmap Of Previous Application', fontsize=20)
plt.show()

**Inference:**

1. Highest Correlations:
 * Credit amount and Application Amount: Suggests that most of the loans amounts sanctioned were as per application of the client.
 * Goods Price and Application Amount: Suggests that goods price has a possible correlation with loan amount applied.
 * Credit amount and Goods Price: Above two observations naturally suggest correlation between these two. Same is proven in the heatmap.


2. Lowest Correlations:
 * Last application in day flag and sellerplace area appear to have to no correlation with other columns.

In [None]:
# Checking housing-type value counts
ad.NAME_HOUSING_TYPE.value_counts()

In [None]:
# Plot Housing Type vs. Payment difficulty

plt.figure(figsize=(15,7))
sns.countplot(ad['NAME_HOUSING_TYPE'],hue=ad['TARGET'])
plt.xticks(rotation=45)
plt.title("Housing Type vs. Payment difficulty",fontsize=20)
plt.xlabel('Housing Type', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.show()

In [None]:
# Calculating value count percentage of different housing types for both targets
print('Housing Type percentage for target 0:\n',ad_target_0.NAME_HOUSING_TYPE.value_counts()* 100 / len(ad_target_0))
print('Housing Type percentage for target 1:\n',ad_target_1.NAME_HOUSING_TYPE.value_counts()* 100 / len(ad_target_1))

**Inference:**
* Most of the applicants live in house or apartment however those living with parents or living on rented house have more percentage of payment difficulty compared to those that don't, when you compare target 0 with target 1. 
* Therefore, along with House/apartment, we can consider these two housing types as our defaulter factors as well. 

## **Merging application data with previous application data**

In [None]:
print("Applicant data shape: ",ad.shape)
print("Previous applicant data shape: ",pa.shape)

In [None]:
#Merging the two dataset using left join
merge_ad_pa = pd.merge(pa,ad, how='left', on = 'SK_ID_CURR')

In [None]:
#Shape of merged dataframe
merge_ad_pa.shape

In [None]:
#Heatmap
corr = merge_ad_pa.corr()
plt.figure(figsize=(25,20))
sns.heatmap(corr, annot = True, cmap="twilight_shifted")
plt.title('Correlation Heatmap Of Merged Data', fontsize=20)
plt.show()

* 'Number of Children' is highly correlated with 'Loan Annuity', 'Previous applicant credit amount' and 'Goods price', which means more applications are receive from applicants with higher number of kids. 
* Based on the diagram we found that the attributes below are highly correlated with Target attribute:
    * DAYS_DECISION - 0.04
    * DAYS_REGISTRATION - 0.043
    * DAYS_ID_RUBLISH - 0.051
    * FLAG_EMP_PHONE - 0.049
    * REGION_RATING_CLIENT- 0.057
    * REGION_RATING_CLIENT_W-CITY - 0.06
    * DAYS_LAST_PHONE_CHANGE - 0.06

### Lets breakdown the merged data for easier understanding and further examination of the relevant attributes

In [None]:
#Examining DAYS_DECISION
merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_DECISION.describe()

In [None]:
# Distribution plot of both targets against DAYS_DECISION

fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_DECISION.abs(),ax=ax1).set(title='Facing Payment Difficulties')

sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 0].DAYS_DECISION.abs(),ax=ax2).set(title='Not facing Payment Difficulties')
plt.suptitle('DAYS_DECISION', fontsize=20)
plt.show()

**Inference:** 
Both categories, defaulter and non defaulter, are showing similar kind of structure so we can ignore this attribute.

In [None]:
#Examining DAYS_REGISTRATION

merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_REGISTRATION.describe()

In [None]:
# Distribution plot of both targets against DAYS_REGISTRATION

fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_REGISTRATION.abs(),ax=ax1).set(title='Facing Payment Difficulties')
sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 0].DAYS_REGISTRATION.abs(),ax=ax2).set(title='Not facing Payment Difficulties')
plt.suptitle('DAYS_REGISTRATION', fontsize=20)
plt.show()

**Inference:** 
Both categories defaulter and non defaulter are showing similar kind of structure so we can ignore this attribute.

In [None]:
#Examining DAYS_ID_PUBLISH
merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_ID_PUBLISH.describe()
 

In [None]:
# Distribution plot of both targets against DAYS_ID_PUBLISH

fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,7], sharey=True)
sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_ID_PUBLISH.abs(),ax=ax1).set(title='Facing Payment Difficulties')
sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 0].DAYS_ID_PUBLISH.abs(),ax=ax2).set(title='Not facing Payment Difficulties')
plt.suptitle('How many days before the application did client change the identity document with which he applied for the loan', fontsize=20)
plt.show()

**Inference:** 
Both categories defaulter and non defaulter are showing similar kind of structure so we can ignore this attribute.

In [None]:
#Examining FLAG_EMP_PHONE

merge_ad_pa[merge_ad_pa["TARGET"] == 1].FLAG_EMP_PHONE.describe()

In [None]:
# Pie chart to check if defaulter clients provide phone number

plt.figure(figsize=[6,6])
explode = (0,0.2)
mylabels = ['Yes','No']
plt.title('Client provide phone number', fontsize=20)
merge_ad_pa[merge_ad_pa["TARGET"] == 1].FLAG_EMP_PHONE.value_counts().plot.pie(labels=mylabels)
plt.show()

**Inference:** 
Defaulter are also providing their contact details so we can not infer anything from this attribute.

In [None]:
# Examining REGION_RATING_CLIENT

merge_ad_pa[merge_ad_pa["TARGET"] == 1].REGION_RATING_CLIENT.describe()

In [None]:
# Pie chart to check defaulter clients vs. region rating of their residency

plt.figure(figsize=[6,6])
mylabels=['Rating 2', 'Rating 3', 'Rating 1']
plt.title('Defaulters v. Region Rating Of Where They Live', fontsize=20)
merge_ad_pa[merge_ad_pa["TARGET"] == 1].REGION_RATING_CLIENT.value_counts().plot.pie(labels=mylabels)
plt.show()

**Inference:** 
The clients who live in 2 rated regions are more likely to have payment difficulty. 

In [None]:
#Examining REGION_RATING_CLIENT_W_CITY

merge_ad_pa[merge_ad_pa["TARGET"] == 1].REGION_RATING_CLIENT_W_CITY.describe()

In [None]:
# Pie chart to check defaulter clients vs. region rating of their residency along with city

plt.figure(figsize=[6,6])
mylabels=['Rating 2', 'Rating 3', 'Rating 1']
plt.title('Defaulters v. Region And City Rating Of Where They Live', fontsize=20)
merge_ad_pa[merge_ad_pa["TARGET"] == 1].REGION_RATING_CLIENT_W_CITY.value_counts().plot.pie(labels=mylabels)
plt.show()

**Inference:** 
The clients who live in 2 rated cities and regions are more likely to have payment difficulty.

In [None]:
#Examining DAYS_LAST_PHONE_CHANGE

merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_LAST_PHONE_CHANGE.describe()

In [None]:
# Distribution plot of both targets against DAYS_LAST_PHONE_CHANGE

fig, (ax1,ax2) = plt.subplots(1,2, figsize=[20,7], sharey=True)
sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 1].DAYS_LAST_PHONE_CHANGE.abs(),ax=ax1).set(title='Facing Payment Difficulties')
sns.distplot(merge_ad_pa[merge_ad_pa["TARGET"] == 0].DAYS_LAST_PHONE_CHANGE.abs(),ax=ax2).set(title='Not facing Payment Difficulties')
plt.suptitle('How Many Days Before Application Did Client Last Change Phone', fontsize=20)
plt.show()

**Inference:** 
The clients who last changed the phone within few days of applying are more likely to default. 

## Top 10 correlation for Defaulters

In [None]:
#Top 10 correlation
top10_merge = merge_ad_pa.corr().unstack().sort_values(ascending=False).drop_duplicates()

#Starting index from 1 because SK_ID_CURR was used to merge
top10_merge[1:11]

In [None]:
# Assigning dataframe as per target 0 and target 1 variable 
target0_merge_ad_pa  = merge_ad_pa[merge_ad_pa["TARGET"] == 0]
target1_merge_ad_pa  = merge_ad_pa[merge_ad_pa["TARGET"] == 1]

In [None]:
#Top 10 correlation of applicants who did not face problems with payment
top10_merge_target0 = target0_merge_ad_pa.corr().unstack().sort_values(ascending=False).drop_duplicates()

#Starting index from 1 because SK_ID_CURR was used to merge
top10_merge_target0[1:11]

In [None]:
#Top 10 correlation of applicants who faced problems with payment
top10_merge_target1 = target1_merge_ad_pa.corr().unstack().sort_values(ascending=False).drop_duplicates()

#Starting index from 1 because SK_ID_CURR was used to merge
top10_merge_target1[1:11]

# FINAL OBSERVATIONS

### Important Columns For The Bank To Watchout Against Defaults:
* AMT_INCOME_TOTAL
    * We have noticed that the clients who have higher income are less likely to have payment difficulties, so the low income groups are more likely to be a defaulter. 
    * Also the most defaulters for both Male and Female come from the 100000 to 150000 income range.
        * There are more female defaulters in income range 250000 and below.
        * There are more male defaulters in income range 250000 and above.
* AMT_CREDIT
    *     For clients with payment difficulties, those with higher credit amount and very low region population have noticable correlation outliers, compared to clients in regions with higher population.
* NAME_FAMILY_STATUS
    *     Non-married clients with academic degrees have a much higher minimum whisker than all other categories.
    *     Married clients with higher education or secondary/secondary special education have significant outliers on the higher side.
* CNT_CHILDREN
    *     Clients that live in high population area and have large number of kids have higher chances of facing payment difficulty.
* NAME_EDUCATION_TYPE
    *     Most defaulters came from Secondary and Higher education background.
    *     Least defaulters came from Academic degree background
* OCCUPATION_TYPE
    *     Cooking staff had the least count of clients with difficulty in payment of loan.
    *     Labourers had the highest count of clients with difficulty in payment of loan.
* NAME_HOUSING_TYPE
    *     Most of the applicants live in house or apartment however those living with parents or living on rented house have more percentage of payment difficulty compared to those that don't, when you compare target 0 with target 1.
    *     Therefore, along with House/apartment, we can consider these two housing types as our defaulter factors as well.
    
* Apart from the above, following are few more attributes that can also help us to identify defaulters: 
    * DAYS_LAST_PHONE_CHANGE
        * Clients who change the phone within few days of applying are more likely to default.
    * REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY
        * Clients who live in 2 rated cities and regions are more likely to default.

### Important Columns For The Bank To Increase Revenue And Clients:
* DAYS_BIRTH
    * 55 to 60 age bracket has the highest non-default count in absolute numbers.
    * Their default rate isn't substantially high either compared to others, so they may be focused on for numbers growth.
* NAME_FAMILY_STATUS
    * More focus can be given on Widows and separated clients as they are observed to take good amount of loans with much much less default rate compared to married/single.
* NAME_EDUCATION_TYPE
    * Non-Married clients with higher education or secondary/secondary special education have significant outliers on the upperside to tap as less risky clients.
    * All people with academic degree also have high income overall.
* OCCUPATION_TYPE
    * Cooking staff and private service staff drive provide a very good volume as well as very low chance of payment issues.
* REGION_POPULATION_RELATIVE
    * Cities with medium to high region population have very low default rate for loan amounts. They can be focused on for bigger loans to increase revenue.


# ****END****

Hope this will help people in some way. Please don't forget to upvote if this notebook helped you. Thank you.