# Loan Defaulter

This case study aims to give you an idea of applying EDA in a real business scenario and developing a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In [None]:
# Supressing the warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing the required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import itertools
%matplotlib inline

# Styling the plot
style.use('ggplot')
sns.set_style('whitegrid')

In [None]:
# Adjusting Output Views
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)

In [None]:
# Importing the Dataset
app = pd.read_csv('../input/loan-defaulter/application_data.csv')
prev = pd.read_csv('../input/loan-defaulter/previous_application.csv')

# 1. Understanding the dataset

### 1.1.a. Inspecting Application Dataset

In [None]:
app.head()

In [None]:
# Data types of each of the column in application data
app.info(verbose= True)

In [None]:
# Checking statistical information about the numerical columns
app.describe()

### 1.1.b. Inspecting Previous Application Dataset

In [None]:
prev.head()

In [None]:
# Data types of each of the column in previous application data
prev.info(verbose= True)

In [None]:
# Checking statistical information about numerical columns in previous application dataset
prev.describe()

# 2. Data Cleaning

### 2.1.a. Checking NULL values and Unnecessary Variables in Application dataset

In [None]:
# Percentage of null values in each column
round((app.isnull().sum()/len(app)*100.00), 4)

`Observation: There are many columns with more than 40% missing values.`

In [None]:
# Let us plot a bar graph to look at the proportion of null values with a benchmark of 40%
plt.figure(figsize = [25,7])
plt.title("Proportion of Null Values in Application Dataset",fontsize=20)
plt.xlabel("Column Names", fontsize=15)
plt.ylabel("Percentage of null values", fontsize= 15)
plt.xticks(rotation=90)
ax = sns.barplot(app.columns,round((app.isnull().sum()/app.shape[0])*100,2), color = 'blue')
ax.axhline(40, ls='--',color='red')
plt.show()

In [None]:
# Storing the columns names along their %age of NULL values
app_miss = pd.DataFrame((app.isnull().sum()/len(app))*100).reset_index()
app_miss.columns = ['Column Name', 'Null Value %age']

In [None]:
# Columns having more than 40% Null Values
app_miss_40 = app_miss[app_miss['Null Value %age']>=40]
app_miss_40

In [None]:
len(app_miss_40)

`Observation: 49 columns have more than 40% NULL values of which most are related to the apartment details and won't help in our analysis. Hence, we will drop these columns.`

In [None]:
app_missing = pd.DataFrame((app.isnull().sum()/len(app))*100).reset_index()
app_missing.columns = ['Column Name', 'Percentage of NULL values']

In [None]:
# Columns having NULL Values
app_missing = app_missing[app_missing['Percentage of NULL values'] > 0]
app_missing

In [None]:
# AMT_ANNUITY
app.AMT_ANNUITY.isnull().sum()

In [None]:
# Since, there are only 12 rows with missing values. So, we can delete these records
app = app[~app.AMT_ANNUITY.isnull()]

In [None]:
# AMT_GOODS_PRICE
app.AMT_GOODS_PRICE.isnull().sum()

In [None]:
plt.title('AMT_GOODS_PRICE')
ax = sns.boxplot(y = app.AMT_GOODS_PRICE)
plt.show()

In [None]:
# We can clearly see that there are many outliers. So, imputing the NULL values with median
app.AMT_GOODS_PRICE = app.AMT_GOODS_PRICE.fillna(app.AMT_GOODS_PRICE.median())

In [None]:
# NAME_TYPE_SUITE
app.NAME_TYPE_SUITE.isnull().sum()

In [None]:
# Since, NAME_TYPE_SUITE is a categorical variable and have a lower NULL percentage.
# So, we will use mode to impute the NULL values for this variable
app.NAME_TYPE_SUITE = app.NAME_TYPE_SUITE.fillna(app.NAME_TYPE_SUITE.mode()[0])

In [None]:
# OCCUPATION_TYPE
# This column has more than 30% NULL values. So, we will make a separate category 'Unknown' for the NULL values
app.OCCUPATION_TYPE = app.OCCUPATION_TYPE.fillna('Unknown')

In [None]:
# CNT_FAM_MEMBERS
app.CNT_FAM_MEMBERS.isnull().sum()

In [None]:
# Since, there are just 2 records with missing values. So, we will delete these rows.
app = app[~app.CNT_FAM_MEMBERS.isnull()]

In [None]:
# DAYS_LAST_PHONE_CHANGE
app.DAYS_LAST_PHONE_CHANGE.isnull().sum()

In [None]:
# Since, there is just 1 record with missing values. So, we will delete these rows.
app = app[~app.DAYS_LAST_PHONE_CHANGE.isnull()]

In [None]:
# AMT variables
app[['AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON',
         'AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']].describe()

In [None]:
# Since, these are the number of enquires and it can only be integers. So, imputing the NULL values with median
# as median is an integer
amt = ['AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']

for i in amt:
    app[i].fillna(app[i].median(),inplace = True)

In [None]:
# SOCIAL variables
app[['OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
     'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE']].describe()

In [None]:
# Since, these variables indicate the number of people and it can only be integers.
# So, imputing the NULL values with median.
soc = ['OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
     'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE']

for i in soc:
    app[i].fillna(app[i].median(),inplace = True)

In [None]:
# Checking for the NULL values again
print(app.isnull().sum())
print('Shape: ', app.shape)

In [None]:
# Let's check the relationship of EXT_SOURCE variables with the TARGET variables
source = app[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'TARGET']]

In [None]:
sns.heatmap(source.corr(),
            xticklabels= source.columns,
            yticklabels= source.columns,
            annot=True,
            cmap = 'Reds')
plt.show()

`Observation: Since, the correlation of the EXT_SOURCE variables is very low with the TARGET variable, we can drop these.`

In [None]:
# Adding these SOURCE variables to the above 49 variables which are to be dropped
app_unwanted = app_miss_40['Column Name'].to_list() + ['EXT_SOURCE_2', 'EXT_SOURCE_3']
len(app_unwanted)

In [None]:
# Importance of FLAG_DOCUMENTS
app_flag = app.loc[:, 'FLAG_DOCUMENT_2':'FLAG_DOCUMENT_21']
app_flag['TARGET'] = app['TARGET']

In [None]:
# For the ease of undertanding, replacing the 1 and 0 in TARGET variable with 'Defaulter' and 'Non-Defaulter' respectively
app_flag['TARGET'] = app_flag['TARGET'].replace({1 : 'Defaulter', 0 : 'Non-Defaulter'})

In [None]:
# Plotting the countplot of FLAG_DOCUMENTS with TARGET VARIABLE
fig = plt.figure(figsize=(25,25))
for a, b in zip(app_flag, range(0,20)):                         # range(0,20) : for 20 FLAG_DOCUMENTS
    plt.subplot(4,5,b+1)
    ax = sns.countplot(app_flag[a], hue = app_flag['TARGET'], palette = ['blue', 'pink'])
    plt.ylabel("")

`Observation: People who submitted FLAG_DOCUMENT_3 are more likely to not default on loan and hence, this must be an important document. While the other FLAG_DOCUMENTS follow a similar pattern that is even if they did not submit, they were able to repay the loan. So, we should retain this FLAG_DOCUMENT_3 column and drop the remaining FLAG_DOCUMENTS.`

In [None]:
flag_cols = ['FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
             'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
             'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12','FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
             'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
             'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']

In [None]:
app_unwanted = app_unwanted + flag_cols
len(app_unwanted)

In [None]:
# Importance of Contact-related Columns
contact = app.loc[:, 'FLAG_MOBIL':'FLAG_EMAIL']
contact['TARGET'] = app['TARGET']

In [None]:
# Correlation of Contact-related information with the TARGET variable
plt.figure(figsize = (8,7))
sns.heatmap(contact.corr(),
            cmap="Reds",
            annot=True)
plt.show()

`Observation: As we can see from the above heatmap that there is no significant impact of Contact-related information on TARGET variables and would not affect our analysis. So, we can drop these columns.`

In [None]:
contact_cols = ['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL']
app_unwanted = app_unwanted + contact_cols
len(app_unwanted)

In [None]:
# Dropping these 76 variables from our Application Dataset
app.drop(labels=app_unwanted, axis=1, inplace=True)

In [None]:
# Checking the shape of our dataframe after dropping these columns
app.shape

`Observation: After dropping these unwanted variables, we are left with 46 variables in our application dataset.`

In [None]:
# Checking for the Numerical columns in the remaining dataset
app.describe()

`Observation: We can see that columns related to the DAYS are negative. So, let's convert them into positive values.`

In [None]:
# DAYS Columns
app.loc[:,"DAYS_BIRTH": "DAYS_ID_PUBLISH"] = abs(app.loc[:,"DAYS_BIRTH": "DAYS_ID_PUBLISH"])

### 2.1.b. Checking NULL values and Unncecessary Variables in Previous Application dataset

In [None]:
# Percentage of null values in each column
round((prev.isnull().sum()/len(prev)*100.00), 4)

In [None]:
# Let us plot a bar graph to look at the proportion of null values with a benchmark of 40%
plt.figure(figsize = [25,7])
plt.title("Proportion of Null Values in Previous Application Dataset",fontsize=20)
plt.xlabel("Columns", fontsize=15)
plt.ylabel("%null values", fontsize= 15)
plt.xticks(rotation=90)
ax = sns.barplot(prev.columns,round((prev.isnull().sum()/prev.shape[0])*100,2), color = 'blue')
ax.axhline(40, ls='--',color='red')
plt.show()

`Observation: There are quite a few columns with more than 40% missing values.`

In [None]:
# Storing columns names along their %age of NULL values
prev_miss = pd.DataFrame((prev.isnull().sum()/len(prev))*100).reset_index()
prev_miss.columns = ['Column Name', 'Null Value %age']

In [None]:
# Columns having more than 40% Null Values
prev_miss_40 = prev_miss[prev_miss['Null Value %age']>=40]
prev_miss_40

In [None]:
len(prev_miss_40)

`Observation: 11 columns have more than 40% NULL values. Hence, we will drop these columns.`

In [None]:
prev_miss[prev_miss['Null Value %age'] < 40]

In [None]:
# Adding 4 more variables to the unwanted list which won't help in our analysis
prev_unwanted = prev_miss_40['Column Name'].to_list() + ['WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START',
                        'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_LAST_APPL_IN_DAY']
len(prev_unwanted)

In [None]:
# Dropping these unwanted variables from our Previous Application dataset
prev.drop(labels = prev_unwanted, axis=1, inplace=True)

In [None]:
prev.info(verbose=True)

In [None]:
### Handling NULL values in previous dataframe.
#Checking null values percentage of each column.
(prev.isnull().sum()/len(prev)).sort_values(ascending=False)*100

In [None]:
# AMT_GOODS_PRICE
prev.AMT_GOODS_PRICE.plot.box();

In [None]:
#As there are many outliers in this column, we will use median to impute null values.
prev.AMT_GOODS_PRICE = prev.AMT_GOODS_PRICE.fillna(prev.AMT_GOODS_PRICE.median())
prev.AMT_GOODS_PRICE.isnull().sum()

In [None]:
# CNT_PAYMENT
prev.CNT_PAYMENT.plot.box();

In [None]:
prev.CNT_PAYMENT.describe()

In [None]:
# Checking the NAME_CONTRACT_STATUS against NULL values in the CNT_PAYMENT.
prev[prev.CNT_PAYMENT.isnull()]["NAME_CONTRACT_STATUS"].value_counts()

In [None]:
# Since, many of the orders were canceled or refused indicating that loan was not even started.
# So, there can not be any payment against them.  
prev.CNT_PAYMENT = prev.CNT_PAYMENT.fillna(0)
prev.CNT_PAYMENT.isnull().sum()

In [None]:
# AMT_ANNUITY
prev.AMT_ANNUITY.isnull().sum()
prev.AMT_ANNUITY.plot.box();

In [None]:
# As we can see from the boxplot, AMT_ANNUITY has many outliers, 
# therefore we will impute null values using median in this case too.
prev.AMT_ANNUITY=prev.AMT_ANNUITY.fillna(prev.AMT_ANNUITY.median())
prev.AMT_ANNUITY.isnull().sum()

In [None]:
# PRODUCT_COMBINATION
# As this is a categorial column, we will use mode to impute the missing values.
prev.PRODUCT_COMBINATION = prev.PRODUCT_COMBINATION.fillna(prev.PRODUCT_COMBINATION.mode()[0])
prev.PRODUCT_COMBINATION.isnull().sum()

In [None]:
# AMT_CREDIT
prev.AMT_CREDIT.isnull().sum()

In [None]:
# As there is only a single missing value, we will drop this record
prev = prev[~prev.AMT_CREDIT.isnull()]

In [None]:
prev.shape

### 2.2.a. Inspecting Data Types of Variables in Application Dataset

In [None]:
app.info()

In [None]:
app.nunique().sort_values()

In [None]:
cat_col = ['NAME_CONTRACT_TYPE','CODE_GENDER','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE',
                       'NAME_FAMILY_STATUS','NAME_HOUSING_TYPE','OCCUPATION_TYPE','WEEKDAY_APPR_PROCESS_START',
                       'ORGANIZATION_TYPE','FLAG_OWN_CAR','FLAG_OWN_REALTY','LIVE_CITY_NOT_WORK_CITY',
                       'REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY','REG_REGION_NOT_WORK_REGION',
                       'LIVE_REGION_NOT_WORK_REGION','REGION_RATING_CLIENT','WEEKDAY_APPR_PROCESS_START',
                       'REGION_RATING_CLIENT_W_CITY'
                      ]
for a in cat_col:
    app[a] =pd.Categorical(app[a])

In [None]:
app.info()

### 2.2.b. Inspecting Data Types of Variables in Previous Application Dataset

In [None]:
#Converting Categorical columns from Object to categorical 
cat_col = ['NAME_CONTRACT_TYPE', 'NAME_CASH_LOAN_PURPOSE','NAME_CONTRACT_STATUS','NAME_PAYMENT_TYPE',
                    'CODE_REJECT_REASON','NAME_CLIENT_TYPE','NAME_GOODS_CATEGORY','NAME_PORTFOLIO',
                   'NAME_PRODUCT_TYPE','CHANNEL_TYPE','NAME_SELLER_INDUSTRY','NAME_YIELD_GROUP','PRODUCT_COMBINATION']

for i in cat_col:
    prev[i] =pd.Categorical(prev[i])

In [None]:
prev.dtypes

### 2.3.a Data Engineering on Application Dataset

In [None]:
# Calculating client's age in year format
app['AGE'] = app['DAYS_BIRTH'] // 365

In [None]:
# AGE column
bins = [0, 20, 30, 40, 50, 100]
labels = ['0-20', '20-30', '30-40', '40-50', '50+']
app['AGE_GROUP'] = pd.cut(app.AGE, bins = bins, labels= labels)
app.AGE_GROUP.value_counts(normalize=True)*100

In [None]:
# AMT_CREDIT
bins = [0,100000,200000,300000,400000,500000,600000,700000,800000,900000,1000000,10000000]
labels = ['0-100K','100K-200K', '200K-300K','300K-400K','400K-500K','500K-600K','600K-700K','700K-800K',
       '800K-900K','900K-1M', '1M Above']
app['AMT_CREDIT_RANGE'] = pd.cut(app.AMT_CREDIT, bins=bins, labels=labels)
app.AMT_CREDIT_RANGE.value_counts(normalize=True)*100

In [None]:
# AMT_INCOME_TOTAL
bins = [0,100000,200000,300000,400000,500000,600000,700000,800000,900000,1000000,10000000]
labels = ['0-100K','100K-200K', '200K-300K','300K-400K','400K-500K','500k-600K','600K-700K','700K-800K',
       '800K-900K','900K-1M', '1M Above']
app['AMT_INCOME_RANGE'] = pd.cut(app.AMT_INCOME_TOTAL, bins=bins, labels=labels)
app.AMT_INCOME_RANGE.value_counts(normalize=True)*100

In [None]:
# AMT_INCOME_RANGE
app = app[~app.AMT_INCOME_RANGE.isnull()]

In [None]:
# YEARS_EMPLOYED
app['YEARS_EMPLOYED'] = app['DAYS_EMPLOYED'] // 365
bins = [0,5,10,20,30,40,50,60,1000]
labels = ['0-5','5-10','10-20','20-30','30-40','40-50','50-60','60 above']
app['EMPLOYMENT_YEAR_RANGE'] = pd.cut(app['YEARS_EMPLOYED'],bins=bins,labels=labels)
app.EMPLOYMENT_YEAR_RANGE.value_counts(normalize = True)*100

In [None]:
plt.title('YEARS_EMPLOYED')
ax = sns.distplot(app.YEARS_EMPLOYED);
ax.set_facecolor('white')

`Observation: Since, YEARS_EMPLOYED has values approx 1000 indicating that the person has been working for 1000 years which is impossible. Hence this column has a lot of incorrect values. So, we will drop 'DAYS_EMPLOYED', 'YEARS_EMPLOYED', 'EMPLOYMENT_YEAR_RANGE' so that it does not hinder with our analysis later on.`

In [None]:
app.drop(['DAYS_EMPLOYED', 'YEARS_EMPLOYED', 'EMPLOYMENT_YEAR_RANGE'], axis = 1, inplace=True)

### 2.3.b. Data Engineering on Previous Application Dataset

In [None]:
# DAYS_DECISION
prev.DAYS_DECISION.describe()

In [None]:
# Converting Negative days value to Positive
prev.DAYS_DECISION = abs(prev.DAYS_DECISION)

In [None]:
sns.distplot(prev.DAYS_DECISION);

In [None]:
# Binning the DAYS_DECISION to get a better look at the variable
bins = [0,500,1000,1500,2000,2500,3000]
labels = ['0-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000']
prev['DAYS_DECISION_GROUP'] = pd.cut(prev.DAYS_DECISION, bins = bins, labels=labels)

In [None]:
prev.DAYS_DECISION_GROUP.value_counts()

In [None]:
prev.info()

In [None]:
prev.nunique().sort_values()

### 2.4.a. Outliers in Application Dataset

In [None]:
app.dtypes

In [None]:
plt.figure(figsize=(20,10))

col_1 = ['AMT_ANNUITY','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_GOODS_PRICE','CNT_CHILDREN','DAYS_BIRTH']
for i in col_1:
    plt.subplot(2,3,col_1.index(i)+1)
    ax = sns.boxplot(y=app[i])
    plt.title(i)
    plt.ylabel("")
    ax.set_facecolor('white')

In [None]:
round(app[['AMT_ANNUITY','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_GOODS_PRICE', 'CNT_CHILDREN','DAYS_BIRTH']].describe(), 2)

Observation:
- AMT_INCOME_TOTAL: There is an outlier very far from the normal data. Since it is an Income column there might be wealthy people included in the dataset.
- 75% of applicants have taken a credit amount less than 800K.
- AMT_GOODS_PRICE also has number of outliers, but these goods prices are matching with the credit amount since they may have taken the loan for purchasing those goods.
- DAYS_BIRTH column doesn't have any outliers, this shows that DAYS_BIR.

### 2.4.b. Outliers in Previous Application Dataset

In [None]:
plt.figure(figsize = (20, 10))

col_2 = ['AMT_ANNUITY','AMT_APPLICATION','AMT_CREDIT','AMT_GOODS_PRICE','DAYS_DECISION','CNT_PAYMENT','SELLERPLACE_AREA']

for i in col_2:
    plt.subplot(2,4,col_2.index(i)+1)
    ax = sns.boxplot(y=prev[i])
    plt.title(i)
    plt.ylabel("")
    ax.set_facecolor('white')

In [None]:
round(prev[['AMT_ANNUITY','AMT_APPLICATION','AMT_CREDIT','AMT_GOODS_PRICE','SELLERPLACE_AREA',
            'DAYS_DECISION','CNT_PAYMENT']].describe(), 2)

Observation:
- DAYS_DECISION has few amount of outliers indicating that the previous application decisions were taken in a few days        when the application was last applied
- AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_GOODS_PRICE, SELLERPLACE_AREA are distributed compactly with huge            outliers and are quite a few in number.
- Since, AMT_GOODS_PRICE has a number of outliers. To buy these goods people apply for a larger amount of loan (thus, indicating why there are outliers in the AMT_APPLICATION) and hence, receive higher amount of Credits explaining the outliers in the AMT_CREDIT column.

# 3. Data Analysis

### 3.1.a. Data Imbalance in Application Dataset

In [None]:
### Checking if the data is Imbalanced
Imbalance = app.TARGET.value_counts().reset_index()

plt.figure(figsize=(10,10))
x= ['Non-Defaulter','Defaulter']
Imbalance.plot.pie(autopct='%1.1f%%', y="TARGET",
                   labels=["Non-Defaulters","Defaulters"], shadow=True,
                   explode=(0.2,0), legend=False, colors=["#DC143C","#FFF8DC"])
plt.ylabel("")
plt.title("Imbalance Plotting")
ax.set_facecolor("white")
plt.show()

In [None]:
# Calculating the percentage of defaulters and non-Defaulters in the dataset, using the target column.
# Non-Defaulter ---> 0 
# Defaulter ---> 1
Defaulter_percent = round((np.sum(app.TARGET) / len(app))*100, 2)
Non_Defaulter_percent = round((app.shape[0]-np.sum(app.TARGET))/app.shape[0]*100,2)
print("Percentage of Non-defaulter and Defaulter datas are:",
      Non_Defaulter_percent,"% and",Defaulter_percent, '%')
print("Imbalance Ratio of Non-Defaulter to Defaulter in the data is:",
      round(Non_Defaulter_percent/Defaulter_percent,2),": 1") 

### 3.2. Univariate Analysis

#### 3.2.a. Categorical Variables

In [None]:
#### Customized function for Univariate Categorical Analysis

def cat_univ(data, var, broad=False, log = False,
             loc = 'upper right', label_rotate = False):# var = categorical variable under analysis
    data = data
    var = var
    
    if loc == 'upper left':
        loc = 'upper left'
    else:
        'upper right'
        
    if broad:
        fig, ax = plt.subplots(1,2, figsize = (20,7))
    else:
        fig, ax = plt.subplots(1,2,figsize=(10,5))

    def_perc = data[[var,"TARGET"]].groupby(var, as_index=False).mean()
    def_perc['TARGET'] = def_perc['TARGET']*100
    def_perc.sort_values(by='TARGET', ascending=False, inplace=True)
    
    # Subplot 1
    ax1 = sns.countplot(data = data, x = var, hue = 'TARGET', ax = ax[0])
    ax1.set_title(var, fontdict = {'fontsize': 13, 'color': 'red'})
    ax1.legend(['Non-Defaulter', 'Defaulter'], loc = loc)
    ax1.set_ylabel('Number of Customers')
    if (label_rotate):
        ax1.set_xticklabels(ax1.get_xticklabels(), rotation = 90)
    
    if log:                              # Using log scale to increase the readibility of the graph
        ax1.set_yscale('log')
        ax1.set_ylabel("Count (log)",fontdict={'fontsize' : 10, 'fontweight' : 3})  
    
    # Subplot 2
    ax2 = sns.barplot(data=def_perc ,x=var, y="TARGET", palette='Set1', ax = ax[1], order=def_perc[var])
    ax2.set_title('Percentage Defaulters', fontdict = {'fontsize': 13, 'color': 'red'})
    ax2.set_ylabel('Percentage of Defaulters')
    if (label_rotate):
        ax2.set_xticklabels(ax2.get_xticklabels(), rotation = 90)

In [None]:
# NAME_CONTRACT_TYPE
cat_univ(app, 'NAME_CONTRACT_TYPE', broad = True)

Observation:
- There are very less customers with revolving loans and 5% of them have not repaid the loan.
- Approximately 8% of people with cash loans have not repaid the loan.

In [None]:
# CODE_GENDER
cat_univ(app, 'CODE_GENDER')

Observation:
- Female customers have taken more loans but male customers have higher number of defaulters.

In [None]:
# FLAG_OWN_CAR
cat_univ(app, 'FLAG_OWN_CAR')

Observation:
- People who don't own a car have take more number of loans but the percentage of defaulters for both the categories is almost the same.

In [None]:
# FLAG_OWN_REALTY
cat_univ(app, 'FLAG_OWN_REALTY')

Observation:
- People who own Realty have taken more number of loans but the percentage of defaulters for the categories is almost the same.

In [None]:
# NAME_HOUSING_TYPE
cat_univ(app, 'NAME_HOUSING_TYPE', label_rotate=True, broad = True)

Observation:
- People with House/Apartment have taken most number of loans.
- Higher percentage of people who have not repaid the loans live in either Rented Apartments or with their parents.

In [None]:
# NAME_FAMILY_STATUS
cat_univ(app, 'NAME_FAMILY_STATUS', label_rotate=True, broad = True)

Observation:
- Married people have taken most number of loans.
- Civil Marriage folks and Single/Unmarried are the ones who defaulted on the most number of loans.
- Widows have defualted on least number of loans.

In [None]:
# NAME_EDUCATION_TYPE
cat_univ(app, 'NAME_EDUCATION_TYPE', loc = 'upper left', label_rotate=True, broad = True)

Observation: 
- People who have done Secondary/secondary special education have taken higher number of loans.
- People with Lower Secodary although have taken a very few number of loans but have the highest default percentage amongst them.
- People with Academic degree have less than 2% of defaulting rate.

In [None]:
# NAME_INCOME_TYPE
cat_univ(app, 'NAME_INCOME_TYPE', loc = 'upper left', label_rotate=True, broad = True)

Observation:
- People who are working have highest number of loans.
- Although females on maternity leaves have taken significantly lower number of loans but have approximately 40% default rate amongst them which is the highest in any category.
- People who are unemployed have a default rate of more than 35%.

In [None]:
# REGION_RATING_CLIENT
cat_univ(app, 'REGION_RATING_CLIENT', broad = True)

Observation:
- Most of the people who have applied for loans are living in REGION_RATING_CLIENT 2.
- Applicants living in REGION_RATING_1 have defaulted least no. of loans where as applicants living in REGION_RATING_3 
  have defaulted most number of loans.

In [None]:
# OCCUPATION_TYPE
cat_univ(app, 'OCCUPATION_TYPE',log = True, loc ='upper left', broad = True, label_rotate=True)

Observation:
- Laborers have taken most no. of loans followed by Sales staff, core staff, Managers.
- Low-skill Labourers have defaulted most no. of loans(approx. 17%) followed by Waiters/Barmen staff, Drivers, Secondary staff.
- For a very high number of applications occupation type information is unknmown.

In [None]:
# ORGANIZATION_TYPE
cat_univ(app, 'ORGANIZATION_TYPE', broad=True, log = True, label_rotate=True)

Observation:
- Most of the people who applied for loan are from Business Entity Type 3.
- For a very high number of applications, Organization type information is missing(XNA).
- Industry Type 12, Trade type 4 has less defaulters(less than 4%), therefore the applicants from these organizations can be trusted.
- Transport type 3 have more than 15% of defaulters, making it the category with highest defaulters.

In [None]:
# FLAG_DOCUMENT_3
cat_univ(app, 'FLAG_DOCUMENT_3', log = False)

Observation:
- We can see that the percentage of Defaulters is similar for both kinds of people, it doesn't depend on wether the applicant has submitted FLAG_DOCUMENT_3.

In [None]:
# AGE_GROUP
cat_univ(app, 'AGE_GROUP', log = True, broad = True)

Observation:
- All age group from 20 to 50+ have applied for loans and most defaulters are in 20-40 age group.

In [None]:
# AMT_CREDIT_RANGE
cat_univ(app, 'AMT_CREDIT_RANGE', label_rotate=True, broad = True)

Observation:
- People with credit amount between 400K-600K tend to default more than others.

In [None]:
# AMT_INCOME_RANGE
cat_univ(app, 'AMT_INCOME_RANGE', broad=True, label_rotate = True)

Observations:
- Most of the applicants have income less than 300K.
- Applicants with income more than 700K are less likely to default.
- Applicants with income less than 300K are more likely to default.

#### 3.2.b. Numerical Univariate and Bivariate Analysis

In [None]:
# Dividing the dataset on the basis of TARGET column
Non_Defaulter_df = app.loc[app['TARGET']==0] # Non-Defaulters
Defaulter_df = app.loc[app['TARGET']==1] # Defaulters

In [None]:
# Plotting pairplots for the AMOUNT with respect to the TARGET variable.
amt = app[[ 'AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE', 'TARGET']]
ax = sns.pairplot(amt, hue = 'TARGET', palette='husl')
ax.fig.legend(['Non-Defaulter', 'Defaulter'])
plt.show()

Observations from individual Distribution plots `(Univariate Numerical Analysis)`:
- Most number of loans are given for goods price below 10 lakhs.
- Most people pay annuity below 50000 for the credit loan.
- Credit amount of the loan is mostly less then 10 lakhs.
- The repayers and defaulters distribution overlap in all the plots and hence we cannot use any of these variables in         isolation to make a decision.

Observations from Scatter Plots `(Bivariate Analysis)`:
- AMT_CREDIT and AMT_GOODS_PRICE are highly correlated, as the points are forming a straight line, thus showing a linear     relationship between the two.
- We can see that as AMT_CREDIT & AMT_GOODS_PRICE exceeds 3M, the proportion of defaulters decreases significantly.
- When AMT_ANNUITY > 150K & AMT_CREDIT > 3M, the percentage of defaulters decreases.

### 3.4. Correlation

#### 3.4.a. Correlation for Non-Defaulters

In [None]:
Non_Defaulter_df.drop(['SK_ID_CURR', 'TARGET', 'DAYS_BIRTH'], axis = 1, inplace = True)
Non_Defaulter_corr = Non_Defaulter_df.corr()

In [None]:
# Correlation for Non-Defaulter
mask = np.zeros_like(Non_Defaulter_corr)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize = (11,9))
with sns.axes_style('white'):
    ax = sns.heatmap(Non_Defaulter_corr, mask = mask, square=True, linewidths=1, cmap = 'YlGnBu')

Observation for Non-Defaulter Data:
- - These variables are intercorrelated with each other:
  1. AMT_CREDIT
  2. AMT_INCOME_TOTAL
  3. AMT_GOODS_PRICE
  4. AMT_ANNUITY
- A high degree correlation can be seen between CNT_CHILDREN and CNT_FAM_MEMBERS.

In [None]:
# Top-10 Correlation for the Non-Defaulter
Non_Defaulter_corr_10 = Non_Defaulter_corr.unstack().reset_index()
Non_Defaulter_corr_10.columns = ['Column 1', 'Column 2', 'Correlation']
Non_Defaulter_corr_10['Correlation'] = abs(Non_Defaulter_corr_10['Correlation'])
Non_Defaulter_corr_10 = Non_Defaulter_corr_10[Non_Defaulter_corr_10['Correlation'] != 1 ]
Non_Defaulter_corr_10.sort_values(by = 'Correlation', ascending = False, inplace = True)
Non_Defaulter_corr_10.drop_duplicates(subset = 'Correlation', keep = 'first', inplace = True)
Non_Defaulter_corr_10.head(10)

#### 3.4.b. Correlation for Defaulters

In [None]:
Defaulter_df.drop(['SK_ID_CURR', 'TARGET', 'DAYS_BIRTH',], axis = 1, inplace = True)
Defaulter_corr = Defaulter_df.corr()

In [None]:
# Correlation for Defaulter
mask = np.zeros_like(Defaulter_corr)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize = (11,9))
with sns.axes_style('white'):
    ax = sns.heatmap(Defaulter_corr, mask = mask, square=True, linewidths=1, cmap = 'YlGnBu')

Observation for Defaulter Data:
- - These variables are intercorrelated with each other:
  1. AMT_CREDIT
  2. AMT_INCOME_TOTAL
  3. AMT_GOODS_PRICE
  4. AMT_ANNUITY
- A high degree correlation can be seen between CNT_CHILDREN and CNT_FAM_MEMBERS.

In [None]:
# Top-10 Correlation for the Defaulter
Defaulter_corr_10 = Defaulter_corr.unstack().reset_index()
Defaulter_corr_10.columns = ['Column 1', 'Column 2', 'Correlation']
Defaulter_corr_10['Correlation'] = abs(Defaulter_corr_10['Correlation'])
Defaulter_corr_10 = Defaulter_corr_10[Defaulter_corr_10['Correlation'] != 1 ]
Defaulter_corr_10.sort_values(by = 'Correlation', ascending = False, inplace = True)
Defaulter_corr_10.drop_duplicates(subset = 'Correlation', keep = 'first', inplace = True)
Defaulter_corr_10.head(10)

### 3.5. Merged DataFrame

In [None]:
# Merging application DataFrame and Previous Application dataframe.
combined = pd.merge(app,prev,how="inner",on="SK_ID_CURR")
combined.head()

In [None]:
combined.shape

In [None]:
# Checking the Statistics of the combined dataframe.
round(combined.describe(),2)

In [None]:
# Dividing combined dataset on the basis of target column.
combined_Def= combined[combined.TARGET==1] # Defaulters
combined_Non_Def= combined[combined.TARGET==0] # Non-Defaulters

In [None]:
# Checking shape of these datasets.
print(" Combined_Def:",combined_Def.shape,"\n","Combined_Non_Def:", combined_Non_Def.shape)

In [None]:
# Customized function for plotting loan purpose vs loan status for both defaulters and non defaulters seperately.
def combined_univ(data, col, hue, log):
    data = data
    col = col
    hue = hue
    
    plt.figure(figsize=(20,7))
    ax=sns.countplot(x=col, 
                  data=data,
                  hue= hue,
                  palette= 'Set1',
                  order=data[col].value_counts().index)
    

    if log:
        plt.yscale('log')
        
    if data is combined_Def:
        plt.title("Purpose of loan vs diff. Loan Status for Defaulters" ,
                  fontdict={'fontsize' : 20, 'fontweight' : 5, 'color' : 'Blue'})
    elif data is combined_Non_Def:
        plt.title("Purpose of loan vs diff. Loan Status for Non-Defaulters" ,
                  fontdict={'fontsize' : 20, 'fontweight' : 5, 'color' : 'Blue'})
    else:
        plt.title("Purpose of loan vs diff. Loan for all the applicants",
                  fontdict={'fontsize' : 20, 'fontweight' : 5, 'color' : 'Blue'})

    plt.legend(loc = "upper right")
    plt.xticks(rotation=90, ha='right')
    
    plt.show()

In [None]:
# Countplot for all the applicants
combined_univ(data = combined, col = 'NAME_CASH_LOAN_PURPOSE', hue = 'NAME_CONTRACT_STATUS',log=True)
# countplot for Defaulters
combined_univ(data = combined_Def, col = 'NAME_CASH_LOAN_PURPOSE', hue = 'NAME_CONTRACT_STATUS', log = True)

Observation:
- Purpose of loan is unknown for very high number of applicants.
- Banks have rejected high number of applications taken for Repair and Other purposes, also applicants have refused these offers more number of times. 
-  There are few places where proportion of Non-Defaulters is significantly higher.
   
   They are-
   1. 'Buying a garage'
   2. 'Business development'
   3. 'Buying land'
   4. 'Buying a new car'
   5. 'Education'
   
   Hence we can focus on these purposes for which default percentage is less.

In [None]:
# Checking the Contract Status based on loan repayment status and whether there is any business loss or financial loss.
plt.figure(figsize = (10,5))
ax=sns.countplot(data=combined,x="NAME_CONTRACT_STATUS",hue="TARGET",palette=["blue","pink"])
ax.legend(['Non-Defaulter', 'Defaulter'])
plt.title("Contract Status vs TARGET" , fontdict={'fontsize' : 15, 'fontweight' : 5, 'color' : 'Blue'})
plt.show()

Observation:
- 90% of clients who cancelled their loan previously have successfully repayed their current loan, bank should record the
  reason for cancellation of these clients and bring in some policies accordingly as so they can be potential customers
  for the bank.
- Major portion of clients who have been previously refused a loan have payed back the loan in current case. Refual reason should be recorded for further analysis as these clients would turn into potential repaying customer.

In [None]:
# Plotting the relationship between people who defaulted in last 60 days 
# being in client's social circle and contract status.
plt.figure(figsize = (10,5))
ax= sns.pointplot(data=combined, x="NAME_CONTRACT_STATUS", y='DEF_60_CNT_SOCIAL_CIRCLE',
                  hue="TARGET", palette=["blue","pink"])

Observation:
- Clients who have average of 0.13 or higher DEF_60_CNT_SOCIAL_CIRCLE score tend to default more and hence client's social circle has to be analysed before providing the loan.

# 4. Conclusions

`After analysing the datasets, we can see that there are quite a few variables through which the bank can see what are the driving factors as to who can repay the loan.`

Factors which indicate that the person will be a `Non-Defaulter` are:
1. LOANS_EDUCATION_TYPE: People who have Academic Degrees have less defaults as compared to other people.
2. NAME_INCOME_TYPE: Students and Businessmen have no defaults.
3. NAME_FAMILY_STATUS: Widows are least likely to default on the loans.
4. AMT_INCOME_TOTAL: Customers who have income in the range of 700K and 800K are least likely to default.
5. ORGANIZATION_TYPE: Clients with Trade Type:4 & 6, Industry Type:12 Transport Type: 1 are least likely to default.

Factors which indicate that the person will be a `Defaulter` are:
1. CODE_GENDER: Male Customers are more likely to default than females.
2. NAME_FAMILY_STATUS: People who are single or have done Civil Marriage are more likely to default.
3. NAME_INCOME_TYPE: Clients who are on maternity leave or are Unemployed are most likely to default on their payments.
4. NAME_HOUSING_TYPE: People who live with in rented apartments or with their parents are more likely to default on loan.
5. OCCUPATION_TYPE: Low-Skilled Labourers are most likely to default on the loan.
6. AGE_GROUP: People in the Age Group of 20-40 have are most likely to default on the loan.

The following variables indicate the people from the below categories tend to default on the loan which can be prevented by providing them loans at higher interest rate to cushion any default risk and further, preventing any business loss:
    
  1. NAME_HOUSING_TYPE: People living in the Rented Apartments are the ones who take a large number of loans but also have a higher default rate. So, completely shutting them off would be loss for the business.
  2. AMT_CREDIT: There are a large chunk of people who earn in the range of 100K and 200K and also those people have a higher default rate. So, keeping a higher interest rate would make sense
  3. NAME_EDUCATION_TYPE: People with Secondary/Special education applied for most Percentage of loans and thus, keeping a nominal interest rate for those folks would help in the business.
  4. NAME_CASH_LOAN_PURPOSE: Loans taken for the purpose of Repairs have a higher default rate and hence, the bank charges a higher interest rate for that client which the client cannot bear and hence, cancel the loan in other stages of the application.
  5. OCCUPATION_TYPE: There are quite a few low-skilled and other labourers who apply for the loans and these people also have a higher default rate. Since, these people also have low incomes so the bank should keep a decent amount of interest rate which would not lead to carry out any defaults for these people.

`More Suggestions:`
- There are a significant number of people who had cancelled the loan application but have now turned into Repayers. So, the bank could collect the information on what made them cancel the service and improve on those services for the clients.
- Almost 85% of the clients who were refused the loan in the previous application have repayed the loan or have no difficulty in repaying the loan. Thus, refusing these clients any further would be bad for business and hence, bank should recheck the reasons behind the refusals for these customers.