### Background
Consider you are a Data Analyst with a private bank or a loan distribution firm. Your organization receives many applications in a given day. In order to process the applications, you sometimes miss out on accepting applications from people who are able to pay loans in time and end up sanctioning loans to those who later turn out to be defaulters.

### Datasets
You are now provided with two datasets:
> 1. Current_app: This file gives you information on the existing loan applications. Whether or not clients have payment difficulties
> 2. Previous_app: This file contains information on the previous loan applications with status details of the previous applications being Approved, Cancelled, Refused or Unused offer.

***Exploratory Data Analysis is really fun!*** You get to select how to approach the problem with the defined objectives. Here in this analysis, you are required to identify the loan application patterns and recommend the bank/firm on how they can build their loan portfolios and avoid giving loans to defaulters. You have to recommend ways in which the bank/firm can maximize their loan sanction applications to the clients who can repay the installments. This can be really tricky since there could be new clients with no credit history and can take advantage of the bank and turn out to be defaulters in the future.

**Let's explore the Analysis & approach that I have considered here. May you have comments or doubts, feel free to add comments!**

### Importing  Important Libraries

In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_rows',200)


# Filtering out the warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
#Loading datasets for our analysis

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## <u> Steps to Analysis </u>

### Working on Application Data file & Previous Data file
1. Understanding the  dataset

2. Data Cleaning (Null Value treatment, Reporting Outliers, Normalize data type for columns)

3. Univariate  Analysis

4. Bivariate Analysis

5. Correlation for Key Attributes


#### Reporting Conclusions based on the Analysis for Application & Previous Data file

*Comments/Inferences will be reported at required instances throughout the notebook.*

## <u>Application Dataset </u>

In [None]:
#Reading the "application_data.csv" into the dataframe
appdf = pd.read_csv('/kaggle/input/credit-analysis/current_app.csv')
appdf.head()

In [None]:
#Checking for the shape of the dataframe
appdf.shape

### <u> Comments:</u>
- Nature of contract/loan type is mentioned in the column: NAME_CONTRACT_TYPE
- Total number of Rows is 307511 & columns is 122

In [None]:
#Evaluating the amount of null values in the Application_data file across the 122 columns

appdf.isnull().sum()

In [None]:
#Evaluation the percentage of null values in each column of Application_data

round(((appdf.isnull().sum() / len(appdf))*100),2)

In [None]:
#Dropping columns with null values greater than equal to 50%

appdf = appdf.dropna(thresh = (len(appdf)*0.50), axis=1)
appdf.head()

### <u> Comments:</u>
- 40 rows had over 50% of their values as null. They have been dropped in the above step.

### <u> Recommendation:</u>
- Columns with null vales less than 13% can be imputed with mean or mode depending on column attributes



In [None]:
#Checking the datatypes of the columns in the dataframe
appdf.info()

### <u> Comments:</u>
- Observing the columns and their data type reveals that data type for few columns needs to be updated
- Ensuring that columns like DAYS_BIRTH shouldn't contain negative values for age and likewise

In [None]:
#Changing data type to integer & adding the argument on errors since columns may contain NAs

appdf['DAYS_REGISTRATION'] = appdf['DAYS_REGISTRATION'].astype(int, errors='ignore')
appdf['CNT_FAM_MEMBERS'] = appdf['CNT_FAM_MEMBERS'].astype(int, errors='ignore')
appdf['OBS_30_CNT_SOCIAL_CIRCLE'] = appdf['OBS_30_CNT_SOCIAL_CIRCLE'].astype(int, errors='ignore')
appdf['DEF_30_CNT_SOCIAL_CIRCLE'] = appdf['DEF_30_CNT_SOCIAL_CIRCLE'].astype(int, errors='ignore')
appdf['DAYS_LAST_PHONE_CHANGE'] = appdf['DAYS_LAST_PHONE_CHANGE'].astype(int, errors='ignore')
appdf['AMT_REQ_CREDIT_BUREAU_HOUR'] = appdf['AMT_REQ_CREDIT_BUREAU_HOUR'].astype(int, errors='ignore')

#changing column values to positive integers & converting age in days to 'age in years'
appdf['DAYS_BIRTH'] = abs(appdf['DAYS_BIRTH'])//365.25


### <u> Reporting Outliers for Continuous Variables:</u>

In [None]:
def appdf_boxplot_outlier(var_cont):
    
    plt.figure(figsize=(12,6))
    
    
    sns.boxplot(y=var_cont, data=appdf, palette='Spectral')
    plt.title('Distribution of '+ '%s' %var_cont, weight='bold', fontsize=10)
    plt.xlabel(var_cont)
    plt.xticks(rotation=90)
    plt.ylabel('Number of cases')
    
    plt.show()

In [None]:
#Count of Family Members
appdf_boxplot_outlier('CNT_FAM_MEMBERS')

**Inference:** The upper limit of the count of family members is 4 (4.5 by IQR formula) and the lower limit is 1 (0.5 by IQR formula)) The higher numbers ecxeeding 8 must be for a joint family.

In [None]:
appdf['CNT_FAM_MEMBERS'].describe()

In [None]:
#Credit Amount
appdf_boxplot_outlier('AMT_CREDIT')

In [None]:
appdf['AMT_CREDIT'].describe()

## <u>Comments</u>

**1. Recommendation 1:** Treatment for outliers: *Outliers* for the **Count of Family Members & Credit Amount** can be ignored

**2. Recommendation 2:** Binning Continuous variables: Continuous variables can be binned into categories so that we can perform bivaiate categorical-categorical analysis. This would help us prevent the outliers from skewing the data. We can add bins using the code snippet below:

pd.qcut(appdf['AMT_INCOME_TOTAL'],q=5, labels= ['Below Avg', 'Average', "Above Avg", "Good", 'Better']).value_counts().plot(kind='bar')


In [None]:
#Checking for percenntage of null values
print(round(100*(appdf.isnull().sum()/len(appdf)),2))

### <u> Comments:</u>
- Let us know check for data imbalance for Target Column to understand how the application data is distributed for defaulters[Target = 1]  & non-defaulters[Target = 0]

In [None]:
#Checking for data imbalance for the Target Column

Defaulter = round((appdf['TARGET'].value_counts()[1]/len(appdf)),2)
Non_Defaulter = round((appdf['TARGET'].value_counts()[0]/len(appdf)),2)
explode= (0.1,0.1)
client = [Defaulter, Non_Defaulter]
labels = 'Defaulter', 'Non-Defaulter'
plt.pie(client, labels=labels, explode=explode, autopct='%1.1f%%', startangle=90)

plt.show()

### <u> Comments:</u>
- We understand from the plot that the entire client list for the **TARGET** columns is immensly imbalance with an Imbalance ratio of 11.5 (i.e. defaulter clients at 8% compared to 92% non-defaulters).





### <u>Let us now analyze the dataset w.r.t Inferential Statistics</u>

In [None]:
#splitting the dataframe into two data sets

appdf0 = appdf[appdf.TARGET==0] #dataset for non-defaulter clients
appdf1 = appdf[appdf.TARGET==1] #dataset for non-defaulter clients


## <u>Univariate Categorical Features</u>

#### Following plots were plotted:
- Contract Type
- Gender
- Family Status
- Housing Type
- Education Type

In [None]:
def appdf_plot_unnivariate_cat(var):
    
    plt.figure(figsize=(16,6))
    
    plt.subplot(1, 2, 1)
    sns.countplot(var, data=appdf0, palette= 'Spectral', order= appdf0[var].value_counts().index)
    #Order keyword is used above to have the order of the values remain the same in both the subplots
    plt.title('Distribution of '+ '%s' %var +' for Non-Defaulters', weight='bold', fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Number of cases for non-defaulter clients')
    
    plt.subplot(1, 2, 2)
    sns.countplot(var, data=appdf1, palette='Spectral', order= appdf1[var].value_counts().index)
    plt.title('Distribution of '+ '%s' %var +' for Defaulters', weight='bold',fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Number of cases for defaulter clients')
    
    plt.show()

In [None]:
appdf_plot_unnivariate_cat('NAME_CONTRACT_TYPE')

**Inference:** People opt more for cash loans over revolving loans. Further, the defaulter clients are less for revolving loans as compared to non-deafulters.

In [None]:
appdf_plot_unnivariate_cat('CODE_GENDER')

**Inference:** Females opt for loans more as compared to men & at the same time the percentage of Females being defaulter is higher as compared to males

In [None]:
appdf_plot_unnivariate_cat('NAME_FAMILY_STATUS')

**Inference:** Maximum number of loans are taken by people who are married.

In [None]:
appdf_plot_unnivariate_cat('NAME_HOUSING_TYPE')

**Inference:** People living in apartments or houses that they own are more likely to take loans. This is most likely because they own a house that they can keep as a mortgage to take the loan. Above all, defaulters in this category are more

In [None]:
appdf_plot_unnivariate_cat('NAME_EDUCATION_TYPE')

**Inference:** People pursuing Secondary /secondary special studies take more loans and defaulter cases are also higher with these category of people

## <u>Univariate Continuous Features</u>

#### Following plots were plotted:
- Count of Children
- Total Income
- Credit Amount
- Annuity Amount
- Age

In [None]:
def appdf_plot_unnivariate_cont(var):
    
    plt.figure(figsize=(16,6))
    
    plt.subplot(1, 2, 1)
    sns.distplot(appdf0[var],color='tab:orange')
    plt.title('Distribution of '+ '%s' %var +' for Non-Defaulters', weight='bold', fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Normal distribution for non-defaulter clients')
    
    plt.subplot(1, 2, 2)
    sns.distplot(appdf1[var],color='tab:orange')
    plt.title('Distribution of '+ '%s' %var +' for Defaulters', weight='bold',fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Normal distribution for defaulter clients')
    
    plt.show()

In [None]:
appdf_plot_unnivariate_cont('CNT_CHILDREN')

**Inference:** People take loan before they have children. The number of people taking loans after having children is lesser.

In [None]:
appdf_plot_unnivariate_cont('AMT_INCOME_TOTAL')

In [None]:
appdf_plot_unnivariate_cont('AMT_CREDIT')

In [None]:
appdf_plot_unnivariate_cont('AMT_ANNUITY')

In [None]:
appdf_plot_unnivariate_cont('DAYS_BIRTH')

**Inference:**

## <u>Bivariate Categorical-Categorical</u>

#### Following plots were plotted:
- Family Status VS Education
- Housing Type VS Family Status
- Education VS Gender
- Housing Type VS Education
- Housing Type VS Gender

In [None]:
def appdf_plot_bivariate_cat_cat(var,var_hue):
    
    plt.figure(figsize=(16,6))
    
    plt.subplot(1, 2, 1)
    sns.countplot(var,hue=var_hue, data=appdf0, palette= 'Spectral', order= appdf0[var].value_counts().index, hue_order=appdf0[var_hue].value_counts().index)
    plt.title('Distribution of '+ '%s' %var +' for Non-Defaulters', weight='bold', fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Number of cases for non-defaulter clients')
    
    plt.subplot(1, 2, 2)
    sns.countplot(var, hue=var_hue, data=appdf1, palette='Spectral', order= appdf1[var].value_counts().index, hue_order=appdf1[var_hue].value_counts().index)
    plt.title('Distribution of '+ '%s' %var +' for Defaulters', weight='bold',fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Number of cases for defaulter clients')
    
    plt.show()

In [None]:
appdf_plot_bivariate_cat_cat('NAME_FAMILY_STATUS','NAME_EDUCATION_TYPE')

**Inference:** The trend seems to be similar for both the cases.

In [None]:
appdf_plot_bivariate_cat_cat('NAME_HOUSING_TYPE','NAME_FAMILY_STATUS')

**Inference:** Maximum nuber of people who take loans are married and live in the house or appartment

In [None]:
appdf_plot_bivariate_cat_cat('NAME_EDUCATION_TYPE','CODE_GENDER')

**Inference:** Even though more women default than men, the percentage of women who default against the total number of women who have taken the loan is lesser compared to the percentage of men who have defaulted compared to the total number of men who have taken the loan.

In [None]:
appdf_plot_bivariate_cat_cat('NAME_HOUSING_TYPE','NAME_EDUCATION_TYPE')

**Inference:** The trend seems to be siilar in both cases.

In [None]:
appdf_plot_bivariate_cat_cat('NAME_FAMILY_STATUS','CODE_GENDER')

**Inference:** Married women take the most number of loans. And even though in general women take more loans than men, single men default more than single women.

## <u>Bivariate Categorical-Continous</u>

#### Following Variables were plotted:
- Education VS Income 
- Education VS Credit Amount
- Gender VS Credit Amount
- Housing Type VS Count of Childern
- Education VS Count of Children

In [None]:
def appdf_plot_bivariate_cat_cont(var_cat,var_cont):
    
    plt.figure(figsize=(16,6))
    
    plt.subplot(1, 2, 1)
    sns.boxplot(x=var_cat,y=var_cont, data=appdf0, palette='Spectral', order= appdf0[var_cat].value_counts().index)
    plt.title('Distribution of '+ '%s' %var_cat +' for Non-Defaulters', weight='bold', fontsize=10)
    plt.xlabel(var_cat)
    plt.xticks(rotation=90)
    plt.ylabel('%s' %var_cont+' for defaulter clients')
    
    plt.subplot(1, 2, 2)
    sns.boxplot(x=var_cat, y=var_cont, data=appdf1, palette='Spectral', order= appdf1[var_cat].value_counts().index)
    plt.title('Distribution of '+ '%s' %var_cat +' for Defaulters', weight='bold',fontsize=10)
    plt.xlabel(var_cat)
    plt.xticks(rotation=90)
    plt.ylabel('%s' %var_cont+' for defaulter clients')
    
    plt.show()

In [None]:
appdf_plot_bivariate_cat_cont('NAME_EDUCATION_TYPE','AMT_INCOME_TOTAL')

In [None]:
appdf_plot_bivariate_cat_cont('NAME_EDUCATION_TYPE','AMT_CREDIT')

**Inference:** The overall mean for the Credit Amount is higher for people with Academic degree.

In [None]:
appdf_plot_bivariate_cat_cont('CODE_GENDER','AMT_CREDIT')

**Inference**: Both Men and Women have the same anount of credit.

In [None]:
appdf_plot_bivariate_cat_cont('NAME_HOUSING_TYPE','CNT_CHILDREN')

**Innference:** The Range is 0 to 2 for the Count of children however the outliers in the non defaulters are higher than that of the defaulters.

In [None]:
appdf_plot_bivariate_cat_cont('NAME_EDUCATION_TYPE','CNT_CHILDREN')

## <u>Bivariate Continuous-Continuous</u>
#### Following Variables were plotted:
- Credit Amount VS Age
- Income VS Credit Amount
- Income VS Annuity
- Income VS Age
- Age VS Annuity

In [None]:
def appdf_plot_bivariate_cont_cont(var_cont1,var_cont2):
    
    plt.figure(figsize=(18,6))
    
    plt.subplot(1, 2, 1)
    sns.scatterplot(x=var_cont1,y=var_cont2, data=appdf0, palette='Spectral')
    plt.title('Distribution of '+ '%s' %var_cont1 +' for Non-Defaulters', weight='bold', fontsize=10)
   # plt.xlabel(var_cont1)
    plt.xticks(rotation=90)
    plt.ylabel('Distribution of '+ '%s' %var_cont2)
    #Below steps are to fing the IQR range to ignore the Outliers
    #For X-Axis
    xIQR=1.5*(appdf0[var_cont1].quantile(.75)-appdf0[var_cont1].quantile(.25))
    xlowerlim=appdf0[var_cont1].quantile(.25)-xIQR
    xupperlim=appdf0[var_cont1].quantile(.75)+xIQR
    #For Y-Axis
    yIQR=1.5*(appdf0[var_cont2].quantile(.75)-appdf0[var_cont2].quantile(.25))
    ylowerlim=appdf0[var_cont2].quantile(.25)-yIQR
    yupperlim=appdf0[var_cont2].quantile(.75)+yIQR
    #Applying the limits on the Axis range
    plt.ylim(ylowerlim,yupperlim)
    plt.xlim(xlowerlim,xupperlim)
    
    #print(xlowerlim,xupperlim)
    
    plt.subplot(1, 2, 2)
    sns.scatterplot(x=var_cont1,y=var_cont2, data=appdf1, palette='Spectral')
    plt.title('Distribution of '+ '%s' %var_cont1 +' for Defaulters', weight='bold',fontsize=10)
    #plt.xlabel(var_cont1)
    plt.xticks(rotation=90)
    plt.ylabel('Distribution of '+ '%s' %var_cont2)
    #Below steps are to fing the IQR range to ignore the Outliers
    #For X-Axis
    xIQR=1.5*(appdf1[var_cont1].quantile(.75)-appdf1[var_cont1].quantile(.25))
    xlowerlim=appdf1[var_cont1].quantile(.25)-xIQR
    xupperlim=appdf1[var_cont1].quantile(.75)+xIQR
    #For Y-Axis
    yIQR=1.5*(appdf1[var_cont2].quantile(.75)-appdf1[var_cont2].quantile(.25))
    ylowerlim=appdf1[var_cont2].quantile(.25)-yIQR
    yupperlim=appdf1[var_cont2].quantile(.75)+yIQR
    
    #print(xlowerlim,xupperlim)
    #Applying the limits on the Axis range
    plt.ylim(ylowerlim,yupperlim)
    plt.xlim(xlowerlim,xupperlim)
    
    plt.show()

In [None]:
appdf_plot_bivariate_cont_cont('AMT_CREDIT','DAYS_BIRTH')

**Inference:** No inference could be made.

In [None]:
appdf_plot_bivariate_cont_cont('AMT_INCOME_TOTAL','AMT_CREDIT')

**Inference:** Users usually enter Income amount as a whole round nnumber, which is why you can see a line of values whole numbers of income.

In [None]:
appdf_plot_bivariate_cont_cont('AMT_INCOME_TOTAL','AMT_ANNUITY')

**Inference:** No inference could be made.

In [None]:
appdf_plot_bivariate_cont_cont('AMT_INCOME_TOTAL','DAYS_BIRTH')

**Inference:** No inference could be made.

In [None]:
appdf_plot_bivariate_cont_cont('DAYS_BIRTH','AMT_ANNUITY')

**Inference:** No inference could be made.

## <u>Correlation (Application Dataset)</u>

In [None]:
def correlation_heatmap(var):
    plt.figure(figsize=(12,8))
    cor = var.corr()

    sns.heatmap(cor,annot=True,linewidths=.5,cbar_kws={"orientation": "horizontal"},cmap="Reds")
    plt.show()
    #Steps to obtain the top correlation.
    indices = np.where(cor > -1)
    indices = [(cor.index[x], cor.columns[y],abs(cor.iloc[x,y])) for x, y in zip(*indices) if x != y and x < y]
    a=sorted(indices, key=lambda x: x[2],reverse=True)
    print("Top Ten Correlations are:")
    for i in range(0,10):
        print('%d. '%(i+1)+a[i][0]+' and '+a[i][1])

### <u>For Non Defaulters</u>

In [None]:
correlation_heatmap(appdf0[['CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','DAYS_BIRTH','DAYS_EMPLOYED','AMT_GOODS_PRICE']])

**Inference:** The top correlation goes to credit amount and goods purchased. Increased credit amount means increased goods price.

### <u>For Defaulters</u>

In [None]:
correlation_heatmap(appdf1[['CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','DAYS_BIRTH','DAYS_EMPLOYED','AMT_GOODS_PRICE']])

**Inference:** The top position goes to the same relationship Credit and Goods price. The correlation for Total Income vs Credit Amount, Total Income vs Annuity, Total Income vs Goods prices is much lower for defaulters as compared to Non-defaulters.

## <u>Previous Application Dataset </u>

In [None]:
prevappdf = pd.read_csv('/kaggle/input/credit-analysis/previous_app.csv')
prevappdf.head()


In [None]:
#Checking the % of null values across different ccolumns

round(((prevappdf.isnull().sum() / len(prevappdf))*100),2)

### <u>Recommendations</u>
- We observe that 14 of 37 columns have null values. It is highly recommended to drop columns with null values ***greater than 50%***


In [None]:
#The following code can be used to drop the columns with null values:

#prevappdf = prevappdf.dropna(thresh=len(prevappdf)*0.50,axis=1)


## <u>Univariate Categorical Features</u>
#### Following are the variables analysed:
- Contract Type
- Process Start Day
- Loan Purpose
- Contract Status
- Payment Type

In [None]:
def prevappdf_plot_unnivariate_cat(var):
    
    plt.figure(figsize=(12,6))
    
    sns.countplot(var, data=prevappdf, palette= 'Spectral', order= prevappdf[var].value_counts().index)
    plt.title('Distribution of '+ '%s' %var, weight='bold', fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Number of cases for clients')

    plt.show()

In [None]:
prevappdf_plot_unnivariate_cat('NAME_CONTRACT_TYPE')

**Inference:** Revolving loans are applied for less.

In [None]:
prevappdf_plot_unnivariate_cat('WEEKDAY_APPR_PROCESS_START')

**Inference:** Loans are not applied as much on Sunday as the rest of the days

In [None]:
prevappdf_plot_unnivariate_cat('NAME_CASH_LOAN_PURPOSE')

**Inference:** No inference made.

In [None]:
prevappdf_plot_unnivariate_cat('NAME_CONTRACT_STATUS')

**Inference:** Loans are approved generally

In [None]:
prevappdf_plot_unnivariate_cat('NAME_PAYMENT_TYPE')

**Inference:** Loans are pre dominantly paid by the customers via the bank

## <u>Univariate Continuous Features</u>
#### Following are the variables analysed:
- Amount Application
- Credit Amount
- Down Payment Amount
- Annuity Amount
- Interest Rate


In [None]:
def prevappdf_plot_unnivariate_cont(var):
    
    plt.figure(figsize=(12,6))
    
    sns.distplot(prevappdf[var],color='tab:orange')
    plt.title('Distribution of '+ '%s' %var, weight='bold', fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Normal distribution for clients')
    
    plt.show()

In [None]:
prevappdf_plot_unnivariate_cont('AMT_APPLICATION')

In [None]:
prevappdf_plot_unnivariate_cont('AMT_CREDIT')

In [None]:
prevappdf_plot_unnivariate_cont('AMT_DOWN_PAYMENT')

In [None]:
prevappdf_plot_unnivariate_cont('AMT_ANNUITY')

In [None]:
prevappdf_plot_unnivariate_cont('RATE_INTEREST_PRIMARY')

## <u>Bivariate Analysis Categorical-Categorical</u>
#### Following are the variables analysed:
- Contract Type VS Contract Status
- Weekday VS Rejection Reason
- Weekday VS Contract Status
- Payment Type VS Contract Status
- Contract Type VS Payment Type

In [None]:
def prevappdf_plot_bivariate_cat_cat(var,var_hue):
    
    plt.figure(figsize=(12,6))
    
    
    sns.countplot(var,hue=var_hue, data=prevappdf, palette= 'Spectral', order= prevappdf[var].value_counts().index, hue_order=prevappdf[var_hue].value_counts().index)
    plt.title('Distribution of '+ '%s' %var , weight='bold', fontsize=10)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('Number of cases for clients')
    
    plt.show()

In [None]:
prevappdf_plot_bivariate_cat_cat('NAME_CONTRACT_TYPE','NAME_CONTRACT_STATUS')

**Inference:** The majority of the loans approved are Consumer Loans.

In [None]:
prevappdf_plot_bivariate_cat_cat('WEEKDAY_APPR_PROCESS_START','CODE_REJECT_REASON')

**Inference:** The trend seems to be the smae on all days. Sunday has less number of cases

In [None]:
prevappdf_plot_bivariate_cat_cat('WEEKDAY_APPR_PROCESS_START','NAME_CONTRACT_STATUS')

**Inference:** The trend seems to be the smae on all days. Sunday has less number of cases

In [None]:
prevappdf_plot_bivariate_cat_cat('NAME_PAYMENT_TYPE','NAME_CONTRACT_STATUS')

**Inferene:** The loans Approved are Predominantly paid thru the bank.

In [None]:
prevappdf_plot_bivariate_cat_cat('NAME_CONTRACT_TYPE','NAME_PAYMENT_TYPE')

## <u>Bivariate Analysis Categorical-Continuous</u>
#### Following are the variables analysed:
- Contract Type VS Amount Application
- Weekday VS Interest Rate
- Payment Type VS Credit Amount
- Contract Status VS Amount Application
- Contract Status VS Credit Amount

In [None]:
def prevappdf_plot_bivariate_cat_cont(var_cat,var_cont):
    
    plt.figure(figsize=(12,6))
    
    sns.boxplot(x=var_cat,y=var_cont, data=prevappdf, palette='Spectral', order= prevappdf[var_cat].value_counts().index)
    plt.title('Distribution of '+ '%s' %var_cat, weight='bold', fontsize=10)
    plt.xlabel(var_cat)
    plt.xticks(rotation=90)
    plt.ylabel('%s' %var_cont+' for defaulter clients')
    plt.show()

In [None]:
prevappdf_plot_bivariate_cat_cont('NAME_CONTRACT_TYPE','AMT_APPLICATION')

**Inference:** The amount applied for in cash loans are usually higher than those of consumer or revolving loans.

In [None]:
prevappdf_plot_bivariate_cat_cont('WEEKDAY_APPR_PROCESS_START','RATE_INTEREST_PRIMARY')

In [None]:
prevappdf_plot_bivariate_cat_cont('NAME_PAYMENT_TYPE','AMT_CREDIT')

In [None]:
prevappdf_plot_bivariate_cat_cont('NAME_CONTRACT_STATUS','AMT_APPLICATION')

**Inference:** The amount applied for in the refused bin usually higher than those of consumer or revolving loans.

In [None]:
prevappdf_plot_bivariate_cat_cont('NAME_CONTRACT_STATUS','AMT_CREDIT')

**Inference:** People with higher credit amount are also refused loans

## <u>Bivariate Analysis Continuous-Continuous</u>
#### Following are the variables analysed:
- Amount Application VS Credit Amount
- Amount Application VS Interest Rate
- Credit Amount VS Interest Rate
- Amount Application VS Annuity Amount
- Credit Amount VS Annuity Amount

In [None]:
def prevappdf_plot_bivariate_cont_cont(var_cont1,var_cont2):
    
    plt.figure(figsize=(12,6))
    
    plt.subplot(1, 2, 1)
    sns.scatterplot(x=var_cont1,y=var_cont2, data=prevappdf, palette='Spectral')
    plt.title('Distribution of '+ '%s' %var_cont1, weight='bold', fontsize=10)
    #plt.xlabel(var_cont1)
    plt.xticks(rotation=90)
    plt.ylabel('Distribution of '+ '%s' %var_cont2)
    
    xIQR=1.5*(prevappdf[var_cont1].quantile(.75)-prevappdf[var_cont1].quantile(.25))
    xlowerlim=prevappdf[var_cont1].quantile(.25)-xIQR
    xupperlim=prevappdf[var_cont1].quantile(.75)+xIQR
    
    yIQR=1.5*(prevappdf[var_cont2].quantile(.75)-prevappdf[var_cont2].quantile(.25))
    ylowerlim=prevappdf[var_cont2].quantile(.25)-yIQR
    yupperlim=prevappdf[var_cont2].quantile(.75)+yIQR
    #Here the outliers are not ignored as they give a sense of trend to the plot
    #plt.ylim(ylowerlim,yupperlim)
    #plt.xlim(xlowerlim,xupperlim)
    plt.show()

In [None]:
prevappdf_plot_bivariate_cont_cont('AMT_APPLICATION','AMT_CREDIT')

In [None]:
prevappdf_plot_bivariate_cont_cont('AMT_APPLICATION','RATE_INTEREST_PRIMARY')

In [None]:
prevappdf_plot_bivariate_cont_cont('AMT_CREDIT','RATE_INTEREST_PRIMARY')

In [None]:
prevappdf_plot_bivariate_cont_cont('AMT_APPLICATION','AMT_ANNUITY')

In [None]:
prevappdf_plot_bivariate_cont_cont('AMT_CREDIT','AMT_ANNUITY')

## <u> Correlation (Previous Application Dataset)</u>

In [None]:
correlation_heatmap(prevappdf[['AMT_APPLICATION','AMT_CREDIT','AMT_ANNUITY','RATE_INTEREST_PRIMARY','CNT_PAYMENT']])

## <u> Recommendations to the Bank </u>
    
1. Chances for single males who apply for loan applications turning out to be defaulters as compared with single females is higher. We recommend the loan manager to scan their applications carefully

2. Customers living with parents as compared to other housing options find it difficult to pay loans on time. Hence their applications must also be looked at carefully.

3. Chances of rejecting consumer loans are very less in comparison to cash loans. It is recommended that before refusing the consumer loans, applications must be reviewed keenly.

4. The percentage of loan applications getting approved on Saturday is higher than other days of the week. The banks should check if the employees are stringent while approving loans on Saturday


***Hope you found this helpful! Do let me know in the comments what could have been improved or in case you have any doubts!***