# CREDIT EDA CASE STUDY


***

### Objective 

This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.
In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilize this knowledge for its portfolio and risk assessment.

#### Import Libraries 

In [None]:
#Suppress warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Import the required Libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#seaborn settings
sns.set_style("whitegrid")
sns.set_context("talk")

#### Loading Datsets 

In [None]:
#Setting the views to inspect large Dataframe
pd.set_option('display.max_columns', None)
pd.set_option("max_rows", None)

In [None]:
#Loading the Present and previous application files
app=pd.read_csv('/kaggle/input/loan-defaulter/application_data.csv')
pre_app=pd.read_csv('/kaggle/input/loan-defaulter/previous_application.csv')


#### Inspecting Data 

In [None]:
#Info of the Current application('app') Dataset
app.info(verbose=True)

In [None]:
# summary of numeric columns for present application
app.describe()


In [None]:
#info of the previous app ('pre_app') Dataset
pre_app.info()

In [None]:
#Numerical summary of pre_app Dataset
pre_app.describe()

#### Handling missing and invalid values

Missing values can be handled in following ways:
1. Drop the observations (only in case of MCAR)
2. Replace or impute the values with mean, median, etc<br>
3. Create another level for missing categorical data

In [None]:
# Missing Values % 
null_pct1=app.isna().mean().round(4)
null_pct2=pre_app.isna().mean().round(4)


In [None]:
#Missing Value % of 'application' Dataset
null_pct1

In [None]:
#Missing Value % of 'Previous application' Dataset
null_pct2

In [None]:
#Dropping columns with missing values more than 50% in both datasets.
missing_features1 = null_pct1[null_pct1 > 0.50].index
app.drop(missing_features1, axis=1, inplace=True)

missing_features2 = null_pct2[null_pct2 > 0.50].index
pre_app.drop(missing_features2, axis=1, inplace=True)

#### Inspecting Datatypes and Unique Values 

In [None]:
#Checking datatypes and unique values of application dataset
dataT=pd.DataFrame(app.nunique(),columns=['Unique_Values']).reset_index()
uniq=pd.DataFrame(app.dtypes,columns=['dtypes']).reset_index()
res=pd.merge(dataT,uniq,on='index').sort_values(by=['Unique_Values']).reset_index(drop=True)
res

In [None]:
#Checking datatypes and unique values of application dataset
p_dataT=pd.DataFrame(pre_app.nunique(),columns=['Unique_Values']).reset_index()
p_uniq=pd.DataFrame(pre_app.dtypes,columns=['dtypes']).reset_index()
p_res=pd.merge(p_dataT,p_uniq,on='index').sort_values(by=['Unique_Values']).reset_index(drop=True)
p_res

#### Datatype Correction 

In [None]:
#datatype correction in application dataset
app['NAME_CONTRACT_TYPE']=app.NAME_CONTRACT_TYPE.astype('category')
app['REGION_RATING_CLIENT_W_CITY']=app.REGION_RATING_CLIENT_W_CITY.astype('category')
app['CODE_GENDER']=app.CODE_GENDER.astype('category')
app['REGION_RATING_CLIENT']=app.REGION_RATING_CLIENT.astype('category')
app['NAME_EDUCATION_TYPE']=app.NAME_EDUCATION_TYPE.astype('category')
app['NAME_HOUSING_TYPE']=app.NAME_HOUSING_TYPE.astype('category')
app['NAME_INCOME_TYPE']=app.NAME_INCOME_TYPE.astype('category')
app['OCCUPATION_TYPE']=app.OCCUPATION_TYPE.astype('category')
app['CNT_FAM_MEMBERS']=app.CNT_FAM_MEMBERS.astype('category')
app['ORGANIZATION_TYPE']=app.ORGANIZATION_TYPE.astype('category')


In [None]:
#datatype correction in pre_application dataset
pre_app['NAME_CONTRACT_STATUS']=pre_app.NAME_CONTRACT_STATUS.astype('category')
pre_app['NAME_CLIENT_TYPE']=pre_app.NAME_CLIENT_TYPE.astype('category')
pre_app['NAME_CONTRACT_TYPE']=pre_app.NAME_CONTRACT_TYPE.astype('category')

In [None]:
#Converting age to absolute value in years
app['DAYS_BIRTH']=abs(app.DAYS_BIRTH)//365
app.DAYS_BIRTH.head()

In [None]:
#Converting days employed to absolute values in years
app['DAYS_EMPLOYED']=abs(app.DAYS_EMPLOYED)//365
app.DAYS_EMPLOYED.head()

#### Checking for Outliers 

In [None]:
#Setting plot size
plt.figure(figsize = (25, 18))

#Creating Subplots

#1.Total Income
plt.subplot(3,3,1)
plt.title("TOTAL INCOME", fontsize=20)
sns.boxplot(app.AMT_INCOME_TOTAL,color='g')

#2.Credit Amount
plt.subplot(3,3,2)
plt.title("CREDIT AMOUNT", fontsize=20)
sns.boxplot(app.AMT_CREDIT,color='r')

#3.Days Employed
plt.subplot(3,3,3)
plt.title("YEARS EMPLOYED", fontsize=20)
sns.boxplot(app.DAYS_EMPLOYED,color='c')

#4.Annuity Amount
plt.subplot(3,3,4)
plt.title("ANNUITY AMOUNT", fontsize=20)
sns.boxplot(app.AMT_ANNUITY,color='m')

#5.Age
plt.subplot(3,3,5)
plt.title("AGE", fontsize=20)
sns.boxplot(app.DAYS_BIRTH,color='g')


plt.tight_layout()
plt.show()

In [None]:
#Checking the outlier in Days employed column
app.DAYS_EMPLOYED.value_counts()

<div class="alert alert-block alert-info">

**Observation:** 

1.The Days Employed column has got an invalid value of '1000' for lot of entries. This can be treated as missing values.     
    
2.The outliers in Income, credit and Annuity are most likely relevant values. These values could be binned when analyzing.

#### Binning Continuous Value 

Binning of age and Income is done for analytical efficiency by classifying them in specific intervals.

In [None]:
 
#1.age
app['AGE_GROUP']=pd.cut(app.DAYS_BIRTH,[0,30,40,50,60,70],labels=['<30','30-40','40-50','50-60','60-70'])

In [None]:
#2.INCOME
app['INCOME_GROUP']=pd.cut(app.AMT_INCOME_TOTAL,[0,100000,200000,500000,1000000,117000000],
                           labels=['Upto 1L','1L-2L','2L-5L','5L-10L','more than 10L'])

#### Inspecting Data Imbalance 

In [None]:
#Checking Imbalance
app.TARGET.value_counts(normalize=True)

<div class="alert alert-block alert-info">

**Observation:** There is a Data Imbalance of 92:8 between other cases and those with payment difficulties

- Data is divided based on Target value (1 and 0), this would help in better understanding of the various characteristics that would help us deduce the driving factors behind payment difficulty.

In [None]:
#Dividing the data into two sets based on the target variable (those with paymment difficulty and other cases) 
app1=app[app['TARGET']==1]
app0=app[app['TARGET']==0]


# ANALYSIS 

#### Top Correlations 

In [None]:
corr_app1=app1[['TARGET','NAME_CONTRACT_TYPE','REGION_RATING_CLIENT_W_CITY','CODE_GENDER','REGION_RATING_CLIENT',
               'NAME_EDUCATION_TYPE','NAME_HOUSING_TYPE','NAME_FAMILY_STATUS','NAME_INCOME_TYPE','DEF_30_CNT_SOCIAL_CIRCLE',
               'CNT_CHILDREN','CNT_FAM_MEMBERS','OCCUPATION_TYPE','EXT_SOURCE_3','AMT_GOODS_PRICE','AMT_INCOME_TOTAL',
               'AMT_CREDIT','DAYS_EMPLOYED','AMT_ANNUITY','DAYS_BIRTH']]


In [None]:
#Plotting a heatmap to see correlations
plt.figure(figsize=(20,15))
sns.heatmap(corr_app1.corr(),cmap="YlGnBu",annot=True)
plt.show()

<div class="alert alert-block alert-info">

**Observations:**

- Credit Amount and Goods Price has the highest correlation
- A few relavant strongly correlated variables
    - Credit and Annuity amount
    - Goods price and annuity amount
    

## Univariate Analysis 

### I. Numerical


A box plot is used to understand the distribution and other important values like Median, IQR,etc. of both the target variables.

In [None]:
#defining plotting function for box plots
def univariate_num(variable,label_orientation=False):
    
    #setting subplots & fig size
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,4))
    
    #plot1
    sns.boxplot(ax=ax1,data=app1,x=variable,showfliers=False)
    ax1.set_title('PAYMENT DEFAULTERS')
    if(label_orientation==True):
        plt.xticks(rotation=90)
    #plot2
    sns.boxplot(ax=ax2,data=app0,x=variable,showfliers=False)
    ax2.set_title('NON-DEFAULTERS')
    if(label_orientation==True):
        plt.xticks(rotation=90)

    plt.show() 


#### 1. Income 

<font color='red'>Note:Outliers capped</font> 


In [None]:
#plot size
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,8))
#plot 1
sns.distplot(app1.AMT_INCOME_TOTAL[app1.AMT_INCOME_TOTAL <1000000],ax=ax1)
ax1.set_title('PAYMENT DEFAULTERS')
#plot2
sns.distplot(app0.AMT_INCOME_TOTAL[app0.AMT_INCOME_TOTAL <1000000],ax=ax2)
ax2.set_title('NON-DEFAULTERS')

<div class="alert alert-block alert-info">

**Observations:**

- There is a distinct peak observed in the low income range (1L-2L) in the case of defaulters.
    


#### 2. Annuity 

In [None]:
#plot size
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,8))
#plot 1
sns.distplot(app1.AMT_ANNUITY,ax=ax1)
ax1.set_title('PAYMENT DEFAULTERS')
#plot2
sns.distplot(app0.AMT_ANNUITY,ax=ax2)
ax2.set_title('NON-DEFAULTERS')

plt.show()

<div class="alert alert-block alert-info">

**Observations:**

- Annuity amount of defaulters are less distributed when compared with non defaulters which extend to higher amounts.


#### 3. Credit Amount 

In [None]:
#plot size
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,8))
#plot 1
sns.distplot(app1.AMT_CREDIT,ax=ax1)
ax1.set_title('PAYMENT DEFAULTERS')
#plot2
sns.distplot(app0.AMT_CREDIT,ax=ax2)
ax2.set_title('NON-DEFAULTERS')

plt.show()

<div class="alert alert-block alert-info">

**Observations:**

- Defaulters observed more in the lower credit amount region.


#### 4.Ext Source Score 2  & 3

In [None]:
univariate_num('EXT_SOURCE_2',label_orientation=False)
univariate_num('EXT_SOURCE_3',label_orientation=False)

<div class="alert alert-block alert-info">

**Observations:**

- Payment Defaulters: EXT_SOURCE_3 mean score is less than 0.4
- Repayers: EXT_SOURCE_3 mean score is greater than 0.5


<div class="alert alert-block alert-info">

**Observations:**

- Payment Defaulters: EXT_SOURCE_2 mean score is less than 0.5
- Repayers: EXT_SOURCE_2 mean score is greater than 0.5


#### 5.Years Employed

<font color='red'>Note:Outliers suppressed</font> 


In [None]:
univariate_num('DAYS_EMPLOYED',label_orientation=False)

<div class="alert alert-block alert-info">

**Observations:**

- Payment Defaulters have been employed for an average of less than 3 years
- Repayers are employed for an average of 5+ years


### II Categorical 

Summary of distributions of categorical variables are analyzed using bar charts. These charts help in deriving a general idea of various characteristics of the population who avail loan.

In [None]:
#Defining a function for univariate-categorical plots
def univariate_cat(variable,Title,label_orientation=False):

    plt.figure(figsize=(7,7))
    sns.countplot(data=app,x=variable,order=app[variable].value_counts().index)
    plt.title(Title)
    
    if(label_orientation==True):
        plt.xticks(rotation=90)

    plt.show()

#### 1. Gender

In [None]:
#Distribution of Gender who availed a Loan.
univariate_cat('CODE_GENDER','GENDER')

<div class="alert alert-block alert-info">

**Observations:**

- Females are the majority availers of loan when compared with men.

#### 2. Education  

In [None]:
#Education distribution of people who availed a loan.
univariate_cat('NAME_EDUCATION_TYPE','EDUCATION TYPE',label_orientation=True)

<div class="alert alert-block alert-info">

**Observations:**
- Secondary education category is the highest availers of loan.
- The other categories are less in proportion when compared.
    
 

#### 3.Age 

In [None]:
#Age distribution of people who availed a loan
univariate_cat('AGE_GROUP','AGE GROUP')

<div class="alert alert-block alert-info">

**Observations:**

- 30-40 age group are the highest availers of loan
- 60-70 age group is the lowest

#### 4.Income  

In [None]:
#Income group distribution of people who availed a loan.
univariate_cat('INCOME_GROUP','INCOME GROUPS',label_orientation=True)

<div class="alert alert-block alert-info">

**Observations:**

- People with income between 1 and 2 Lakhs are the highest availers of Loan.


#### 5.Dependants 

In [None]:
#Distribution of dependants of loan availers.
univariate_cat('CNT_FAM_MEMBERS','DEPENDANTS',label_orientation=True)


<div class="alert alert-block alert-info">

**Observations:**

- Highest Loan availers are those with 2 dependants.


## Bivariate Analysis 

### I Numerical-Numerical

A scatter plot is used to understand the correlation between two numeric variables.

In [None]:
def bivariate_num_num(variable_x, variable_y,label_orientation=False):
    
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,8))
    
    sns.scatterplot(ax=ax1,data=app1,x=variable_x,y=variable_y)
    ax1.set_title('PAYMENT DEFAULTERS')
    if(label_orientation==True):
        plt.xticks(rotation=90)
    
    sns.scatterplot(ax=ax2,data=app0,x=variable_x,y=variable_y)
    ax2.set_title('NON-DEFAULTERS')
    if(label_orientation==True):
        plt.xticks(rotation=90)
    plt.tight_layout()

    plt.show() 

#### 1. Income vs Credit  


In [None]:
bivariate_num_num('AMT_INCOME_TOTAL','AMT_CREDIT')


<div class="alert alert-block alert-info">

**Observations:**

- Payment defaulters are majority low income group people.

- In many cases higher credit amount is given for low income people, this must be looked into.


#### 2. Goods Price Vs Credit Amount 

In [None]:
bivariate_num_num('AMT_GOODS_PRICE','AMT_CREDIT')

#### 3.Income Vs Annuity 

In [None]:
bivariate_num_num('AMT_INCOME_TOTAL','AMT_ANNUITY')


<div class="alert alert-block alert-info">

**Observations:**

- No significant correlation observed.<br>
- High annuity amounts observed for low income.


#### 4.Income vs EXT_SOURCE_3 

In [None]:
bivariate_num_num('AMT_INCOME_TOTAL','EXT_SOURCE_3')


<div class="alert alert-block alert-info">

**Observations:**

- No significant correlation between income and ext_source_3


#### 4. Age vs ext_score_3 

In [None]:
bivariate_num_num('DAYS_BIRTH','EXT_SOURCE_3')


<div class="alert alert-block alert-info">

**Observations:**

- No correlation observed between days birth and ext_source_2.


#### 5. Credit amount vs Annuity 

In [None]:
bivariate_num_num('AMT_CREDIT','AMT_ANNUITY')


<div class="alert alert-block alert-info">

**Observations:**

- As expected a positive correlation is observed between credit and annuity amount.


### ii CATEGORICAL-NUMERICAL

Relationship between a categorical variable and Numerical variable is analyzed.
A box plot is made use of to understand the distribution of a numerical variable for a categorical variable amongst both the target value.

In [None]:
#defining plotting function
def bivariate_cat_num(variable_x, variable_y,hue1,label_orientation=False,V_layout=False):
    #Setting layout
    if V_layout==True:
        fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(20,15))
    else:
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,15))
    #plotting graph
    s=sns.boxplot(ax=ax1,data=app1,x=variable_x,y=variable_y,hue=hue1)
    ax1.set_title('PAYMENT DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    
    
    s=sns.boxplot(ax=ax2,data=app0,x=variable_x,y=variable_y,hue=hue1)
    ax2.set_title('NON-DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    plt.tight_layout()

    plt.show() 

#### 1. EDUCATION TYPE VS EXT_SOURCE_3

In [None]:
bivariate_cat_num('NAME_EDUCATION_TYPE','EXT_SOURCE_3','NAME_FAMILY_STATUS',V_layout=True)


<div class="alert alert-block alert-info">

**Observations:**

- Separated people despite having a greater avg score has made significant default in payments.
- Non defaulters have an average score greater than 0.4


#### 2. Education vs  Credit Amount

In [None]:
bivariate_cat_num('NAME_EDUCATION_TYPE','AMT_CREDIT','NAME_FAMILY_STATUS',V_layout=True)


<div class="alert alert-block alert-info">

**Observations:**

- Higher Education category has received the highest credit amount.
    



#### 3. Gender vs Score 

In [None]:
bivariate_cat_num('CODE_GENDER','EXT_SOURCE_3','NAME_CONTRACT_TYPE')


<div class="alert alert-block alert-info">

**Observations:**

- Non defaulters have a greater average score of greater than 0.5 when compared with those having payment difficulties.(less than 0.4)


#### 4.EDUCATION VS SCORE 

In [None]:
bivariate_cat_num('NAME_EDUCATION_TYPE','EXT_SOURCE_3','CODE_GENDER',label_orientation=True)


<div class="alert alert-block alert-info">

**Observations:**

- Academic degree holders having EXT_SCORE_3 less than 0.4 will most likely default.
- Lower secondary males are the highest defaulters.


#### 5.Region rating Vs EXT_SOURCE_2 

In [None]:
bivariate_cat_num('REGION_RATING_CLIENT_W_CITY','EXT_SOURCE_2','CODE_GENDER',label_orientation=True)


<div class="alert alert-block alert-info">

**Observations:**

- Those with payment difficulty have lower EXT_SOURCE_2 average score.
- Those with region rating 3 and score less than 0.5 will most likely have payment difficulty.


### iii Cat-Cat 

A pivot table is used to group the variable and impose a mean target value to it. This would give an idea about the characteristics of the people with payment difficulty.

In [None]:
#defining plotting functions
def bivariate_cat_cat(val,ind,col,title,label_orientation=False):
    
    #pivot table function
    table1=pd.pivot_table(data=app,values=val,index=ind,columns=col,aggfunc=np.mean)
    table1.plot(kind='bar',stacked='True',figsize=[20,10])
    plt.title(title)
    if(label_orientation==True):
        plt.xticks(rotation=45)
    plt.show() 
    print(table1)
    print(' ')
    print(app[ind].value_counts())

#### 1.AGE VS INCOME 

In [None]:
bivariate_cat_cat("TARGET",'INCOME_GROUP','AGE_GROUP','AGE VS INCOME')


<div class="alert alert-block alert-info">

**Observations:**

- Age group of  <30 earning less than 1Lakh are most likely to have payment difficulties.
- Income group 5L-10L shows the least chances of payment difficulty.
- Income group 'more than 10L' are outliers, thus considering them as isolated events.                    
                    


#### 2. AGE VS INCOME TYPE 

In [None]:
bivariate_cat_cat("TARGET",'NAME_INCOME_TYPE','AGE_GROUP','AGE VS INCOME TYPE')


<div class="alert alert-block alert-info">

**Observations:**

- Unemployed category shows the highest chances of having payment difficulty.
- Pensioners in the age group of <30 are at greater risk of payment difficulty.
- The values of 'Unemployed', 'Student', 'Businessman' and 'Maternity leave' are inconclusive because of very less applicants.


#### 3. FAMILY STATUS VS EDUCATION TYPE 

In [None]:
bivariate_cat_cat("TARGET",'NAME_EDUCATION_TYPE','NAME_FAMILY_STATUS','FAMILY STATUS VS EDUCATION TYPE')


<div class="alert alert-block alert-info">

**Observations:**

- Lower secondary education-Civil Marriage & Single people are the most risky category exhibiting payment difficulty.
- Widows have shown less percentage of payment difficulty through all Education types.
- Academic degree Education type shows less chances of payment difficulty.

#### 4.AGE VS HOUSING TYPE

In [None]:
bivariate_cat_cat("TARGET",'NAME_HOUSING_TYPE','AGE_GROUP','AGE VS HOUSING TYPE')


<div class="alert alert-block alert-info">

**Observations:**

- People living in Office apartments are of the least risk to have payment difficulty.
- People living in Rented apartments exhibit the most chances of having payment difficulty.
- Age group <30 living in rented apartments shows a greater chance of exhibiting payment difficulty.  

#### 5. GENDER VS OCCUPATION TYPE 

In [None]:
bivariate_cat_cat("TARGET",'OCCUPATION_TYPE','CODE_GENDER','GENDER VS OCCUPATION TYPE')


<div class="alert alert-block alert-info">

**Observations:**

- Low-skill laborers are the most risky group.
- Male realty agents shows 17% likelihood of payment difficulty.
- Female Waiter/Barmen staff are more likely to have payment difficulty than Male waiter/barmen staff.
- Accountants are the least riskiest.

### Merging current and previous Dataset 

In [None]:
#Merging two DF
app_merge=pd.merge(app,pre_app, on='SK_ID_CURR',how='inner')

#### Inspecting the DF 

In [None]:
app_merge.head()

In [None]:
#info
app_merge.info(verbose=True)


In [None]:
#Checking the missing value %
app_merge.isnull().mean().round(4)*100

In [None]:
#Dividing the Dataset into two based on Target
app1_merge=app_merge[app_merge.TARGET==1]
app0_merge=app_merge[app_merge.TARGET==0]


## Univariate Analysis 

### i.Categorical


In [None]:
#defining plots
def merge_univariate_cat(variable,hue1,V_layout=False,label_orientation=False):
    #setting layouts
    if V_layout==True:
        fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(20,15))
    else:
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,10))
    
    #plotting
    s=sns.countplot(ax=ax1,data=app0_merge,x=variable,hue=hue1,order=app0_merge[variable].value_counts().index)
    ax1.set_title("NON-DEFAULTERS")
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    
    s=sns.countplot(ax=ax2,data=app1_merge,x=variable,hue=hue1,order=app1_merge[variable].value_counts().index)
    ax2.set_title("PAYMENT DEFAULTERS")
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    


    plt.show()

#### 1.NAME_CONTRACT_STATUS 

In [None]:
merge_univariate_cat('NAME_CONTRACT_STATUS','CODE_GENDER')


<div class="alert alert-block alert-info">

**Observations:**

-  Higher proportion of males in defaulters whose previouus loan applicatons were  approved
- Similar to what we expected, payment defaulters had a higher proportion of those whose previous loan applications were refused.



#### 2. NAME_CLIENT_TYPE 

In [None]:
merge_univariate_cat('NAME_CLIENT_TYPE','NAME_CONTRACT_STATUS')


<div class="alert alert-block alert-info">

**Observations:**

-   Higher ratio of people whose previous applications were rejected in recurring customers who showcased difficulty with loan repayments  
 


#### 3. NAME_CONTRACT_TYPE 

In [None]:
merge_univariate_cat('NAME_CONTRACT_TYPE_y','NAME_CONTRACT_STATUS',label_orientation=True)


<div class="alert alert-block alert-info">

**Observations:**

- Consumer /Retail loan applicants were in higher proportion in both groups(those with payment difficulty and those who had no payment difficulty) 



### ii NUMERIC 

In [None]:
#defining plotting functions
def merge_univariate_num(variable,label_orientation=False):
    #setting plots
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,4))
    
    sns.boxplot(ax=ax1,data=app1_merge,x=variable,showfliers=False)
    ax1.set_title('PAYMENT DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    
    sns.boxplot(ax=ax2,data=app0_merge,x=variable,showfliers=False)
    ax2.set_title('NON-DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)

    plt.show() 


#### 1. AMT_CREDIT_y 

In [None]:
merge_univariate_num('AMT_CREDIT_y')

In [None]:
# #plot size
# plt.figure(figsize = (20, 4))

# #subplots
# #1.PREV. CREDIT AMOUNT-DEFAULTERS
# plt.subplot(1,2,1)
# plt.title("NON-DEFAULTERS")
# sns.boxplot(app0_merge.AMT_CREDIT_y,showfliers=False)

# #2..PREV. CREDIT AMOUNT-NON-DEFAULTERS
# plt.subplot(1,2,2)
# plt.title(" PAYMENT DEFAULTERS")
# sns.boxplot(app1_merge.AMT_CREDIT_y,showfliers=False)

# plt.show()


<div class="alert alert-block alert-info">

**Observations:**

-   75 percentile of applicants have been credited with lower amount of credit (less than 220K) in both segments of applicants

In [None]:
merge_univariate_num('AMT_APPLICATION')


<div class="alert alert-block alert-info">

**Observations:**

-  75 percentile of applicants have applied for lower amount of credit (less than 200K) in both segments

In [None]:
merge_univariate_num('AMT_ANNUITY_y')


<div class="alert alert-block alert-info">

**Observations:**

- 75 percentile of applicants have been approved with lower annuity loans(either higher term loans and/or low interest loans)  

## Bivariate Analysis 

### i) Categorical-Numerical

In [None]:
#defining plotting functions
def merge_bivariate_cat_num(variable_x, variable_y,label_orientation=False,V_layout=False):
    #setting layout
    if V_layout==True:
        fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(20,15))
    else:
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,10))
    
    sns.boxplot(ax=ax1,data=app1_merge,x=variable_x,y=variable_y,showfliers=False)
    ax1.set_title('PAYMENT DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    
    sns.boxplot(ax=ax2,data=app0_merge,x=variable_x,y=variable_y,showfliers=False)
    ax2.set_title('NON-DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    plt.tight_layout()

    plt.show() 

#### 1. Name_contract_status vs AMT_CREDIT_Y 

In [None]:
merge_bivariate_cat_num('NAME_CONTRACT_STATUS', 'AMT_CREDIT_y')


<div class="alert alert-block alert-info">

**Observations:**
    
- Surprisingly till 75 percentile of applications which were approved in previous years had lower amount being credited in the current loan where as the highest credit amount as well as the 75 percentile amount of those applications which were rejected in the previous years are much higher compared to those which got approval. This could be flagged as a potential leakage where previous rejections are approved for higher credits with out due consideration of the reason behind previous rejections.

#### 2.NAME_CONTRACT_STATUS VS EXT_SOURCE_3 

In [None]:
merge_bivariate_cat_num('NAME_CONTRACT_STATUS','EXT_SOURCE_3')


<div class="alert alert-block alert-info">

**Observations:**

- Based on the box plot, the External Credit Score is reliable, since above 25 percentile of those applicants who got approved earlier are in the higher credit score bracket compared to those who got rejected in the previous applications. But, credit score should be considered always along with other drivers to determine a customer default



#### 3.NAME_CONTRACT_STATUS VS EXT_SOURCE_2 

In [None]:
merge_bivariate_cat_num('NAME_CONTRACT_STATUS','EXT_SOURCE_2')


<div class="alert alert-block alert-info">

**Observations:**

- Based on the box plot, the External Credit Score is reliable, since above 25 percentile of those applicants who got approved earlier are in the higher credit score bracket compared to those who got rejected in the previous applications. But, credit score should be considered always along with other drivers to determine a customer default





### ii) NUMERICAL-NUMERICAL 

In [None]:
#defining plotting functions
def merge_bivariate_num_num(variable_x, variable_y,label_orientation=False):
    
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,8))
    
    sns.scatterplot(ax=ax1,data=app1_merge,x=variable_x,y=variable_y)
    ax1.set_title('PAYMENT DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    
    sns.scatterplot(ax=ax2,data=app0_merge,x=variable_x,y=variable_y)
    ax2.set_title('NON-DEFAULTERS')
    if(label_orientation==True):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    plt.tight_layout()

    plt.show() 

#### 1. APPLICATION AMOUNT VS CREDIT AMOUNT 

In [None]:
merge_bivariate_num_num('AMT_APPLICATION','AMT_CREDIT_y')


<div class="alert alert-block alert-info">

**Observations:**

-  As expected the amount credited vs loan amount applied has a direct correlation between them. There are no outliers or exceptional amount being credited for a lower amount applied by an applicant



#### 2. ANNUITY AMOUNT VS CREDIT AMOUNT 

In [None]:
merge_bivariate_num_num('AMT_ANNUITY_y','AMT_CREDIT_y')


<div class="alert alert-block alert-info">

**Observations:**

-  For a lesser increment in annuity there is a higher increment in the amount credited. So for a moderately risky customer credit amount can be made unchanged provided the applicant is willing to go for a higher annuity loans or higher interest charge or/and with lesser term of the loan.



#### 3. CREDIT AMOUNT VS GOODS PRICE 

In [None]:
merge_bivariate_num_num('AMT_GOODS_PRICE_y','AMT_CREDIT_y')


<div class="alert alert-block alert-info">

**Observations:**

- As expected the amount credited vs the amount of goods under loan consideration has a direct correlation between them. There are no outliers or exceptional amount being credited for a low value good purchased/transacted by an applicant. From a forensic perspective the visualization doesn't call for a potential fraud/collusion between an applicant and bank employee or potential violation of the bank norms.



#### iii) CATEGORICAL-CATEGORICAL 

In [None]:
#defining plotting functions
def merge_bivariate_cat_cat(val,ind,col,title,label_orientation=False):
    
    
    table1=pd.pivot_table(data=app_merge,values=val,index=ind,columns=col,aggfunc=np.mean)
    table1.plot(kind='bar',stacked='True',figsize=[20,10])
    plt.title(title)
    if(label_orientation==True):
        plt.xticks(rotation=45)
    plt.show() 
    print(table1)
    print(' ')
    print(app_merge[ind].value_counts())

#### 1.PORTFOLIO VS CONTRACT STATUS 

In [None]:
merge_bivariate_cat_cat("TARGET",'NAME_PORTFOLIO','NAME_CONTRACT_STATUS','PORTFOLIO VS CONTRACT STATUS')


<div class="alert alert-block alert-info">

**Observations:**

- POS Portfolio loans have the highest payment defaults thereby having a low approval rate.
- Defaulting in Car loans is very less..



#### 2.GENDER VS CONTRACT STATUS 

In [None]:
merge_bivariate_cat_cat("TARGET",'CODE_GENDER','NAME_CONTRACT_STATUS','GENDER VS CONTRACT STATUS')


<div class="alert alert-block alert-info">

**Observations:**

- Males had higher proportion of their previous loans rejected compared to Females.
    

#### 3.YIELD GROUP VS CONTRACT STATUS 

In [None]:
merge_bivariate_cat_cat("TARGET",'NAME_YIELD_GROUP','NAME_CONTRACT_STATUS','YIELD GROUP VS CONTRACT STATUS')


<div class="alert alert-block alert-info">

**Observations:**

- High yield loans have high risk of defaulting.
- Refusal rate is low for Low action loans



# Miscellaneous Plots to help identify potentially less risky customers

As part of the miscellaneous analysis, we have profiled the customers into low risk & high risk customers and then segmented the merged  data into applicants with or without payment difficulty and performed bivariate analysis on these segmented data frames.

In [None]:
#Filtering out rows having contract status=Unused
pre_app_F = pre_app[~(pre_app['NAME_CONTRACT_STATUS']=='Unused offer')]
pre_app_F.NAME_CONTRACT_STATUS.value_counts()


In [None]:
#Quantifying Contract status into a new variable 'Status'
pre_app_F["Status"]= pre_app_F['NAME_CONTRACT_STATUS'].apply(lambda x: 1 if x =='Approved'or x=='Canceled' else 0)
pre_app_F.Status.value_counts()

In [None]:
#Finding avg approval rate of each applicants using groupby function
pre_app_new = pre_app_F.groupby('SK_ID_CURR')['Status'].mean()
pre_app_new.head(5)

In [None]:
#merging the values to the main dataset
pre_app_F=pd.merge(app,pre_app_new, on='SK_ID_CURR',how='inner')


In [None]:
# * app_merge_11 is all customers from merged data with payment difficulty
# * app_merge_01 is all customers from merged data with no payment difficulty
app_merge_11=pre_app_F[pre_app_F['TARGET']==1]
app_merge_01=pre_app_F[pre_app_F['TARGET']==0]

In [None]:
# * app_merge_1 is high risk customers who have payment difficulty and has less than 75% loan applications been approved
# * app_merge_0 is low risk customers who had no payment difficulty and has more than 75% loan applications been approved
app_merge_1=app_merge_11[app_merge_11['Status']<.75 ]
app_merge_0=app_merge_01[app_merge_01['Status']>=.75 ]

 ### Cont - Cont Bivariate Analysis

 #### 1. Credit Score vs Previous Loan Approved:Rejection ratio


In [None]:
pre_app_F[['EXT_SOURCE_3','Status']].corr()

<div class="alert alert-block alert-info">
    
**Observations:**
Less correlation between Credit Score and Previous Application Approved:Rejection Ratio. We expected a higher positive correlation. Same is visualized using a scatter plot below.As evident from the scatter plot, credit score and the previous application status has no evident relation to each other


In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
plt.title('Customers with Payment Difficulty')
sns.scatterplot(x=app_merge_11.EXT_SOURCE_3, y=pre_app_F.Status)

plt.subplot(1,2,2)
plt.title('Customers with No Payment Difficulty')
sns.scatterplot(x=app_merge_01.EXT_SOURCE_3, y=pre_app_F.Status)

plt.show()

<div class="alert alert-block alert-info">

 **Observations:**
     As evident from the scatter plot, credit score and the previous application status has no evident relation to each other

#### 2. Income Vs Approved:Rejection ratio (between target = 0 & target =1)

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
plt.title('Customers with Payment Difficulty')
sns.scatterplot(x=app_merge_11.AMT_INCOME_TOTAL[app_merge_11.AMT_INCOME_TOTAL < 1000000], y=app_merge_11.Status)

plt.subplot(1,2,2)
plt.title('Customers with No Payment Difficulty')
sns.scatterplot(x=app_merge_01.AMT_INCOME_TOTAL[app_merge_01.AMT_INCOME_TOTAL < 1000000], y=app_merge_01.Status)
plt.show()

<div class="alert alert-block alert-info">

 **Observations:**
    Low income group (200K to 400K) showcased the most payment defaults where as the high income group on the otherside were correct on loan repayment.Still, there is a good chunk of low income people  who made no defaults in their current loan.Surprisingly previous application rejection ratio has no influence over this


#### 3. Credit Amount Vs Approved:Rejection ratio (between target = 0 & target =1)

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
#plt.yticks(rotation = 45)
plt.title('Customers with Payment Difficulty')
sns.scatterplot(y=app_merge_11.AMT_CREDIT, x=app_merge_11.Status)

plt.subplot(1,2,2)
#plt.yticks(rotation = 45)
plt.title('Customers with No Payment Difficulty')
sns.scatterplot(y=app_merge_01.AMT_CREDIT, x=app_merge_01.Status)
plt.tight_layout()
plt.show()

#### 4. Income VS Credit Score (between High Risk & Low Risk Customers)

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
plt.title('High Risk Customers')
sns.scatterplot(x=app_merge_1.AMT_INCOME_TOTAL[app_merge_1.AMT_INCOME_TOTAL < 1000000], y=app_merge_1.EXT_SOURCE_3)

plt.subplot(1,2,2)
plt.title('Low Risk Customers')
sns.scatterplot(x=app_merge_0.AMT_INCOME_TOTAL[app_merge_0.AMT_INCOME_TOTAL < 1000000], y=app_merge_0.EXT_SOURCE_3)
plt.show()

<div class="alert alert-block alert-info">

 **Observations:**
High Risk Customers(those with higher rejection ratio) has it's major chunk in the low income group
As evident from the second graph high income group is generally less risky; income of a customer is major driving factor of his/her credit worthiness

#### 5. Income VS Age of Customers (between High Risk & Low Risk Customers)

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
plt.title('High Risk Customers')
sns.scatterplot(x=app_merge_1.AMT_INCOME_TOTAL[app_merge_1.AMT_INCOME_TOTAL < 1000000], y=app_merge_1.DAYS_BIRTH)

plt.subplot(1,2,2)
plt.title('Low Risk Customers')
sns.scatterplot(x=app_merge_0.AMT_INCOME_TOTAL[app_merge_0.AMT_INCOME_TOTAL < 1000000], y=app_merge_0.DAYS_BIRTH)
plt.show()

<div class="alert alert-block alert-info">

   **Observations:**
 Age has no evident influence over the risk exposure of a customer vis-a-vis the customer's income as well as the previous loans rejection ratio


#### 6.AGE GROUP VS PREVIOUS APPLICATION STATUS

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
sns.boxplot(x = "AGE_GROUP",y='Status',data = app_merge_11,hue='CODE_GENDER')
#plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("with payment difficulty")
plt.subplot(1,2,2)
sns.boxplot(x = "AGE_GROUP",y='Status',data = app_merge_01,hue='CODE_GENDER')
#plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("no payment difficulty")
plt.show()

<div class="alert alert-block alert-info">
    
  **Observations:**
Between the age groups no considerable difference in approval ratio  is observed whereas we could profile the customers across the age group 

<div class="alert alert-block alert-info">
    
  **Observations:**
    
1. Customers with the higher rejection ratio and default in the current application (on the lower bottom of graph 1 (age group - 20 - 50 ) are with high risk profiles hence could be rejected the loans if their external credit score is less and income levels are relatively low and credit amount applied is high
     
2. More than 25% people who have made no default in the current loan have a lower rejection ratio(higher approval ratio, >.75) across all age groups, They are low risk customers who could be given higher credit loans in future applications. One thing  to be mindful here is the number of approved loans a particular customer holding till date.
     
3. Customers with the higher rejection ratio but has not defaulted in the current application (on the lower bottom of graph 2 (across age groups) are with moderate risk profiles hence could be granted the loans if their external credit score is high and reliable and income levels are relatively high and credit amount applied is low.
 
    a. They could be granted loans with lesser credit amount
    
    b. They could be granted credit at a higher interest rate provided income is higher and credit score is reliably higher
        

4. Customers who had a higher approval ratio but defaulted with the current application could be granted with loans with lesser credit amount since a hIgher credit amount would attract higher interest charges and that would further stress the customer financially

#### <font color=red>**RISKY APPLICANTS:**</font>
<div class="alert alert-block alert-info">
    
 
    

- EXT_SOURCE_3 mean score is less than 0.4
- EXT_SOURCE_2 mean score is less than 0.5
- Have been employed for an average of less than 3 years
- Men with lower secondary Education
- Age group of <30 earning less than 1Lakh, living in rented apartments.
- Those with region rating 3 and score less than 0.5 
- Unemployed categoryPensioners in the age group of <30 are at greater risk of payment difficulty.
- Lower secondary education-Civil Marriage & Single people
- People living in Rented apartments.
- Age group <30 living in rented apartments shows a greater chance of exhibiting payment difficulty.
- Low-skill laborers.
- Male realty agents .
- Female Waiter/Barmen staff

<div class="alert alert-block alert-success">

**Conclusion**<br>
We identified the following Variables significantly driving Credit Default by an applicant:
1. Income
2. Previous Loan Rejection Ratio
3. External Credit Score
4. Education Category
5. Family Status
6. Occupation Type
7. No. of Approved Loans currently outstanding<br>
We also identified the following patterns on ‘if a client has difficulty paying their installments’:<br>
    
    - Males exhibited more payment difficulty though there were more female applicants 
    - Applicants living in rented apartments and with parents exhibited  a greater chance of payment default
    - Low skill Labours are prone to payment default
    - As expected unemployed and lower income group exhibited higher payment default tendency



<div class="alert alert-block alert-success">

**Recommendations**
    
Based on the customer profile and credit default drivers we identify and recommend:<br>
- High Risk profiles: Loan applications could be rejected if external credit score is less & income levels are low and credit amount applied is high.

- Low Risk profiles: They should be extended higher credit loans in future applications.The number of previously approved loans a particular customer holding till date should be enquired.

- Moderately Low Risk profiles:
 Could be granted loans with lesser credit amount.
 Could be granted credit at a higher interest rate provided income is higher and credit score is reliably higher.  
    
- Medium Risk profiles: could be granted loans with lesser credit amount since a higher credit amount would attract higher annuity and that would further stress an already defaulting customer.

