In [1]:
# Filtering out warnings
import warnings
warnings.filterwarnings('ignore')

In [1]:
# Importing the Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
%matplotlib inline


In [1]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)

# <font color = blue > EDA Case Study </font>

### <font color = Green > Business Understanding </font>

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. **PS: Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.**


When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

+ If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

+ If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

**The client with payment difficulties:** he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,

**All other cases:** All other cases when the payment is paid on time.

When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

+ **Approved**: The Company has approved loan Application

+ **Cancelled**: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.

+ **Refused**: The company had rejected the loan (because the client does not meet their requirements etc.).

+ **Unused offer**:  Loan has been cancelled by the client but on different stages of the process.

In this case study, you will use EDA to understand how **consumer attributes** and **loan attributes** influence the tendency of default.


### <font color = Green > Business Objectives </font>

This case study aims to **identify patterns which indicate if a client has difficulty paying their installments** which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.  **What are your driving varibles??**

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment.

### <font color = Green > Results Expected by Learners: </font>
    
+ Present the overall approach of the analysis in a presentation. Mention the problem statement and the analysis approach briefly.

+ Identify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value)

**Hint:** Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach.

+ Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points.

+ Identify if there is data imbalance in the data. Find the ratio of data imbalance.

**Hint:** How will you analyse the data in case of data imbalance? You can plot more than one type of plot to analyse the different aspects due to data imbalance. For example, you can choose your own scale for the graphs, i.e. one can plot in terms of percentage or absolute value. Do this analysis for the **‘Target variable’** in the dataset **( clients with payment difficulties and all other cases).** Use a mix of univariate and bivariate analysis etc.

**Hint:** Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights.

+ Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms.

+ Find the top 10 correlation for the **Client with payment difficulties and all other cases (Target variable)**. Note that you have to find the top correlation by segmenting the data frame w.r.t to the target variable and then find the top correlation for each of the segmented data and find if any insight is there.  Say, there are 5+1(target) variables in a dataset: **Var1, Var2, Var3, Var4, Var5, Target.** And if you have to find top 3 correlation, it can be: **Var1 & Var2, Var2 & Var3, Var1 & Var3.** Target variable will not feature in this correlation as it is a categorical variable and not a continuous variable which is increasing or decreasing.  

+ Include visualisations and summarise the most important results in the presentation. You are free to choose the graphs which explain the numerical/categorical variables. Insights should explain why the variable is important for differentiating the **clients with payment difficulties with all other cases.**

##  Task - 1: Reading and understanding the dataset - 1

### Subtask 1.1: Importing the current application data.

In [1]:
# Read the csv file using 'read_csv'. 
appl_df = pd.read_csv('../input/application-data/application_data.csv')
appl_df.head()

###  Subtask 1.2: Inspecting the Dataframe


In [1]:
# Check the number of rows and columns in the dataframe

appl_df.shape

In [1]:
# Check the column-wise info and datatypes of the dataframe

appl_df.info("all")

###  Subtask 1.3: Dealing with Null Values in Columns above 35%

In [1]:
# Checking Null Values

null_col = round((appl_df.isnull().sum()*100/len(appl_df)).sort_values(ascending = False),2)
null_col_35 = null_col[null_col>35] 
print(null_col_35)
print()
print("Number of Columns having missing values more than 35% :",len(null_col_35))

In [1]:
null_col_35.index

# Even if some of the columns may seem relevant as more than 35% of the data is missing in them, let's drop them out...

In [1]:
# Dropping Columns Having Missing Values more than 35%

appl_df.drop(columns = null_col_35.index, inplace = True)

In [1]:
appl_df.shape # Now there are 73 columns remaining. 

### Subtask 1.4: Dealing with rest of the Null Values in Columns 

In [1]:
def remain_null_col(df):
    return round(appl_df.isnull().sum()*100/len(appl_df),4)

In [1]:
# Checking Null values greater than 0 inclusing the minuscule ones also

remain_null_col(appl_df)[remain_null_col(appl_df)>0]

In [1]:
# Identifying the spread of Annuity Amount Range

appl_df['AMT_ANNUITY'].describe()

In [1]:
appl_df['AMT_ANNUITY'].mode()

In [1]:
sns.set_style("dark")
plt.style.use("ggplot")
plt.figure(figsize=[10,6])
sns.distplot(appl_df['AMT_ANNUITY'], rug = True, color = 'royalblue')
plt.show()

### <font color = Green > INSIGHT </font>

Since there is a lot difference in mean and the max which means outliers and median is smaller than mean, I will replace the missing values in  AMT_ANNUITY with mode.

In [1]:
appl_df['AMT_ANNUITY'].isnull().sum()

In [1]:
appl_df['AMT_ANNUITY'].mode()[0]

In [1]:
appl_df['AMT_ANNUITY'].fillna(appl_df['AMT_ANNUITY'].mode()[0], axis = 0, inplace = True)

In [1]:
appl_df['AMT_ANNUITY'].isna().sum()

### <font color = Green > WAY FORWARD </font>

Similarly Let's check all data in once having Null Values below 35% except Name_Type_Suit & Occupation_Type as they are Categorical:

In [1]:
appl_df[['AMT_GOODS_PRICE','CNT_FAM_MEMBERS','EXT_SOURCE_2','EXT_SOURCE_3','OBS_30_CNT_SOCIAL_CIRCLE',      
'DEF_30_CNT_SOCIAL_CIRCLE',    
'OBS_60_CNT_SOCIAL_CIRCLE',       
'DEF_60_CNT_SOCIAL_CIRCLE',       
'DAYS_LAST_PHONE_CHANGE',         
'AMT_REQ_CREDIT_BUREAU_HOUR',   
'AMT_REQ_CREDIT_BUREAU_DAY', 
'AMT_REQ_CREDIT_BUREAU_WEEK', 
'AMT_REQ_CREDIT_BUREAU_MON',     
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR']].describe()

### <font color = Green > Rule Followed </font>

All of these are numerical columns are expected to fill msising values over here using the following rules:

1. If Mean ~ Median approximately, substitute by mean.

2. If Mean != median, substitute by median 

3. but if there is a huge difference in mean and max, subsitute if by mode.

### <font color = Green > Insights </font>

+ Most of the above columns seems to a certain type of response from the customers hence should be dicrete and not continous. There replacing their values with median and not mean as mean is in decimel. 
+ Remaining columns null values will be replaced by mean value. 
+ columns like 'DAYS_LAST_PHONE_CHANGE', will deal with them later while correcting them into absolute values. 

In [1]:
plt.figure(figsize=[10,6])
sns.distplot(appl_df['AMT_GOODS_PRICE'], rug = True, color = 'royalblue')
plt.show()

In [1]:
appl_df['AMT_GOODS_PRICE'].mode()

### <font color = Green > Insight </font>

mode is exactly the same as median hence adding 'AMT_GOODS_PRICE' in the median replacing list. 

In [1]:
replace_median_val_df = appl_df[['AMT_GOODS_PRICE', 'CNT_FAM_MEMBERS', 'OBS_30_CNT_SOCIAL_CIRCLE',      
'DEF_30_CNT_SOCIAL_CIRCLE',    
'OBS_60_CNT_SOCIAL_CIRCLE',       
'DEF_60_CNT_SOCIAL_CIRCLE',
'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR']]

In [1]:
# Imputing the median values all at ones for the above selected columns:

for i in replace_median_val_df:

    appl_df[i][appl_df[i].isnull()] = appl_df[i].median()

### <font color = Green > Way Forward </font>

Checking for remaining null values again:

In [1]:
remain_null_col(appl_df)[remain_null_col(appl_df)>0]

In [1]:
appl_df['NAME_TYPE_SUITE'].value_counts()

In [1]:
plt.figure(figsize = [10,6])
appl_df['NAME_TYPE_SUITE'].value_counts().plot.bar(color = 'royalblue', width = 0.6)
plt.title("Type of People Accompanied the Prospect", fontdict={"fontsize":20}, pad =20)
plt.show()


### <font color = Green > Insight </font>

+ Here we can see most of our prospect visited alone while filling the application for loan.
+ Or else most of them were accompanied by there Family members or Spouse/Partner.
+ Here we cannot replace the missing values from Name_Type_Suit as it will create a data imbalance, hence we will leave the Null values as it is and deal with the information availble in hand. 

In [1]:
appl_df['OCCUPATION_TYPE'].value_counts()

In [1]:
plt.figure(figsize = [12,6])
appl_df['OCCUPATION_TYPE'].value_counts().plot.bar(color = 'royalblue', width = 0.6)
plt.title("Type of Occupations", fontdict={"fontsize":20}, pad =20)
plt.show()

### <font color = Green > Insight </font>

+ Our Major Prospects comes from the blue colar category followed by the white collar ones. 
+ The categories are quiet vague, cannot judge someone on the basis of their occupation for getting a loan. 
+ Here we cannot replace the missing values from Occupation_Type as it will create a data imbalance, hence we will leave the Null values as it is and deal with the information availble in hand. 

In [1]:
remain_null_col(appl_df)

### <font color = Red >  Datafram is almost free from of Null Values in Rows and Columns:  </font>


In [1]:
appl_df.shape

###  Subtask 1.5: Numeric and Categorical Analysis: 

In [1]:
# Check the summary for the numeric columns 

appl_df.describe()

### <font color = Green > Insights </font>

+ These numbers in the 'AMT_INCOME_TOTAL','AMT_CREDIT' & 'AMT_GOODS_PRICE' are too big, compromising its readability. Let's convert them into lakhs. 
+ 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH' & 'DAYS_LAST_PHONE_CHANGE' have negative values. thus will convert them into absolute. 
+ DAYS_BIRTH to AGE in years , DAYS_EMPLOYED to YEARS EMPLOYED, CNT_CHILDREN in CHILD_COUNTS and CNT_FAM_MEMBERS in FAMILY_COUNTS
+ Will Analyse 'EXT_SOURCE_2','EXT_SOURCE_3' relevance.

In [1]:
# Analysing categorical values

appl_df.select_dtypes(include=['object']).describe()

### <font color = Green > Insights </font>


### Subtask 1.6: Standardising Values

In [1]:
appl_df.head()

#### These numbers in the 'AMT_INCOME_TOTAL','AMT_CREDIT' & 'AMT_GOODS_PRICE' are too big, compromising its readability. Let's convert them into Lakh Rupees. 

In [1]:
appl_df['AMT_INCOME_TOTAL'] = round(appl_df['AMT_INCOME_TOTAL']/100000,3)

In [1]:
appl_df['AMT_CREDIT'] = round(appl_df['AMT_CREDIT']/100000,3)

In [1]:
appl_df['AMT_GOODS_PRICE'] = round(appl_df['AMT_GOODS_PRICE']/100000,3)

In [1]:
appl_df[['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_GOODS_PRICE']].describe()

In [1]:
appl_df['AMT_INCOME_TOTAL'].sort_values().value_counts(normalize = True)*100

#### 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH' & 'DAYS_LAST_PHONE_CHANGE' have negative values. thus will convert them into absolute.

In [1]:
days_related = ['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH' , 'DAYS_LAST_PHONE_CHANGE']
appl_df[days_related] = abs(appl_df[days_related])

In [1]:
appl_df[days_related].describe()

#### DAYS_BIRTH to AGE in years , DAYS_EMPLOYED to YEARS EMPLOYED in years, CNT_CHILDREN in CHILD_COUNTS and CNT_FAM_MEMBERS in FAMILY_COUNTS. 

In [1]:
appl_df[['DAYS_BIRTH', 'DAYS_EMPLOYED', 'CNT_FAM_MEMBERS', 'CNT_FAM_MEMBERS']]

In [1]:
appl_df["AGE"] = appl_df["DAYS_BIRTH"]/365
bins = [0,20,25,30,35,40,45,50,55,60,100]
slots = ["0-20","20-25","25-30","30-35","35-40","40-45","45-50","50-55","55-60","60 Above"]

appl_df["AGE_GROUP"] = pd.cut(appl_df["AGE"], bins=bins, labels=slots)

In [1]:
appl_df["AGE_GROUP"].value_counts(normalize= True)*100

### <font color = Green > Insight </font>

Majority of the clients belong to a Age group of 30 to 45 years. 

In [1]:
#creating column "EMPLOYEMENT_YEARS" from "DAYS_EMPLOYED"

appl_df["YEARS_EMPLOYED"] = appl_df["DAYS_EMPLOYED"]/365
bins = [0,5,10,15,20,25,30,50]
slots = ["0-5","5-10","10-15","15-20","20-25","25-30","30 Above"]

appl_df["EMPLOYEMENT_YEARS"] = pd.cut(appl_df["YEARS_EMPLOYED"], bins=bins, labels=slots)

In [1]:
appl_df["EMPLOYEMENT_YEARS"].value_counts(normalize= True)*100


### <font color = Green > Insight </font>

Majority of our clients are employed between 1 to 15 years. 

In [1]:
#creating column "FAMILY_COUNTS" from "CNT_FAM_MEMBERS": 

appl_df["FAMILY_COUNTS"] = pd.cut(appl_df.CNT_FAM_MEMBERS, [0,2,4,6,8,10,12,14,16,17,18,20], labels=["<2", "2-4", "4-6", "6-8", "8-10", "10-12", "12-14", "14-16", "16-18", "18-20", "20+"])
appl_df.FAMILY_COUNTS.value_counts(normalize= True)*100

### <font color = Green > Insight </font>

Majority of our clients have a family of less than 4 members. 

In [1]:
#creating column "CHILD_COUNTS" from "CNT_CHILDREN": 

appl_df["CHILD_COUNTS"] = pd.cut(appl_df.CNT_CHILDREN, [0,1,2,3,4,5,6,7,8,9,10], labels=["<1", "1-2", "2-3", "3-4", "4-5", "5-6", "6-7", "7-8", "9-10", "10+"])
appl_df.CHILD_COUNTS.value_counts(normalize= True)*100

### <font color = Green > Insight </font>

Majority of our clients have no children or have 1/2 childrens. 

In [1]:
appl_df.head()

#### Analyse 'EXT_SOURCE_2','EXT_SOURCE_3' relevance?

In [1]:
appl_df[['EXT_SOURCE_2','EXT_SOURCE_3', 'TARGET']].corr()

In [1]:
plt.figure(figsize = [8,6])
sns.heatmap(appl_df[['EXT_SOURCE_2','EXT_SOURCE_3', 'TARGET']].corr(), annot = True, cmap = "RdYlGn")
plt.show()

### <font color = Green > Insight </font>

There seem no correlation between EXT_SOURCE_2 , EXT_SOURCE_3 and TARGET variable, hence dropping them. 

In [1]:
appl_df.drop(['EXT_SOURCE_2','EXT_SOURCE_3'], axis = 1, inplace = True)

In [1]:
appl_df.head()

In [1]:
# Also, creating a separate dataframe of Numeric Columns for further analysis

num_cols = list(appl_df.describe().columns)

In [1]:
len(appl_df.describe().columns)

In [1]:
# Also, creating a separate dataframe of Categorical Columns for further analysis

cat_cols = list(set(appl_df.columns) - set(appl_df.describe().columns))

In [1]:
len(cat_cols)

### Subtask 1.7: Identifying Outliers

In [1]:
# calling numerical columns: 
num_cols

In [1]:
# Will only take those are relevant to Analyse: 

outliers_num_cols = [ 'CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'AGE',
 'YEARS_EMPLOYED',
 'DAYS_REGISTRATION']

In [1]:
list(enumerate(outliers_num_cols))

In [1]:
plt.figure(figsize=([20,22]))

for n,col in enumerate(outliers_num_cols):
    plt.subplot(5,2,n+1)
    sns.boxplot(appl_df[col], orient = "h")
    plt.xlabel("")
    plt.ylabel("")
    plt.title(col)
    plt.tight_layout()


### <font color = Green > Insight </font>

1. AMT_INCOME_TOTAL has a huge number outliers which means few of the client have a good income level.
2. CNT_CHILDREN also have large number of outliers, some client has mentioned have childrens more than 5 too which is hilarious and justified for them to take a loan. 
3. AGE seems to be the only column with no outiers, we have converted that column into a categorical one AGE and we will anlyse it further in comparision to TARGET column. 
4. AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE' are other columns having some outliers. 
5. YEARS_EMPLOYED have an extreme oultier which is not possible for some to start the employement 350000 days before the application. 

### Subtask 1.8: Identifying Imbalance

In [1]:
appl_df['TARGET'].value_counts()


In [1]:
appl_df['TARGET'].value_counts(normalize=True)*100


### <font color = Green > Insight </font>

+ Imbalance in the Data Class Target
+ As we know, 1 is Client with payment difficulties i.e probability of being a Defaulter and 0 is all other cases i.e we can assume Repayer. 
+ 91% of the Clients have a probability of Repaying the loan on time, whereas 9% of the clients have a probability of defaulting it. 

In [1]:
## Lets calculate the ratio of imbalance

(appl_df.TARGET==0).sum()/(appl_df.TARGET==1).sum()
# appl_df[appl_df.TARGET==0].shape[0] / appl_df[appl_df.TARGET==1].shape[0]

### <font color = Green > Ratio of Imbalance: 11.39 </font>



In [1]:
appl_df['NEW_TARGET'] = appl_df['TARGET'].apply(lambda x: 'Repayer' if x == 0 else 'Defaulter')
appl_df['NEW_TARGET'].value_counts(normalize=True)*100

In [1]:
plt.figure(figsize = [14,6])
appl_df['NEW_TARGET'].value_counts().plot.barh(color = 'royalblue')
plt.ylabel("Loan Repayment Categories")
plt.xlabel("Count")
plt.title("Repayer vs Defaulter")
plt.show()

### Subtask 1.9: Sorting Imbalance into two different dataframes: 


+ let's Bifurcate the both Loan Repayemnt Categories into two different datasets for indvidual comparision with the driving variables. 

In [1]:
appl_df.head()

In [1]:
Repayer_0 = appl_df.loc[appl_df['NEW_TARGET'] == 'Repayer']
Defaulter_1 = appl_df.loc[appl_df['NEW_TARGET'] == 'Defaulter']

In [1]:
Repayer_0.head()

In [1]:
Defaulter_1.head()

## Subtask 1.10: Categorical Analysis 

In [1]:
# Calling categorical columns: 

cat_cols

In [1]:
# Univatriate Categorical Analysis for better understanding of those columns:

for i in appl_df.columns:
    if appl_df[i].dtypes ==  "object":
        print(appl_df[i].value_counts(normalize = True)*100)
        plt.figure(figsize=[6,6])
        appl_df[i].value_counts(normalize = True).plot.pie(labeldistance=None)
        plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
        #plt.tight_layout()
        plt.show()

### <font color = Green > Insight </font>

+ Cash loans offered are more than revolving loans. 
+ Majority of Females have taken loans in comparison to males. 
+ Most applicant dont own cars
+ Most applicants own living quarters
+ Most applicants came accompanied for loan application
+ While most applicants are working class, 18% are pensioners
+ Most have secondary education
+ Most are married
+ Most of them have mention their Occupation Type
+ There are some unwanted values in Gender and Organization Type as XNA, let's replace that the Gender one.

In [1]:
Repayer_0['CODE_GENDER'].value_counts()

In [1]:
Repayer_0.loc[Repayer_0['CODE_GENDER']=='XNA','CODE_GENDER']='F'
Repayer_0['CODE_GENDER'].value_counts()

In [1]:
def univariate_categorical(data , col , title, hue):
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 10
    plt.rcParams['axes.titlesize'] = 12
    plt.rcParams['axes.titlepad'] = 15
    
    #fig, ax = plt.subplots(1,2, sharey = True)
    plt.figure(figsize=(10,8))
    sns.countplot(data = data, x = col, order = data[col].value_counts().index, hue = hue, palette='magma') 
    plt.xticks(rotation=45)
    plt.title(title) 
    plt.show()


### Analysis In Respect to Repayers: 

In [1]:
univariate_categorical(Repayer_0, col = 'NAME_CONTRACT_TYPE', title = 'Distribution of Contract Type', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Under Repayer Database, Cash Loans have larger number of credit base than Revolving loans. 
+ Here Females are more who are seeking for Cash Loans as compare to Males.


In [1]:
univariate_categorical(Repayer_0, col = 'NAME_INCOME_TYPE', title = 'Distribution of Income Type', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ We can conclude here, that Females from working class can have more credits as compare to Males.
+ Whereas less credits to Students, Unemployed, Businesman and Maternity Leave crowd. 


In [1]:
univariate_categorical(Repayer_0, col = 'OCCUPATION_TYPE', title = 'Distribution of OCCUPATION_TYPE', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Males from Laborers class can have more credits in lending the loans. 
+ Also, Females from Sales Staff anf Core Staff can have more credits. 

In [1]:
univariate_categorical(Repayer_0, col = 'EMPLOYEMENT_YEARS', title = 'Distribution of EMPLOYEMENT_YEARS', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Most of the Males and Females have been working since between 0-5 years before the loan applicaltion. 
+ But the crowd working for more than 5 years should be given more credits to grant the loan. 

In [1]:
univariate_categorical(Repayer_0, col = 'AGE_GROUP', title = 'Distribution of Age Group', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ In almost all age groups except below 20 years, Females counts are more as compare to Males in having the need for a loan. 
+ It would be interesting to know the reason why? 


In [1]:
univariate_categorical(Repayer_0, col = 'NAME_HOUSING_TYPE', title = 'Distribution of Housing Type', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ We have more Males and Females both coming from owning a House/apartment with more credits for granting the loan. 


### <font color = Green > Consolidated Insights For Repayers Dataset: </font>

+ Under Repayer Database, Cash Loans have larger number of credit base than Revolving loans. 
+ Here Females are more who are seeking for Cash Loans as compare to Males.
+ We can conclude under Occupation Type that Females from working class can have more credits as compare to Males.
+ Whereas less credits to Students, Unemployed, Businesman and Maternity Leave crowd. 
+ Males from Laborers class can have more credits in lending the loans. 
+ Also, Females from Sales Staff anf Core Staff can have more credits. 
+ Most of the Males and Females have been working since between 0-5 years before the loan applicaltion. 
+ But the crowd working for more than 5 years should be given more credits to grant the loan. 
+ In almost all age groups except below 20 years, Females counts are more as compare to Males in having the need for a loan. 
+ We have more Males and Females both coming from owning a House/apartment with more credits for granting the loan. 

## Categorical Univariate Analysis In Respect to Defaulters: 

In [1]:
univariate_categorical(Defaulter_1, col = 'NAME_CONTRACT_TYPE', title = 'Distribution of Contract Type', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Under Defaulter Database too, Cash Loans have larger number of credit base than Revolving loans. 
+ Here also, Females are more who have payemnet difficulties for Cash Loans as compare to Males.


In [1]:
univariate_categorical(Defaulter_1, col = 'NAME_INCOME_TYPE', title = 'Distribution of Income Type', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Most of the Males and Females come from Working Class when it comes to defaulting or Payment Difficulties

In [1]:
univariate_categorical(Defaulter_1, col = 'OCCUPATION_TYPE', title = 'Distribution of OCCUPATION_TYPE', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Contradictorily, Males from laborers class faces more payment difficulties. 
+ Whereas Females from Sales Staff too. 


In [1]:
univariate_categorical(Defaulter_1, col = 'EMPLOYEMENT_YEARS', title = 'Distribution of EMPLOYEMENT_YEARS', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Most of the Males and Females have a job period of 0-5 years that faces payment difficulties

In [1]:
univariate_categorical(Defaulter_1, col = 'AGE_GROUP', title = 'Distribution of AGE_GROUP', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Most of the Males and Females from the Age 30 and above defaults the loans. 

In [1]:
univariate_categorical(Defaulter_1, col = 'NAME_HOUSING_TYPE', title = 'Distribution of Housing Type', hue = 'CODE_GENDER')

### <font color = Green > Points to Note from Graph </font>

+ Similarly, Both the gender categories having an House/appartment faces payement difficulties. 

### <font color = Green > Consolidated Insights For Defaulter dataset: </font>

+ Females are more who have payemnet difficulties for Cash Loans as compare to Males.
+ Most of the Males and Females come from Working Class when it comes to defaulting or Payment Difficulties
+ Contradictorily, Males from laborers class faces more payment difficulties. Whereas Females from Sales Staff too. 
+ Most of the Males and Females have a job period of 0-5 years that faces payment difficulties
+ Most of the Males and Females from the Age 30 and above defaults the loans. 
+ Similarly, Both the gender categories having an House/appartment faces payement difficulties. 

### Checking their Income Level In respect to their Education Status for both male and female separately. 

In [1]:
pd.pivot_table(data = Repayer_0, index = 'NAME_EDUCATION_TYPE', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)

In [1]:
Repayer_Status = pd.pivot_table(data = Repayer_0, index = 'NAME_EDUCATION_TYPE', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)
Repayer_Status.plot(kind='bar', stacked="True", figsize=[12,8])
plt.title("Income Level vs Education Type for Repayers")
plt.show()

In [1]:
pd.pivot_table(data = Defaulter_1, index = 'NAME_EDUCATION_TYPE', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)

In [1]:
Defaulter_Status = pd.pivot_table(data = Defaulter_1, index = 'NAME_EDUCATION_TYPE', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)
Defaulter_Status.plot(kind='bar', stacked="True", figsize=[12,8])
plt.title("Income Level vs Education Type for Defaulters")
plt.show()

### <font color = Green > Insights for Income Level vs Education Type: </font>

+ Females who have completed higher education tend to pay back the loan more when compared to men.
+ Men who have completed lower secondary tend to default more on the loan compared to females.
+ Sum of Income level of females is higher who have completed Secondary/Secondary Special in both defaulters and repayers.
+ Females with an academic degree tend not to default on the loan.

### Checking their Income Level In repect to their Family Status for both male and female separately. 

In [1]:
pd.pivot_table(data = Repayer_0, index = 'NAME_FAMILY_STATUS', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)

In [1]:
Repayer_Status = pd.pivot_table(data = Repayer_0, index = 'NAME_FAMILY_STATUS', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)
Repayer_Status.plot(kind='bar', stacked="True", figsize=[12,8])
plt.title("Income Level vs Family Status for Repayers")
plt.show()

In [1]:
pd.pivot_table(data = Defaulter_1, index = 'NAME_FAMILY_STATUS', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)

In [1]:
Defaulter_Status = pd.pivot_table(data = Defaulter_1, index = 'NAME_FAMILY_STATUS', columns = 'CODE_GENDER', values = 'AMT_INCOME_TOTAL', aggfunc = sum)
Defaulter_Status.plot(kind='bar', stacked="True", figsize=[12,8])
plt.title("Income Level vs Family Status for Defaulters")
plt.show()

### <font color = Green > Insights for Income Level vs Family Status: </font>

+ We see that the majority of the repayers are married.
+ The number of Females is more compared to men for both repayers and defaulters.
+ Single women can be financially stable without a man.
+ Men are included in the Widow section.
+ A major difference between Males and Females can be seen in the Widow Section.
+ Single men tend to default more on the loan when compared to females.

### Subtask 1.12: Numerical Analysis 

In [1]:
num_cols

In [1]:
plt.figure(figsize=[10,8])
g=sns.pairplot(appl_df[["AMT_INCOME_TOTAL", "AMT_CREDIT", "AMT_ANNUITY", "AMT_GOODS_PRICE", "NEW_TARGET"]],
            vars = ["AMT_INCOME_TOTAL", "AMT_CREDIT", "AMT_ANNUITY", "AMT_GOODS_PRICE"],
            hue = "NEW_TARGET", 
            markers = ['o', 's'])
g.fig.set_size_inches(14,12)
#g.fig.set_figheight(20)
#g.fig.set_figwidth(20)
plt.legend(labels=['Defaulter','Repayer'])
plt.show()

### <font color = Green > Insights: </font>

+ When Annuity is more than 15K and Good Price is more than 20 Lakhs, there are fewer defaulters in that zone. 
+ Loan Amount and Goods price are highly and correlated where most of the data are consolidated in form of a linear line. 
+ There are fewer defaulters for Loan Amount greater than 20 Lakhs. 
+ Loan Amount and Loan Annuity are also high correlated. 

In [1]:
# Plotting the numerical columns related to amount as distribution plot to see density

amt_values = appl_df[[ 'AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE']]

fig = plt.figure(figsize=(18,12))

for i in enumerate(amt_values):
    plt.subplot(2,2,i[0]+1)
    sns.distplot(Defaulter_1[i[1]], hist=False,label ="Defaulter")
    sns.distplot(Repayer_0[i[1]], hist=False, label ="Repayer")
    plt.title(i[1], fontdict={'fontsize' : 15, 'fontweight' : 5})
    plt.legend()


plt.show()

### <font color = Green > Insights: </font>

+ Most no of loans are given for goods price less than 10 lakhs.
+ Most people pay annuity less than 50K for the credit loan.
+ Credit amount of the loan is mostly less then 10 lakhs.
+ Apart from the majority there are few who have a Higher Income Level.

In [1]:
plt.figure(figsize=[15,8])
sns.scatterplot(data = Repayer_0, x = 'AMT_INCOME_TOTAL', y = 'AMT_CREDIT', color = "royalblue")
plt.title('Income Level vs Credit Amount of the Loan', fontsize = 20, color = 'Brown')
plt.xlabel("Income Level", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel("Credit Amount of the Loan", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.show()

In [1]:
plt.figure(figsize=[15,8])
sns.scatterplot(data = Defaulter_1, x = 'AMT_INCOME_TOTAL', y = 'AMT_CREDIT', color = "royalblue")
plt.title('Income Level vs Credit Amount of the Loan', fontsize = 20, color = 'Brown')
plt.xlabel("Income Level", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel("Credit Amount of the Loan", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.show()

### <font color = Green > Insights for Income Level vs Credit Amount of the Loan: </font>

+ Credit amount is ranging from 0 - 25 lakhs while the income level ranging from 0-15 lakhs in repayers.
+ There is one client with an income level of 175 and the credit amount of exactly 7 in repayers.
+ Repayers have a decent income level with respect to credit amount which makes it easier for the bank to collect their loans. + + This cannot be seen in regards to deafulters as their income levels are low.
+ One client with an income level of around 1170 has a credit amount of around 5.5-6 in defaulters.
+ Defaulters credit amount range from 0-20 which can be a loss to the bank as the amount and the number of clients are big.

In [1]:
plt.figure(figsize=[15,8])
sns.scatterplot(data = Repayer_0, x = 'AMT_INCOME_TOTAL', y = 'AMT_ANNUITY', color = "royalblue")
plt.title('Income Level vs Loan Annuity Range', fontsize = 20, color = 'Brown')
plt.xlabel("Income Level", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel("Loan Annuity Range", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.show()

In [1]:
plt.figure(figsize=[15,8])
sns.scatterplot(data = Defaulter_1, x = 'AMT_INCOME_TOTAL', y = 'AMT_ANNUITY', color = "royalblue")
plt.title('Income Level vs Loan Annuity Range', fontsize = 20, color = 'Brown')
plt.xlabel("Income Level", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.ylabel("Loan Annuity Range", fontdict={'fontsize': 15, 'fontweight' : 5, 'color' : 'Brown'})
plt.show()

### <font color = Green > Insights for Income Level vs Loan Annuity Range: </font>

+ The Annuity Range cap in upto 1 lakh for majority of the repayers. 
+ Under it, there are few whose income level is more than 25 lakhs and there are two whose income level is 125 lakhs within the same cap. 
+ The Annuity Range cap in upto 80k for majority of the Defaulters.
+ There is an outlier with an income level of more than 1000 lakh who has defaulted within an annuity cap of under 40k. This could be an error.

## Checking for the Range of Credit Amount applied by the Prospects in respect to their Education Type and Family Status

In [1]:
plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
sns.boxplot(data = Repayer_0, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Credit amount vs Education Status For Repayers')
plt.show()

In [1]:
plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
sns.boxplot(data = Defaulter_1, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Credit amount vs Education Status For Defaulters')
plt.show()

### <font color = Green > Insights for Credit amount vs Education Status for Repayers & Defaulters: </font>

+ Under Rpayers: Married, Civil Marriage and Separated having an Academic Degree seems to have applied more for larger credit amounts as compare to others. 
+ The median of the lower secondary education type for all the family status is low compared to others which means they opt for lower credit amount.
+ Under Defaulters: Only the Married ones having the Academic Degree seems to have more application for larger credit amounts followed by the ones having Higher Education degrees. 
+ There is one client under Secondary/Secondary Special under Civil Marriage with a credit amount of over 40 lakhs which could be a loss for the bank.

## Checking for the Level of Income applied by the Prospects in respect to their Education Type and Family Status

In [1]:
plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
plt.yscale('log')
sns.boxplot(data =Repayer_0, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Income amount vs Education Status for Repayers')
plt.show()

In [1]:
plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
plt.yscale('log')
sns.boxplot(data = Defaulter_1, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Income amount vs Education Status For Defaulters')
plt.show()

### <font color = Green > Insights for Income Amount vs Education Status for Repayers & Defaulters: </font>

+ Simirlay Under Repayers: Most of the crowd having an Academic Degree seems to have a Higher Income Level when it comes to repayig the loan back. 
+ Under Defaulters: Majority of the crowd from any family status does not have an Academic Degree except married ones. Rest from any Eduction Type have an moderate Income Level. 

#  Task - 2: Reading and understanding the dataset - 2

### Subtask 2.1: Importing the previous application data.

In [1]:
# Read the csv file using 'read_csv'. 
pappl_df = pd.read_csv('../input/previous-application/previous_application.csv')
pappl_df.head()

In [1]:
# Check the number of rows and columns in the dataframe

pappl_df.shape

In [1]:
# Check the column-wise info and datatypes of the dataframe
pappl_df.info("all")

### Subtask 2.2:  Dealing with Null Values in Columns above 35%

In [1]:
# Checking Null Values

pnull_col = round((pappl_df.isnull().sum()*100/len(pappl_df)).sort_values(ascending = False),2)
pnull_col_35 = pnull_col[pnull_col>35] 
print(pnull_col_35)
print()
print("Number of Columns having missing values more than 35% :",len(pnull_col_35))

In [1]:
pnull_col_35.index

# Even if some of the columns may seem relevant as more than 35% of the data is missing in them, let's drop them out...

In [1]:
# Dropping Columns Having Missing Values more than 35%

pappl_df.drop(columns = pnull_col_35.index, inplace = True)

In [1]:
pappl_df.shape # Now there are 26 columns remaining. 

#### Dealing with rest of the Null Values in Columns

In [1]:
def premain_null_col(df):
    return round(pappl_df.isnull().sum()*100/len(pappl_df),4)

In [1]:
premain_null_col(pappl_df)[premain_null_col(pappl_df)>0]

In [1]:
pappl_df['AMT_ANNUITY'].describe()

In [1]:
pappl_df['AMT_ANNUITY'].mode()

In [1]:
sns.set_style("dark")
plt.style.use("ggplot")
plt.figure(figsize=[10,6])
sns.distplot(pappl_df['AMT_ANNUITY'], rug = True, color = 'royalblue')
plt.show()

### <font color = Green > Insights: </font>

Replacing it with the mean value

In [1]:
pappl_df['AMT_GOODS_PRICE'].describe()

In [1]:
plt.figure(figsize=[10,6])
sns.distplot(pappl_df['AMT_GOODS_PRICE'], rug = True, color = 'royalblue')
plt.show()

In [1]:
pappl_df['AMT_GOODS_PRICE'].mode()

### <font color = Green > Insights: </font>

Replacing it with the mode value

In [1]:
pappl_df['CNT_PAYMENT'].describe()

In [1]:
plt.figure(figsize=[10,6])
sns.distplot(pappl_df['CNT_PAYMENT'], rug = True, color = 'royalblue')
plt.show()

In [1]:
pappl_df['CNT_PAYMENT'].mode()

### <font color = Green > Insights: </font>

Replacing it with the mean value

In [1]:
pappl_df['PRODUCT_COMBINATION'].value_counts()

### <font color = Green > Insights: </font>

The Null Values are very less hence leaving it as it is..

In [1]:
repl_mean_val_df = pappl_df[['AMT_ANNUITY', 'CNT_PAYMENT']]

In [1]:
for i in repl_mean_val_df:

    pappl_df[i][pappl_df[i].isnull()] = pappl_df[i].mean()

In [1]:
pappl_df['AMT_GOODS_PRICE'].mode()[0]

In [1]:
pappl_df['AMT_GOODS_PRICE'].fillna(pappl_df['AMT_GOODS_PRICE'].mode()[0], axis = 0, inplace = True)

In [1]:
premain_null_col(pappl_df)

#### There are almost no Null Values in Previous application Dataset

### Subtask 2.3: Numerical and Categorical Analysis


In [1]:
pappl_df.describe()

### <font color = Green > Insights: </font>

+ The values are larger but not fixing those as it is not required.

In [1]:
pappl_df.select_dtypes(include=['object']).describe()

### <font color = Green > Insights: </font>

+ We need to focus on Previous Cash Loan Purpose & Contract Status columns.

In [1]:
pappl_df.columns

### Subtask 2.4: Identifying Outliers


In [1]:
outliers_check = [ 'AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_GOODS_PRICE' , 'CNT_PAYMENT']

In [1]:
list(enumerate(outliers_check))

In [1]:
plt.figure(figsize=([20,22]))

for n,col in enumerate(outliers_check):
    plt.subplot(3,2,n+1)
    sns.boxplot(pappl_df[col], orient = "h")
    plt.xlabel("")
    plt.ylabel("")
    plt.title(col)

### <font color = Green > Insights: </font>

+ Amounts in Annuity, Application, Credit and Goods Price have a huge number of outliers.
+ CNT_PAYMENT has outlier values but lesser as compare to others.

### Subtask 2.5: Merging both the dataset: 

In [1]:
pappl_df.SK_ID_PREV.isnull().sum()

In [1]:
appl_df.SK_ID_CURR.isnull().sum()

In [1]:
loan_df = pd.merge(left = appl_df, right = pappl_df, how = 'inner', on = 'SK_ID_CURR' )

In [1]:
loan_df.head()

In [1]:
loan_df.shape

In [1]:
loan_df.info("all")

### Subtask 2.6: Dropping Irrelevant Columns: 


In [1]:
##Deleting all the Flag columns 

for i in loan_df.columns:
    if i.startswith("FLAG"):
        loan_df.drop(columns=i, inplace=True)
        
loan_df.shape

In [1]:
##Deleting all the AMT_REQ columns 

for i in loan_df.columns:
    if i.startswith("AMT_REQ"):
        loan_df.drop(columns=i, inplace=True)
        
loan_df.shape

In [1]:
# Bifurcating the loan_df based on Target Value

In [1]:
Repayer_L0 = loan_df.loc[loan_df['NEW_TARGET'] == 'Repayer']
Defaulter_L1 = loan_df.loc[loan_df['NEW_TARGET'] == 'Defaulter']

In [1]:
Repayer_L0.head()

In [1]:
Defaulter_L1.head()

### Subtask 2.7: Final Analysis: 


In [1]:
loan_df['NAME_CONTRACT_STATUS'].value_counts(normalize = True)*100

In [1]:
# Plotting Name contract status to check % of default & Repayers, i.e. Target 0 and 1

contract_status = loan_df['NAME_CONTRACT_STATUS'].unique()

for i in contract_status:
    print("Repayer and Defaulter for : ",i)
    plt.figure(figsize=[8,5])
    print(loan_df[(loan_df['NAME_CONTRACT_STATUS']==i)].NEW_TARGET.value_counts(normalize=True)*100)
    loan_df[(loan_df['NAME_CONTRACT_STATUS']==i)].NEW_TARGET.value_counts().plot.pie(normalize=True)
    plt.legend()
    plt.show()

### <font color = Green > Insights: </font>

1. 7.5% of Approved loans applicants have defaulted the previous loan sanctioned to them. 
3. 88 % of the Clients have Repayed the Current Loan who were Refused in the Previous application. 
2. Also, there are defaulters with with Refused, Cancelled, Unused loans in Previous application. This indicates that the company has approved the current and is facing default on these loans.

In [1]:
#Checking "NAME_CONTRACT_STATUS", "NAME_INCOME_TYPE",aggregating on Target

res=pd.pivot_table(data=loan_df,index="NAME_CONTRACT_STATUS",columns="NAME_INCOME_TYPE",values='TARGET', aggfunc="sum")
plt.figure(figsize=(17,6))
sns.heatmap(res, annot=True,cmap='Greens', fmt="g")
plt.show()

#### Note: Since Target 1 is default, higher on the above matrix shows correlation to default.
### <font color = Green > Insights: </font>

+ A Working applicant with an Approved Contrr=act Status has Defaulted in large numbers.
+ 8906 from Commercial associate who where Approved earlier have Defaulted. 
+ 11389 from Working class who were refused including 8738 who were canceled earlier have defaulted. 

In [1]:
#Checking "NAME_CONTRACT_STATUS", "AGE_GROUP",aggregating on Target

res=pd.pivot_table(data=loan_df,index="NAME_CONTRACT_STATUS",columns="AGE_GROUP",values='TARGET', aggfunc="sum")
plt.figure(figsize=(17,6))
sns.heatmap(res, annot=True,cmap='Greens', fmt="g")
plt.show()

#### Note: Since Target 1 is default, higher on the above matrix shows correlation to default.
    
### <font color = Green > Insights: </font>

+ Approved Loans of age group 25-30, 30-35, 35-40, 40-45 have highere defaults. 
+ Refused and cancelled earlier from the age of 25 till 50 has defaulted this time. 

In [1]:
## Checking for Defaults who were Approved Earlier:

In [1]:
loan_df[(loan_df["NAME_CONTRACT_STATUS"] == 'Approved') & (loan_df['TARGET'] == 1)]

In [1]:
approved_default = loan_df[(loan_df["NAME_CONTRACT_STATUS"] == 'Approved') & (loan_df['TARGET'] == 1)]

In [1]:
# Checking for Repayers who were Refused Earlier: 

In [1]:
loan_df[(loan_df["NAME_CONTRACT_STATUS"] == 'Refused') & (loan_df['TARGET'] == 0)]

In [1]:
refused_repayers = loan_df[(loan_df["NAME_CONTRACT_STATUS"] == 'Refused') & (loan_df['TARGET'] == 0)]

In [1]:
columns = [ 'AGE_GROUP', 'EMPLOYEMENT_YEARS', 'NAME_INCOME_TYPE',  'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 
           'ORGANIZATION_TYPE', 'NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS', 'CHILD_COUNTS', 'FAMILY_COUNTS']


In [1]:
# Gathering all Defaulter % for Previously Approved in repsect to above Variables: 

for i in columns:
    print("Defaulter % for Previously Approved: ", i)
    plt.figure(figsize=[8,5])
    print(approved_default[i].value_counts(normalize=True)*100)
    approved_default[i].value_counts().plot.bar()
    plt.legend()
    plt.show()

### <font color = Green > Insights for Defaulter % for Previously Approved: </font>

+ Age Group of 30-35, followed by 35-45  
+ Employement Years - 0-5
+ Income Type - Working
+ Education Type - Secondary Special
+ Occupation Type - Labourers 
+ Organization Type - Business type 3
+ Housing Type - House/Appartment
+ Family Status - Married Ones 
+ Child Count - 0 and 1/2
+ Family Count - less than 2 and 2-4

In [1]:
# Gathering all Repayer % for Previously Refused in repsect to above Variables: 

for i in columns:
    print("Repayer % for Previously Refused for: ", i)
    plt.figure(figsize=[8,5])
    print(refused_repayers[i].value_counts(normalize=True)*100)
    refused_repayers[i].value_counts().plot.bar()
    plt.legend()
    plt.show()

### <font color = Green > Insights for Repayer % for Previously Refused: </font>

+ Age Group of 30 - 45 and 60 above. 
+ Employement Years - 0-5 & 5-10.
+ Income Type - Working
+ Educaton Tyoe - Secondary Special
+ Occupation Type - Labourers followed by Sales Staff and Core Staff
+ Organization Type - Business type 3
+ Housing Type - House/Appartment
+ Family Status - Married Ones 
+ Child Count - 0 and 1/2
+ Family Count - less than 2 and 2-4



### Subtask 2.7: Case Summary: 

+ Females are more likely to Repay whereas Males are more likhely to default. 
+ 11% of the clients who did not repay the current loan who were refused in the previous application
+ People having an Academic Degree are less likely to Default but does not mean they Repay too. Proper scrunity is needed by considering other parameters. 
+ People with Secondary Special education type are higher as Defaulters. 
+ Student and Businessmen from Income Type have no defaults. 
+ People above the age of 50 has lower likeliness to default. 
+ Prospects with 30+ year experience having less than 1% default rate.
+ Banks should focus less on income type ‘Working’ as they are having most number of Defaulters. 


