# <font color = black> EDA Case Study </font>

In [None]:
# Filtering out the warnings

import warnings

warnings.filterwarnings('ignore')

In [None]:
pwd

In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

In [None]:
# Setting options to view large data

pd.set_option('display.max_columns', 300) 
pd.set_option('display.max_rows', 300) 
pd.set_option('display.width', 1000)

In [None]:
# Read the csv file using 'read_csv'. application_data.csv 
app_data = pd.read_csv('../input/loan-defaulter/application_data.csv')

## Data Analysis

Now that we have loaded the dataset and inspected it, we see that most of the data is in place. As of now, no data cleaning is required, so let's start with some data manipulation, analysis, and visualisation to get various insights about the data. 

#### Checking the Application Data

In [None]:
# Check the number of rows and columns in the application_data dataframe
app_data.shape

In [None]:
# Check the column-wise info of the application_data dataframe
app_data.info(verbose=True, null_counts=True)

In [None]:
# Check the summary for the numeric columns application_data dataframe
app_data.describe()

### Checking the Null value in Application Data


In [None]:
null_app=round(app_data.isna().sum()/len(app_data)*100,2)
null_app.sort_values(ascending=False).head(100)

In [None]:
# Check the percentage of Null value

null_app=round(app_data.isnull().sum()/len(app_data)*100,2)
null_app.sort_values(ascending=False).head(100)

In [None]:
# Cleaning the missing data

# listing the null values columns having more than 30%

null_app = app_data.isnull().sum()
null_app = null_app[null_app.values>(0.5*len(null_app))]
len(null_app)

In [None]:
# Removing those 64 columns

null_app = list(null_app[null_app.values>=0.5].index)
app_data.drop(labels=null_app,axis=1,inplace=True)
print(len(null_app))

In [None]:
# Check the Application data frame after removal

app_data.shape

##### Here we have removed 64 columns so our latest column count has reduced from 122 to 58

In [None]:
# Checking for the lesser Null value in the app_data

app_data.isnull().sum()/len(app_data)*100

In [None]:
# Checking again the null values in the various column

app_data.isnull().sum()

Here only three columns have missing values that too in very low amount hence we will skip this column names are 

*   AMT_ANNUITY - 12
*   CNT_FAM_MEMBERS - 2
*   DAYS_LAST_PHONE_CHANGE - 1

In [None]:
# To get the total count of null values in 'AMT_ANNUITY' column

app_data['AMT_ANNUITY'].isnull().sum()
#Since it is a very low amount of null values we will not impute it or change it we will use the data as it

Since it is a very low amount of null values we will not impute it or change it we will use the data as it but below it is shown how to impute if needed

In [None]:
# Checking the Outliers for Annuity Amount

sns.boxplot(data= app_data, y='AMT_ANNUITY')
plt.yscale('log')
plt.title('Distribuition Annuity Amount')
plt.ylabel('Total Annuity')
plt.show()

From the above graph:
* There seems to have outliers in the Annuity Amount column
* The first quartile range seems to be bigger than the third quartile

In [None]:
app_data['AMT_ANNUITY'].describe()

Checking the outliers for the 'AMT_ANNUITY', thus it is not possible to 
replace the value with the mean values(). In this case, it can be replaced with the median value which is 24903.00

In [None]:
# app_data['AMT_ANNUITY'].fillna(app_data['AMT_ANNUITY'].median(), inplace=True)

In [None]:
# To get the total count of null values in 'CNT_FAM_MEMBERS ' column

app_data['CNT_FAM_MEMBERS'].isnull().sum()

In [None]:
# Checking the outliers for the column

sns.boxplot(data= app_data, y='CNT_FAM_MEMBERS')
plt.title('Distribution Family Members')
plt.ylabel('Total Family Members')
plt.show()

From the above graph:
* There are minimal outliers present in the graph of the Distribution of the Family Members
* The outliers for this column, is around 20 member in the family.

In [None]:
sns.boxplot(data= app_data, y='AMT_CREDIT')
plt.title('Distribution Credit Amount')
plt.yscale('log')
plt.ylabel('Total Credit')
plt.show()

From the graph, we can conclude that:
* There are an outliers present in the column.
* Most of the client lies within the first quartile.

In [None]:
app_data['AMT_INCOME_TOTAL'].describe()

In [None]:
sns.boxplot(data= app_data, y='AMT_INCOME_TOTAL')
plt.title('Distribution Income Amount')
plt.yscale('log')
plt.ylabel('Total Credit')
plt.show()

The Amount of Income graph, shows that:
* There are significant outliers present in the Income dataset.
* It seems like for the first quartile and third quartile has an equal range of customers.
    

In [None]:
# It is important to drop columns which are not relevant to our analysis

app_data.drop(columns=['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL','REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','FLAG_EMAIL', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY','DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'],axis=1,inplace=True)

### Above list of columns have been removed and analysis will be done on the rest 29 columns which will be analysed with the **Target** column 

In [None]:
app_data.shape

#### Here we have removed more 29 columns so our latest column count has reduced from 58 to 29

### Checking the categorical columns in the Application dataframe

In [None]:
app_data.nunique().sort_values()

In [None]:
# 'Code_Gender' column
# Cheking the values XNA,it can either be Male or Female
# SInce the common value is Female, thus we will replace those 'XNA' values with Female

app_data['CODE_GENDER'].value_counts()

In [None]:
#### since there are only four data with XNA hence we can do this by imputing the Female data(F) for the 4 records
app_data['CODE_GENDER'].replace('XNA','F', inplace=True)

In [None]:
#checking again the CODE_GENDER value counts
app_data['CODE_GENDER'].value_counts()

### Below code can be used for the imputation


> app_data['CODE_GENDER'].replace('XNA','F', inplace=True)



In [None]:
# There are 2 Unknown values in the 'Name_Family_Status' column
# This can be replaced with the common value(mode)

app_data['NAME_FAMILY_STATUS'].value_counts()

Below code can be used to impute the NAME_FAMILY_STATUS column
> app_data['NAME_FAMILY_STATUS'].replace('Unknown','Married', inplace=True)

In [None]:
# XNA values are available in the 'Organization Type' column

app_data['ORGANIZATION_TYPE'].value_counts()

#### From the above we can determine that there are lot of XNA values in **ORGANIZATION_TYPE** i.e. - **XNA - 55374** this consist of **18% of the total **rows hence we can take the decision of **dropping the columns** (But this should be done **after approval** with other stakeholders)

### Checking the Numerical Columns available in the Application dataset

In [None]:
num_col=app_data.describe().columns
num_col

In [None]:
#checking DAYS_BIRTH column
app_data['DAYS_BIRTH'].value_counts()

In [None]:
# Creating Age column based on Days of Birth

app_data['AGE']=abs(round((app_data['DAYS_BIRTH'].replace('-',''))/365,2)).astype(int)

In [None]:
app_data['AGE'].describe()

By Analysing the Age column we can infer here that 

*   Max age is **69 Years**
*   Min age is **20 Years**
*   Mean age is **~43 Years**

In [None]:
#DAYS_REGISTRATION
# Creating REGISTRATION_SINCE column based on DAYS_REGISTRATION

app_data['REGISTRATION_SINCE']=abs(round((app_data['DAYS_REGISTRATION'].replace('-',''))/365,2)).astype(int)

In [None]:
app_data['REGISTRATION_SINCE'].describe()

By Analysing the REGISTRATION_SINCE column we can infer here that

* Max Registered since is **67 Years**
* Min Registered since is **0 Years**
* Mean Registered since is **~13.1 Years**

In [None]:
#Creating a box_Plot to check the OutLiers

sns.boxplot(data= app_data, y='REGISTRATION_SINCE')
plt.title('Distribution Registration')
plt.yscale('log')
plt.ylabel('Registration Since')
plt.show()

Conclusions form the graph:
* Most of the customer lies in the first quartile
* There are outliers present in the Registration column

In [None]:
# Creating WORKING_SINCE column based on DAYS_EMPLOYED
app_data['WORKING_SINCE']=abs(round((app_data['DAYS_EMPLOYED'].replace('-',''))/365,2)).astype(int)

In [None]:
#Creating a box_Plot to check the Outliers

sns.boxplot(data= app_data, y='WORKING_SINCE')
plt.title('Distribution Working Since')
plt.yscale('log')
plt.ylabel('Working Since')
plt.show()

From the above graph, we can see that:
* The outliers in the column, lies at the value of 1000 value which is seems irrelevant.
* And customer lies in the first quartile, are greater than the third quartile.

In [None]:
#checking the value counts for all the records of WORKING_SINCE columns 

app_data['WORKING_SINCE'].value_counts()

#### 55374 rows are having 1000 years 

In [None]:
from IPython.display import Image
Image(filename='../input/screenshot/Org_type.png')

In [None]:
from IPython.display import Image
Image(filename='../input/screenshot/Working_Since.png')

### As derived above for ORGANIZATION_TYPE and Here for WORKING_SINCE that the exact same number of Records which shows discripancy i.e. 55374 Records here are 1000 year since working and in ORGANIZATION_TYPE column these records were XNA
### Hence it will not be wrong if we drop these rows from our App_Data Dataframe

In [None]:
#Dropping the rows with XNA values because 55374 divided by 307511 is 18%, which if dropped off won't have a significant impact on the dataset.
app_data = app_data.drop(app_data.loc[app_data['ORGANIZATION_TYPE']=='XNA'].index)
app_data[app_data['ORGANIZATION_TYPE']=='XNA'].shape


#### We have dropped these 55374 records lets check the count for ORGANIZATION_TYPE and WORKING_SINCE columns respectively

In [None]:
numeric_columns=['TARGET','CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','REGION_POPULATION_RELATIVE',
                'DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','HOUR_APPR_PROCESS_START','LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY']

In [None]:
app_data[numeric_columns].info()

These are already in Numerical if it would not be in Numerical then we can change these columns by using **pd.to_numeric** function

### Bucketing

In [None]:
# Creating the bucket for Age Group
# For reference, we can change the binning to quartile as well ## to be deleted

app_data['AGE_GROUP']=pd.cut(x=app_data['AGE'],bins=[20,30,40,50,60,70], labels=['20 to 30', '31 to 40', '41 to 50', '51 to 60', '61 to 70'])

In [None]:
app_data.AGE_GROUP.value_counts()

#### Here we can observe from the new AGE_GROUP column that 
* Most number of Application are from age group - 31 to 40 i.e - 82550
* Least number of Application are from age group - 61 to 70 i.e - 4483

In [None]:
# Creating the bucket for Age Group
# For reference, we can change the binning to quartile as well

app_data['WORKING_AGE_GROUP']=pd.cut(x=app_data['WORKING_SINCE'],bins=[1,10,20,30,40,50], labels=['1 to 10', '11 to 20', '21 to 30', '31 to 40', '41 to 50'])

In [None]:
app_data.WORKING_AGE_GROUP.value_counts()

#### Here we can observe from the new WORKING_AGE_GROUP column that 

* Most number of Application are from WORKING_AGE_GROUP - 1 to 10 i.e - 148807
* Least number of Application are from WORKING_AGE_GROUP - 41 to 50 i.e - 175

In [None]:
# REGISTRATION_SINCE
# Creating the bucket for REGISTRATION_SINCE_GROUP Group
# For reference, we can change the binning to quartile as well

app_data['REGISTRATION_SINCE_GROUP']=pd.cut(x=app_data['REGISTRATION_SINCE'],bins=[1,10,20,30,40,50,60], labels=['1 to 10', '11 to 20', '21 to 30', '31 to 40', '41 to 50','51 to 60'])

In [None]:
app_data['REGISTRATION_SINCE_GROUP'].value_counts()

#### Here we can observe from the new REGISTRATION_SINCE_GROUP column that 

* Most number of Application are from WORKING_AGE_GROUP - 1 to 10 i.e - 90235
* Least number of Application are from WORKING_AGE_GROUP - 51 to 60 i.e - 20

In [None]:
# We will check the Amt_Income_Total and Amt_Credit

app_data[['AMT_INCOME_TOTAL','AMT_CREDIT']].describe()

In [None]:
# Binning the income based on quantile

app_data['INCOME_GROUP']=pd.qcut(app_data['AMT_INCOME_TOTAL'],q=[0,0.1,0.3,0.6,0.8,1.0], labels=['Very Low','Low','Medium','High','Very High'])
app_data['INCOME_GROUP'].dtypes

In [None]:
app_data['INCOME_GROUP'].value_counts()

#### Here we can observe from the new INCOME_GROUP column that 
* Most number of Application are from INCOME_GROUP - Medium i.e - 91768
* Least number of Application are from INCOME_GROUP - Very Low i.e - 38905

In [None]:
# Binning the Amount Credit

app_data['AMT_CREDIT'].value_counts()

In [None]:
app_data['CREDIT_GROUP']=pd.qcut(app_data['AMT_CREDIT'],q=[0,0.1,0.3,0.6,0.8,1.0], labels=['Very Low','Low','Medium','High','Very High'])
app_data['CREDIT_GROUP'].dtypes

In [None]:
app_data['CREDIT_GROUP'].value_counts()

#### Here we can observe from the new CREDIT_GROUP column that 
* Most number of Application are for CREDIT_GROUP - Medium i.e - 74748
* Least number of Application are for CREDIT_GROUP - Very Low i.e - 26125

In [None]:
app_data['CODE_GENDER'].value_counts()

### Categorizing the dataset into two: of  target1=(client with payment difficulties) & target0=(The rest)


In [None]:
target0_app_data=app_data.loc[app_data["TARGET"]==0]
target1_app_data=app_data.loc[app_data["TARGET"]==1]

### Calculating Imbalance percentage for target column 

In [None]:
round(len(target0_app_data)/len(target1_app_data),2)

In [None]:
from matplotlib.pyplot import figure
figure(figsize=(8, 6), dpi=80)
y = np.array([len(target0_app_data), len(target1_app_data)])
mylabels = ["Non-Defaulters", "Defaulters"]

plt.pie(y, labels = mylabels,  autopct="%.1f%%", explode=[0.05]*2 )
plt.title('Imbalance Percentage for Target Column')
plt.show()

Here after calculating Imbalance percentage for target column we can clearly infer that: 
* Target_1 i.e. people with payment difficulties are around 10.55 times less the Non-Defaulters
* **Defaulter are 8.7 % of the total data and Non Defulters are 91.3%**

### Calculating Gender Imbalance 


In [None]:
gender_type_counts = app_data['CODE_GENDER'].value_counts()

In [None]:
gender_type_counts

In [None]:
from matplotlib.pyplot import figure
figure(figsize=(8, 6), dpi=80)

mylabels = ["Female", "Male"]

plt.pie(gender_type_counts, labels = mylabels,  autopct="%.1f%%", explode=[0.05]*2 )
plt.title('Gender Imbalance')
plt.show()

For the pie above, we can conclude that:
* Female consists **62.3% and Males 37.7%** of the total customers in the dataset.
* Clearly more customers are female

# UNIVARIATE ANALYSIS

## Creating Box Plots for AMT_CREDIT , AMT_INCOME_TOTAL

In [None]:
# Box plotting for univariate variables analysis in logarithmic scale

def univariate_Analysis(data,col,title):
    from matplotlib.pyplot import figure
    figure(figsize=(8, 6), dpi=80)
    sns.set_style('darkgrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 22
    plt.rcParams['axes.titlepad'] = 30
    
    plt.title(title)
    plt.yscale('log')
    sns.boxplot(data=data, y=col)
    plt.show()

In [None]:
# Univariate Analysis - Box plot For AMT_INCOME_TOTAL to check the outliers 

univariate_Analysis(data=target0_app_data,col='AMT_INCOME_TOTAL',title='Distribution of income amount for Non Defaulters')
univariate_Analysis(data=target1_app_data,col='AMT_INCOME_TOTAL',title='Distribution of income amount for Defaulters')

We can infer below points from the above graphs :
* Both the box plots are quite similar althogh we can see a lot of outliers in both the graphs. 
* However, we can see that for the Amount Income for the Defaulters, most of the defaulters lies in the third quartile as compared with the first quartile

In [None]:
# Univariate Analysis - Box plot For AMT_CREDIT to check the outliers 

univariate_Analysis(data=target0_app_data,col='AMT_CREDIT',title='Distribution of AMT_CREDIT for Non Defaulters')
univariate_Analysis(data=target1_app_data,col='AMT_CREDIT',title='Distribution of AMT_CREDIT for Defaulters')

We can infer below points from the above graphs of **Amount Credit for Defaulters and non-Defaulters** :

* Both the box plots are quite similar althogh we can see a lot of outliers in both the graphs. 
* However, we can see from the Defaulters boxplot, the max values are lesser than the non-Defaulters.

In [None]:
# Plotting a graph for Organization type

sns.set_style('dark')
sns.set_context('talk')
plt.figure(figsize=(20,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30

plt.title("Distribution by Organization type for target0(Non Defaulters)")

plt.xticks(rotation=45)
plt.xscale('log')

sns.countplot(data=target0_app_data,y='ORGANIZATION_TYPE',order=target0_app_data['ORGANIZATION_TYPE'].value_counts().index,palette='pastel')

plt.show()

We can infer below points from the above graphs of **Target 0(NON Defaulters) with ORGANIZATION_TYPE**:

* Most number of customers are from Business Entity 3,Self Employed, Medicine, Government ORGANIZATION_TYPE
* Least number of customers are from Industry type 8, Trade type 5, Industry type 13 

In [None]:
# Plotting a graph for Organization type

sns.set_style('dark')
sns.set_context('talk')
plt.figure(figsize=(20,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30
plt.title("Distribution by Organization type for target1(Defaulters)")

plt.xticks(rotation=45)
plt.xscale('log')

sns.countplot(data=target1_app_data,y='ORGANIZATION_TYPE',order=target1_app_data['ORGANIZATION_TYPE'].value_counts().index,palette='pastel')

plt.show()

We can infer below points from the above graphs of **Target 1( Defaulters) with ORGANIZATION_TYPE**:

* Most number of customers are from Business Entity 3,Business type 2 and Self Employed ORGANIZATION_TYPE
* Least number of customers are from Industry type 8, Trade type 4, Trade type 5

In [None]:
#Plotting graphs for Target0 (Client with no payment difficulties)
def plotting(df,col,title,hue =None):
    
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 18
    plt.rcParams['axes.titlesize'] = 20
    plt.rcParams['axes.titlepad'] = 28
    temp = pd.Series(data = hue)
    fig, ax = plt.subplots()
    width = len(df[col].unique()) + 7 + 5*len(temp.unique())
    fig.set_size_inches(width , 7)
    plt.xticks(rotation=45)
    plt.yscale('log')
    plt.title(title)
    sns.countplot(data = df, x= col, order=df[col].value_counts().index,hue = hue,palette='pastel') 
        
    plt.show()

In [None]:
#Plotting a Graph for NAME_FAMILY_STATUS with Target Group 0 and Target Group 1
plotting(target0_app_data,col='NAME_FAMILY_STATUS',title='Number of Customer Distributed by NAME_FAMILY_STATUS for Target Group-0(Non Defaulters)')
plotting(target1_app_data,col='NAME_FAMILY_STATUS',title='Number of Customer Distributed by NAME_FAMILY_STATUS for Target Group-1(Defaulters)')

From the **Family Status graph, for Defaulters or non-Defaulters**, we can conclude that:
* For the customers who are Married, for both target groups, are greater. However the count are far greater for the non-Defaulters.
* Followed closely by the Single/not married customers.
* The are unknown status available in the column for the non-Defaulters and it carries the lesser value for the target group.
* However for the Defaulters, Widow status are the least

In [None]:
#Plotting a Graph for REGISTRATION_SINCE_GROUP with Target Group 0 and Target Group 1
plotting(target0_app_data,col='REGISTRATION_SINCE_GROUP',title='Number of Customer Distributed by REGISTRATION_SINCE_GROUP for Target Group-0(Non Defaulters)')
plotting(target1_app_data,col='REGISTRATION_SINCE_GROUP',title='Number of Customer Distributed by REGISTRATION_SINCE_GROUP for Target Group-1(Defaulters)')


We can infer below points from the above graphs of **Target 0(Non Defaulters) and  Target 1(Defaulters) with REGISTRATION_SINCE_GROUP**.
* Plots of both Defaulter and NON Defaulters are quite similar.however there are very minimal value for the range 51-60 for the Defaulters.
* We can clearly see that most of the defaulters as well as Non Defaulter are from Registered Since 1 to 20 years 
* Hoever there are very minimal values present in range of 51-60 for the Defaulters.
* Hence we can say that Older the registeration lesser the chances of default

In [None]:
#Plotting a Graph for WORKING_AGE_GROUP with Target Group 0 and Target Group 1
plotting(target0_app_data,col='WORKING_AGE_GROUP',title='Number of Customer Distributed by WORKING_AGE_GROUP for Target Group-0(Non Defaulters)')
plotting(target1_app_data,col='WORKING_AGE_GROUP',title='Number of Customer Distributed by WORKING_AGE_GROUP for Target Group-1(Defaulters)')


We can infer below points from the above graphs of **Target 0(Non Defaulters) and  Target 1(Defaulters) with WORKING_AGE_GROUP.**

* Plots of both Defaulter and NON Defaulters are quite similar
* The count for the Working Year for the non-Defaulters, are far much greater than the Defaulters.
* We can clearly see that most of the Defaulters as well as Non Defaulters are Working since 1 to 10 years 
* While very less applications in both target groups comes within the working years of 41-50 years.

In [None]:
#Plotting a Graph for NAME_CONTRACT_TYPE with Target Group 0 and Target Group 1
plotting(target0_app_data,col='NAME_CONTRACT_TYPE',title='Number of Customer Distributed by NAME_CONTRACT_TYPE for Target Group-0(Non Defaulters)')
plotting(target1_app_data,col='NAME_CONTRACT_TYPE',title='Number of Customer Distributed by NAME_CONTRACT_TYPE for Target Group-1(Defaulters)')



We can infer below points from the above graphs of **Target 0(Non Defaulters) and Target 1(Defaulters) with NAME_CONTRACT_TYPE.**

* Plots of both Defaulter and Non Defaulters are quite similar, however the values for the Cash Loans are more for the non-Defaulters.
* We can clearly infer that more customers take Cash loans as their Contract type and most of the defaulter are also from Cash Loans

In [None]:
#Plotting a Graph for FLAG_OWN_CAR with Target Group 0 and Target Group 1
plotting(target0_app_data,col='FLAG_OWN_CAR',title='Number of Customer Distributed by FLAG_OWN_CAR for Target Group-0(Non Defaulters)')
plotting(target1_app_data,col='FLAG_OWN_CAR',title='Number of Customer Distributed by FLAG_OWN_CAR for Target Group-1(Defaulters)')


We can infer below points from the above graphs of **Target 0(Non Defaulters) and Target 1(Defaulters) with FLAG_OWN_CAR.**
* Plots of both Defaulter and NON Defaulters are quite similar
* We can clearly infer that most of the loan customer do not OWN CAR hence most of the defaulter are NON CAR OWNERS for both target groups.

In [None]:
#Plotting a Graph for FLAG_OWN_REALTY with Target Group 0 and Target Group 1
plotting(target0_app_data,col='FLAG_OWN_REALTY',title='Number of Customer Distributed by FLAG_OWN_REALTY for Target Group-0(Non Defaulters)')
plotting(target1_app_data,col='FLAG_OWN_REALTY',title='Number of Customer Distributed by FLAG_OWN_REALTY for Target Group-1(Defaulters)')



We can infer below points from the above graphs of **Target 0(Non Defaulters) and Target 1(Defaulters) with FLAG_OWN_REALTY.**
* We can clearly infer that most of the loan customer do not OWN ANY HOUSE/FLAT hence most of the defaulter are NON HOUSE/FLAT OWNERS for both target groups.

# BI-VARIATE ANALYSIS

In [None]:
#Plotting a Graph for Income Range with Target Group 0
plotting(target0_app_data,col='INCOME_GROUP',title='Number of Customer Distributed by Income Range for Target Group-0(Non Defaulters)',hue='CODE_GENDER')

### From the graph of Income Distribution for the non-Defaulters with CODE_GENDER:

* In all the income groups female are on the higher end then male.
* Most number of Application are from INCOME_GROUP - Medium.
* Least number of Application are from INCOME_GROUP - Low.
* Hence here one point is evident that chances of getting loan for females are higher for each income group

In [None]:
#Plotting a Graph for Income Range with Target Group 1
plotting(target1_app_data,col='INCOME_GROUP',title='Number of Customer Distributed by Income Range for Target Group-1(Defaulters)',hue='CODE_GENDER')


### From the graph of Income Distribution for the Defaulters with CODE_GENDER:

* Here also we can derive that Female customers are more then the male custome but we have Male higher for High and Very High income group
* Here except Medium Income group, We can derive that number of customer are decreasing as income is increasing.
* Most number of Application are from INCOME_GROUP - Medium.
* Least number of Application are from INCOME_GROUP - Very High.

In [None]:
#Plotting a Graph for Credit Range with Target Group 0
plotting(target0_app_data,col='CREDIT_GROUP',title='Number of Customer Distributed by CREDIT GROUP for Target Group-0(Non Defaulters)',hue='CODE_GENDER')


### From the graph of Credit Distribution for the non- Defaulters with CODE_GENDER:

* For the Customer Distribution in the Credit Group, we can infer that it is dominated by the Females, followed by Males.
* Here except Medium Income group, we can derive that number of customer are decreasing as income is increasing, however, for the range of Credit for Very High, Low and High Credit Group, we can see and equal distributions amongs this ranges.
* Most number of Application are from CREDIT_GROUP - Medium.
* Least number of Application are from CREDIT_GROUP - Very Low.

In [None]:
#Plotting a Graph for Credit Range with Target Group 1
plotting(target1_app_data,col='CREDIT_GROUP',title='Number of Customer Distributed by CREDIT GROUP for Target Group-1(Defaulters)',hue='CODE_GENDER')


### From the graph of Credit Distribution for the Defaulters with CODE_GENDER:

* For the Customer Distribution in the Credit Group, we can infer that it is higher for Females, followed by Males.
* Most number of Application are from CREDIT_GROUP - Medium.
* Least number of Application are from CREDIT_GROUP - Very Low.
    * As the same as the non-Defaulters groups.

In [None]:
#Plotting a Graph for AGE_GROUP with Target Group 0
plotting(target0_app_data,col='AGE_GROUP',title='Number of Customer Distributed by AGE GROUP for Target Group-0(Non Defaulters)',hue='CODE_GENDER')


### From the graph of Age Group Distribution for the non-Defaulters with CODE_GENDER:

* The Age group within 31-40 are higher in this distribution, and it follows closely with the Age group of 41-50, which is almost seen not much differences.
* For the Age group of 20 to 30 and 51 to 61, it looks like both of the Age group has veyr minimal differences.
* While the Age group of 61-70 carries values the least in the dataset.

In [None]:
#Plotting a Graph for AGE_GROUP with Target Group 1
plotting(target1_app_data,col='AGE_GROUP',title='Number of Customer Distributed by AGE GROUP for Target Group-1(Defaulters)',hue='CODE_GENDER')


### From the graph of Age Group Distribution for the Defaulters with CODE_GENDER:

* Here also in all the AGE Group Female customers are more then the male customers
* Most number of Application are from AGE_GROUP - 31 to 40.
* Least number of Application are from AGE_GROUP - 61 to 70.

In [None]:
#Plotting a Graph for NAME_EDUCATION_TYPE with Target Group 0
plotting(target0_app_data,col='NAME_EDUCATION_TYPE',title='Number of Customer Distributed by NAME_EDUCATION_TYPE for Target Group-0(Non Defaulters)',hue='CODE_GENDER')


### From the graph of Education Type for the non-Defaulters with CODE_GENDER:

* For the non-Defaulter, we can infer that the Secondary/ Secondary special carries more in the application for both of the gender.
* Followed by Higher education, Incomplete higher.
* For the Lower secondary, for both gender, Males and Females, there are minimal differences.
* While for the Academis degree carries the least application.

In [None]:
#Plotting a Graph for NAME_EDUCATION_TYPE with Target Group 1
plotting(target1_app_data,col='NAME_EDUCATION_TYPE',title='Number of Customer Distributed by NAME_EDUCATION_TYPE for Target Group-1(Defaulters)',hue='CODE_GENDER')


### From the graph of Education Type for the non-Defaulters with CODE_GENDER:

* For the Defaulter, we can infer that the Secondary/ Secondary special carries more in the application for both of the gender.
* Followed by Higher education, Incomplete higher and Lower secondary.
* However, for the Academis degree in Defaulter group, very minimal application comes from the Male group.

In [None]:
#Plotting a Graph for INCOME_GROUP with Target Group 0
plotting(target0_app_data,col='INCOME_GROUP',title='Number of Customer Distributed by INCOME_GROUP and AGE_GROUP for Target Group-0(NON Defaulters)',hue='AGE_GROUP')


### We can infer below points from the distribution of Income Group for the non-Defaulters with AGE_GROUP:

* Medium Income customers carried the most applications in all of the Age Group with the Age group 31-40 is the highest and 61-70 is the lowest, followed by the Income Group Very High.
* There are no significant differences for the Income Group High, Very Low and Low, which we can see that the Age Group 31-40 are higher and 61-70 are the lowest.


In [None]:
#Plotting a Graph for INCOME_GROUP with Target Group 1
plotting(target1_app_data,col='INCOME_GROUP',title='Number of Customer Distributed by INCOME_GROUP and AGE_GROUP for Target Group-1(Defaulters)',hue='AGE_GROUP')


### We can infer below points from the distribution of Income Group for the Defaulters with AGE_GROUP:
* Most of the Defaulter customers are from Medium salary with AGE_GROUP as 31-40 and 20-30
* There are no significant differences for the Income Group High, Very Low and Low, which we can see that the Age Group 31-40 are higher and 61-70 are the lowest.


* **The distribution of the Income Group within Age Group carries not much of significant differences in both Target Group (Defaulters and non-Defaulters).**

# Multivariate Analysis with Heatmap

### Finding the correlation between 'AMT_INCOME_TOTAL','AMT_CREDIT','AGE','WORKING_SINCE','REGISTRATION_SINCE' columns

In [None]:
target0_app_data[['AMT_INCOME_TOTAL','AMT_CREDIT','AGE','WORKING_SINCE','REGISTRATION_SINCE']].corr()

In [None]:
x = target0_app_data[['AMT_INCOME_TOTAL','AMT_CREDIT','AGE','WORKING_SINCE','REGISTRATION_SINCE']].corr()

In [None]:
plt.figure(figsize=(20, 10))
sns.set_style('whitegrid')
sns.set_context('talk')
plt.rcParams['axes.titlesize'] = 23
plt.rcParams['axes.titlepad'] = 20
plt.rcParams["axes.labelsize"] = 20
plt.title("Finding the correlation between columns through Heatmap for Target Group-0(NON Defaulters)")
sns.heatmap(x,annot=True,cmap='Pastel2')
plt.show()

#### From the Heatmap above, we can find the following correlations between columns for the NON-DEFAULTERS:

* Based from the column AMT_CREDIT and AMT_INCOME_TOTAL, we can conclude that both of the columns are positively correlated with **33%**
* Then followed by AMT_CREDIT and WORKING_SINCE with **8.7%**
* There are also columns which are negatively correlated such as AMT_INCOME_TOTAL and REGISTRATION_SINCE which negatively correlated at **-3.4%**

In [None]:
x1 = target1_app_data[['AMT_INCOME_TOTAL','AMT_CREDIT','AGE','WORKING_SINCE','REGISTRATION_SINCE']].corr()


In [None]:
plt.figure(figsize=(20, 10))
sns.set_style('whitegrid')
sns.set_context('talk')
plt.rcParams['axes.titlesize'] = 23
plt.rcParams['axes.titlepad'] = 20
plt.rcParams["axes.labelsize"] = 20
plt.title("Finding the correlation between columns through Heatmap for Target Group-1(Defaulters)")
sns.heatmap(x1,annot=True,cmap='Pastel2')
plt.show()

#### From the Heatmap above, we can find the following correlations between columns for the DEFAULTERS:

* AGE and WORKING_SINCE columns are positively correlated with the value of **31%**
* Followed by AGE and REGISTRATION_SINCE with **2.4%**
* For the negatively correlated column such as AMT_INCOME_TOTAL and WORKING_SINCE.

In [None]:
B = target0_app_data[['AMT_INCOME_TOTAL','AMT_CREDIT','AGE','WORKING_SINCE','REGISTRATION_SINCE','CNT_CHILDREN','REG_REGION_NOT_LIVE_REGION','AMT_ANNUITY','REGION_POPULATION_RELATIVE','REG_REGION_NOT_LIVE_REGION','REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION','REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY','LIVE_CITY_NOT_WORK_CITY']].corr()

In [None]:
plt.figure(figsize=(20, 10))
sns.set_style('whitegrid')
sns.set_context('talk')
plt.rcParams['axes.titlesize'] = 23
plt.rcParams['axes.titlepad'] = 20
plt.rcParams["axes.labelsize"] = 20
plt.title("Finding the correlation between columns through Heatmap for Target Group-0(Non-Defaulters)")
sns.heatmap(B,annot=False,cmap='YlGnBu')
plt.show()

#### From the Heatmap above, we can find the followeing correlations between columns for the NON-DEFAULTERS:

* Between the column LIVE_REGION_NOT_WORK_REGION and REG_REGION_NOT_WORK_REGION, there are positively correlated.
* Followed with REG_CITY_NOT_WORK_CITY and LIVE_CITY_NOT_WORK_CITY, AMT_ANNUITY and AMT_CREDIT.
* Negatively correlated columns such as CNT_CHILDREN and AGE.

In [None]:
B1 = target1_app_data[['AMT_INCOME_TOTAL','AMT_CREDIT','AGE','WORKING_SINCE','REGISTRATION_SINCE','CNT_CHILDREN','REG_REGION_NOT_LIVE_REGION','AMT_ANNUITY','REGION_POPULATION_RELATIVE','REG_REGION_NOT_LIVE_REGION','REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION','REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY','LIVE_CITY_NOT_WORK_CITY']].corr()

In [None]:
plt.figure(figsize=(20, 10))
sns.set_style('whitegrid')
sns.set_context('talk')
plt.rcParams['axes.titlesize'] = 23
plt.rcParams['axes.titlepad'] = 20
plt.rcParams["axes.labelsize"] = 20
plt.title("Finding the correlation between columns through Heatmap for Target Group-(Defaulters)")
sns.heatmap(B1,annot=False,cmap='YlGnBu')
plt.show()

#### From the Heatmap above, we can find the followeing correlations between columns for DEFAULTERS:

* Based from the column LIVE_REGION_NOT_WORK_REGION and REG_REGION_NOT_WORK_REGION, we can conclude that both of the columns are positively correlated.
* This positively correlated column followed with REG_CITY_NOT_WORK_CITY and LIVE_CITY_NOT_WORK_CITY, AMT_ANNUITY and AMT_CREDIT.
* There are also columns which are negatively correlated such as AGE and CNT_CHILDREN, REG_CITY_NOT_LIVE_CITY and AGE so on.

## Previous Application Data

In [None]:
# Read the csv file using 'read_csv'. previous_application.csv

prev_app = pd.read_csv('../input/loan-defaulter/previous_application.csv')
prev_app.head(3)

In [None]:
# Checking the number of rows and columns in the previous_application dataframe

prev_app.shape

In [None]:
# Checking the column wise info from the dataframe

prev_app.info()

In [None]:
# Checking the cummary for the numeric column in the dataframe

prev_app.describe()

### Checking the Null value in Previous Application data

In [None]:
# Check the percentage of Null value

null_prev=round(prev_app.isnull().sum()/len(prev_app)*100,2)
null_prev.sort_values(ascending=False).head(100)

In [None]:
# Cleaning the missing data for the missing value more than 50%

null_prev_app = prev_app.isnull().sum()
null_prev_app = null_prev_app[null_prev_app.values>(0.5*len(null_prev_app))]
len(null_prev_app)

In [None]:
# listing the null values columns having more than 50%

null_prev_app

In [None]:
# Dropping the column which having more than 50% null values

null_prev_app = list(null_prev_app[null_prev_app.values>=0.5].index)
prev_app.drop(labels=null_prev_app,axis=1,inplace=True)
print(len(null_prev_app))

In [None]:
# Check the Previous Application data after removal
prev_app.shape

#### The column has been reduced from 37 to 22 after the removal of the Null values

In [None]:
# Checking the data types of the columns

prev_app.dtypes

#### The data types in the data frame looks good, thus we are not changing any data type in the column

In [None]:
# Checking for the lesser NUll value in the Previous Application data

100*(prev_app.isnull().sum()/len(prev_app.index))

* From the above output, we can see that, the column 'AMT_CREDIT' consists of a very minimal null values. Thou it wont affect our further analisations, we can impute those column for the purpose of our study.

In [None]:
prev_app['AMT_CREDIT'].value_counts()

In [None]:
# Checking the Outliers for Credit Amount

sns.boxplot(data= app_data, y='AMT_CREDIT')
plt.yscale('log')
plt.title('Distribuition Credit Amount')
plt.ylabel('Total Credit')
plt.show()

From the above graph:
* We can see there are an outliers in the Distribution.

#### Below code can be used to impute the AMT_CREDIT column, with outliers:

> prev_app['AMT_CREDIT'].fillna(prev_app.AMT_CREDIT.median(),inplace=True)

In [None]:
prev_app['NAME_CLIENT_TYPE'].value_counts()

* The XNA category for this column involves 0.11% of the data in column. Hence, it is better to treat those value based from the following details:

* Converting the XNA category to the Repeater category because of a common value(mode)

#### Below code can be used for the imputation

In [None]:
prev_app['NAME_CLIENT_TYPE'].replace('XNA','Repeater', inplace=True)

In [None]:
# Updated values in 'NAME_CLIENT_TYPE' column

prev_app['NAME_CLIENT_TYPE'].value_counts()

In [None]:
# Checking the value in 'NAME_CASH_LOAN_PURPOSE' column

prev_app['NAME_CASH_LOAN_PURPOSE'].value_counts()

#### The values of XAP and XNA based on the following percentage:
* XAP consist of 55% of the data
* XNP consists of 40.5% of the data

Since that the value is too large, it is better to drop this data

In [None]:
#Removing the column values of 'XNA' and 'XAP' for our further analysing

prev_app=prev_app.drop(prev_app[prev_app['NAME_CASH_LOAN_PURPOSE']=='XNA'].index)
prev_app=prev_app.drop(prev_app[prev_app['NAME_CASH_LOAN_PURPOSE']=='XAP'].index)

prev_app.shape

In [None]:
# Checking the values for Name_Portfolio column

prev_app['NAME_PORTFOLIO'].value_counts()

* The value of XNA is very less for imputation. Thus we can ignore the values and proceed with our analysis.

In [None]:
prev_app['DAYS_DECISION'].value_counts()

* As we can see from the above output, the values represent in the column are negative, thus it is required to convert the value to absolute value

In [None]:
# Converting the Negative values, and convert it to the YEAR

prev_app['YEARS_DECISION']=abs(round((prev_app['DAYS_DECISION'].replace('-',''))/365,2)).astype(int)
prev_app['YEARS_DECISION'].describe()

## MERGING THE APPLICATION AND PREVIOUS APPLICATION DATASETS

In [None]:
merge= pd.merge(left=app_data, right=prev_app, how='inner', on='SK_ID_CURR',suffixes='_x')
merge.head()

### Renaming the Columns after Merging

In [None]:
merge=merge.rename({'NAME_CONTRACT_TYPE_':'NAME_CONTRACT_TYPE','AMT_CREDIT_':'AMT_CREDIT','AMT_ANNUITY_':'AMT_ANNUITY',
                         'WEEKDAY_APPR_PROCESS_START_' : 'WEEKDAY_APPR_PROCESS_START',
                         'HOUR_APPR_PROCESS_START_':'HOUR_APPR_PROCESS_START','NAME_CONTRACT_TYPEx':'NAME_CONTRACT_TYPE_PREV',
                         'AMT_CREDITx':'AMT_CREDIT_PREV','AMT_ANNUITYx':'AMT_ANNUITY_PREV',
                         'WEEKDAY_APPR_PROCESS_STARTx':'WEEKDAY_APPR_PROCESS_START_PREV',
                         'HOUR_APPR_PROCESS_STARTx':'HOUR_APPR_PROCESS_START_PREV'}, axis=1)

### Removing the Unwanted columns to analyse further

In [None]:
merge.drop(['SK_ID_CURR','WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START','REG_REGION_NOT_LIVE_REGION', 
              'REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
              'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY','WEEKDAY_APPR_PROCESS_START_PREV',
              'HOUR_APPR_PROCESS_START_PREV', 'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_LAST_APPL_IN_DAY'],axis=1,inplace=True)

## Performing Univariate Analysis

### Analysing the NAME_CONTRACT_STATUS column

In [None]:
merge['NAME_CONTRACT_STATUS'].value_counts()

In [None]:
# Checking the percentage value for Contract Status 

df= round((merge['NAME_CONTRACT_STATUS'].value_counts()/merge['NAME_CONTRACT_STATUS'].count())*100,2)
df

### Distribution of Contract Status Against the Purposes

In [None]:

sns.set_style('dark')
sns.set_context('talk')
plt.figure(figsize=(17,32))
plt.rcParams["axes.labelsize"] = 22
plt.rcParams['axes.titlesize'] = 24
plt.rcParams['axes.titlepad'] = 32
plt.xticks(rotation=45)
plt.xscale('log')

plt.title('Distribution Of Contract Status With Purposes ')
ax = sns.countplot(data = merge, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=merge['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue = 'NAME_CONTRACT_STATUS',palette='pastel')

#### From the above graph, points that can be taken are:
* Application of Loans for the purposes of 'Repairs' carried most Rejections
* Significant Rejections outcome can be seen for 'Other' purposes as well, followed closely by applying loan for the purposes of 'Urgent Need'
* There are also Applications where the applicant refused to name the goal of the application process, this carried the least Approval and Rejection cases

In a summary, we can see that the number of Rejections exceeded the number of Approved loans.

### Distribution of Contract Status Against Target

In [None]:
sns.set_style('dark')
sns.set_context('talk')
plt.figure(figsize=(17,32))
plt.rcParams["axes.labelsize"] = 22
plt.rcParams['axes.titlesize'] = 24
plt.rcParams['axes.titlepad'] = 32
plt.xticks(rotation=45)
plt.xscale('log')

plt.title('Distribution Of Contract Status With Target')
ax = sns.countplot(data = merge, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=merge['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue = 'TARGET',palette='pastel')

#### Conclusions from the graph:

* Application for the purpose of Repairs are facing more difficulties in payment on time.
* We can see significance differences in regards with the loan payment than the difficulties of payments such as for the purpose of 'Buying a land', 'Buying a garage',' Buying a new car' and so on. 

**In a summary, we can focus on the purposes which lead to very less of payment difficulties and highly repayment for the bank.**

## Performing Bivariate Analysis

### Plotting for the Previous Credit Amount and the Loan Purposes

In [None]:
plt.figure(figsize=(20,12))

sns.barplot(data =merge, x='NAME_CASH_LOAN_PURPOSE',hue='NAME_INCOME_TYPE',y='AMT_CREDIT', palette='pastel')
plt.xticks(rotation=90)
plt.xlabel('Loan Purposes', fontsize=20)
plt.ylabel('Amount Credit', fontsize=20)
plt.yscale('log')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Previous Credit amount vs Loan Purpose')
plt.show()

#### Inferences:
* The credit amount for the purpose of Buying a new car, buying a home and buying a land are higher for majority of the Income type.
* State servant applied more of the credit in most of the loans applications.
* The least applications comes from the purposes of 'Money for the Third Person' across all of the Income Type

### Plotting for the Previous Credit Amount and the Housing Type

In [None]:
plt.figure(figsize=(20,12))

sns.barplot(data =merge, x='NAME_HOUSING_TYPE',hue='TARGET',y='AMT_CREDIT_PREV', palette='pastel')
plt.xticks(rotation=90)
plt.xlabel('Housing Type', fontsize=20)
plt.ylabel('Amount Credit Previous', fontsize=20)
plt.yscale('log')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Previous Credit amount vs Housing Type')
plt.show()

#### From the graph, we can conclude that:

   **For the Office Apartment are higher Amount of Credit especially for the non-Deafaulters as compared with the Defaulters. Here we can conclude that the bank should avoid applying loans for Co-op apartment as they have the greater chances of not making payment. The focus for the bank should be for the housing type of House/Apartment, or With parents as they have greater values of making payments.**


## Findings from Application Dataset Analysis.

- Imbalance percentage for target column we can clearly infre that Target_1 i.e. people with payment difficulties are around 10.55 times less the Non-Defaulters
- Defaulter are 8.7 % of the total data and Non Defulters are 91.3%
- Creating some box plots we have found out that there are many outliners in both the Numerical column - AMT_CREDIT , AMT_INCOME_TOTAL
- Analysing the ORGANIZATION_TYPE column we can conclude that Most number of customers are from Business Entity 3,Self Employed, Medicine, Government ORGANIZATION_TYPE
- We can infer from the first plot of Non Defaulters that Married are bit higher then other FAMILY_STATUS but rest of them are of very similar count.
- We can clearly see that most of the defaulters as well as Non Defaulter are from Registered Since 1 to 20 years
- We can clearly see that most of the defaulters as well as Non Defaulters are Working since 1 to 10 years
- We can clearly infer that more customers take Cash loans as there Contract type. 
- Most of the loan customer do not OWN CAR hence most of the defaulter are NON CAR OWNERS
- Most of the loan customer do not OWN REALTY hence most of the defaulter are NON REALTY OWNERS
- Here one point is evident that chances of getting load for females are higher for each income group
- We can derive that in all the Credit Group Female customers are more then the male customers
- Here also in all the AGE Group Female customers are more then the male customers
- Most of the Non Defaulter customers are from Medium salary with AGE_GROUP as 31-40 and 41-50
- Most of the Defaulter customers are from Medium salary with AGE_GROUP as 31-40 and 20-30

## Below are the Major steps we have done during the analysis:- 

- [x] There were total of 307K rows along with 122 columns in the previous application data
- [X] Around 64 columns were having more then 50% records as Null hence we have Dropped these columns 
- [X] Further more we have dropped some more columns which doesnot looks very important to out analysis 
- [X] Hence we have now left with 29 Columns on which further analysis have been done
- [X] Going further we have analysed 55374 rows with data as 'XNA' in DAYS_EMPLOYED and ORGANIZATION_TYPE hence we have dropped the rows as they were around 18% of the total data and will impact the analysis 
- [X] We have done bucketing for some of the columns and created some further colums as AGE_GROUP,INCOME_GROUP,CREDIT_GROUP,WORKING_AGE_GROUP,REGISTRATION_SINCE_GROUP
- [X] Now we have devided the APP_Data set in two parts 1- target1=(client with payment difficulties) 2-  target0=(The rest)
- [X] we have found out imbalance percentage for target column which came out to be 10.55
- [X] Defaulter are 8.7 % of the total data and Non Defulters are 91.3%
- [X] Now we have started out UNIVARIATE ANALYSIS 
- [X] Creating Box Plots for AMT_CREDIT , AMT_INCOME_TOTAL
- [X] Created some countplot for UNIVARIATE ANALYSIS for these columns ORGANIZATION_TYPE, NAME_FAMILY_STATUS,REGISTRATION_SINCE_GROUP,WORKING_AGE_GROUP,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY
- [X] Now we have started out BIVARIATE ANALYSIS 
- [X] Created some countplot for BIVARIATE ANALYSIS for these columns INCOME_GROUP with CODE_GENDER,CREDIT_GROUP with CODE_GENDER ,AGE_GROUP with CODE_GENDER, NAME_EDUCATION_TYPE with CODE_GENDER,INCOME_GROUP with AGE_GROUP,
- [X] Now We have done some Multivariate Analysis with Heatmaps 
- [X] Correlation between these columns are analysed AMT_INCOME_TOTAL,AMT_CREDIT,AGE,WORKING_SINCE,REGISTRATION_SINCE
- [X] Heat map is depicting the correlation. 
#### Note - All the Plots are depected for both Defaulter and Non-Defaulter data set


## CONCLUSION FOR THE CREDIT EDA CASE STUDY.

1. Bank should approve more loans for the Housing Type of House/Apartment, Office Apartment,or With parents as there are  having less payment difficulties.
2. Bank can focus more on the 'Working' Females as they applied most of the applications, less focus for the Pensioner, Age range within 61-70.
3. Also, Bank should provide more loans to 'Business entity Type 3' and 'Self Employed'