# Bank Loan Exploratory Data Analysis
****
By: Santh Raul and Ramlal Naik

# I. Problem Statement:
* If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
* If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

The company wants to understand the driving factors (or driver variables) behind the loan default. i.e the variables which are strong in loan default.


###  II. Import Libraries and set required parameters

In [None]:
# Import libraries
import numpy as np
print('numpy version\t:', np.__version__)
import pandas as pd
print('pandas version\t:', pd.__version__)
import matplotlib.pyplot as plt
import seaborn as sns
print('seaborn version\t:', sns.__version__)
from scipy import stats

import os

pd.set_option('display.max_columns', 200) # to display all the columns
pd.set_option('display.max_rows',150) # to display all rows of df series
pd.options.display.float_format = '{:.4f}'.format #set it to convert scientific noations such as 4.225108e+11 to 422510842796.00

import warnings
warnings.filterwarnings('ignore') # if there are any warning due to version mismatch, it will be ignored

import random

###  1. Data Importing

In [None]:
# # Sample data to overcome Memory Error
# # Less RAM: Reduce the data: It's completely fine to take a sample of the data to work on this case study
# # Random Sampling to get a random sample of data from the complete data
# filename = "application_data.csv"# This file is available is the same location as the jupyter notebook

# # Count the number of rows in my file
# num_lines = sum(1 for i in open(filename))
# # The number of rows that I wanted to load
# size = num_lines//2

# # Create a random indices between these two numbers

# random.seed(10)
# skip_id = random.sample(range(1, num_lines), num_lines-size)

# df_app = pd.read_csv(filename, skiprows = skip_id)

In [None]:
# read data
df_app = pd.read_csv('../input/credit-card/application_data.csv')

Get some insights of data

In [None]:
# get shape of data (rows, columns)
print(df_app.shape)

In [None]:
df_app.dtypes.value_counts()

In [None]:
# get some insights of data
df_app.head()

In [None]:
df_app.info()

In [None]:
# get the count, size and Unique value in each column of application data
df_app.agg(['count','size','nunique'])

### 2. Data Quality Check and Missing Values

#### 2.a. Find the percentage of missing values of the columns

In [None]:
# funcion to get null value
def column_wise_null_percentage(df):
    output = round(df.isnull().sum()/len(df.index)*100,2)
    return output

In [None]:
# get missign values of all columns
NA_col = column_wise_null_percentage(df_app)
NA_col

In [None]:
# identify columns only with null values
NA_col = NA_col[NA_col>0]
NA_col

In [None]:
# grafical representation of columns having % null values
plt.figure(figsize= (20,4),dpi=300)
NA_col.plot(kind = 'bar')
plt.title (' columns having null values')
plt.ylabel('% null values')
plt.show()
# plt.savefig('filename.png', dpi=300)

#### 2.b. Identify and remove columns with high missing percentage (>50%)

In [None]:
# Get the column with null values more than 50%
NA_col_50 = NA_col[NA_col>50]
print("Number of columns having null value more than 50% :", len(NA_col_50.index))
print(NA_col_50)

* Droped all columns from Dataframe for which missing value percentage are more than 50%.

`````````````
       'OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG',
       'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
       'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BUILD_MODE',
       'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI',
       'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI',
       'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE',
       'WALLSMATERIAL_MODE'
```````````

In [None]:
# removed 41 columns having null percentage more than 50%.
df_app = df_app.drop(NA_col_50.index, axis =1)
df_app.shape

#### 2.c. identify columns with less missing missing values (<15%)

In [None]:
# Get columns having <15% null values
NA_col_15 = NA_col[NA_col<15]
print("Number of columns having null value less than 15% :", len(NA_col_15.index))
print(NA_col_15)

In [None]:
NA_col_15.index

* The columns having null values less than 15% are,

> 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'EXT_SOURCE_2','OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE','AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'

* These columns shall be imputed with suitable values which shall be explained subsequently. 

In [None]:
# understand the insight of missing columns having <15% null values
df_app[NA_col_15.index].describe()

In [None]:
# identify unique values in the colums having <15% null value 
df_app[NA_col_15.index].nunique().sort_values(ascending=False)

* **For analysis of imputation selecetd 7 varibles.**
<br>Continuious variables:
``````
> 'EXT_SOURCE_2','AMT_GOODS_PRICE'
``````
Categorical variables:
`````````
> 'OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','NAME_TYPE_SUITE'
`````````


##### Continous variable:

In [None]:
# Box plot for continuious variable
plt.figure(figsize=(12,4))
sns.boxplot(df_app['EXT_SOURCE_2'])
plt.show()

In [None]:
plt.figure(figsize=(12,4))
sns.boxplot(df_app['AMT_GOODS_PRICE'])
plt.show()

Inference from box plot:
* for 'EXT_SOURCE_2' there is no outliers present. And there is no significant diffence observed between mean and median. However data look to be right skewed. So missing values can be imputed with median value: 0.565
* for 'AMT_GOODS_PRICE' there is significant number of outlier present in the data. SO data should be imputed with median value: 450000


#### Categorical variables:

In [None]:
# identify maximum frequency values
print('Maximum Frequncy categorical values are,')
print('NAME_TYPE_SUITE: ',df_app['NAME_TYPE_SUITE'].mode()[0])
print('OBS_30_CNT_SOCIAL_CIRCLE:', df_app['OBS_30_CNT_SOCIAL_CIRCLE'].mode()[0])
print('DEF_30_CNT_SOCIAL_CIRCLE:', df_app['DEF_30_CNT_SOCIAL_CIRCLE'].mode()[0])
print('OBS_60_CNT_SOCIAL_CIRCLE:', df_app['OBS_60_CNT_SOCIAL_CIRCLE'].mode()[0])
print('DEF_60_CNT_SOCIAL_CIRCLE:', df_app['DEF_60_CNT_SOCIAL_CIRCLE'].mode()[0])


For categorical vriable the value which should be imputed with maximum in frequency.<br>
So the value to be imputed are:<br>
NAME_TYPE_SUITE:  Unaccompanied<br>
OBS_30_CNT_SOCIAL_CIRCLE: 0.0 <br>
DEF_30_CNT_SOCIAL_CIRCLE: 0.0<br>
OBS_60_CNT_SOCIAL_CIRCLE: 0.0<br>
DEF_60_CNT_SOCIAL_CIRCLE: 0.0<br>


In [None]:
# Remove unwanted columns from application dataset for better analysis.

unwanted=['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE','FLAG_PHONE', 'FLAG_EMAIL',
          'REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','FLAG_EMAIL','CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
          'REGION_RATING_CLIENT_W_CITY','FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4',
          'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6','FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10',
          'FLAG_DOCUMENT_11','FLAG_DOCUMENT_12','FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
          'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18','FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
          'FLAG_DOCUMENT_21','EXT_SOURCE_2','EXT_SOURCE_3','YEARS_BEGINEXPLUATATION_AVG','FLOORSMAX_AVG','YEARS_BEGINEXPLUATATION_MODE',
          'FLOORSMAX_MODE','YEARS_BEGINEXPLUATATION_MEDI','FLOORSMAX_MEDI','TOTALAREA_MODE','EMERGENCYSTATE_MODE']

df_app.drop(labels=unwanted,axis=1,inplace=True)

In [None]:
df_app.shape

In [None]:
df_app.head()

There are some columns where the value is mentioned as 'XNA' which means 'Not Available'. So we have to find the number of rows and columns.

In [None]:
# For Code Gender column

print('CODE_GENDER: ',df_app['CODE_GENDER'].unique())
print('No of values: ',df_app[df_app['CODE_GENDER']=='XNA'].shape[0])

XNA_count = df_app[df_app['CODE_GENDER']=='XNA'].shape[0]
per_XNA = round(XNA_count/len(df_app.index)*100,3)

print('% of XNA Values:',  per_XNA)

print('maximum frequency data :', df_app['CODE_GENDER'].describe().top)

Since, Female is having the majority and only 2 rows are having XNA values, we can impute those with Gender 'F' as there will be no impact on the dataset. Also there will no impact if we drop those rows.

In [None]:
# Dropping the XNA value in column 'CODE_GENDER' with "F" for the dataset

df_app = df_app.drop(df_app.loc[df_app['CODE_GENDER']=='XNA'].index)
df_app[df_app['CODE_GENDER']=='XNA'].shape

In [None]:
# For Organization column
print('No of XNA values: ', df_app[df_app['ORGANIZATION_TYPE']=='XNA'].shape[0])

XNA_count = df_app[df_app['ORGANIZATION_TYPE']=='XNA'].shape[0]
per_XNA = round(XNA_count/len(df_app.index)*100,3)

print('% of XNA Values:',  per_XNA)

df_app['ORGANIZATION_TYPE'].describe()


So, for column 'ORGANIZATION_TYPE', we have total count of 153755 rows of which 27737 rows are having 'XNA' values. Which means 18% of the column is having this values.

In [None]:
# # Dropping the rows have 'XNA' values in the organization type column

# df_app = df_app.drop(df_app.loc[df_app['ORGANIZATION_TYPE']=='XNA'].index)
# df_app[df_app['ORGANIZATION_TYPE']=='XNA'].shape

#### 2.d. Check the data type of all the columns and changed the data type.

In [None]:
df_app.head()

In [None]:
# Casting variable into numeric in the dataset

numeric_columns=['TARGET','CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','REGION_POPULATION_RELATIVE',
                 'DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','HOUR_APPR_PROCESS_START',
                 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
                'DAYS_LAST_PHONE_CHANGE']

df_app[numeric_columns]=df_app[numeric_columns].apply(pd.to_numeric)
df_app.head(5)


Following age/days columns are having -ve value, which needs to converted to  +ve value.

```
'DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE',
```

In [None]:
# Converting '-ve' values into '+ve' Values
df_app['DAYS_BIRTH'] = df_app['DAYS_BIRTH'].abs()
df_app['DAYS_EMPLOYED'] = df_app['DAYS_EMPLOYED'].abs()
df_app['DAYS_REGISTRATION'] = df_app['DAYS_REGISTRATION'].abs()
df_app['DAYS_ID_PUBLISH'] = df_app['DAYS_ID_PUBLISH'].abs()
df_app['DAYS_LAST_PHONE_CHANGE'] = df_app['DAYS_LAST_PHONE_CHANGE'].abs()

#### 2.e Checking the outlier for numerical variables:

In [None]:
# describe numeric columns
df_app[numeric_columns].describe()

In [None]:
# Box plot for selected columns
features = ['CNT_CHILDREN', 'AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','DAYS_EMPLOYED', 'DAYS_REGISTRATION']

plt.figure(figsize = (20, 15), dpi=300)
for i in enumerate(features):
    plt.subplot(3, 2, i[0]+1)
    sns.boxplot(x = i[1], data = df_app)
plt.show()

From the above box plot and descibe analysis we found that following are the numeric columns are having outliers:
~~~~~~~~~
CNT_CHILDREN, AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,DAYS_EMPLOYED, DAYS_REGISTRATION
~~~~~~~~

* The first quartile almost missing for CNT_CHILDREN that means most of the data are present in the first quartile.

* There is single high value data point as outlier present in AMT_INCOME_TOTAL and Removal this point will dtrasticaly impact the box plot for further analysis.

* The first quartiles is slim compare to third quartile for AMT_CREDIT,AMT_ANNUITY, DAYS_EMPLOYED, DAYS_REGISTRATION. This mean data are skewed towards first quartile.

#### 2.f. Bin Creation

Creating bins for continous variable categories column 'AMT_INCOME_TOTAL' and 'AMT_CREDIT'

In [None]:
bins = [0,100000,200000,300000,400000,500000,10000000000]
slot = ['<100000', '100000-200000','200000-300000','300000-400000','400000-500000', '500000 and above']

df_app['AMT_INCOME_RANGE']=pd.cut(df_app['AMT_INCOME_TOTAL'],bins,labels=slot)

In [None]:
bins = [0,100000,200000,300000,400000,500000,600000,700000,800000,900000,10000000000]
slot = ['<100000', '100000-200000','200000-300000','300000-400000','400000-500000', '500000-600000',
        '600000-700000','700000-800000','850000-900000','900000 and above']

df_app['AMT_CREDIT_RANGE']=pd.cut(df_app['AMT_CREDIT'],bins,labels=slot)

### 3. Analysis:

In [None]:
# Dividing the dataset into two dataset of  target=1(client with payment difficulties) and target=0(all other)

target0_df=df_app.loc[df_app["TARGET"]==0]
target1_df=df_app.loc[df_app["TARGET"]==1]

In [None]:
# insights from number of target values

percentage_defaulters= round(100*len(target1_df)/(len(target0_df)+len(target1_df)),2)

percentage_nondefaulters=round(100*len(target0_df)/(len(target0_df)+len(target1_df)),2)

print('Count of target0_df:', len(target0_df))
print('Count of target1_df:', len(target1_df))


print('Percentage of people who paid their loan are: ', percentage_nondefaulters, '%' )
print('Percentage of people who did not paid their loan are: ', percentage_defaulters, '%' )

In [None]:
# Calculating Imbalance percentage
    
# Since the majority is target0 and minority is target1

imb_ratio = round(len(target0_df)/len(target1_df),2)

print('Imbalance Ratio:', imb_ratio)

The Imbalance ratio is 11.48

#### 3.a Univariate analysis

Categorical Univariate Analysis in logarithmic scale for target=0 (client with no payment difficulties)

In [None]:
# Count plotting in logarithmic scale

def uniplot(df,col,title,hue =None):
    
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 14
    plt.rcParams['axes.titlesize'] = 16
    plt.rcParams['axes.titlepad'] = 14
    
    
    temp = pd.Series(data = hue)
    fig, ax = plt.subplots()
    width = len(df[col].unique()) + 7 + 4*len(temp.unique())
    fig.set_size_inches(width , 8)
    plt.xticks(rotation=45)
    plt.yscale('log')
    plt.title(title)
    ax = sns.countplot(data = df, x= col, order=df[col].value_counts().index,hue = hue) 
        
    plt.show()

In [None]:
# Categoroical Univariate Analysis in logarithmic scale

features = ['AMT_INCOME_RANGE', 'AMT_CREDIT_RANGE','NAME_INCOME_TYPE','NAME_CONTRACT_TYPE']
plt.figure(figsize = (20, 15))

for i in enumerate(features):
    plt.subplot(2, 2, i[0]+1)
    plt.subplots_adjust(hspace=0.5)
    sns.countplot(x = i[1], hue = 'TARGET', data = df_app)
    
    plt.rcParams['axes.titlesize'] = 16
    
    plt.xticks(rotation = 45)
    plt.yscale('log')
    

##### Insights:<br>

> AMT_INCOME_RANGE : 
    * The people having 100000-200000 are havign higher number of loan and also having higher in defaulter
    * The income segment having >500000 are having less defaulter.

> AMT_CREDIT_RANGE:
    * The people having <100000 loan are less defaulter.
    * income having more thatn >100000 are almost equal % of loan defaulter

> NAME_INCOME_TYPE:
    * Student pensioner and business have higher percentage of loan repayment.
    * Working, State servent and Commercial associates have higher default percentage.
    * Maternity category is significantly higher problem in replayement.

> NAME_CONTRACT_TYPE
    * For contract type ‘cash loans’ is having higher number of credits than ‘Revolving loans’ contract type.
    * From the above graphs we can see that the Revolving loans are small amount compared to Cash loans but the % of non payment for the revolving loans are comapritvely high.

In [None]:
# Categoroical Univariate Analysis in Value scale

features = ['CODE_GENDER','FLAG_OWN_CAR']
plt.figure(figsize = (20, 10))

for i in enumerate(features):
    plt.subplot(2, 2, i[0]+1)
    plt.subplots_adjust(hspace=0.5)
    sns.countplot(x = i[1], hue = 'TARGET', data = df_app)
     
    plt.rcParams['axes.titlesize'] = 16
    plt.xticks(rotation = 45)
#     plt.yscale('log')

##### Insights: 
> CODE_GENDER:
    * The % of defaulters are more in Male than Female


> FLAG_OWN_CAR:
    * The person owning car is having higher percentage of defaulter.


#### Univariate analysis Continuious variables:

In [None]:
# Univariate Analysis for continous variable

features = ['AMT_ANNUITY','AMT_GOODS_PRICE','DAYS_BIRTH','DAYS_EMPLOYED','DAYS_LAST_PHONE_CHANGE','DAYS_ID_PUBLISH']
plt.figure(figsize = (15, 20))

for i in enumerate(features):
    plt.subplot(3, 2, i[0]+1)
    plt.subplots_adjust(hspace=0.5)
    sns.boxplot(x = 'TARGET', y = i[1], data = df_app)
    

Inference:
* Days_Birth: The people having higher age are having higher probability of repayment.
* Some outliers are observed in In 'AMT_ANNUITY','AMT_GOODS_PRICE','DAYS_EMPLOYED', DAYS_LAST_PHONE_CHANGE in the dataset.
* Less outlier observed in Days_Birth and DAYS_ID_PUBLISH
* 1st quartile is smaller than third quartile in In 'AMT_ANNUITY','AMT_GOODS_PRICE', DAYS_LAST_PHONE_CHANGE.
* In DAYS_ID_PUBLISH: people changing ID in recent days are relativelty prone to be default.
* There is single high value data point as outlier present in DAYS_EMPLOYED. Removal this point will drastically impact the box plot for further analysis. 

#### 3.b. Bivariate analysis for numerical variables

**For Target 0**

In [None]:
# Box plotting for Credit amount

plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
sns.boxplot(data =target0_df, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Credit amount vs Education Status')
plt.show()

* Family status of 'civil marriage', 'marriage' and 'separated' of Academic degree education are having higher number of credits than others. 
* Also, higher education of family status of 'marriage', 'single' and 'civil marriage' are having more outliers. Civil marriage for Academic degree is having most of the credits in the third quartile.

In [None]:
# Box plotting for Income amount in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
plt.yscale('log')
sns.boxplot(data =target0_df, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Income amount vs Education Status')
plt.show()

* In Education type 'Higher education' the income amount is mostly equal with family status. It does contain many outliers. 
* Less outlier are having for Academic degree but there income amount is little higher that Higher education. 
* Lower secondary of civil marriage family status are have less income amount than others.

**For Target 1**

In [None]:
# Box plotting for credit amount

plt.figure(figsize=(15,10))
plt.xticks(rotation=45)
sns.boxplot(data =target0_df, x='NAME_EDUCATION_TYPE',y='AMT_CREDIT', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Credit Amount vs Education Status')
plt.show()

* Observations are Quite similar with Target 0 
* Family status of 'civil marriage', 'marriage' and 'separated' of Academic degree education are having higher number of credits than others. 
* Most of the outliers are from Education type 'Higher education' and 'Secondary'. 
* Civil marriage for Academic degree is having most of the credits in the third quartile.

In [None]:
# Box plotting for Income amount in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=45)
plt.yscale('log')
sns.boxplot(data =target0_df, x='NAME_EDUCATION_TYPE',y='AMT_INCOME_TOTAL', hue ='NAME_FAMILY_STATUS',orient='v')
plt.title('Income amount vs Education Status')
plt.show()

* There is also have some similarity with Target0, 
* Education type 'Higher education' the income amount is mostly equal with family status. 
* Less outlier are having for Academic degree but there income amount is little higher that Higher education. 
* Lower secondary are have less income amount than others.

### 3.c. Correlation:

Getting top 10 correlation between variables

In [None]:
# Top 10 correlated variables: target 0 dataaframe

corr = target0_df.corr()
corrdf = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corrdf = corrdf.unstack().reset_index()
corrdf.columns = ['Var1', 'Var2', 'Correlation']
corrdf.dropna(subset = ['Correlation'], inplace = True)
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
corrdf.sort_values(by = 'Correlation', ascending = False).head(10)

In [None]:
# Top 10 correlated variables: target 1 dataaframe

corr = target1_df.corr()
corrdf = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corrdf = corrdf.unstack().reset_index()
corrdf.columns = ['Var1', 'Var2', 'Correlation']
corrdf.dropna(subset = ['Correlation'], inplace = True)
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
corrdf.sort_values(by = 'Correlation', ascending = False).head(10)


* From the above correlation analysis it is infered that the highest corelation (1.0) is between (OBS_60_CNT_SOCIAL_CIRCLE with OBS_30_CNT_SOCIAL_CIRCLE) and (FLOORSMAX_MEDI with FLOORSMAX_AVG) which is same for both the data set.

### 4. Read Previous Application data and merging with application data

In [None]:
# Reading the dataset of previous application

df_prev=pd.read_csv('../input/credit-card/previous_application.csv')

In [None]:
#explore the dataset
df_prev.columns

In [None]:
# get shape of data (rows, columns)
df_prev.shape

In [None]:
# get the type of dataset
df_prev.dtypes

In [None]:
# displaying the informtion of previous application dataset
df_prev.info()

In [None]:
# Describing the previous application dataset
df_prev.describe()

In [None]:
# Finding percentage of null values columns
NA_col_pre = column_wise_null_percentage(df_prev)

In [None]:
# identify columns only with null values
NA_col_pre = NA_col_pre[NA_col_pre>0]
NA_col_pre

In [None]:
# grafical representation of columns having % null values
plt.figure(figsize= (20,4),dpi=300)
NA_col_pre.plot(kind = 'bar')
plt.title (' columns having null values')
plt.ylabel('% null values')
plt.show()

In [None]:
# Get the column with null values more than 50%
NA_col_pre = NA_col_pre[NA_col_pre>50]
print("Number of columns having null value more than 50% :", len(NA_col_pre.index))
print(NA_col_pre)

* Droped all columns from Dataframe for which missing value percentage are more than 50%.
``````    
    'AMT_DOWN_PAYMENT', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY','RATE_INTEREST_PRIVILEGED'
``````

In [None]:
# removed 4 columns having null percentage more than 50%.
df_prev = df_prev.drop(NA_col_pre.index, axis =1)
df_prev.shape

In [None]:
# Merging the Application dataset with previous appliaction dataset

df_comb = pd.merge(left=df_app,right=df_prev,how='inner',on='SK_ID_CURR',suffixes='_x')
df_comb.shape

In [None]:
df_comb.head()

In [None]:
# Renaming the column names after merging from combined df

df_comb = df_comb.rename({'NAME_CONTRACT_TYPE_' : 'NAME_CONTRACT_TYPE','AMT_CREDIT_':'AMT_CREDIT','AMT_ANNUITY_':'AMT_ANNUITY',
                         'WEEKDAY_APPR_PROCESS_START_' : 'WEEKDAY_APPR_PROCESS_START',
                         'HOUR_APPR_PROCESS_START_':'HOUR_APPR_PROCESS_START','NAME_CONTRACT_TYPEx':'NAME_CONTRACT_TYPE_PREV',
                         'AMT_CREDITx':'AMT_CREDIT_PREV','AMT_ANNUITYx':'AMT_ANNUITY_PREV',
                         'WEEKDAY_APPR_PROCESS_STARTx':'WEEKDAY_APPR_PROCESS_START_PREV',
                         'HOUR_APPR_PROCESS_STARTx':'HOUR_APPR_PROCESS_START_PREV'}, axis=1)


In [None]:
# Removing unwanted columns from cmbined df for analysis

df_comb.drop(['SK_ID_CURR','WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START','REG_REGION_NOT_LIVE_REGION', 
              'REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
              'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY','WEEKDAY_APPR_PROCESS_START_PREV',
              'HOUR_APPR_PROCESS_START_PREV', 'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_LAST_APPL_IN_DAY'],axis=1,inplace=True)

** Performing univariate analysis**

In [None]:
# Distribution of contract status in logarithmic scale
# Distribution of contract status in logarithmic scale

sns.set_style('whitegrid')
sns.set_context('talk')

plt.figure(figsize=(10,25),dpi = 300)
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30
plt.xticks(rotation=90)
plt.xscale('log')
plt.title('Distribution of contract status with purposes')
ax = sns.countplot(data = df_comb, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=df_comb['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue = 'NAME_CONTRACT_STATUS') 

Points to be concluded from above plot:

Most rejection of loans came from purpose 'repairs'.
For education purposes we have equal number of approves and rejection
Payign other loans and buying a new car is having significant higher rejection than approves.

In [None]:
# Distribution of contract status

sns.set_style('whitegrid')
sns.set_context('talk')

plt.figure(figsize=(10,30),dpi = 300)
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30
plt.xticks(rotation=90)
plt.xscale('log')
plt.title('Distribution of purposes with target ')
ax = sns.countplot(data = df_comb, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=df_comb['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue = 'TARGET') 

Few points we can conclude from abpve plot:

Loan purposes with 'Repairs' are facing more difficulites in payment on time.
There are few places where loan payment is significant higher than facing difficulties. They are 'Buying a garage', 'Business developemt', 'Buying land','Buying a new car' and 'Education' Hence we can focus on these purposes for which the client is having for minimal payment difficulties.

**Bivariate analysis**

In [None]:
# Box plotting for Credit amount in logarithmic scale

plt.figure(figsize=(20,15),dpi = 300)
plt.xticks(rotation=90)
plt.yscale('log')
sns.boxplot(data =df_comb, x='NAME_CASH_LOAN_PURPOSE',hue='NAME_INCOME_TYPE',y='AMT_CREDIT_PREV',orient='v')
plt.title('Prev Credit amount vs Loan Purpose')
plt.show()

From the above we can conclude some points-

The credit amount of Loan purposes like 'Buying a home','Buying a land','Buying a new car' and'Building a house' is higher.
Income type of state servants have a significant amount of credit applied
Money for third person or a Hobby is having less credits applied for.

In [None]:
# Box plotting for Credit amount prev vs Housing type in logarithmic scale

plt.figure(figsize=(15,15),dpi = 150)
plt.xticks(rotation=90)
sns.barplot(data =df_comb, y='AMT_CREDIT_PREV',hue='TARGET',x='NAME_HOUSING_TYPE')
plt.title('Prev Credit amount vs Housing type')
plt.show()

Here for Housing type, office appartment is having higher credit of target 0 and co-op apartment is having higher credit of target 1. So, we can conclude that bank should avoid giving loans to the housing type of co-op apartment as they are having difficulties in payment. Bank can focus mostly on housing type with parents or House\appartment or miuncipal appartment for successful payments.

# 6. Conclusion/Recomendation:

**1. Banks should focus more on contract type ‘Student’ ,’pensioner’ and ‘Businessman’ with housing ‘type other than ‘Co-op apartment’ for successful payments.**

**2. Banks should focus less on income type ‘Working’ as they are having most number of unsuccessful payments.**

**3. In loan purpose ‘Repairs’:**

> a. Although having higher number of rejection in loan purposes with 'Repairs' there are observed difficulties in payment on time.<br> 
b. There are few places where loan payment is delay is significantly high.<br> 
c. Bank should keep continue to caution while giving loan for this purpose.

**4. Bank should avoid giving loans to the housing type of co-op apartment as they are having difficulties in payment.**

**5. Bank can focus mostly on housing type ‘with parents’ , ‘House\apartment’ and ‘municipal apartment’ for successful payments.**
