# Credit EDA Case Study

![]( https://static.toiimg.com/photo/imgsize-115259,msid-62940208/62940208.jpg)

## Business Objective:
The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter.


This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In [None]:
import warnings

warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#reading application_data.csv
app = pd.read_csv("../input/application-datacsv/application_data.csv")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
#feel of the data
app.head()

## 1. Data Understanding and Data-Cleaning.

In [None]:
app.shape

In [None]:
app.info()

In [None]:
#checking datatypes
app.dtypes

#### 1.1 Handling Nulls - Removing unwanted columns

In [None]:
#checking percentage of null values in all columns
app_nulls = app.isnull().sum() * 100/len(app)

#Number of columns having null values
len(app_nulls[app_nulls.values>0])

In [None]:
app_nulls[app_nulls.values>0]

In [None]:
print('Least % of null values in the data set: ', min(app_nulls.values))
print('Most % of null values in the data set: ', max(app_nulls.values))

There are **67** columns in the dataset containing null values.

Market standard says most of the times we can drop columns with more than 25-30% missing data, of course with many exceptions as per the use case.

Also, **Occupation Type** lies in the border line with 31% null values. But, we believe Occupation Type plays a significant role to find whether the person will default or not. Hence we are deciding to drop columns columns with more than 40% of missing values (Since there's no column with null value percentages between 31-40%)

In [None]:
app_nulls = app_nulls[app_nulls.values > 40]
len(app_nulls)

We have a total of 49 columns with more than 40% null values and insignificant to our analysis.

In [None]:
#Removing columns with more than 40 percent null values in the dataset
app.drop(app_nulls.index, axis=1, inplace = True)
app.shape

In [None]:
app.isnull().sum()[app.isnull().sum()>0]

All the columns except AMT_ANNUITY, AMT_GOODS_PRICE, CNT_FAM_MEMBERS, OCCUPATION_TYPE and NAME_TYPE_SUITE seems insignificant. Hence dropping them.

In [None]:
app_nulls=list(app.isnull().sum()[app.isnull().sum()>600].index)
app_nulls

In [None]:
#removing NAME_TYPE_SUITE, OCCUPATION_TYPE and adding DAYS_LAST_PHONE_CHANGE
app_nulls.remove('NAME_TYPE_SUITE')
app_nulls.remove('OCCUPATION_TYPE')
app_nulls.append('DAYS_LAST_PHONE_CHANGE')

In [None]:
app.drop(labels=app_nulls,axis=1,inplace=True)
app.isnull().sum()[app.isnull().sum()>0]

On further analysis, we can observe there are columns with values of 0/1 or N/Y. 
On examining those columns, they won't be of any significance further in our analysis. Dropping the Flag columns.

In [None]:
#Fetch all indicator FLAG columns
flag_col = app.filter(regex='^FLAG',axis=1).columns.tolist()

flag_col

Here other than FLAG_OWN_CAR and FLAG_OWN_REALTY all other columns seems insignificant further for analysis. Hence dropping all other columns from flag_col except FLAG_OWN_CAR and FLAG_OWN_REALTY.

In [None]:
flag_col.remove('FLAG_OWN_CAR')
flag_col.remove('FLAG_OWN_REALTY')

In [None]:
#Delete all indicator FLAG columns as they are not relevant to our analysis
app.drop(flag_col, axis = 1, inplace = True)

In [None]:
app.shape

#### 1.2 Handling Nulls - Filling in appropriate values for analysis

In [None]:
app.isnull().sum()[app.isnull().sum() > 0]

In [None]:
#calculating mean, median and mode for AMT_ANNUITY and AMT_GOODS_PRICE
print("AMT_ANNUITY")
print('Mean: ', app['AMT_ANNUITY'].mean())
print('Median: ', app['AMT_ANNUITY'].median())
print('Mode: ', app['AMT_ANNUITY'].mode())

print("----------------------------------")
print("AMT_GOODS_PRICE")
print('Mean: ', app['AMT_GOODS_PRICE'].mean())
print('Median: ', app['AMT_GOODS_PRICE'].median())
print('Mode: ', app['AMT_GOODS_PRICE'].mode())

In [None]:
app[app['AMT_ANNUITY'].isnull()].head()

Looking at the AMT_CREDIT for rows with AMT_ANNUITY null values, replacing null values with mode(9000.0) doesn't seem to be a good idea. Hence replacing the null values with median(24903.0).

In [None]:
app['AMT_ANNUITY'].fillna(app['AMT_ANNUITY'].median(), inplace=True)

Imputing AMT_GOODS_PRICE

In [None]:
app.AMT_GOODS_PRICE.value_counts()

In [None]:
#Since median and mode both is 450000.0 and it has the highest counts as well imputing again with 450000 might lead to incorrect
#analysis. Hence imputing it with mean.
app['AMT_GOODS_PRICE'].fillna(app['AMT_GOODS_PRICE'].mean(), inplace=True)

The column OCCUPATION_TYPE with 96391 null values needs to be imputed.

In [None]:
app.OCCUPATION_TYPE.value_counts()

Null values for OCCUPATION_TYPE to be replaced by 'Not Specified'

In [None]:
app['OCCUPATION_TYPE'].fillna('Not Specified', inplace = True)

The column NAME_TYPE_SUITE needs to be imputed

In [None]:
app.NAME_TYPE_SUITE.value_counts()

In [None]:
app['NAME_TYPE_SUITE'].fillna('Unaccompanied', inplace=True)

Since CNT_FAM_MEMBERS cannot be a fraction, replacing null values by median. 

In [None]:
app['CNT_FAM_MEMBERS'].fillna(app['CNT_FAM_MEMBERS'].median(), inplace=True)

#### 1.3 Correcting Datatypes

In [None]:
app.dtypes

In [None]:
app.head(1)

In [None]:
#CNT_FAM_MEMBERS cannot be float. Converting to integer
app.CNT_FAM_MEMBERS = app.CNT_FAM_MEMBERS.apply(lambda x: int(x))

In [None]:
#DAYS_EMPLOYED, DAY_REGISTRATION, DAYS_ID_PUBLISH should be a positive value. Converting into absolute value
app.DAYS_EMPLOYED = app.DAYS_EMPLOYED.apply(lambda x: abs(x))
app.DAYS_REGISTRATION = app.DAYS_REGISTRATION.apply(lambda x: abs(x))
app.DAYS_ID_PUBLISH = app.DAYS_ID_PUBLISH.apply(lambda x: abs(x))

#### 1.4 Categorizing continuous variables into discrete intervals - Binning

We have DAYS_BIRTH column which can used to derive the age of the customer.

In [None]:
#Dividing by -365.25 to include leap years
app['Age'] = app['DAYS_BIRTH'] //-365.25

In [None]:
app.drop('DAYS_BIRTH', axis = 1, inplace = True)

Binning Age

In [None]:
app.Age.describe()

In [None]:
#Minimum age is 20 years and maximum is 69 - so dividing into 6 intervals
app['AgeGroup'] = pd.cut(app['Age'], bins=np.linspace(20, 70, 6))

In [None]:
app['AgeGroup']=app['AgeGroup'].astype(str)
app['AgeGroup'].value_counts()

AMT_INCOME_TOTAL Statistics

In [None]:
app.AMT_INCOME_TOTAL.describe()

We can see there is a high difference between the 75th percentile and the maximum value. Need to check for outliers before deciding how to handle this.

### 1.5 Identifying and handling Outliers. 

To check outliers firstly we will have to divide the columns into 2 sections-Numerical columns and categorical columns.

In [None]:
numeric_data=app.select_dtypes(include=np.number)
numeric_cols=numeric_data.columns

numeric_data.head(2)

In [None]:
categoric_data=app.select_dtypes(exclude=np.number)
categoric_cols=categoric_data.columns

categoric_data.head(2)

In [None]:
print('App data: ', app.shape)
print('Numeric data: ', numeric_data.shape)
print('Categoric data: ', categoric_data.shape)

None of the columns are missed during segregation.

In [None]:
numeric_data.describe()

#### <FONT COLOR='BROWN'> CNT_CHILDREN</FONT>

Checking for outliers in numeric fields where there is a significant difference in the min and max values.

In [None]:
plt.figure(figsize = (15,2))
sns.boxplot(numeric_data.CNT_CHILDREN)
plt.show()

In [None]:
#The data outside the 100th percent percentile is not continuous.
print(numeric_data.CNT_CHILDREN.quantile(0.99))
print('Number of values with rows more than 99 percentile:' ,len(numeric_data[numeric_data.CNT_CHILDREN>3]))

In [None]:
#Since the difference between 75th and 99th percentile of data with number of children > 3 is huge, lets change above 
#this 75h percentile. Also in India more than 4 children is rare.
app.loc[app['CNT_CHILDREN'] > 4, ['CNT_CHILDREN']] = 5

All the rows with cnt_children > 5 are 5 or more.

#### <FONT COLOR='BROWN'> AMT_INCOME_TOTAL</FONT>

In [None]:
plt.figure(figsize = (10,2))
sns.boxplot(app.AMT_INCOME_TOTAL)
plt.show()

In [None]:
app.AMT_INCOME_TOTAL.quantile(0.99)

In [None]:
app[app.AMT_INCOME_TOTAL>app.AMT_INCOME_TOTAL.quantile(0.99)].sort_values(by='AMT_INCOME_TOTAL', ascending=False).head(10)

In [None]:
#There is a lot of difference between the 99th precentile and maximum value. Hence detecting Outlier by the IQR formula
iqr=app.AMT_INCOME_TOTAL.quantile(0.75)-app.AMT_INCOME_TOTAL.quantile(0.25)
AMT_INCOME_TOTAL_OUTLIER=app.AMT_INCOME_TOTAL.quantile(0.75)+(iqr*1.5)
AMT_INCOME_TOTAL_OUTLIER

In [None]:
print(len(app[app.AMT_INCOME_TOTAL>AMT_INCOME_TOTAL_OUTLIER]))
print(len(app[app.AMT_INCOME_TOTAL>app.AMT_INCOME_TOTAL.quantile(0.99)]))

Since AMT_INCOME_TOTAL outlier is 337500 and 99th percentile is 472500 and the number of rows with these values are 14035 and 3014 respectively. Meaning income from 337500.0 to 472500.0 is still a norm but anything above 472500.0 will lead to a bias in average of the income. Hence replacing the values higher than 472500.0 with 472500.0+10000.0 to mske it continuous and distinctly identifiable and not 337500.0 since 99th percentile value is higher.

In [None]:
app.loc[app['AMT_INCOME_TOTAL'] > 472500.0, ['AMT_INCOME_TOTAL']] = 472500.0+10000.0

In [None]:
plt.figure(figsize = (10,2))
sns.boxplot(app.AMT_INCOME_TOTAL)
plt.show()

In [None]:
#Binning AMT_INCOME_TOTAL into income range categories for ease of analysis
app['IncomeRange'] = pd.qcut(app['AMT_INCOME_TOTAL'], q=[0,0.25,0.50,0.90,1], labels=['Low','Average','High','Very High'])

In [None]:
app.IncomeRange.head()

#### <FONT COLOR='BROWN'> AMT_CREDIT</FONT>

In [None]:
plt.figure(figsize = (10,2))
sns.boxplot(app.AMT_CREDIT)
plt.show()

In [None]:
#Calculating outlier
iqr=app.AMT_CREDIT.quantile(0.75)-app.AMT_CREDIT.quantile(0.25)
AMT_CREDIT_OUTLIER=app.AMT_CREDIT.quantile(0.75)+(iqr*1.5)
AMT_CREDIT_OUTLIER

Outlier is more than 75th percentile.

In [None]:
app.AMT_CREDIT.quantile(0.99)

Outlier is much lower than the 99th percentile value and 99th percentile value is much lower than the maximum value.
Hence replacing values more than 99th percentile with 1854000.0+ 10000.0

In [None]:
app.loc[app['AMT_CREDIT'] > 1854000.0, ['AMT_CREDIT']] = 1854000.0+ 10000.0

In [None]:
plt.figure(figsize = (10,2))
sns.boxplot(app.AMT_CREDIT)
plt.show()

#### <font color='brown'> AMT_ANNUITY</FONT>

In [None]:
plt.figure(figsize = (10,2))
sns.boxplot(app.AMT_ANNUITY)
plt.show()

In [None]:
app.AMT_ANNUITY.quantile(0.99)

The 99th percentile is more than the 75th percentile while is much more less than the max value.

In [None]:
len(app[app.AMT_ANNUITY>app.AMT_ANNUITY.quantile(0.99)])

In [None]:
#Calculating outlier:
iqr=app.AMT_ANNUITY.quantile(0.75)-app.AMT_ANNUITY.quantile(0.25)
AMT_ANNUITY_OUTLIER=app.AMT_ANNUITY.quantile(0.75)+(iqr*1.5)
AMT_ANNUITY_OUTLIER

Outlier is less than the 99th percentile and much greater than the 75th percentile. Hence replacing values higher than 70006.5 by 70006.5+10000

In [None]:
app.loc[app['AMT_ANNUITY'] > 70006.5, ['AMT_ANNUITY']] = 70006.5+ 1000.0

In [None]:
plt.figure(figsize = (10,2))
sns.boxplot(app.AMT_ANNUITY)
plt.show()

In [None]:
len(app[app.AMT_ANNUITY>70000.6])

Total 3081 columns have value more than 99 percentile.

#### <font color='brown'> CNT_FAM_MEMBERS </font>

In [None]:
plt.figure(figsize = (15,2))
sns.boxplot(numeric_data.CNT_FAM_MEMBERS)
plt.show()

In [None]:
#The data outside the 100th percent percentile is not continuous.
print(numeric_data.CNT_FAM_MEMBERS.quantile(0.99))
print('Number of values with rows more than 99 percentile:' ,len(numeric_data[numeric_data.CNT_FAM_MEMBERS>numeric_data.CNT_FAM_MEMBERS.quantile(0.99)]))

In [None]:
max(numeric_data.CNT_FAM_MEMBERS)

In [None]:
numeric_data[numeric_data.CNT_FAM_MEMBERS>5].CNT_FAM_MEMBERS.quantile([0.25,0.50,0.75,0.99])

There is a huge dufference between values with 99 percentile of family members count and the maximum number. 
Hence replacing the rows with value more than 5 with 6(since 6 is almost 75 percentile number of members in a family after 99 percentile values)

In [None]:
app.loc[app['CNT_FAM_MEMBERS'] > 5, ['CNT_FAM_MEMBERS']] = 6

In [None]:
plt.figure(figsize= (10,2))
sns.boxplot(app.CNT_FAM_MEMBERS)
plt.show()

#### <font color='brown'> AMT_GOODS_PRICE </FONT>

In [None]:
plt.figure(figsize= (10,2))
sns.boxplot(app.AMT_GOODS_PRICE)
plt.show()

In [None]:
numeric_data.AMT_GOODS_PRICE.quantile([0.25,0.50,0.75,0.99])

In [None]:
max(numeric_data.AMT_GOODS_PRICE)

There is a huge difference between the 75th percentile and the 99th percentile value.
Also there is a huge difference between the 99th percentile and the maximum value.

Calculating outliers

In [None]:
iqr=numeric_data.AMT_GOODS_PRICE.quantile(0.75)-numeric_data.AMT_GOODS_PRICE.quantile(0.25)
AMT_GOODS_PRICE_OUTLIER=numeric_data.AMT_GOODS_PRICE.quantile(0.75)+(1.5 * iqr)
AMT_GOODS_PRICE_OUTLIER

The outlier has less value than the 99th percentile value. Hence replacing values more than the 99th percentile with a uniform
data 1800000.0+10000.0

In [None]:
app.loc[app['AMT_GOODS_PRICE']>1800000.0,['AMT_GOODS_PRICE']]=1800000.0+10000.0

In [None]:
plt.figure(figsize= (10,2))
sns.boxplot(app.AMT_GOODS_PRICE)
plt.show()

In [None]:
plt.figure(figsize = (10,2))
sns.boxplot(numeric_data.DAYS_EMPLOYED)
plt.show()

In [None]:
numeric_data.DAYS_EMPLOYED.quantile([0.0,0.25,0.50,0.75,0.99])

In [None]:
max(numeric_data.DAYS_EMPLOYED)

There is a huge difference between 75th percentile and 99th percentile and the 99th percentile and the max values are the same.

In [None]:
iqr=numeric_data.DAYS_EMPLOYED.quantile(0.75)-numeric_data.DAYS_EMPLOYED.quantile(0.25)
DAYS_EMPLOYED_OUTLIER=numeric_data.DAYS_EMPLOYED.quantile(0.75)+(1.5 * iqr)
DAYS_EMPLOYED_OUTLIER

The outlier is much smaller than the 99th percentile/max value

In [None]:
len(app[app.DAYS_EMPLOYED>12868])

Since 56357 rows are present with values more than the outlier, they cannot be ignored.

In [None]:
app.DAYS_EMPLOYED[app.DAYS_EMPLOYED>12868].value_counts()

Looking at the value counts of DAYS_EMPLOYED for value more than the calculated outlier, a value 365243 appeared. On dividing it by 365, it gives more than 1000. Employment for 1000 years is not possible at all. While univariate and bivariate analysis it can be analyzed if people having difficulty to make a payment gives this kind of vague values or these are just missing values. Hence not treating the outlier right now.

In [None]:
categoric_data.head()

## 2. Data Analysis: Analysing Data Imbalance on Target Variable

In [None]:
#Since a few columns were dropped, again splitting our categoric and numeric columns in lists
numeric_data=app.select_dtypes(include=np.number)
numeric_cols=numeric_data.columns

categoric_data=app.select_dtypes(exclude=np.number)
categoric_cols=categoric_data.columns

In [None]:
class_values = (app['TARGET'].value_counts()/app['TARGET'].value_counts().sum())*100
print(class_values)

The data only consists of 8.07% cases where payment wasn't made on time.

In [None]:
#Creating two dataframes for Target = 0 and Target = 1 for univariate and bivariate analysis.

app_T0 = app[app.TARGET == 0]
app_T1 = app[app.TARGET == 1]
app.head()

In [None]:
app_T0.shape

In [None]:
app_T1.shape

### 2.1 Univariate Analysis

#### 2.1.1 Categorical variables.

In [None]:
# Selecting the categorical columns for applicants who made payment on time.
plt.style.use('ggplot')
# Plotting a bar chart for each of the cateorical variable
for column in categoric_cols:
    plt.figure(figsize=(20,4))
    plt.subplot(121)
    app_T0[column].value_counts().plot(kind='bar')
    plt.title(column)

### Inferences from Univariate analysis of Categorical variables Applicants making payments on time.

1. More than 2,50,000 applicants have applied for Cash loans and very small proportion of people have applied for Revolving loans.
2. More than 175000 loan applicants are female and slightly less than 100000 loan applicants are male and very few(4) applicants have third or unknown gender.
3. More than 175000 loan applicants doesn't own a car and little less than 100000 applicants own a car.
4. Slightly less than 200000 loan applicants own a house or a flat where as less than 100000 people doesn't own a house.
5. A large number of people i.e; almost 250000 people who applied for loans were unaccompanied. The difference between the most and the second most category i.e; family is a lot and the rest are almost negligible.
6. Top 3 categories of peple who applied for loans were getting income by working, Commercial associates or were pensioners, highest being the Working class category.
7. People with Secondary or secondary special education applied the highest number of loans followed by Higher educated people.
8. Mostly married people applied for loans followed by Single people with a huge difference.
10. People having their own houses or apartments applied for the most number of loans followed by other categories with few numbers.
11. Most of the people who opted for loan, didn't mention their occupation type. Other high number of people who applied for loans are labourers, Sales Staff and Core staff.
12. Apart from Weekends, every day has almost equal distribution for loan application with Sunday being noticably least and Tuesday being the most.
13. Most of the people who applied for loans either work in Business Entity type 2 kind of organizations and the other top 2 are unknown types and Self-employed respectively.
14. People falling under the age group of 30-40, 40-50 and 50-60 are most likely to apply for loan.
15. There are relatively few people with very high income or average who have applied for loans but most of them who have applied for loans either have high income or low income.

In [None]:
# Selecting the categorical columns for applicants with payment difficulties: he/she had late payment more than X days 
#on at least one of the first Y installments of the loan.
plt.style.use('ggplot')
# Plotting a bar chart for each of the cateorical variable
for column in categoric_cols:
    plt.figure(figsize=(20,4))
    plt.subplot(121)
    app_T1[column].value_counts().plot(kind='bar')
    plt.title(column)

### Inferences from Univariate analysis of Applicants with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan

1. More than 2,000 applicants have applied for Cash loans and very small proportion of people have applied for Revolving loans.
2. More than 14000 loan applicants are female and slightly more than 10000 loan applicants are male.
3. Almost 17500 loan applicants dont own a car and little less than 7500 applicants own a car.
4. More than 16000 loan applicants own a house or a flat where as slightly less than 8000 people doesn't own a house.
5. A large number of people i.e; more than 20000 people who applied for loans were unaccompanied. The difference between the most and the second most category i.e; family is a lot and the rest are almost negligible.
6. Top 3 categories of peple who applied for loans were getting income by working, Commercial associates or were pensioners, highest being the Working class category.
7. People with Secondary or secondary special education applied the highest number of loans followed by Higher educated people.
8. Mostly married people applied for loans followed by Single people with a huge difference.
10. People having their own houses or apartments applied for the most number of loans followed by other categories with few numbers.
11. Most of the people who opted for loan, didn't mention their occupation type. Other high number of people who applied for loans are labourers, Sales Staff and Drivers.
12. Apart from Weekends, every day has almost equal distribution for loan application with Sunday being noticably least and Tuesday being the most.
13. Most of the people who applied for loans either work in Business Entity type 2 kind of organizations and the other top 2 are unknown types and Self-employed respectively.
14. People falling under the age group of 30-40, 20-30 and 40-50 are most likely to apply for loan.
15. There are relatively few people with very high income or average who have applied for loans but most of them who have applied for loans either have high income or low income.

#### 2.1.2 Continuous variables.

In [None]:
numeric_cols=list(numeric_data.columns)
numeric_cols.remove('SK_ID_CURR')
numeric_cols.remove('TARGET')

In [None]:
app_T0.describe()

In [None]:
for column in numeric_cols:
    plt.figure(figsize=(20,5))
    plt.subplot(121)
    sns.distplot(app_T0[column])
    plt.title(column)

### Inferences from Univariate analysis of Continuous variables of Applicants making payments on time.

1. People with 0 children are much more than people with few or more than 5 children which shows that people who are applying for loan have less people dependent financially on them which will lessen the loan payment difficulties.
2. The density of people earning between 1,00,000-2,20,000 are more likely to apply for the loans and pay them on time. There is a skew to the right which also shows people with higher salaries present as well.
3. The maximum density of loan applied by people is between 0.045x10^6-0.053x10^6 with a right skew because of the presence of 
   1.864000e+06. Loans with such high credits are also paid.
4. The KDE for AMT_ANNUITY almost resembles a normal distribution with a right skew because of values with more than 34749.000000 in small amounts and the maximum value 71006.500000 in it.
5. There's no pattern for AMT_GOODS_PRICE, REGION_POPULATION_RELATIVE.
6. For column DAY_EMPLOYED, value more than 12780 makes no sense because that is equiavalent to 35 years. But there are many rows with values more than 12780. Hence nothing can be derived from this column either.
7. The density of applicants changing registration between 0-5000 is the most and the density of applicants who changed the identity document with which he applied between 4000-4600 is the most.
8. The applicants mostly have 2 family members.
9. REGION_RATING_CLIENT, REGION_RATING_CLIENT_W_CITY for most of the applicants is 2.
10. Most of the applicantys have applied between 10 AM-12:30 PM
11. All the columns REG_REGION_NOT_LIVE_REGION, REG_REGION_NOT_WORK_REGION, LIVE_REGION_NOT_WORK_REGION, REG_CITY_NOT_LIVE_CITY, REG_CITY_NOT_WORK_CITY, LIVE_CITY_NOT_WORK_CITY which chceks the address given by the applicant is matching in most of the cases here.
12. Age is almost evenly distributed with maximum density between 35-43.

In [None]:
app_T1.describe()

In [None]:
for column in numeric_cols:
    plt.figure(figsize=(20,5))
    plt.subplot(121)
    sns.distplot(app_T1[column])
    plt.title(column)

### Inferences from Univariate analysis of Continuous variables of Applicants with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan

1. People with 0 children are much more than people with few or more than 5 children which shows that people who are applying for loan have less people dependent financially on them which will lessen the loan payment difficulties.
2. The density of people earning between 1,00,000-1,70,000 are more likely to apply for the loans and pay them on time. There is a skew to the right which also shows people with higher salaries present as well.
3. The maximum density of loan applied by people is between 0.25x10^6-0.051x10^6 with a right skew because of the presence of 
   1.864000e+06. Loans with such high credits are also paid.
4. The KDE for AMT_ANNUITY almost resembles a normal distribution with a right skew because of values with more than 32976.000000 in small amounts and the maximum value 71006.500000 in it.
5. There's no pattern for AMT_GOODS_PRICE, REGION_POPULATION_RELATIVE.
6. For column DAYS_EMPLOYED, value more than 12780 makes no sense because that is equiavalent to 35 years. But there are many rows with values more than 12780. Hence nothing can be derived from this column either.
7. The density of applicants changing registration between 0-5000 days is the most and the density of applicants who changed the identity document with which he applied between 4000-4600 days is the most.
8. The applicants mostly have 2 family members.
9. REGION_RATING_CLIENT, REGION_RATING_CLIENT_W_CITY for most of the applicants is 2.
10. Most of the applicantys have applied between 10 AM-12:30 PM
11. All the columns REG_REGION_NOT_LIVE_REGION, REG_REGION_NOT_WORK_REGION, LIVE_REGION_NOT_WORK_REGION, REG_CITY_NOT_LIVE_CITY, REG_CITY_NOT_WORK_CITY, LIVE_CITY_NOT_WORK_CITY which chceks the address given by the applicant is matching in most of the cases here.
12. Age is almost evenly distributed with maximum density between 25-38.

### 2.2 Bivariate Analysis and multivariate

In [None]:
for column in categoric_cols:
    plt.figure(figsize=(30,6))
    plt.subplot(121)
    sns.countplot(x=app[column],hue=app['TARGET'],data=app)
    plt.title(column)    
    plt.xticks(rotation=90)

### Inferences:

1. The proportion of people opting out for Cash loans and paying the amount back is more than the people opting for revolving loans.
2. The proportion of males applying for loans and having difficulties in payment is much more than females.
3. The applicants having difficulties to pay back the loans mostly come unaccompanied while applying.
4. The number of people working for income are more than any other category. But the number of people having difficulty to pay the more are also from working category people. There are very negligible amount of applicants who are unemployed, student, business man or are on maternity leave who have applied for loans or who have difficulties to pay.
5. Applicants who are Secondary or special secondary educated have applied most of the loans but are also the population facing most of the difficulties while paying the amount.
6. Married people apply for the most number of loans but tend to have difficulties in the payment as well.
7. Labourers and applicants who have not specified their occupation type have some history to face difficulties while paying back the loan and least proportion of people whi have applied for loans and have the least difficulties are IT staff, HR staff etc.
8. People belonging to organizations like Business entity type 3, unknown organization or are self- employed have applied for the most number of loans respectively and most number of people having propblems in paying back the amount is Business entity type 3, are self- employed or unknown organization respectively.
10. People in the age group 30-40, 40-50 and 50-60 have applied for the most number of loans but applicants in the age group 30-40 have difficulties to pay back the loans.
11. People with high income and low income have applied for the most number of loans(Loans applied by people with high income is much more than the applicants with low income) but people with low income and high income has almost same number of people facing difficulties to pay the loan which means the proportion of people having low income and applying for loans have most difficulties to pay the amount.

In [None]:
numeric_data_bivar=numeric_data.filter(['TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'DAYS_EMPLOYED','DAYS_REGISTRATION',
       'DAYS_ID_PUBLISH', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'Age'],axis=1)

sns_plot=sns.pairplot(data=numeric_data_bivar, hue='TARGET')
plt.show()

In [None]:
numeric_data_bivar1=numeric_data[numeric_data.TARGET==0].filter(['TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'DAYS_EMPLOYED','DAYS_REGISTRATION',
       'DAYS_ID_PUBLISH', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'Age'],axis=1)
#from IPython.display import Image
sns_plot=sns.pairplot(data=numeric_data_bivar1, hue='TARGET')
plt.show()

In [None]:
numeric_data_bivar1=numeric_data[numeric_data.TARGET==1].filter(['TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'DAYS_EMPLOYED','DAYS_REGISTRATION',
       'DAYS_ID_PUBLISH', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'Age'],axis=1)
#from IPython.display import Image
sns_plot=sns.pairplot(data=numeric_data_bivar1, hue='TARGET')
plt.show()

#### Few other visualizations for more inferences:

In [None]:
numeric_data.columns

In [None]:
categoric_data.columns

In [None]:
sns.barplot(data=app,x='FLAG_OWN_CAR',y='Age')
plt.show()

In [None]:
sns.barplot(data=app,x='FLAG_OWN_REALTY',y='Age')
plt.show()

In [None]:
plt.figure(figsize=(15,2))
sns.catplot(data=app,x='IncomeRange',y='REGION_RATING_CLIENT')
plt.show()

In [None]:
sns.barplot(data=app,x='NAME_FAMILY_STATUS',y='Age')
plt.xticks(rotation=90)
plt.show()

In [None]:
sns.scatterplot(data=app,x='AMT_GOODS_PRICE',y='AMT_CREDIT',hue='TARGET')
plt.show()

In [None]:
sns.scatterplot(data=app,x='AMT_GOODS_PRICE',y='AMT_ANNUITY',hue='TARGET')
plt.show()

In [None]:
sns.scatterplot(data=app,x='AMT_CREDIT',y='AMT_ANNUITY',hue='TARGET')
plt.show()

### Inference: On saving the pairplot as an image and zooming each graph in it, following inferences are developed. Also inferences derived from the above bargraphs are also mentioned below:

1. With increasing count of children, the applicant starts facing payment problem irrespective of income, credit amount, annuity and goods price with an exception if the salary of applicant is very high and age is relatively high.
2. It becomes difficult for applicant to pay if the 'AMT_CREDIT and AMT_ANNUITY' , 'AMT_CREDIT and AMT_Goods_price' , 'AMT_ANNUITY AND AMT_GOODS_PRICE' rises together.
3. 'AMT_CREDIT and AMT_ANNUITY' , 'AMT_CREDIT and AMT_Goods_price' , 'AMT_ANNUITY AND AMT_GOODS_PRICE' have a rising relation if not completely linear relation with few exceptions.
3. Larger the DAYS_REGISTRATION shows larger chances of getting payment of loan irrespective of AMT_INCOME_TOTAL,AMT_CREDIT, AMT_ANNUITY, AMT_GOODS_PRICE.
4. An increasing relationship is established between AMT_GOODS_PRICE and AMT_CREDIT.
5. Less family members and more days of registration leads to easy payment of loan amount. Further more CNT_FAM_MEMBERS is not related to other variable.
6. For every value of Region rating by client and REGION_RATING_CLIENT_W_CITY more DAYS_REGISTRATION indicates assured payment of loans.
7. Age doen't matter for owning a car or a house/apartment.
8. Mostly people with more age who have applied for loan are either widowed or seperated.

## 3. Correlation between Target and other variables

In app_TO and app_T1, we will drop the below 2 columns for the following reasons:
1. SK_ID_CURR since calculating correlation for the id makes no sense.
2. TARGET: Since we have divided dataframes as per target 0 and 1. Hence the variance of target will be zero which gives NaN while calculating corr.

In [None]:
app_T0=app_T0.drop(['SK_ID_CURR', 'TARGET'], axis=1)

In [None]:
app_T1=app_T1.drop(['SK_ID_CURR', 'TARGET'], axis=1)

In [None]:
corr_app = abs(round(app_T0.corr(),2))
corr_app

In [None]:
#plot heatmap to identify the correlation between different variables in the dataset for Target = 0
plt.figure(figsize = (18,6))
sns.heatmap(round(app_T0.corr(),3), annot = True, fmt='.2g',cmap= 'coolwarm')
plt.show()

In [None]:
#finding the top 10 correlation pairs for Target = 0
corr_T0 = corr_app[corr_app!=1].unstack().sort_values(ascending = False).head(20)
print("The top 10 correlation pairs for Target = 0 are:")
corr_T0

### The top 10 correlation pairs for applicants who made their payments on time - 
- AMT_GOODS_PRICE              AMT_CREDIT                     0.99
- REGION_RATING_CLIENT         REGION_RATING_CLIENT_W_CITY    0.95
- CNT_CHILDREN                 CNT_FAM_MEMBERS                0.88
- LIVE_REGION_NOT_WORK_REGION  REG_REGION_NOT_WORK_REGION     0.86
- REG_CITY_NOT_WORK_CITY       LIVE_CITY_NOT_WORK_CITY        0.83
- AMT_CREDIT                   AMT_ANNUITY                    0.79
- AMT_GOODS_PRICE              AMT_ANNUITY                    0.79
- DAYS_EMPLOYED                Age                            0.63
- REGION_RATING_CLIENT         REGION_POPULATION_RELATIVE     0.54
- REGION_RATING_CLIENT_W_CITY  REGION_POPULATION_RELATIVE     0.54

## 3.2 Finding correlation of variables for Target = 1

In [None]:
corr_app_t1 = abs(round(app_T1.corr(),2))
corr_app_t1

In [None]:
#plot heatmap to identify the correlation between different variables in the dataset for Target = 0
plt.figure(figsize = (15,6))
sns.heatmap(round(app_T1.corr(),3), annot = True, fmt='.2g',cmap= 'coolwarm')
plt.show()

In [None]:
#finding the top 10 correlation pairs for Target = 1
corr_T1 = corr_app_t1[corr_app_t1!=1].unstack().sort_values(ascending = False).head(20)
print("The top 10 correlation pairs for Target = 1 are:")
corr_T1

### The top 10 correlation pairs for Defaulters are - 
- AMT_CREDIT                   AMT_GOODS_PRICE                0.98
- REGION_RATING_CLIENT_W_CITY  REGION_RATING_CLIENT           0.96
- CNT_FAM_MEMBERS              CNT_CHILDREN                   0.88
- LIVE_REGION_NOT_WORK_REGION  REG_REGION_NOT_WORK_REGION     0.85
- REG_CITY_NOT_WORK_CITY       LIVE_CITY_NOT_WORK_CITY        0.78
- AMT_ANNUITY                  AMT_GOODS_PRICE                0.76
- AMT_CREDIT                   AMT_ANNUITY                    0.76
- DAYS_EMPLOYED                Age                            0.58
- REG_REGION_NOT_LIVE_REGION   REG_REGION_NOT_WORK_REGION     0.50
- REG_CITY_NOT_LIVE_CITY       REG_CITY_NOT_WORK_CITY         0.47

### The top 10 correlation pairs are almost the same for both Defaulters and Non Defaulters with below observations: 
For applicants who have difficulties in paying back the amounts, they have very less correlation between there total income(AMT_INCOME_TOTAL) and series of payments made at equal intervals(AMT_ANNUITY)

Goods price increases with credit, annuity increases with credit and goods price increases with annuity not entirely lineraly. But since it shows a definite pattern which resembles linearity and has relatively high correlation- these 3 variables are directly or indirectly related to each other.

For defaulters there is a linearly increasing relation, where there applicants permanent address and city doesn’t match with contact address or city to work address or city.

## 4. Analysing the previous_application.csv data
**We need to create new dataset joining previous data to get history of loan credit for customer**

In [None]:
p_app = pd.read_csv("../input/credit-card/previous_application.csv")

In [None]:
p_app.head()

In [None]:
p_app.shape

In [None]:
#Obtaining columns with null value percentage higher than 40
p_nulls = p_app.isnull().sum() * 100 / len(p_app)

p_nulls = p_nulls[p_nulls > 40]

p_nulls

In [None]:
#dropping columns having null value percentage over 40
p_app.drop(p_nulls.index, inplace = True, axis = 1)

In [None]:
p_app.shape

In [None]:
p_app.NAME_CASH_LOAN_PURPOSE.value_counts()

In [None]:
p_app.CODE_REJECT_REASON.value_counts()

In [None]:
p_app[p_app['CODE_REJECT_REASON'] == 'XAP']['NAME_CONTRACT_STATUS'].unique()

In [None]:
p_app[p_app['NAME_CASH_LOAN_PURPOSE'] == 'XNA']['NAME_CONTRACT_TYPE'].unique()

In [None]:
#Removing null values i.e. rows where NAME_CASH_LOAN_PURPOSE = XNA 
p_app=p_app.drop(p_app[p_app['NAME_CASH_LOAN_PURPOSE']=='XNA'].index)

**There is no clear indication for XAP category of cash loans - will handle it as and when applicable.**

In [None]:
p_app.NAME_CASH_LOAN_PURPOSE.value_counts()

### 4.1 Merging current and previous data sets

In [None]:
merge_app = app.merge(p_app, on = 'SK_ID_CURR', how = 'inner')

In [None]:
merge_app.shape

In [None]:
merge_app.head()

### 4.2 Cleaning the data for analysis

In [None]:
list(merge_app.columns)

In [None]:
#Removing unwanted (or previously analysed) columns to proceed with analysis
merge_app.drop(['SK_ID_PREV',
 'NAME_CONTRACT_TYPE_y',
 'WEEKDAY_APPR_PROCESS_START_y',
 'HOUR_APPR_PROCESS_START_y',
 'FLAG_LAST_APPL_PER_CONTRACT',
 'NFLAG_LAST_APPL_IN_DAY','DAYS_DECISION',
 'NAME_PAYMENT_TYPE','NAME_CLIENT_TYPE',
 'NAME_GOODS_CATEGORY',
 'NAME_PORTFOLIO',
 'CHANNEL_TYPE',
 'SELLERPLACE_AREA',
 'NAME_SELLER_INDUSTRY',
 'CNT_PAYMENT',
 'NAME_YIELD_GROUP',
 'PRODUCT_COMBINATION',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'WEEKDAY_APPR_PROCESS_START_x',
 'HOUR_APPR_PROCESS_START_x',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY'], axis = 1, inplace =True)

In [None]:
merge_app.shape

In [None]:
merge_app.columns

In [None]:
merge_app["NAME_CONTRACT_STATUS"].value_counts()

### 4.3 Univariate Analysis

#### We have already analysed the attributes for application_data. From Previous_application data, we can obtain the status of the previous loans applied by the customers for additional insights.

In [None]:
#Plotting counts of NAME_CASH_LOAN_PURPOSE
plt.figure(figsize=(15,6))
plt.subplot(121)
merge_app['NAME_CASH_LOAN_PURPOSE'].value_counts().plot(kind='bar', color = 'green')
plt.title('NAME_CASH_LOAN_PURPOSE')
plt.show()

As Loan Purpose of XAP value is unclear and is hindering analysis on the trends, removing this value.

In [None]:
#Removing XAP value rows as value seems to be randomly assigned or is not applicable 
merge_app=merge_app.drop(merge_app[merge_app['NAME_CASH_LOAN_PURPOSE']=='XAP'].index)

In [None]:
#obtaining result only for defaulters
merge_appT1 = merge_app[merge_app.TARGET == 1]

In [None]:
plot_merge = merge_app.filter(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE_x', 'CODE_GENDER', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE','ORGANIZATION_TYPE', 'AgeGroup',
       'IncomeRange', 'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS',
       'CODE_REJECT_REASON'], axis = 1)

plot_merge.drop('SK_ID_CURR', axis = 1, inplace = True)

plot_merge.info()

In [None]:
#Plotting counts of NAME_CASH_LOAN_PURPOSE again after removing XAP value entries
plt.figure(figsize=(15,6))
plt.subplot(121)
merge_app['NAME_CASH_LOAN_PURPOSE'].value_counts().plot(kind='bar', color = 'green')
plt.title('NAME_CASH_LOAN_PURPOSE')
plt.show()

In [None]:
#Plotting counts of NAME_CONTRACT_STATUS
plt.figure(figsize=(12,4))
plt.subplot(121)
merge_app['NAME_CONTRACT_STATUS'].value_counts().plot(kind='bar', color = 'green')
plt.title('NAME_CONTRACT_STATUS')
plt.show()

- Loans have been requested maximum for Repairs, Urgent needs and unspecified categories. There are almost no loans issued for a third person or because of customer refusal to reveal the objective of taking a loan.
- Maximum number of loans applied by existing customers previously have been refused.

**Univariate Analysis for defaulters**

In [None]:
#Plotting counts of NAME_CASH_LOAN_PURPOSE only for defaulter
plt.figure(figsize=(15,6))
plt.subplot(121)
merge_appT1['NAME_CASH_LOAN_PURPOSE'].value_counts().plot(kind='bar', color = 'green')
plt.title('NAME_CASH_LOAN_PURPOSE')
plt.show()

In [None]:
#Plotting counts of NAME_INCOME_TYPE only for defaulter
plt.figure(figsize=(15,6))
plt.subplot(121)
merge_appT1['NAME_INCOME_TYPE'].value_counts().plot(kind='bar', color = 'green')
plt.title('NAME_INCOME_TYPE')
plt.show()

In [None]:
#Plotting counts of CODE_GENDER only for defaulter
plt.figure(figsize=(15,6))
plt.subplot(121)
merge_appT1['CODE_GENDER'].value_counts().plot(kind='bar', color = 'green')
plt.title('CODE_GENDER')
plt.show()

### Defaulters majorly belong to the Working category, have applied maximum number of times for loans on Repairs and females show a significantly higher trend.

### 4.3 Bivariate Analysis

In [None]:
#Dividing the data by NAME_CONTRACT_STATUS into separate data frames for better clarity 
merge_app_Refused = plot_merge[plot_merge["NAME_CONTRACT_STATUS"]  == 'Refused']
merge_app_Approved = plot_merge[plot_merge["NAME_CONTRACT_STATUS"]  == 'Approved']
merge_app_Canceled = plot_merge[plot_merge["NAME_CONTRACT_STATUS"]  == 'Canceled']
merge_app_Unused = plot_merge[plot_merge["NAME_CONTRACT_STATUS"]  == 'Unused offer']

In [None]:
print(merge_app_Refused.shape)
print(merge_app_Approved.shape)
print(merge_app_Canceled.shape)
print(merge_app_Unused.shape)

In [None]:
#plotting against Target for categorical variables to analyse trend 
i = 0

for column in merge_app_Unused:
    if column != 'TARGET':
        fig, axes = plt.subplots(1, 4, figsize=(15, 5), sharey = True)
       

        sns.countplot(ax = axes[0], x=merge_app_Unused[column],hue=merge_app_Unused['TARGET'],data=merge_app_Unused)
        axes[0].set_title('Unused Loans')
        
        sns.countplot(ax = axes[1], x=merge_app_Approved[column],hue=merge_app_Approved['TARGET'],data=merge_app_Approved)
        axes[1].set_title('Approved Loans')

        sns.countplot(ax = axes[2], x=merge_app_Refused[column],hue=merge_app_Refused['TARGET'],data=merge_app_Refused)
        axes[2].set_title('Refused Loans')

        sns.countplot(ax = axes[3], x=merge_app_Canceled[column],hue=merge_app_Canceled['TARGET'],data=merge_app_Canceled)
        axes[3].set_title('Canceled Loans')

        for i in range(4):
            for tick in axes[i].get_xticklabels():
                tick.set_rotation(90)
        

In [None]:
#For better understanding, re-plotting NAME_CASH_LOAN_PURPOSE vs TARGET.
plt.figure(figsize=(20,10))
sns.countplot(x=merge_app['NAME_CASH_LOAN_PURPOSE'],hue=merge_app['TARGET'],data=merge_app)
plt.xticks(rotation = 90)
plt.show()

In [None]:
#For better understanding, re-plotting NAME_CASH_LOAN_PURPOSE vs TARGET for only Refused loans
plt.figure(figsize=(20,10))
sns.countplot(x=merge_app_Refused['NAME_CASH_LOAN_PURPOSE'],hue=merge_app_Refused['TARGET'],data=merge_app_Refused)
plt.xticks(rotation = 90)
plt.show()

In [None]:
#Visualizing gender, income, education along with target for approved loans.

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey = True)
sns.countplot(ax = axes[0],x=merge_app_Approved[(merge_app_Approved['CODE_GENDER']=='M') & (merge_app_Approved['IncomeRange']=='High')]['NAME_EDUCATION_TYPE'],hue=merge_app_Approved['TARGET'],data=merge_app_Approved)
axes[0].set_title('High income,Males,education,Target-approved loans.')

sns.countplot(ax = axes[1],x=merge_app_Approved[(merge_app_Approved['CODE_GENDER']=='F') & (merge_app_Approved['IncomeRange']=='High')]['NAME_EDUCATION_TYPE'],hue=merge_app_Approved['TARGET'],data=merge_app_Approved)
axes[1].set_title('High income,females,education,Target-approved loans.')
plt.xticks(rotation = 90)

for i in range(2):
            for tick in axes[i].get_xticklabels():
                tick.set_rotation(90)
plt.show()

In [None]:
#Visualizing gender, income, education along with target for approved loans.

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey = True)
sns.countplot(ax = axes[0],x=merge_app_Approved[(merge_app_Approved['CODE_GENDER']=='M') & (merge_app_Approved['IncomeRange']=='Low')]['NAME_EDUCATION_TYPE'],hue=merge_app_Approved['TARGET'],data=merge_app_Approved)
axes[0].set_title('Low income,Males,education,Target-approved loans.')

sns.countplot(ax = axes[1],x=merge_app_Approved[(merge_app_Approved['CODE_GENDER']=='F') & (merge_app_Approved['IncomeRange']=='Low')]['NAME_EDUCATION_TYPE'],hue=merge_app_Approved['TARGET'],data=merge_app_Approved)
axes[1].set_title('Low income,females,education,Target-approved loans.')

plt.xticks(rotation = 90)

for i in range(2):
            for tick in axes[i].get_xticklabels():
                tick.set_rotation(90)
plt.show()

In [None]:
#Visualizing gender, income, education along with target for refused loans.

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey = True)
sns.countplot(ax = axes[0],x=merge_app_Refused[(merge_app_Refused['CODE_GENDER']=='M') & (merge_app_Refused['IncomeRange']=='High')]['NAME_EDUCATION_TYPE'],hue=merge_app_Refused['TARGET'],data=merge_app_Refused)
axes[0].set_title('High income Males, education,Target-Refused loans.')

sns.countplot(ax = axes[1],x=merge_app_Refused[(merge_app_Refused['CODE_GENDER']=='F') & (merge_app_Refused['IncomeRange']=='High')]['NAME_EDUCATION_TYPE'],hue=merge_app_Refused['TARGET'],data=merge_app_Refused)
axes[1].set_title('High income Females,education,Target-Refused loans.')
plt.xticks(rotation = 90)

for i in range(2):
            for tick in axes[i].get_xticklabels():
                tick.set_rotation(90)
plt.show()

In [None]:
#Visualizing gender, income, education along with target for refused loans.

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey = True)
sns.countplot(ax = axes[0],x=merge_app_Refused[(merge_app_Refused['CODE_GENDER']=='M') & (merge_app_Refused['IncomeRange']=='Low')]['NAME_EDUCATION_TYPE'],hue=merge_app_Refused['TARGET'],data=merge_app_Refused)
axes[0].set_title('Low income Males, education,Target-Refused loans.')

sns.countplot(ax = axes[1],x=merge_app_Refused[(merge_app_Refused['CODE_GENDER']=='F') & (merge_app_Refused['IncomeRange']=='Low')]['NAME_EDUCATION_TYPE'],hue=merge_app_Refused['TARGET'],data=merge_app_Refused)
axes[1].set_title('Low income Females,education,Target-Refused loans.')
plt.xticks(rotation = 90)

for i in range(2):
            for tick in axes[i].get_xticklabels():
                tick.set_rotation(90)
plt.show()

### Inferences:


- <font color='red'>In case of refused loans, a HUGE-HUGE proportion of applicants were refused of loans even if they wouldn’t have ended defaulting across all age groups.</font>

- Out of all the loans the bank has approved, the bank has made proportionally more losses for Cash loans compared to revolving loans because there were applicants who couldn't pay their installments on time.
- Out of all the loans the bank has approved, the bank has made proportionally more losses when they have approved loans to males.
- The ratio of Refused loans is higher for defaulting males, although the count of applications shows a higher trend in female population.
- Customers having secondary or secondary special education show highest trend in Refused as well as Approved loans and to a lesser extent in Cancelled loans.
- Banks have made proportionally more losses by approving loans to labourers, drivers and not-specified occupation types. Also banks have refused more loans proportionally for the 3 categories and sales staff.
- In case of approved loans, the ratio of applicants paying the amounts to not paying the amounts are same in the age group of 30-40 and 40-50.
- Highest percentage of defaulters can be observed in case where Cash loans have been requested for Repairs. Highest percentage of Refusal of loans also is in the same category. Based on proportionally high defaulters, more care should be taken while giving loans for repairs, other purposes, buying a used car, buying a house or an annex.
- Refusal of loans is highest for customers earning a high income. This rate is significantly high for people falling into the age group of 30-40 years.
- The ratio of applicants who will pay their amounts on time to the applicants having difficulties to pay their amounts on time are almost same in case of approved and refused loans which seems somewhat of an efficiency problem. The rules and parameters need to be more strict while approving loans.
-  Males with high income and secondary education are more likely to have difficulties while paying back the amount or default. This ratio is followed by males with low income and then females with high income and the females with low income.

In [None]:
#plot heatmap to identify the correlation between different variables in the merged dataset
plt.figure(figsize = (18,6))
sns.heatmap(round(merge_app.corr(),3), annot = True, fmt='.2g',cmap= 'coolwarm')
plt.show()

A very high correlation is observed between Application Amount and the amount of the goods and between Application Amount and the amount of the goods from already processed loans. Also, we can observe a correlation trend similar to the one we observed while analysing application data.

In [None]:
#finding the top 10 correlation pairs for Target = 0
refused = round(merge_app[merge_app["NAME_CONTRACT_STATUS"] == 'Refused'].corr(),3)
approved = round(merge_app[merge_app["NAME_CONTRACT_STATUS"] == 'Approved'].corr(),3)
canceled = round(merge_app[merge_app["NAME_CONTRACT_STATUS"] == 'Canceled'].corr(),3)
unused = round(merge_app[merge_app["NAME_CONTRACT_STATUS"] == 'Unused offer'].corr(),3)

print(refused[refused!=1].unstack().sort_values(ascending = False).drop_duplicates().head(10))
print('---------------------------------------------------------------------------------------')
print(approved[approved!=1].unstack().sort_values(ascending = False).drop_duplicates().head(10))
print('---------------------------------------------------------------------------------------')
print(canceled[canceled!=1].unstack().sort_values(ascending = False).drop_duplicates().head(10))
print('---------------------------------------------------------------------------------------')
print(unused[unused!=1].unstack().sort_values(ascending = False).drop_duplicates().head(10))
print('---------------------------------------------------------------------------------------')


We can observe a similar correlation trend for all loans previously issued to existing customers irrespective of the status of the loan.

# _*`Overall Recommendations:`*_

- The ratio of applicants who will pay their amounts on time to the applicants having difficulties to pay their amounts on time are almost same in case of approved and refused loans. This is an efficiency problem which if kept the same way will compound the losses, the bank is facing . The rules and parameters need to be more strict while approving loans and while rejecting loans, the concerned officials need to go through other parameters to conclude if the person has a high chance of paying the amounts.
- The bank has lost quite a good amount of profits by not approving revolving loans to applicants who would have paid the installments on time. The bank needs to increase their revolving loans number so as to derive continuous profits from the same depending on the amount of cash reserve the bank has.
- Males with high income and secondary education are more likely to have difficulties while paying back the amount or default. Hence while approving their loans, more attention to detail needs to be given. 
- Approve more to applicants with more high annual income, secondary and high educated and less credit amount. This category is the least likely to default.
- Maximum number of loans applied by existing customers previously were refused. So now when they have applied again, if total income has imcreased or credit amount has decreased or the annuity has decreased such that now the payment is possible- then loans can be approved to them. Also if now the target variable for those applicants shows 0- loans can be approved for them.