## **Objective**: 

This analysis aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment.

We will be looking at the process for cleaning the data, and visualizing several parameters so we can gain an understanding of the driving factors. We will be documenting all the inferences we can make based on our observations.

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows', 500)

In [None]:
#Reading previous application file
prev=pd.read_csv('../input/home-credit-default-risk/previous_application.csv')

In [None]:
#Reading application_data file 
app=pd.read_csv('../input/home-credit-default-risk/application_train.csv')

## Inspecting the dataframe

In [None]:
app.shape

In [None]:
app.head()

In [None]:
app.describe()

In [None]:
#column-wise null count in the application data
100*round(app.isnull().sum()/len(app),4)

In [None]:
#retaining only those columns where the null percentage is less than 50
app=app.loc[:,100*round(app.isnull().sum()/len(app),4)<50]

In [None]:
#inspecting
app.shape

#### After dropping the columns where more than half the rows contained null values, we are left with 81 columns (we started with 122)

In [None]:
#checking null percentages again 
100*round(app.isnull().sum()/len(app),4)

### We can now try imputing the missing values for columns where the null percentage is less than 14%

In [None]:
#Checking what these columns are, in which we can safely impute values
cols_to_impute = list(app.loc[:,(100*round(app.isnull().sum()/len(app),4) > 0) & (100*round(app.isnull().sum()/len(app),4) <14)].columns)
cols_to_impute

In [None]:
#Checking what these columns look like, and what data they hold
for i in enumerate(cols_to_impute):
    print(i[1],'\n')
    print((app[i[1]].describe()))
    print('\n')

### We can make some observations here
* Name_Type_Suite is a categorical variable and cannot be imputed using numerical analysis 
* Rest of the columns are numerical in nature, and some appear to have outliers. 

To further study the presence of these outliers, we'll use box plots

In [None]:
#We create another list of columns with only the numerical variables that we wish to impute null values for
num_cols_to_impute=cols_to_impute.copy()
num_cols_to_impute.remove('NAME_TYPE_SUITE')

In [None]:
plt.figure(figsize=(15,15))

for i in enumerate(num_cols_to_impute):
    plt.subplot(3,4,i[0]+1)
    sns.boxplot(y=i[1],data=app)

plt.show()

In [None]:
#Let's also visualize the categorical variable and see what the spread is like
sns.countplot(y='NAME_TYPE_SUITE',data=app)
plt.show()

### We can take the following observations from the visual analysis:
* All the discussed numerical variables have a large number of outliers except for EXT_SOURCE_2. We can use the mean for imputing the null values in this column since the spread is even. For the rest, we'll have to use median. Let's take a look at some of these columns: 
    * 'OBS_30_CNT_SOCIAL_CIRCLE' - For this column, we see that the spread is heavily concentrated near the lower end. If we look at the description of the column above, the 50th percentile is at 0, 75th percentile at 2 and the max value is very high, 348. This will lead to a skewed mean, hence we can impute with the median which is 0.
    * Same as above, for the columns 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', we see a similar spread of the data. A very huge volume of the data is at zero, and few outliers span to higher values. We can impute the null values with 0 for these as well. 
    * For the column 'AMT_REQ_CREDIT_BUREAU_YEAR', however, imputing with zero won't be wise. This is because the data is more spread out in this column, and there are many occurences of values greater than zero, as can be seen from the box plot. The median of the values in this column is occuring at 1, and we can use 1 as imputing value while filling the nulls in this column.
* For the one categorical column we looked at (NAME_TYPE_SUITE), the cateogry "Unaccompanied" is a clear winner for the imputation process. 
    * The categorical variable can be imputed using the concept of mode. The mode of the data is the data point with the highest frequency of occurence. We can easily identify from the countplot that "Unaccompanied" has the highest frequency of occurence and can be used for this impuation. 
* For the column 'AMT_GOODS_PRICE' we will need to proceed with caution since this variable would be critical towards the analysis. If we impute a wrong value here, it could lead to heavily skewed results.

In [None]:
#Let's check if any numerical columns have negative values which don't make sense (for example, negative age)
app.min()

In [None]:
#Inspecting the data types
app.dtypes

In [None]:
#Fixing data types of columns which appear to be incorrectly formatted
app['DAYS_REGISTRATION'] = app['DAYS_REGISTRATION'].astype('int64')
app.CNT_FAM_MEMBERS = pd.to_numeric(app.CNT_FAM_MEMBERS, errors = 'coerce')
app['CNT_FAM_MEMBERS'] = app['CNT_FAM_MEMBERS'].fillna(0).astype('int64')



### Observation:
 
 There are several columns like 'DAYS_BIRTH','DAYS_EMPLOYED','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE','DAYS_REGISTRATION' which contain a value representing number of days. These cannot be negative and need to be fixed. We can also change the 'DAYS_BIRTH' column to years, as years is a good measure for age of a person, not days. 

In [None]:
app.DAYS_BIRTH.describe()

In [None]:
#Let's see what this looks like with absolute values
app.DAYS_BIRTH.apply(lambda x: abs(x)).describe()

In [None]:
# Converting 'DAYS_BIRTH' to 'AGE'.

app['AGE']=pd.to_timedelta(abs(app['DAYS_BIRTH']), unit = 'days')
app['AGE']= round((app['AGE']/np.timedelta64(1,'Y')))

In [None]:
app['AGE'] = app['AGE'].astype('int64')

In [None]:
#Inspecting again to ensure the calcualtion was correct
app.AGE.describe()

In [None]:
#We can drop the original days column for age
app.drop('DAYS_BIRTH',axis=1,inplace=True)

In [None]:
#Now, we'll change the rest of the days columns to positive values.

days_columns = ['DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE','DAYS_REGISTRATION']
for i in days_columns:
    app[i]=app[i].apply(lambda x:abs(x))
   

In [None]:
app.AMT_ANNUITY.isna().sum()

In [None]:
#Changing AMT_ANNUITY and AMT_CREDIT units to thousand 
#there are some NaNs in AMT_ANNUITY. We will be dropping them
app.dropna(subset=['AMT_ANNUITY'],inplace=True)
app.AMT_ANNUITY = app.AMT_ANNUITY.apply(lambda x:round(x/1000))
app.AMT_CREDIT = app.AMT_CREDIT.apply(lambda x: round(x/1000))

In [None]:
#Changing the units of client's income to thousands

app['AMT_INCOME_TOTAL_original']=app['AMT_INCOME_TOTAL'] #backing up the column if needed in future
app['AMT_INCOME_TOTAL']=app['AMT_INCOME_TOTAL'].apply(lambda x:round(x/1000))

In [None]:
app['AMT_INCOME_TOTAL'].describe()

In [None]:
#The box plot can be used to identify the outliers easily in the income column.
sns.boxplot(app.AMT_INCOME_TOTAL)
plt.show()

In [None]:
#The income seems to spread towards the lower end, indicating a lot of loan seekers belong to the lower income bracket (Mainly middle class folks)
sns.distplot(app['AMT_INCOME_TOTAL'], hist=False)
plt.show()

In [None]:
#This is the outlier that skews our data to the higher end
app['AMT_INCOME_TOTAL'].max()   

In [None]:
#We can bin the values in the income column to deal with the outliers
# bins = pd.IntervalIndex.from_tuples([(0, 50), (50, 120), (120, 250),(250,500),(500,5000),(5000,120000)])
app['INCOME_CATEGORY']=pd.cut(app['AMT_INCOME_TOTAL'], bins=[0,50,120,250,500,5000,120000], labels = ['Lower','LowerMiddle','UpperMiddle', 'Upper','Rich','UberRich'])

In [None]:
sns.countplot(y=app.INCOME_CATEGORY)

In [None]:
app.INCOME_CATEGORY.value_counts()

### Observations
* We have a lot of crowding of loan application from people in middle class, specifically upper middle class with an income in the range of 120-250 thousand. 
* There's clearly a few **outliers** in this data, we have binned them in the "UberRich" category 

In [None]:
# Binning the age column
app['AGE_CAT']=pd.cut(app['AGE'], bins=[0,30,40,50,60,70], labels = ['0 - 30','30 - 40', '40 - 50','50 - 60','60 +'])

In [None]:
app.AGE_CAT.value_counts()

In [None]:
#The days employed column has some negative values. We need to fix this
#Also, the days as such would be of little value to us during the analysis. Instead, we can convert this column to years
app.DAYS_EMPLOYED.describe()

In [None]:
#Let's see the spread of data when viewed as years
ax=sns.boxplot(app['DAYS_EMPLOYED'].apply(lambda x: abs(x)).apply(lambda x:x//365))
ax.set(xlabel='Years Employed')
plt.show()

### Observation
There is definitely something wrong going on here. We see there is an applicant which has an employment experience of around 1000 years! This is definitely an error and we need to drop this value so it doesn't impact the analysis. Let's see what this row is.

In [None]:
#Here, we can abserve that there is not one, but 55374 such rows which are causing outliers in the column. Since this is a lot of rows, we will not delete them. Instead, let's change these to NaN
app.loc[app['DAYS_EMPLOYED'].abs()==365243,'DAYS_EMPLOYED']=np.NaN

In [None]:
#Binning the days employed column into YEARS_EMPLOYED
app['YEARS_EMPLOYED']=pd.cut(app['DAYS_EMPLOYED'].apply(lambda x: abs(x)), bins=[0,365,5*365,10*365,25*365,45*365,1001*365], labels = ['Upto 1 Year','1 - 5 Years','5 - 10 Years', '10 - 25 Years','25 - 45 Years','45+ Years'])
app['YEARS_EMPLOYED'].value_counts()

In [None]:
#Let's visualize the results for a better view
plt.figure(figsize=(10,5))
ax = sns.countplot(app['YEARS_EMPLOYED'])
ax.set(xlabel='Years Employed',ylabel='Number of Applicants')
plt.show()

### Observation:
* We can note that the largest number of loan applicants fall in the range of 1-5 years in terms of work experience. There is also a significant chunk of people in the 45+ years experience range who apply for loans
* We have successfully dealt with the outliers in the AMT_INCOME_TOTAL and DAYS_EMPLOYED using the method of binning

In [None]:
#Let's convert the AMT_GOODS_PRICE to thousands as well, for ease of analysis 
app['AMT_GOODS_PRICE']=app['AMT_GOODS_PRICE'].apply(lambda x:(x//1000))

We can easily observe that the value of goods against which a loan is acquired ranges from 40k to 4050k

In [None]:
#Let's see what the spread of the goods value looks like
sns.distplot(app.AMT_GOODS_PRICE.dropna(),hist=False)
plt.show()

In [None]:
#We can spot some outliers in the goods price column
plt.figure(figsize=(15,5))
sns.boxplot(app['AMT_GOODS_PRICE'])
plt.show()

In [None]:
#Upon inspecting the spread of the data using qcut, we can see that high values may not be outliers after all, since there are quite a few values located in the top 10% range.
pd.qcut(app['AMT_GOODS_PRICE'],q=[0,0.2,0.4,0.6,0.8,0.9,1]).value_counts()

In [None]:
app.AMT_GOODS_PRICE.describe()

In [None]:
#Let's see how the family member count column looks like. We'll be excluding the zero value here since that was used to fill NANs previously
plt.figure(figsize=(15,5))
sns.boxplot(app.loc[app['CNT_FAM_MEMBERS']>0].CNT_FAM_MEMBERS)
plt.show()

### Observation
There are some **outliers** in the number of family members, these indicate some extraordinarily large families, and may be one-off cases.

In [None]:
#Let's now see how the credit amount is spread out
plt.figure(figsize=(15,5))
sns.boxplot(app.AMT_CREDIT.apply(lambda x: round(x/1000)))
plt.show()

## Observations
* The **outlier(s)** seem to be in sync with those in the AMT_GOODS_PRICE column
* This makes sense, since a client that purchases high value goods would certainly need a high value loan, hence causing the outliers in the in the credit amount column too

In [None]:
#Observing outliers in the annuity
plt.figure(figsize=(15,5))
sns.boxplot(app.AMT_ANNUITY)
plt.show()

### Observation
We can observe quite few outliers in the annuity column. These would again be due to high value goods purchased by clients, needing high value credit, in turn causing a high annuity amount for the loan.

# STARTING ANALYSIS PART

In [None]:
app.info()

In [None]:
#Checking the imbalance percentage of the dataset
app['TARGET'].value_counts(normalize = True)*100

### Observation
* Around 8% of the records have the target variable as 1, and the rest 92% have the target variable as 0
* Hence, our data is **highly imbalanced**

In [None]:
#Separting the data into two data frames based on the target values 0 and 1
app1 = app[app['TARGET']==1]
app0 = app[app['TARGET']==0]

In [None]:
#Looking for trends in the column NAME_INCOME_TYPE against the target variable. The spread seems to be even across both.
plt.figure(figsize=(12,5))
sns.countplot(y = 'NAME_INCOME_TYPE', data = app, hue='TARGET' )
plt.show()

### Planning
Due to the high imbalance in the data, using the 'hue' paramter while plotting may not yeild discernible results at all. Hence, we can make use of the subplot functionality to view the plots side by side, for target value 0 vs target value 1

In [None]:
# Analysis for continuous variables
cont_var = ['AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE']

plt.figure(figsize=(10,10))

for i in enumerate(cont_var):
    plt.subplot(2,2,i[0]+1)
    sns.boxplot(y = i[1],x='TARGET',  data = app)

plt.show()

In [None]:
plt.figure(figsize = (10, 8))

# KDE plot of loans that were repaid on time
sns.kdeplot(app.loc[app['TARGET'] == 0, 'AGE'], label = 'target == 0')

# KDE plot of loans which were not repaid on time
sns.kdeplot(app.loc[app['TARGET'] == 1, 'AGE'], label = 'target == 1')

# Labeling of plot
plt.xlabel('Age (years)'); plt.ylabel('Density'); plt.title('Distribution of Ages');
plt.savefig('ages')

### Observation: 
* Younger applicants are a little more likely to default a payment than the older ones

In [None]:
#We can see some outliers in the AMT_INCOME_TOTAL. Let's visualizing this using the binner variable we created.
## The spread seems to quite similar here.
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.countplot('INCOME_CATEGORY', data=app0)
plt.title('NOT DEFAULTED')
plt.subplot(1,2,2)
sns.countplot('INCOME_CATEGORY', data=app1)
plt.title('DEFAULTED')
plt.show()

In [None]:
#Checking the spread of the total credit amount against the target variable 
plt.figure(figsize=(20,6))
sns.distplot(app0.AMT_CREDIT,hist=False,color='green')
sns.distplot(app1.AMT_CREDIT,hist=False,color='red')
plt.title('Distribution of income of applicants, Defaulted in Red, Non-Defaulted in Green')
plt.show()

In [None]:
#Checking the income distrubution for target 0 vs 1; it turns out to be similar
plt.figure(figsize=(10,10))
plt.subplot(2,1,1)
plt.title('Non Defaulted Applications')
sns.countplot(y='INCOME_CATEGORY',data=app0)
plt.subplot(2,1,2)
plt.title('Defaulted Applications')
sns.countplot(y='INCOME_CATEGORY',data=app1)

In [None]:
app.info()

In [None]:
# Analysis of categorical variables
cat_list = ['NAME_CONTRACT_TYPE','CODE_GENDER','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','NAME_FAMILY_STATUS',
            'NAME_HOUSING_TYPE','WEEKDAY_APPR_PROCESS_START','OCCUPATION_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
            'ORGANIZATION_TYPE', 'AGE_CAT'
           ]

plt.figure(figsize=(20,100))
for i in enumerate(cat_list):
    plt.subplot(13,2,2*(i[0]+1)-1)
    plt.title('TARGET = 0')
    plt.xticks(rotation = 90)
    sns.countplot(x= i [1], data = app0.sort_values(by=i [1]))
    plt.subplot(13,2,2*(i[0]+1))
    plt.title('TARGET = 1')
    plt.xticks(rotation = 90)
    sns.countplot(x= i [1], data = app1.sort_values(by=i [1]))
    #plt.save
plt.show()

### Observations
* Females tend to default more than men do
* The labourer class has a high number of defaults

In [None]:
app['CODE_GENDER'].value_counts()

In [None]:
app.loc[app['CODE_GENDER'] == 'XNA','CODE_GENDER'] = 'F'
app['CODE_GENDER'].value_counts()

In [None]:
#Observing the documents submitted by those who did not default vs. those who defaulted
flag_list = ['FLAG_DOCUMENT_2','FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5','FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
             'FLAG_DOCUMENT_8','FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11','FLAG_DOCUMENT_12','FLAG_DOCUMENT_13',
             'FLAG_DOCUMENT_14','FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17','FLAG_DOCUMENT_18', 
             'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20','FLAG_DOCUMENT_21']

app0.loc[:, flag_list].sum(axis=0).plot.barh()
plt.show()
app1.loc[:, flag_list].sum(axis=0).plot.barh()
plt.show()

In [None]:
#The spread is almost similar, let's combine these flags into a single column
app.loc[:,'TOTAL_DOCS']=app.loc[:, flag_list].sum(axis=1)
app0.loc[:,'TOTAL_DOCS']=app0.loc[:, flag_list].sum(axis=1)
app1.loc[:,'TOTAL_DOCS']=app1.loc[:, flag_list].sum(axis=1)

In [None]:
# Now dropping the FLAG_DOCUMENT columns as we have already created a column for that.
app.drop(flag_list, axis=1, inplace = True)
app0.drop(flag_list, axis=1, inplace = True)
app1.drop(flag_list, axis=1, inplace = True)

In [None]:
#There is some data from external sources present in the dataset. These seem to be ratings given by credit rating agencies.
#Let's see how the data links to these ratings.
plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate([ 'EXT_SOURCE_2', 'EXT_SOURCE_3']): #EXT_SOURCE_1 was dropped since it had several null values
    
    # create a new subplot for each source
    plt.subplot(2, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app.loc[app['TARGET'] == 0, source], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app.loc[app['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

### Observation:
* We can note here that EXT_SOURCE_3 seems to be realitvely better at rating a client's repayment tendency.
* The plot for defaulted payments skewed towards the left hand side, indicating that a low credit score can be used an indicator to represent repayment failure tendency

In [None]:
plt.figure(figsize=(15,6))
sns.kdeplot(app.loc[app['TARGET'] == 0, 'DAYS_EMPLOYED'].abs()/365, label = 'target == 0')
sns.kdeplot(app.loc[app['TARGET'] == 1, 'DAYS_EMPLOYED'].abs()/365, label = 'target == 1')
plt.title('Years of employment of applicant compared against whether the applicant defaulted or not')
plt.show()

### Observation
* We can see that the defaulting applicatns are skewed highly towards the left end, indicating that less experiences people with < 10 years of work experience are much more likely to default on a loan payment

In [None]:
app.info()

In [None]:
#Plotting education level and work experience 
plt.figure(figsize=(8,8))
plt.subplot(211)
sns.countplot(y='NAME_EDUCATION_TYPE',hue='TARGET',data=app)
plt.ylabel('Education Level')
plt.legend(title='Target',loc='lower right')
plt.subplot(212)
sns.countplot(y='YEARS_EMPLOYED',hue='TARGET',data=app)
plt.ylabel('Years in Employment')
plt.legend(title='Target',loc='lower right')
plt.show()

In [None]:
app.info()

In [None]:
#Checking the income vs credit amount against the target variable 
plt.figure(figsize=(12,6))
sns.scatterplot(x='AMT_INCOME_TOTAL_original', y='AMT_CREDIT', data=app.loc[app.AMT_INCOME_TOTAL < 2500], hue='TARGET',alpha=0.3)
plt.xlabel('Total Income of the Applicant')
plt.ylabel('Total Credit Amount')
plt.title('Both DEFAULTED and NOT DEFAULTED')
plt.show()

In [None]:
#Checking the income vs credit amount against the target variable 
plt.figure(figsize=(12,6))
sns.scatterplot(x='AMT_INCOME_TOTAL_original', y='AMT_CREDIT', data=app1.loc[app1.AMT_INCOME_TOTAL < 2500], alpha=0.3,color='orange')
plt.xlabel('Total Income of the Applicant')
plt.title('DEFAULTED')
plt.ylabel('Total Credit Amount')
plt.show()

In [None]:
#Let's also obserrve the same using the binned column we created
sns.scatterplot(x='INCOME_CATEGORY', y='AMT_CREDIT', data=app, hue='TARGET',alpha=0.3)

### Observation
* An applicant was found to be more likely to default on the payment in the smaller income bracket.
* In the lower middle class applicants, the default percentage is seen to be higher when they apply for high value loans with a credit amount of >2000k

In [None]:
#Let's check the correlation matrix for the two cases, one where target variable is 1, and second where target variable is 0
corr0=app0[['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'AMT_GOODS_PRICE',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT',
       'HOUR_APPR_PROCESS_START',
       'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'TOTALAREA_MODE',
       'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AGE',]].corr()
corr1=app1[['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'AMT_GOODS_PRICE',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT',
       'HOUR_APPR_PROCESS_START',
       'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'TOTALAREA_MODE',
       'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AGE',]].corr()

In [None]:
corr0 = corr0.where(np.triu(np.ones(corr0.shape), k=1).astype(np.bool))
corrdf0 = corr0.unstack().reset_index()
corrdf0.head()

In [None]:
corrdf0.columns = ['VAR1', 'VAR2', 'Correlation']
corrdf0.dropna(subset = ['Correlation'], inplace = True)
corrdf0['Correlation'] = round(corrdf0['Correlation'], 2)
# We will be using absolute value of the correlation coefficiaents since we are only interested in seeing the absolute value.
# The dirction in which the entities are correlated is currently not of concern
corrdf0['Correlation'] = corrdf0['Correlation'].abs()

In [None]:
corr1 = corr1.where(np.triu(np.ones(corr1.shape), k=1).astype(np.bool))
corrdf1 = corr1.unstack().reset_index()
corrdf1.head()

In [None]:
corrdf1.columns = ['VAR1', 'VAR2', 'Correlation']
corrdf1.dropna(subset = ['Correlation'], inplace = True)
corrdf1['Correlation'] = round(corrdf1['Correlation'], 2)
# We will be using absolute value of the correlation coefficiaents since we are only interested in seeing the absolute value.
# The dirction in which the entities are correlated is currently not of concern
corrdf1['Correlation'] = corrdf1['Correlation'].abs()

In [None]:
#Inspecintg the TOP 10 correlated variables in the app1 dataframe, which indicates applicants who defaulted
corrdf1.sort_values(by = 'Correlation', ascending = False).head(10)

In [None]:
#Inspecintg the TOP 10 correlated variables in the app0 dataframe, which indicates applicants who did not default
corrdf0.sort_values(by = 'Correlation', ascending = False).head(10)

### Observation:
* Both the data sets are showing a similar set of variables with a high correlation value

In [None]:
#Income vs Annuity
#Here we observe that the income and annuity observe a low|low trend for the defaulters. 
#High income earners who get loans for higher annuity values end up defaulting much lesser than those..
#...in the low income bracket, even when they opt for lower annuity
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
plt.title('DID NOT DEFAULT')
sns.scatterplot(x='AMT_ANNUITY',y='AMT_INCOME_TOTAL',data=app0,alpha=0.1,color='green')
plt.ylabel('Income in Thousands')
plt.xlabel('Annuity in Thousands')
plt.subplot(1,2,2)
plt.title('DEFAULTED')
sns.scatterplot(x='AMT_ANNUITY',y='AMT_INCOME_TOTAL',data=app1[app1.AMT_INCOME_TOTAL < 20000],alpha=0.1,color='red')
plt.ylabel('Income in Thousands')
plt.xlabel('Annuity in Thousands')
plt.show()

In [None]:
#AMT_GOODS_PRICE vs AMT_ANNUITY
plt.figure(figsize=(15,10))
plt.subplot(1,2,1)
plt.title('DID NOT DEFAULT')
sns.scatterplot(x='AMT_ANNUITY',y='AMT_GOODS_PRICE',data=app0)
plt.subplot(1,2,2)
plt.title('DEFAULTED')
sns.scatterplot(x='AMT_ANNUITY',y='AMT_GOODS_PRICE',data=app1)
plt.show()

In [None]:
#AMT_GOODS_RPICE vs AMT_TOTAL_INCOME
#Here we observe that low income earning applicants who seek to buy high value goods are more likely to default on payments
plt.figure(figsize=(11,10))
plt.subplot(211)
plt.title('DID NOT DEFAULT')
sns.scatterplot(x='AMT_GOODS_PRICE',y='AMT_INCOME_TOTAL',data=app0[app0.AMT_INCOME_TOTAL < 10000],alpha=0.1, color='green')
plt.xlabel('Goods Price in Thousands')
plt.ylabel('Income in Thousands')
plt.subplot(212)
plt.title('DEFAULTED')
sns.scatterplot(x='AMT_GOODS_PRICE',y='AMT_INCOME_TOTAL',data=app1[app1.AMT_INCOME_TOTAL < 10000],alpha=0.1, color='red')
plt.xlabel('Goods Price in Thousands')
plt.ylabel('Income in Thousands')
plt.savefig('appllesss')
plt.show()

In [None]:
#Let's check the pairplot for amount variables once, to spot any trends
amt = app[[ 'AMT_INCOME_TOTAL','AMT_CREDIT',
                         'AMT_ANNUITY', 'AMT_GOODS_PRICE',"TARGET"]]
amt = amt[(amt["AMT_GOODS_PRICE"].notnull()) & (amt["AMT_ANNUITY"].notnull())]
sns.pairplot(amt,hue="TARGET",palette=["b","r"])
plt.show()

### Observation
* A lot of defaulters were concentrated in the low income region .
* The annuity amount for the loans issued to these defaulters was also low, yet they defaulted on the payments.

In [None]:
app.head()

In [None]:
prev.head()

In [None]:
prev.shape

In [None]:
prev.describe()

### Cleaning the data in previous application file

In [None]:
#Inspecing null values
100*round(prev.isnull().sum()/len(prev),4)

In [None]:
#Retaining only those columns with less than 50% null values
prev=prev.loc[:,100*round(prev.isnull().sum()/len(prev),4)<50]

In [None]:
100*round(prev.isnull().sum()/len(prev),4)

In [None]:
#Dropping columns that we'll not be using 
prev.drop(['DAYS_FIRST_DRAWING','DAYS_FIRST_DUE','DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE','DAYS_TERMINATION'],axis=1,inplace=True)

In [None]:
prev.shape

In [None]:
#Some columns can be imputed
#Checking what these columns are, in which we can safely impute values
cols_to_impute = list(prev.loc[:,(100*round(prev.isnull().sum()/len(app),4) > 0) & (100*round(prev.isnull().sum()/len(prev),4) <25)].columns)
cols_to_impute

In [None]:
#Checking what these columns look like, and what data they hold
for i in enumerate(cols_to_impute):
    print(i[1],'\n')
    print((prev[i[1]].describe()))
    print('\n')

### Observation
* The data in 3 of these columns is numerical, and 1 is categorical wiht 17 unique values.

In [None]:
#We create another list of columns with only the numerical variables that we wish to impute null values for
num_cols_to_impute=cols_to_impute.copy()
num_cols_to_impute.remove('PRODUCT_COMBINATION')

In [None]:
plt.figure(figsize=(15,8))

for i in enumerate(num_cols_to_impute):
    plt.subplot(1,3,i[0]+1)
    sns.boxplot(y=i[1],data=prev)

plt.show()

In [None]:
#Let's also visualize the categorical variable and see what the spread is like
sns.countplot(y='PRODUCT_COMBINATION',data=prev)
plt.show()

### Observations
* If we wish to impute, we'll need to use the median for the columns AMT_ANNUITY and AMT_GOODS_PRICE since there are a significant number of outliers in the data which would skew the mean
* For imputing null values in the categorical varaible PRODUCT_COMBINATION, we can go with the mode of the data, which is 'Cash'
* CNT_PAYMENT has an even spread, and we can choose mean for imputing values in this column

In [None]:
#Let's check if any numerical columns have negative values which don't make sense (for example, negative age)
prev.min()

In [None]:
prev.info()

In [None]:
#DAYS_DECISION is signifying number of days when was the decision about previous application made
#this cannot be negative, let's change it to posiitve
prev['DAYS_DECISION']=prev.DAYS_DECISION.apply(lambda x:abs(x))

### Starting with anlyzing previous application file now

In [None]:
#Let's see drill-down of each type of loan by the status
ax = pd.crosstab(prev["NAME_CONTRACT_TYPE"],prev["NAME_CONTRACT_STATUS"]).plot(kind="barh",figsize=(10,7),stacked=True)
plt.xticks(rotation =0)
plt.ylabel("count")
plt.title("Count of application status by application type")
plt.show()

### Observation:
* We see a hug enumber of canceled loans in the cash loans sector, and the same sector also attracts a lot of refused loans
* The highest number of approved loans lie in the consumer loans sector

In [None]:
#Let's see what is the exact ype of cash loans that get cancelled
plt.figure(figsize=(12,8))
sns.countplot(y='NAME_CASH_LOAN_PURPOSE',data=prev[(prev['NAME_CONTRACT_STATUS']=='Canceled') & (prev['NAME_CONTRACT_TYPE']=='Cash loans') & (prev['NAME_CASH_LOAN_PURPOSE'] != 'XNA')])
plt.xlabel('Number of Cancelled Loans')
plt.ylabel('Purpose of Loan')
plt.savefig('CancelledPurposeCashLoan')
plt.title('Cancelled Cash Loans - A Breakdown by purpose')
plt.show()

In [None]:
#Let's see the count of previous loans available for the current loans we have in the app dataframe
x = prev.groupby("SK_ID_CURR")["SK_ID_PREV"].count().reset_index()
plt.figure(figsize=(13,7))
ax = sns.distplot(x["SK_ID_PREV"],color="red")
plt.title("Current loan ID having previous loan applications")
plt.show()

In [None]:
#It's also observed that the loan application amount, and the actual amount credited was not the same.
#We see in the plots that these two factors differed by both positive and negative values
#This implies that while there were instances when the bank granted a loan for an amount less than the application amount, 
#there were also instances when the loan was granted for an amount higher than application amount
plt.figure(figsize=(12,13))
plt.subplot(211)
ax = sns.kdeplot(prev["AMT_APPLICATION"],color="b",linewidth=3)
ax = sns.kdeplot(prev[prev["AMT_CREDIT"].notnull()]["AMT_CREDIT"],color="r",linewidth=3)
plt.title("Previous loan amounts applied and loan amounts credited.")

plt.subplot(212)
diff = (prev["AMT_CREDIT"] - prev["AMT_APPLICATION"]).reset_index()
diff = diff[diff[0].notnull()]
ax1 = sns.kdeplot(diff[0],color="g",linewidth=3,label = "difference in amount requested by client and amount credited")
plt.title("difference in amount requested by client and amount credited")
plt.axvline(0,color="black",linestyle="dashed",label = "Zero")

In [None]:
prev.NAME_CONTRACT_STATUS.value_counts()

In [None]:
#Analysis of difference in credit amount and application amount seen against the application status
plt.figure(figsize=(12,13))
a1=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Approved']['AMT_CREDIT'] -
               prev[prev.NAME_CONTRACT_STATUS == 'Approved']['AMT_APPLICATION'],
               label = 'Approved', linewidth= 1)
a2=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Canceled']['AMT_CREDIT'] -
               prev[prev.NAME_CONTRACT_STATUS == 'Canceled']['AMT_APPLICATION'], 
               label = 'Canceled', linewidth= 1)
a3=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Refused']['AMT_CREDIT'] -
               prev[prev.NAME_CONTRACT_STATUS == 'Refused']['AMT_APPLICATION'], 
               label = 'Refused', linewidth= 1)
a4=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Unused offer']['AMT_CREDIT'] -
               prev[prev.NAME_CONTRACT_STATUS == 'Unused offer']['AMT_APPLICATION'], 
               label = 'Unused offer', linewidth= 1)
plt.axvline(0,color="black",linestyle="dashed",label = "Zero")
plt.title('Difference between credit and application amount against status of contract')

In [None]:
plt.figure(figsize=(12,13))
a1=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Approved']['AMT_CREDIT'], label = 'Approved', linewidth=3)
a2=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Canceled']['AMT_CREDIT'], label = 'Canceled', linewidth=3)
a3=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Refused']['AMT_CREDIT'], label = 'Refused', linewidth=3)
a4=sns.kdeplot(prev[prev.NAME_CONTRACT_STATUS == 'Unused offer']['AMT_CREDIT'], label = 'Unused offer', linewidth=3)
plt.title('Amount of Credit compared against contract status')

### Observation
* A large number of cancelled loans lie in the low value loans

In [None]:
prev.info()

In [None]:
#let's see how the bank is treating its different types of clients
plt.figure(figsize=(10,6))
sns.countplot(y='NAME_CLIENT_TYPE',data=prev,hue='NAME_CONTRACT_STATUS')
plt.xlabel('Number of loans')
plt.ylabel('Type of Client')
plt.title('Status of each loan vs Type of Client (From Previous Data)')
plt.legend(title='Contract Status', loc= 'lower right')
plt.show()

In [None]:
#We now try to see how these approved loans turned out in each category.
newapprovals=prev[(prev['NAME_CONTRACT_STATUS']=='Approved')][['SK_ID_CURR','NAME_CLIENT_TYPE']]

In [None]:
newapprovals.head()

In [None]:
merged_new_approvals = newapprovals.merge(app, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')

In [None]:
#Merging selected data from previous with application data (current)
new_approvals_by_default = merged_new_approvals[['SK_ID_CURR','NAME_CLIENT_TYPE','TARGET']]

In [None]:
#This plot shows all the loans which were marked as "Approved" in the previous applications, 
#and their statuses in the current applications.
plt.figure(figsize=(10,6))
ax=sns.countplot(y='NAME_CLIENT_TYPE',data=new_approvals_by_default,hue='TARGET')
plt.title('Previous Approved Loans - Current Status (Defaulted or Not)')
plt.ylabel('Type of Client')
plt.xlabel('Number of occurences')
plt.legend(title='Target Variable',loc='lower right')
plt.show()

### Observation
* We see that the highest number of loan approvals are occuring for repeat clients, while loans are being very rarely refused to new customers. 
* Repeat Clients are also defaulting on payments the most.
* There is also a significant chunk of "New" clients with defaulted payments

In [None]:
#Checking how the category of goods being bought is impacting the loan approval status 
plt.figure(figsize=(10,12))
sns.countplot(y='NAME_GOODS_CATEGORY',hue='NAME_CONTRACT_STATUS',data=prev[prev.NAME_GOODS_CATEGORY!='XNA'])

In [None]:
#Let's see the pair plots between same numerical variables in the previous application data
amtp = prev[[ 'AMT_ANNUITY','AMT_APPLICATION','AMT_CREDIT', 'AMT_GOODS_PRICE', 'NAME_CONTRACT_STATUS']]
amtp = amtp[(amtp["AMT_GOODS_PRICE"].notnull()) & (amtp["AMT_ANNUITY"].notnull())]
sns.pairplot(amtp,hue="NAME_CONTRACT_STATUS")
plt.show()

In [None]:
#Let's check the correlation matrix for the previous application dataset
corr=prev[['AMT_ANNUITY','AMT_APPLICATION','AMT_CREDIT','AMT_GOODS_PRICE','HOUR_APPR_PROCESS_START','NFLAG_LAST_APPL_IN_DAY',
           'DAYS_DECISION','SELLERPLACE_AREA','CNT_PAYMENT','NFLAG_INSURED_ON_APPROVAL']].corr()


In [None]:
corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corrdf = corr.unstack().reset_index()
corrdf.head()

In [None]:
corrdf.columns = ['VAR1', 'VAR2', 'Correlation']
corrdf.dropna(subset = ['Correlation'], inplace = True)
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
# We will be using absolute value of the correlation coefficiaents since we are only interested in seeing the absolute value.
# The direction in which the entities are correlated is currently not of concern
corrdf['Correlation'] = corrdf['Correlation'].abs()

In [None]:
#Inspecting the top-10 correlated entities
corrdf.sort_values(by = 'Correlation', ascending = False).head(10)

In [None]:
#AMT_APPLICATION vs AMT_ANNUITY
#Here we can notice a steady rice in the annuity amount with a rise in the application amount
plt.figure(figsize=(10,5))
sns.scatterplot(x='AMT_APPLICATION',y='AMT_ANNUITY',data=prev)
plt.show()

In [None]:
#AMT_ANNUITY vs AMT_CREDIT
#We see an almost proportional increase in annuity amount with an increase in the credit amount 
plt.figure(figsize=(10,5))
sns.scatterplot(y='AMT_ANNUITY',x='AMT_CREDIT',data=prev)
plt.show()

In [None]:
prev.columns

In [None]:
#Let's check the Amount of credit requested in various goods_categories
plt.figure(figsize =(15,10))
sns.barplot(x ='AMT_CREDIT', y="NAME_GOODS_CATEGORY", data = prev )
plt.show()

### Observation
* We have a very major outlier here! Home Counstruction is outperforming every other category in terms of credit

In [None]:
#Let's the see the number of loans in this category and the status of these loans
sns.countplot('NAME_CONTRACT_STATUS', data = prev[prev.NAME_GOODS_CATEGORY == 'House Construction'])

In [None]:
prev[prev.NAME_GOODS_CATEGORY=='House Construction']

### Observation
* We get only one record, and as expected, the bank has refused this loan with a credit request of an obnoxiously large amount
* This is an outlier in the data, but seems genuine since it was rejected.

In [None]:
prev['PRODUCT_COMBINATION'].value_counts().plot(kind = 'barh')

In [None]:
#Cash loans are seeing the most cancellations, while almost all the unused offers lie in the 'POS mobile with interest' combo
plt.figure(figsize=(8,12))
sns.countplot(y='PRODUCT_COMBINATION', data = prev, hue = 'NAME_CONTRACT_STATUS')

In [None]:
prev['CNT_PAYMENT'].value_counts()

In [None]:
#We will bin the CNT_PAYMENT variable to analyze it categorically
prev['CNT_PAYMENT_BINNED']=pd.cut(prev['CNT_PAYMENT'], bins=[0,10,20,30,40,50,60,70,80,90], 
                         labels = ['0 - 10','10 - 20', '20 - 30','30 - 40','40 - 50','50 - 60','60 - 70','70 - 80','80 - 90'])

In [None]:
prev['CNT_PAYMENT_BINNED'].value_counts()

In [None]:
#We see an interesting insight here when analyzing the term of previous credit against the contract statuses in previous application datatset
ax = pd.crosstab(prev["CNT_PAYMENT_BINNED"],prev["NAME_CONTRACT_STATUS"]).plot(kind="barh",figsize=(12,5),stacked=True)
plt.ylabel("Term of Previous Credit")
plt.xlabel('Number of applications')
plt.title('Distrubution of applications against contract status vs term of previous credit')
plt.show()

In [None]:
#We try to cross-reference this column against the application dataset to see how these approved loans are performing

In [None]:
newapprovals=prev[(prev['NAME_CONTRACT_STATUS']=='Approved')][['SK_ID_CURR','CNT_PAYMENT_BINNED']]

In [None]:
newapprovals.head()

In [None]:
merged_new_approvals = newapprovals.merge(app, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')

In [None]:
#Merging selected data from previous with application data (current)
new_approvals_by_default = merged_new_approvals[['SK_ID_CURR','CNT_PAYMENT_BINNED','TARGET']]

In [None]:
#Now, we see the current status of the approver loans in against the term of previous
plt.figure(figsize=(12,5))
sns.countplot('CNT_PAYMENT_BINNED', hue='TARGET', data=new_approvals_by_default)
plt.xlabel('Term of previous credit')
plt.ylabel('Number of applications')
plt.title('Approved applications and their default status vs Term of Previous Credit')