This notebooks contains EDA of the credit-card-approval-prediction datasets for the following notebook: 

https://www.kaggle.com/hungndo/credit-modeling-models

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
application = pd.read_csv('/kaggle/input/credit-card-approval-prediction/application_record.csv')
credit = pd.read_csv('/kaggle/input/credit-card-approval-prediction/credit_record.csv')

In [None]:
len(set(application['ID'])) #this is how many people are unique in application 438510

In [None]:
len(set(credit['ID'])) #unique in credit 

In [None]:
len(set(application['ID']).intersection(set(credit['ID']))) # how many IDs do two tables share

In [None]:
# only analyze in the intersection cases between 2 dataset
ids = set(application['ID']).intersection(set(credit['ID']))
application = application[application['ID'].isin(ids)]
credit = credit[credit['ID'].isin(ids)]

In [None]:
def barplot(label,value,title='',normalize=False):
    fig = plt.figure(figsize=(len(value)*2,5))
    ax = fig.add_subplot()
    if normalize:
        ax.bar(label, value/value.sum()*100)
    else:
        ax.bar(label,value)
    ax.set_title(title)

def histogram(value, bins=10, title='',normalize=False):
    fig = plt.figure(figsize=(10,5))
    ax = fig.add_subplot()
    
    ax.hist(value, bins=bins)
    ax.set_title(title)


# **APPLICATION RECORD**

In [None]:
application.info()

! Need to handle null values for OCCUPATION_TYPE

**Categorical**

In [None]:
x = application['CODE_GENDER'].value_counts()
barplot(x.index,x,'Male/Female')

In [None]:
x = application['FLAG_OWN_CAR'].value_counts()
barplot(x.index,x, 'Own car')

In [None]:
x = application['FLAG_OWN_REALTY'].value_counts()
barplot(x.index,x,'Own realty')

In [None]:
x = application['NAME_INCOME_TYPE'].value_counts()
barplot(x.index,x,'income type')

In [None]:
x = application['NAME_EDUCATION_TYPE'].value_counts()
barplot(x.index,x,'education type')

In [None]:
x = application['NAME_FAMILY_STATUS'].value_counts()
barplot(x.index,x,'family status')

In [None]:
x = application['NAME_HOUSING_TYPE'].value_counts()
barplot(x.index,x,'housing type')

In [None]:
pd.DataFrame(application['OCCUPATION_TYPE']).info()

There are null values for OCCUPATION_TYPE

In [None]:
x = application['OCCUPATION_TYPE'].value_counts()
barplot(x.index,x,'occupation type')

Although the last few values are very small compared to the others, they may add some value to the model and not cause any overfit. However, when we convert these into dummy vairables, there'll be 18 more columns, which can be expensive. Therefore, combine the last few values into one generic value ("others") can be one of our options. 

In [None]:
application['FLAG_MOBIL'].value_counts()

# => everyone owns a mobile phone. Therefore, this column won't add any value and can be dropped.

In [None]:
x = application['FLAG_WORK_PHONE'].value_counts()
barplot(x.index,x, title='FLAG WORK PHONE')

In [None]:
x = application['FLAG_PHONE'].value_counts()
barplot(x.index,x, title='FLAG PHONE')

In [None]:
x = application['FLAG_EMAIL'].value_counts()
barplot(x.index, x, title='FLAG EMAIL')

**Quantitative**

In [None]:
histogram(application['CNT_CHILDREN'], bins=20, title='Count Children')

In [None]:
histogram(application['CNT_FAM_MEMBERS'], bins=20, title='CNT_FAME_MEMBERS')

In [None]:
histogram(application['DAYS_BIRTH'],bins=100, title='DAY BIRTH')

# converting days_birth to age for easier intepretation, the plot is flipped horizontally due to days_birth is negative, while age is positive
histogram(-application['DAYS_BIRTH']/365, bins=100, title='AGE')

The distribution seems normal

In [None]:
histogram(application['AMT_INCOME_TOTAL'],bins=100, title='AMT_INCOME_TOTAL')

In [None]:
histogram(application['DAYS_EMPLOYED'],bins=100, title='DAY EMPLOYED')

# converting to year for easier intepretation
histogram(application['DAYS_EMPLOYED']/365,bins=100, title='YEAR EMPLOYED')

This does not mean that there are people who have been employed for 1000 years!
According to the dataset description. Count backwards from current day(0). If positive, it means the person currently unemployed.
Let's plot days_employed again after removing the unemployed ones.

In [None]:
histogram(application[application['DAYS_EMPLOYED']<=0]['DAYS_EMPLOYED'], bins=100, title='DAYS_EMPLOYED')
histogram(application[application['DAYS_EMPLOYED']<=0]['DAYS_EMPLOYED']/365, bins=100, title='YEARS_EMPLOYED')

# **CREDIT RECORD**

In [None]:
credit.info()

There is aready a great vintage analysis for the credit dataset in this link https://www.kaggle.com/rikdifos/eda-vintage-analysis, so I'm just going to utilize his work for this project, including how to define good/bad credits.

# **OTHERS**

In [None]:
import seaborn as sns

In [None]:
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot()
ax.set_title('Correlation Plot', fontsize=20)
sns.heatmap(application[['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'FLAG_WORK_PHONE', 
                         'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS']].corr(), ax=ax)

There's a strong correlation between DAYS_EMPLOYED and DAYS_BIRTH, CNT_FAM_MEMBERS and CNT_CHIDREN

# **TO DO FOR DATA PREPROCESSING**

* Drop all cases that are not shared between two datasets before processing any data
* Drop FLAG_MOBIL column
* Handle null values in OCCUPATION_TYPE
* Consider merging the last few values of OCCUPATION_TYPE to some generic value ("others")
* Handle unemployed cases in DAYS_EMPLOYED (we can probably make another binary column to specify whether they are employed or not
* Research if skewed distribution in DAYS_EMPLOYED, COUNT_CHILDREN, COUNT_FAM_MEMBERS can affect the model, if yes, use log distribution
* Handle multicolinearity after plotting out the ocrrelation chart
* Consider using WOE to remove columns that have low weights
* For credit dataset, classify them into good/bad customers
* Encode variables

# **ACKNOWLEGEMENTS**

* https://www.kaggle.com/rikdifos/eda-vintage-analysis