In [None]:
import numpy as np, pandas as pd
import os
import pandas_profiling as ppf
import matplotlib.pyplot as plt
import gc

In [None]:
appl = pd.read_csv("/kaggle/input/credit-card-approval-prediction/application_record.csv")
cred = pd.read_csv("/kaggle/input/credit-card-approval-prediction/credit_record.csv")

Let's quickly check if we have any missing data

And we have missing values only in the `OCCUPATION_TYPE` in application data

In [None]:
print(appl.isna().sum().loc[appl.isna().sum() > 0])
print(cred.isna().sum().loc[cred.isna().sum() > 0])

Actually, we can just replace the `nan` values with `'NA'` here - and use it as a new category

In [None]:
appl = appl.fillna('NA')
print(appl.OCCUPATION_TYPE.unique())

### Finalizing the response variable

here, the label is not given, we are going to analyze the customer's behaviour post loan sanction to derive the label of whether the loan should have been given or not

> here, we replace the values `C` and `X` that signify either a paid loan or no loan for the month - with value `-1` - so that everything can be turned `int`

In [None]:
# loan paid / not existing is set as -1
cred.STATUS = cred.STATUS.replace(['C','X'], -1).astype('int8')

Now, We calculate the count of each `STATUS` for each of the customers
> this information will be used to see how many times a customer fall into a `STATUS` that's not worthwhile

In [None]:
# pivot by status
df = (cred.pivot_table(values='MONTHS_BALANCE', index='ID', columns='STATUS', aggfunc='count')
          .fillna(0)
          .assign(TOTAL_MONTHS=lambda x: x.sum(axis=1)).reset_index())
df.columns.name = ''
df.columns = ['STATUS_'+str(i) if i not in ['ID', 'TOTAL_MONTHS'] else i for i in df.columns]

# join back to see if the loan already terminated
# if latest (max) MONTHS_BALANCE != 0, then loan must have ended
cred = (cred.groupby(['ID'], as_index=False)['MONTHS_BALANCE'].max()
            .assign(LOAN_TERMINATED = lambda x: x.MONTHS_BALANCE<0)[['ID', 'LOAN_TERMINATED']].astype('int')
            .merge(df, how='inner', on='ID'))
del df
gc.collect()

cred.sample(2)

In [None]:
%%capture

cred_profile = ppf.ProfileReport(cred)

So, after we normalize the counts in each status by the TOTAL_MONTHS, we see that 52% of all statuses are in `-1` i.e. either **paid off or no loan** - this is obviously the customers we want

And, 46% are in `0` - which is **1-29 days past due** - so, this is very common - but, we may still want to limit something like this

1.2% are in `1` - which is **30-59 days past due** - this seems to be the anomaly territory we want to avoid

The rest are are overdues for longer duration or outright writeoff - given their relative low frequency and propensity for loss we may want to outright tag the customers with these behaviour as anomalies

In [None]:
cred_norm = cred.copy()
for i in cred_norm.columns:
    if 'STATUS' in i:
        cred_norm[i] = cred_norm[i] / cred_norm['TOTAL_MONTHS']
        
cred_norm.describe()

The profiling report below captures the behaviour described above with even more detail

Looking at correlation, we see that (as expected) statuses `2`, `3`, `4`, `5` are very highly correlated as well
Although `STATUS_2` shares a medium correlation with `STATUS_1` as well

This gives credence to the idea that we should dock points for having too many `STATUS_1`

**In a proper business context, this needs to be decided after looking into the economic impact and/or consulting with the business knowldege of the client**

Here, we will go ahead with a bit of analysis paired with our own rationale

In [None]:
%%capture

cred_norm_profile = ppf.ProfileReport(cred_norm)

In [None]:
cred_norm_profile

Let's see how many customers are having statuses `>1` - overdure more than 60 days

> we calculate the %age of statues `>1` normalized by total duration 

Counting every customer with atleast 1 status of `2 or higher` gives `1.45%` risky population

In [None]:
# STATUS_2 and upwards are highly correlated - and also, signifies overdue - so, should be of concern

cols_2to5    = [i for i in cred.columns if ('STATUS' in i) and (int(i.split('_')[1]) >= 2)]
concern_cols = cols_2to5

def check_risky_population(concern_cols, min_cutoff=0, max_cutoff=50, step=5):
    df = pd.DataFrame(['False', 'True'], columns=['IS_RISKY'])

    for i in range(min_cutoff, max_cutoff, step):
        _ = (cred_norm.assign(IS_RISKY = lambda x: x[concern_cols].apply(sum, axis=1) > (i/100))
                      .groupby(['IS_RISKY'], as_index=False)['ID'].count()
            )
        df['at '+str(i)+'% cutoff'] = _['ID'] * 100 / cred_norm['ID'].count()

    return df

check_risky_population(concern_cols)

So, let's look a bit more into `STATUS_1` - 30-59 days past due (on average shows up 1.2% of all the months)
> it also shows medium correlation with `STATUS_2`

We look at what happens if we classify everyone with `> 0.x` percentage of `STATUS_1`s

Deeming people with more than `30%` of months in `STATUS_1` gives further `0.68%` (including overlaps)

In [None]:
concern_cols = [i for i in cred.columns if ('STATUS' in i) and (int(i.split('_')[1]) == 1)]

check_risky_population(concern_cols)

Also, we see that at the `0.3` decile of `STATUS_1` the average % for being in statuses `>=2` shoots up to `2.4%` - while `STATUS_-1` (i.e. no debt) is down to `17%` 

> Now, based on this, if we want to more **conservative**, we could choose `0.5` as the cut-off as well - because there `STATUS_-1` reaches just `10%`

But, with increasing decile, the corresponding count also decreases quite rapidly

In [None]:
cols = ['STATUS_1_decile', 'STATUS_-1', 'STATUS_0', 'STATUS_1', 'STATUS_2to5','ID']

(cred_norm.assign(STATUS_1_decile = lambda x: x.STATUS_1.round(1))
          .assign(STATUS_2to5 = lambda x: x[cols_2to5].apply(sum, axis=1))
          .groupby(['STATUS_1_decile'], as_index=False).agg({i:('mean' if i!='ID' else 'count') for i in cols})
)

Let's do the same for `STATUS_0` - 1-29 days past due

And as we had seen before (`STATUS_0` had 46% of all months) - this is very **common** scenario indeed

In [None]:
concern_cols = [i for i in cred.columns if ('STATUS' in i) and (int(i.split('_')[1]) == 0)]

check_risky_population(concern_cols, min_cutoff=50, max_cutoff=100, step=5)

So, finally we are gonna go ahead with 
> customers who have either `>=1` instances with statuses `>=2`

> or, customers who have `>=30%` of their total months as `STATUS_1`

And in total we get `2.03%` as risky!

> btw this again needs to be validated with business / domain knowledge

**iteration 2 note:**

Sadly, when we model the data - the performance becomes incredibly flaky - the good performance in CV and validation doesn't carry forward to test

One reason might be that we have been incredibly loose with the definition of bad customer - so, a lot of the customers might have been termed as bad although they share a very similar profile to good ones

To test for that quickly, we will just restrict our criterion right now - only to statuses `>=4` - and see how that affects!

<s>**iteration 3 note:**

iteration 2 shows that status 4 and 5 are indeed better as labels - because we see a well enough performance transfer for valid to test

Let's try and include `status 3` as well - and see how that carries over </s>

In [None]:
STATUS_1_limit = 0.3

In [None]:
cols_4to5 = [i for i in cred.columns if ('STATUS' in i) and (int(i.split('_')[1]) >= 4)]
cols_2to5 = cols_4to5

STATUS_1_limit = 1

In [None]:
cred_norm = cred_norm.assign(IS_RISKY = lambda x: (x.STATUS_1.apply(lambda y: y > STATUS_1_limit) 
                                                   + x[cols_2to5].apply(sum, axis=1) > 0
                                                  ).astype('int8')
                            )

(cred_norm.groupby("IS_RISKY", as_index=False)['ID'].count()
          .assign(PERCENT = lambda x: x['ID'] * 100 / cred_norm.ID.count())
)

Now, let's quickly check what impact `TOTAL_MONTHS` i.e the duration of the loan has

Interestingly, with increasing loan duration, the proportion of `STATUS_-1` increases - while that for `STATUS_0` decreases
> although their `sum` holds to be pretty stable

And as expected, the proportion of `STATUS_2to5` increases - leading to increase in `IS_RISKY` as well!
> this can be used to set separate criterion (by duration) for `IS_RISKY` as well

*sidenote:* `LOAN_TERMINATED` as expected is inversely related to the loan duration

In [None]:
cols = ['STATUS_-1', 'STATUS_0', 'STATUS_1', 'STATUS_2to5', 'LOAN_TERMINATED', 'IS_RISKY', 'ID']

(cred_norm.assign(loan_duration = pd.cut(cred_norm['TOTAL_MONTHS'],12))
          .assign(STATUS_2to5 = lambda x: x[cols_2to5].apply(sum, axis=1))
          .groupby(['loan_duration'], as_index=False)
          .agg({i:(('mean' if i not in ['IS_RISKY'] else ['mean', 'sum']) 
                   if i!= 'ID' else 'count'
                  ) for i in cols
               })
)

Finally, let's have a quick look at `LOAN_TERMINATED` (i.e. loans that aren't updated in the recent months)

Interestingly, we see that loans that are still continuing generally have a higher `STATUS_2to5` - consequently higher `IS_RISKY`
> this information can be harnessed while deciding the rules for a loan being risky or not, as well

In [None]:
cols = ['STATUS_-1', 'STATUS_0', 'STATUS_1', 'STATUS_2to5', 'IS_RISKY', 'TOTAL_MONTHS', 'ID']

(cred_norm.assign(STATUS_2to5 = lambda x: x[cols_2to5].apply(sum, axis=1))
          .groupby(['LOAN_TERMINATED'], as_index=False)
          .agg({i:('mean' if i!= 'ID' else 'count') for i in cols
               })
)

### Joining the label to the application data

Once we use the credit information to decide whether a customer is risky or not, we need to actually join it to the application data to prepare the dataset for using in our ML model

In [None]:
print(len(appl), len(appl.ID.unique()))
print(len(cred_norm), len(cred_norm.ID.unique()))
print(f"# of IDs present in both: {len(set(appl.ID).intersection(set(cred_norm.ID)))}")

Let's just check first for the overlap in these 2 datasets

So, we have duplicate `ID`s in the application data - this might be because the same customer has submitted multiple applications
> or it might be genuine duplicate data that needs to be removed

But, given the overlap between the application data and credit record data is only `36K` - let's just **merge** them first before looking into the duplicates

And once we merge the datasets (inner join), we see that the duplicate issue is gone

In [None]:
df = appl.merge(cred_norm[['ID', 'IS_RISKY']], how='inner', on='ID')

assert len(df) == len(df.ID.unique())
df.sample(2)

Let's just look into the data profile now

In [None]:
%%capture
df_profile = ppf.ProfileReport(df)

In [None]:
df_profile

Ok, so, `FLAG_MOBIL` has only a single *unique* value - so, it can be dropped easily

Let's also convert `DAYS_BIRTH` to age (in years) - and round it up to one decimal place

We also see that `CNT_CHILDREN` and `CNT_FAMILY_MEMBERS` are highly correlated - instead of dropping one, one easy way to decorrelate them would be to create a new feature that tracks the number of family members excluding children
> although there still might be correlation, if it's likely that people with larger families also have more children

Looking at the ratio of income and age also should be a good indicator of financial wellness

`DAYS_EMPLOYED > 0` means currently unemployed - let's create a feature for this
And then let's also convert it to years

Plus, the ratio of income and working years, and ratio of working years and age can be interesting to look at too

In [None]:
df['Age'] = (df['DAYS_BIRTH'] * -1 / 365).round(1)
df['CNT_FAM_MEMBERS_MINUS_CHILDREN'] = df['CNT_FAM_MEMBERS'] - df['CNT_CHILDREN']
df['Income_Age_ratio'] = df['AMT_INCOME_TOTAL'] / df['Age']
df['Is_Unemployed'] = (df['DAYS_EMPLOYED'] > 0).astype('int8')
df.loc[df.DAYS_EMPLOYED > 0, 'DAYS_EMPLOYED'] = 0
df['Years_Employed'] = (df['DAYS_EMPLOYED'] * -1 / 365).round(1)
df['Employed_Age_ratio'] = df['Years_Employed'] / df['Age']
df['Employed_Income_ratio'] = df['Years_Employed'] / df['AMT_INCOME_TOTAL']


df =  df.drop(columns=['FLAG_MOBIL', 'DAYS_BIRTH', 'CNT_FAM_MEMBERS', 'DAYS_EMPLOYED'])

Now, let's actually look at the insection of people who are unemployed, and people for whom we had `NA` as `OCCUPATION_TYPE`

Well, all the people termed as **Unemployed** actually have `NA` as their occupation

So, we can get rid of the `Is_Unemployed` feature and replace the `NA` for them with `Unemployed`

Also, ordering by `IS_RISKY` % gives us an indicaton about encoding these categorical values
> Although interestingly `IT staff` seems to be most risky - but, that might be the effect of the small sample size

In [None]:
(df.groupby(['Is_Unemployed','OCCUPATION_TYPE'], as_index=False)
   .agg({'ID':'count', 'IS_RISKY':'mean'})
   .sort_values(by=['IS_RISKY'], ascending=False)
)

In [None]:
df.loc[df.Is_Unemployed == 1, 'OCCUPATION_TYPE'] = 'Unemployed'
df = df.drop(columns=['Is_Unemployed'])

Next, we look at the interaction between `OCCUPATION_TYPE` and `FLAG_WORK_PHONE`
> the idea being - a job without a work phone might be an indication of job quality

And indeed we see that `IT staff` without work phone is one of the **most risky** one here!

In [None]:
(df.groupby(['OCCUPATION_TYPE','FLAG_WORK_PHONE'], as_index=False)
   .agg({'ID':'count', 'IS_RISKY':'mean'})
   .sort_values(by=['IS_RISKY'], ascending=False)
)

Let's check the same for other categorical features as well

We should probably drop the `Student` here - since it's so few - and all of them are `0`s

Although this needs a discussion with client to see if they are aligned to this (for any number of business reasons)

In [None]:
def sort_by_risk(x): 
    return (df.groupby(x, as_index=False)
              .agg({'ID':'count', 'IS_RISKY':'mean'})
              .sort_values(by=['IS_RISKY'], ascending=False)
           )

sort_by_risk(['NAME_INCOME_TYPE'])

In [None]:
print(df.shape, df.loc[df.NAME_INCOME_TYPE != 'Student', :].shape)
df = df.loc[df.NAME_INCOME_TYPE != 'Student', :]

The same argument can be followed here as well

In [None]:
sort_by_risk(['NAME_EDUCATION_TYPE'])

In [None]:
print(df.shape, df.loc[df.NAME_EDUCATION_TYPE != 'Academic degree', :].shape)
df = df.loc[df.NAME_EDUCATION_TYPE != 'Academic degree', :]

In [None]:
sort_by_risk(['NAME_FAMILY_STATUS'])

In [None]:
sort_by_risk(['NAME_HOUSING_TYPE'])

**Finally, a further data profile with all the new features**

We see that the newly created features (from income) are highly correlated with income - for the obvious reason
But, they capture different facets of income - which is why they are created in the first place!

The same goes for features created from years employed

In [None]:
%%capture
df_profile = ppf.ProfileReport(df)

In [None]:
df_profile

In [None]:
df.to_csv("fc_cc_data_v1.csv", index=False)

#### Issues to be careful about

The data has `GENDER` as a variable - and this raises a very important issue : [NY Times reporting on DHH's tweet on Apple Card's gender bias](https://www.nytimes.com/2019/11/10/business/Apple-credit-card-investigation.html)

And this may warrant a discussion with the stakeholders before deploying a model in the real world.
But, for now, we will just **note** this as something to look more into later on!