# Reviewing the default rate

This EDA aims to analyse the default rate for the credit card datasets. If a default rate moves with time this could help to provide some insights into the flows of default activity. We will have to see what insights can be gained.

Taken inspiration from @ambrosm [AMEX EDA which makes sense](https://www.kaggle.com/code/ambrosm/amex-eda-which-makes-sense).

### Objectives
* Build the default rate time series
* Understand the flow and stock positions

After reviewing the code by customer_ID there appear to be no customer's that flow into or out of default.

In [None]:
# Import packages
import pandas as pd
import numpy as np
import pickle
from matplotlib import pyplot as plt
import os
import plotly.express as px
import plotly.graph_objects as go

# The labels

We start by reading the labels for the training data. There are neither missing values nor duplicated customer_IDs. Of the 458913 customer_IDs, 340000 (74 %) have a label of 0 (good customer, no default) and 119000 (26 %) have a label of 1 (bad customer, default).

We know that the good customers have been subsampled by a factor of 20; this means that in reality there are 6.8 million good customers. 98 % of the customers are good; 2 % are bad.

**Insight:**
- The classes are imbalanced. A StratifiedKFold for cross-validation is recommended.
- Because the classes are imbalanced, accuracy would be a bad metric to evaluate a classifier. The [competition metric](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464) is a mix of area under the roc curve (auc) and recall.

In [None]:
train_labels = pd.read_csv('../input/amex-default-prediction/train_labels.csv')
train_labels.head(2)

In [None]:
# Check for missing data and duplicated customer_IDs
train_labels.isna().any().any(), train_labels.customer_ID.duplicated().any()

In [None]:
label_stats = pd.DataFrame({'absolute': train_labels.target.value_counts(),
              'relative': train_labels.target.value_counts() / len(train_labels)})
label_stats['absolute upsampled'] =  label_stats.absolute * np.array([20, 1])
label_stats['relative upsampled'] = label_stats['absolute upsampled'] / label_stats['absolute upsampled'].sum()
label_stats

# The data

The dataset of this competition has a considerable size. If you read the original csv files, the data barely fits into memory. That's why we read the data from @munumbutt's [AMEX-Feather-Dataset](https://www.kaggle.com/datasets/munumbutt/amexfeather). In this [Feather](https://arrow.apache.org/docs/python/feather.html) file, the floating point precision has been reduced from 64 bit to 16 bit. And reading a Feather file is faster than reading a csv file because the Feather file format is binary.

There are 5.5 million rows for training and 11 million rows of test data.

In [None]:
%%time
train = pd.read_feather('../input/amexfeather/train_data.ftr')
test = pd.read_feather('../input/amexfeather/test_data.ftr')
with pd.option_context("display.min_rows", 6):
    display(train)
    display(test)

The target column of the train dataframe corresponds to the target column of train_labels.csv. In the csv file of the train data, there is no target column; it has been joined into the Feather file as a convenience.

S_2 is the statement date. All train statement dates are between March of 2017 and March of 2018 (13 months), and no statement dates are missing. All test statement dates are between April of 2018 and October of 2019. This means that the statement dates of train and test don't overlap:

In [None]:
train.dtypes

In [None]:
print('Train statement dates: ', train.S_2.min(), train.S_2.max(), train.S_2.isna().any())
print('Test statement dates: ',  test.S_2.min(), test.S_2.max(), test.S_2.isna().any())

**Insight:**
- The test data come from a different phase in the economic cycle than the training data. Our models have no way of learning the effect of the economic cycle.

This could be something to review. We could bring in the key American macro economic data relating to the time periods for the training and test data. Having this exogenous component to the model could really help with any additional noise that the model is not seeing and could provide further insights into how the customers are defaulting.

In [None]:
# Understanding the file size of one file
from humanize import naturalsize
size = train.memory_usage(deep='True').sum()
print(size)
print(naturalsize(size))

In [None]:
print(f'Train data memory usage: {naturalsize(train.memory_usage(deep="True").sum())} ')
print(f'Test data memory usage:  {naturalsize(test.memory_usage(deep="True").sum())}')

The training data takes 2.8 GBytes of RAM. The test data is twice the size of the training data.

**Insight:**
- With that much data, we need to have an eye on memory efficiency. Avoid keeping unnecessary copies of the data in memory, and avoid keeping unnecessary copies of models!
- Whereas most machine learning algorithms expect the whole training data to be in memory, we don't need to load all the test data at once. The test data can be processed in batches.
- You may want to separate training and inference code into two notebooks so that you never have training and test data in memory at the same time.

The info function shows that most other features have missing values:


In [None]:
train.info(max_cols=200, show_counts=True, memory_usage='deep')

**Insight:**
- There are many columns with missing values: Dropping all columns which have missing values is not a sensible strategy.
- There are many rows with missing values: Dropping all rows which have missing values is not a sensible strategy.
- Many decision-tree based algorithms can deal with missing values. If we choose such a model, we don't need to change the missing values.
- Neural networks and other estimators cannot deal with missing values. If we choose such a model, we need to impute values. See [this guide](https://www.kaggle.com/code/parulpandey/a-guide-to-handling-missing-values-in-python) for an overview of the many imputation options.
- Most features are 16-bit floats. The original data (in the csv file) has higher precision. By rounding it to 16-bit precision, some information is lost. To make this information loss more tangible: Every float16 number between 1 and 2 is a multiple of 1/1024. These numbers have only three digits behind the decimal point! This precision is enough to start the competition; maybe we'll have to switch to higher precision towards the end.

# Building default data

When reviewing the target variable within a time series we want to see how the pattern of defaults change as time progresses. Do we see more defaults at the beginning or end of a cycle.

A few items to be aware of:
- Need to check if each customer ID has a record for each month
- Having statement dates spread throughout the month may distort the assessment of default. Adding a start of the month date can ensure consistency for monthly reviews

In [None]:
# Review the target variable across time
t_data = train.loc[:, ['customer_ID', 'S_2', 'target']]
t_data.head()

In [None]:
# Review the numbers of records by customer ID
t_data.groupby(['customer_ID'])['target'].count().value_counts()

In [None]:
# What proportion of the customers are seeing default status by number of statements received?
t_data_summ = t_data.groupby(['customer_ID']).agg({
                    'target': ['count', sum]
                }).value_counts(normalize=True).to_frame(name='prop')
t_data_summ

The two largest groups are the customers who had 13 statements with either all non-default (64.6%) or all default (19.5%) status. With the defaulters this creates a challenge as we don't understand when they first entered default. It is the entering default that we are trying to understand and seeking to find variables that highlight the increased likelihood of entering.

In [None]:
# Number of unique values for the statement dates. Lets review this with a visualization
def summary(x):
    result = {
                'total': x['target'].count()
                ,'perf': (np.where(x['target']==0,1,0)).sum()
                ,'def': (np.where(x['target']==1,1,0)).sum()
            }
    return pd.Series(result).round(4)

# Apply the summary method grouped by the time period variable
time_d = t_data.groupby(['S_2']).apply(summary)
time_d = time_d.reset_index()

fig = px.line(time_d, x='S_2', y="total", title='Customer statements by day')
fig.show()

In [None]:
# Review the stock positions over time
fig = px.area(time_d, x='S_2', y=["perf", "def"], title='Default stock positions across time', groupnorm='percent')

fig.show()

The default stock positions show the stable positions over time with no spikes. A stock position shows were the customer is at during that point in time. For the analysis this would be either a performing or default status. 

Next stage is to review the flows into and out of default

In [None]:
# Need to sort the data by customer_ID and date
t_data = t_data.sort_values(by = ['customer_ID', 'S_2'], ignore_index=True)
t_data.head(20)

In [None]:
# Review movement by records that are provided. Note we will have to go back and ensure that each customer has statements for consecutive months
def default_rules(df):
    # Check to make sure that the current loan and previous loan match
    df['cust_c'] = np.where(df.customer_ID.shift(1) == df.customer_ID, 1, 0)
    # Numpy condition to check for matching loan and flow into default
    df_cond_np = np.array((df.cust_c == 1) & ((df.target.shift(1) == 0) &  (df.target == 1)))
    # Apply the numpy array to create a default flow variable
    df['def_flow'] = np.where(df_cond_np, 1, 0)
    # Understanding if any customers return to performing status
    df_cond_p = np.array((df.cust_c == 1) & ((df.target.shift(1) == 1) &  (df.target == 0)))
    # Apply the numpy array to create a performing flow variable
    df['perf_flow'] = np.where(df_cond_p, 1, 0)
    return df

t_data_1 = default_rules(t_data)
t_data_1.head(50)

In [None]:
# It looks like there are no flows into or out of default. So all customers are either in one state or the other
t_data_1[['def_flow', 'perf_flow']].value_counts()

In [None]:
# Confirm that the def_flow doesn't change by time
time_d1 = t_data_1.groupby(['S_2'])['def_flow'].sum()
time_d1 = time_d1.reset_index()

fig = px.line(time_d1, x='S_2', y="def_flow", title='Customer statements by day')
fig.show()