# Introduction

In this work I'm going to do some exploratory data analysis for recent [American Express - Default Prediction](https://www.kaggle.com/competitions/amex-default-prediction) competition. It's still in progress and I'll try to update it whenever it's possible for me, I hope you find it useful!

## Some Simple Information:
### About the Task:
- We're going to predict the *probability* of a future payment default, for each customer_ID.

### About the Target:
- On competition page, hosts added this: `The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.`
- Also the negative class has been subsampled by %5 to deal with imbalanced data, but still should be pretty imbalanced.

### About the Metric:
- Competition has it's own custom metric which details can be found [here](https://www.kaggle.com/competitions/amex-default-prediction/overview/evaluation).
- Because of the downsampling with negative class, the negative labels are given a weight of 20.
- Python code for the custom competition metric can be found [here](https://www.kaggle.com/code/inversion/amex-competition-metric-python).

### About the Data:
- Both train and test sets are pretty big, loading and working with them in limited memory environments should be points to take into account.
- Features are anonymized and normalized. Anonymized features fell into these general categories:
    * D_* = Delinquency variables
    * S_* = Spend variables
    * P_* = Payment variables
    * B_* = Balance variables
    * R_* = Risk variables
- Data consists mostly of continious features but there are some categorical features given by hosts:
    - `['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']`
    - Besides these there are datetime, ID and label columns.

# Loading Packages

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
import lightgbm as lgb

import gc


orange_black = [
    '#fdc029', '#df861d', '#FF6347', '#aa3d01', '#a30e15', '#800000', '#171820'
]

# Setting plot styling.
plt.style.use('ggplot')

plt.rcParams['figure.figsize'] = (18, 14)
plt.rcParams['figure.dpi'] = 300
plt.rcParams["axes.grid"] = True
plt.rcParams["grid.color"] = orange_black[0]
plt.rcParams["grid.alpha"] = 0.5
plt.rcParams["grid.linestyle"] = '--'
plt.rcParams["font.family"] = "monospace"

plt.rcParams['axes.edgecolor'] = 'black'
plt.rcParams['figure.frameon'] = False
plt.rcParams['axes.spines.left'] = True
plt.rcParams['axes.spines.bottom'] = True
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.linewidth'] = 1.0

import warnings
warnings.filterwarnings("ignore")

# Reading Data

In [None]:
train = pd.read_feather('../input/parquet-files-amexdefault-prediction/train_data.ftr')
gc.collect()

In [None]:
test = pd.read_feather('../input/parquet-files-amexdefault-prediction/test_data.ftr')
gc.collect()

Here we check if there's any overlap.


In [None]:
print(f"Number of overlapping ID's between Train and Test set: {train['customer_ID'].isin(test['customer_ID']).sum()}")


We're going to separate the general datatypes here, basically for categoricals, continious, date and targets. We also found some categorical-like columns so we add them into categoricals.

In [None]:
all_cols = train.columns.to_list()
cat_cols = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

str_cols = ["customer_ID"]
date_cols = ["S_2"]
targ_col = ["target"]

cont_cols = [col for col in all_cols if col not in cat_cols + str_cols + date_cols + targ_col]

# finding other categorical cols
sec_cats = [col for col in cont_cols if train[col].nunique()<=10 and col not in cat_cols]
cat_cols = cat_cols + sec_cats

for col in sec_cats:
    cont_cols.remove(col)

# setting categorical cols
train[cat_cols] = train[cat_cols].astype('str').astype('category')
test[cat_cols] = test[cat_cols].astype('str').astype('category')

In [None]:
train_test_nonmatch = []
for col in cat_cols:
    if train[col].nunique(dropna=False) != test[col].nunique(dropna=False):
        train_test_nonmatch.append(col)
    else:
        continue

In [None]:
def show_nomatch(train, test, cols):
    fig, axes = plt.subplots(len(cols),2, sharey=True)
    axes = axes.flatten()
    
    for j in zip(cols, axes[1::2], axes[::2]):
        order1 = train[j[0]].value_counts().index
        order2 = test[j[0]].value_counts().index

        sns.countplot(train[j[0]],ax=j[2], label='train', palette=orange_black, order=order1)
        sns.countplot(test[j[0]],ax=j[1], label='test', palette=orange_black, order=order2)
        j[1].set_yticklabels([])
        j[1].set_ylabel('')
        j[2].set_ylabel('')
        
    col_tit = ['Train', 'Test']
        
    for i, ax in enumerate(axes[:2]):
        ax.set_title("{}".format(col_tit[i]), fontweight='bold', fontsize=12)
    plt.suptitle('Unique Categorical Values by Train/Test Splits', fontweight='bold', fontsize=16)
    plt.tight_layout()
        
show_nomatch(train, test, train_test_nonmatch)

We can see that there are some values in train set that doesn't represented in test set. We can drop them.

In [None]:
shape_old = train.shape[0]
for col in train_test_nonmatch:
    drop = set(train[col].value_counts().index) - set(test[col].value_counts().index)
    for i in list(drop):
        train = train[train[col]!=i]
        train[col] = train[col].cat.remove_unused_categories()
        
train.reset_index(drop=True, inplace=True)
shape_new = train.shape[0]
print(f"Train Rows dropped: {shape_old - shape_new}")
del shape_old, shape_new
gc.collect()

# Missing Values

In [None]:
miss = train.isna().sum().sort_values(ascending=False).reset_index(drop=False).rename({0:'missing_count'}, axis=1)
miss['miss_ratio'] = (miss['missing_count'] / train.shape[0]) * 100

sns.barplot(data=miss[:25], x= 'miss_ratio', y='index', palette='YlOrBr_r', linewidth=0.7, edgecolor=".2")
plt.title('Missing Value Ratios Per Column')
plt.ylabel('Column')
plt.xlabel('Percent of Missing Values')
del miss
plt.show()


In [None]:
miss = train.isna().sum(axis=1).reset_index(drop=False).rename({0:'missing_count'}, axis=1)
miss['miss_ratio'] = (miss['missing_count'] / (train.shape[1]-1)) * 100

sns.histplot(miss.miss_ratio, stat='percent', bins=50, linewidth=0.7, edgecolor=".2", color=orange_black[0])
plt.title("Distribution of Missing Percentage by Row")
plt.xlabel('Missing Value Ratio per Row')
plt.ylabel("Frequency")
del miss
plt.show()
gc.collect()

# Target Column

In [None]:
train_labels = pd.read_csv('../input/amex-default-prediction/train_labels.csv')
train_labels['target'] = train_labels['target'].astype('uint8')
train = train.merge(train_labels, how='left', on='customer_ID')

del train_labels
gc.collect()

In [None]:
g=sns.countplot(train.target, palette='autumn',  linewidth=0.7, edgecolor=".2")

# Adding percentages

total = float(len(train.target))

for p in g.patches:
    height = p.get_height()
    g.text(p.get_x() + p.get_width() / 2.,
            height + height/100,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center',
           fontsize=8,
          bbox=dict(boxstyle='round',facecolor='black', alpha=0.5))

plt.ylabel('Count')    
plt.title('Target Distribution', weight='bold')
plt.tight_layout()

In [None]:
correlations = train.corrwith(train['target']).iloc[:-1].to_frame()
correlations['Abs Corr'] = correlations[0].abs()
sorted_correlations = correlations.sort_values('Abs Corr', ascending=False)['Abs Corr']
fig, ax = plt.subplots(figsize=(6,8))
sns.heatmap(sorted_correlations.iloc[1:].to_frame()[sorted_correlations>=.25], cmap='inferno', annot=True, vmin=-1, vmax=1, ax=ax, cbar=False)

plt.ylabel('Feature')
plt.title('Feature Correlations With Target')
del correlations, sorted_correlations
plt.show()

# Parsing Date

In [None]:
train['S_2']= pd.to_datetime(train['S_2'], format='%Y-%m-%d')
test['S_2']= pd.to_datetime(test['S_2'], format='%Y-%m-%d')
train['day_name'] = train['S_2'].dt.day_name().astype('category')

In [None]:
sns.countplot(data=train, x='day_name', hue='target', palette=orange_black[1::5])
plt.title("Weekday of the Dataset Entries by Target")
plt.show()

# Categorical Data

In [None]:
def categorical_dist(df, t_df, cols, rows, columns):
    fig, axes = plt.subplots(rows, columns, figsize=(30, 25), constrained_layout=True)
    axes = axes.flatten()
    for j in zip(cols, axes[1::2], axes[::2]):
        order = df[j[0]].value_counts().index
        sns.countplot(data=df, x=str(j[0]), ax=j[2], order=order, hue='target', palette=orange_black[2::3])
        sns.countplot(data=t_df, x=str(j[0]),ax=j[1], order=order, label='Test', color=orange_black[0], alpha=0.8)
        sns.countplot(data=df, x=str(j[0]),ax=j[1], order=order, label='Train', color=orange_black[-1], alpha=0.8)
        j[1].set_yticklabels([])
        j[1].set_ylabel('')
        j[2].set_ylabel('')
        j[1].legend(title='Dataset')
    col_tit = ['Train by Target', 'Train vs Test']    
    for i, ax in enumerate(axes[:2]):
        ax.set_title("{}".format(col_tit[i]), fontweight='bold', fontsize=18)
    plt.suptitle('Unique Categorical Values by Train/Test Splits', fontweight='bold', fontsize=16)

    plt.tight_layout()
    
    
categorical_dist(train, test, cat_cols, 6, 2)

In [None]:
corr = train[cont_cols].fillna(0).sample(frac=0.1, random_state=42).corr()
sns.clustermap(corr, metric="correlation", figsize=(20, 20), dendrogram_ratio=(.1, .2), cmap="coolwarm")
plt.suptitle('Correlations Between Features', fontsize=24, weight='bold')
plt.show()

In [None]:
corr = corr.abs()

corrs = corr.unstack()
pair = corrs.sort_values(ascending=False)
pair = pair.reset_index(name='correlation').rename(columns={'level_0': 'feature_a', 'level_1': 'feature_b', 0: 'correlation'})
pair = pair[pair['feature_a'] != pair['feature_b']].iloc[::2,:]
pair['Features'] = pair.feature_a +' / '+ pair.feature_b

In [None]:
fig, axes = plt.subplots(figsize=(25,20))
sns.barplot(data=pair[:50], x= 'correlation', y='Features', palette='YlOrBr_r', linewidth=0.7, edgecolor=".2")
plt.title('Highest Correlation Pairs Between Features')
plt.ylabel('Pairs')
plt.xlabel('Correlation')

plt.show()

del corr, pair
gc.collect()

Let's get to the continious features next, but there are too many of them. It'd be better if we group them as hosts did, if you recall:

- D_* = Delinquency variables
- S_* = Spend variables
- P_* = Payment variables
- B_* = Balance variables
- R_* = Risk variables



In [None]:
cont_d = [col for col in train.columns.tolist() if col.startswith('D') and col not in cat_cols]
cont_s = [col for col in train.columns.tolist() if col.startswith('S') and col not in cat_cols]
cont_p = [col for col in train.columns.tolist() if col.startswith('P') and col not in cat_cols]
cont_b = [col for col in train.columns.tolist() if col.startswith('B') and col not in cat_cols]
cont_r = [col for col in train.columns.tolist() if col.startswith('R') and col not in cat_cols]


cat_d = [col for col in train.columns.tolist() if col.startswith('D') and col in cat_cols]
cat_s = [col for col in train.columns.tolist() if col.startswith('S') and col in cat_cols]
cat_p = [col for col in train.columns.tolist() if col.startswith('P') and col in cat_cols]
cat_b = [col for col in train.columns.tolist() if col.startswith('B') and col in cat_cols]
cat_r = [col for col in train.columns.tolist() if col.startswith('R') and col in cat_cols]

spec_cont = [cont_d,cont_s,cont_p,cont_b,cont_r]
spec_cat = [cat_d,cat_s,cat_p,cat_b,cat_r]
spec_cat = [col for col in spec_cat if col != []]

# Delinquency Variables

In [None]:
class ContiniousDist():
    def __init__(self, train, test_df):
        self.train = train
        self.test = test_df
        
        self.df_0 = self.train[self.train['target']==0]
        self.df_1 = self.train[self.train['target']==1]
        
    def histplot(self, cols, title:str='', figsize:tuple=(45, 90), row_factor:int=5, n_cols:int=5):
        fig, axes = plt.subplots(len(cols)//row_factor, n_cols, figsize=figsize, constrained_layout=True)
        axes = axes.flatten()
        for i, col in tqdm(enumerate(cols)):
            axes[i].hist(self.df_0[col], bins=100, alpha=0.5, color=orange_black[1], density=True,  linewidth=0.2, edgecolor=".2", label='Target: 0', histtype='stepfilled')
            axes[i].hist(self.df_1[col], bins=100, alpha=0.6, color=orange_black[4], density=True,  linewidth=0.2, edgecolor=".2", label='Target: 1', histtype='stepfilled')
            axes[i].hist(self.test[col], bins=100, alpha=0.75, color=orange_black[-1], density=True,  linewidth=0.2, edgecolor=".2", label='Test', histtype='stepfilled')
            axes[i].set_title(col)
            axes[i].legend()
        plt.suptitle(title, fontsize=25)
        plt.show()
        
    def ecdf(self, cols, title:str='', figsize:tuple=(30, 60), row_factor:int=5, n_cols:int=5, test=False):
        fig, axes = plt.subplots(len(cols)//row_factor, n_cols, figsize=figsize, constrained_layout=True)
        axes = axes.flatten()
        for i, col in tqdm(enumerate(cols)):
            if test==False:
                sns.ecdfplot(self.df_0[col].dropna(), color=orange_black[1], ax=axes[i], label='Target: 0')
                sns.ecdfplot(self.df_1[col].dropna(), color=orange_black[-1], ax=axes[i], label='Target: 1')
            else:
                sns.ecdfplot(self.train[col].dropna(), color=orange_black[1], ax=axes[i], label='Train')
                sns.ecdfplot(self.test[col].dropna(), color=orange_black[-1], ax=axes[i], label='Test')

            axes[i].set_title(col)
            axes[i].legend()
        plt.suptitle(title)
        plt.show()

In [None]:
hist = ContiniousDist(train, test)

In [None]:
hist.histplot(cont_d[:80], title="Delinquency Variable Distributions")
gc.collect()

The distribution between train, targets and tests seems aren't that distinctive but I can see some differences in some features (like d_47,d_52, d_59, d_74 etc.), these features should be helpful for models or feature engineering. 

In [None]:
hist.ecdf(['D_47','D_52', 'D_59', 'D_74'], title="Empirical Cumulative Distributions for Interesting Deliquency Variables", figsize=(12, 12), row_factor=2, n_cols=2)

We can clearly see there are distribution differences between targets for these features which is good.

# Spend Variables

In [None]:
hist.histplot(spec_cont[1][:21], title="Spend Variable Distributions", row_factor=3, n_cols=3)
gc.collect()

We can see some odd shapes here too. Easiest one to see is time feature, notice that test set covers future in timeline and wider interval. On S_11 you can notice a shift between train and test splits, it could be related to time, you can also notice in first look that S_25 has different target distribution for 1's than 0's. Some feature distributions are clusteren in tightly knitted bins around discrete value, this might indicate random noise injection by hosts like @raddar pointed out.

In [None]:
hist.ecdf(['S_7','S_11'], title="Empirical Cumulative Distributions for Interesting Spend Variables", figsize=(8, 4), row_factor=2, n_cols=2, test=True)
gc.collect()

Just got a random variable to show you what similar distribution between two sets, at S_7, meanwhile you can see the difference in S_11 better above...

# Payment Variables

In [None]:
hist.histplot(cont_p, title="Payment Variable Distributions", figsize=(12, 4), row_factor=3, n_cols=3)

There isn't big difference between train and test for payment variables but we can see apparent distinction in P_2 between targets.

# Balance Variables

In [None]:
hist.histplot(cont_b[:36], title="Balance Variable Distributions", row_factor=4, n_cols=4)
gc.collect()

In [None]:
hist.ecdf(['B_2','B_16'], title="Empirical Cumulative Distributions for Interesting Spend Variables", figsize=(8, 4), row_factor=2, n_cols=2, test=False)
gc.collect()

In [None]:
hist.ecdf(['B_2','B_16'], title="Empirical Cumulative Distributions for Interesting Balance Variables", figsize=(8, 4), row_factor=2, n_cols=2, test=True)
gc.collect()

Again we can see decisive differences between target distributions, test set seems similar though...

# Risk Variables

In [None]:
hist.histplot(cont_r, title="Risk Variable Distributions", row_factor=4, n_cols=4, figsize=(45, 60))
gc.collect()

In [None]:
hist.ecdf(['R_4','R_27'], title="Empirical Cumulative Distributions for Interesting Risk Variables", figsize=(8, 4), row_factor=2, n_cols=2, test=False)
gc.collect()

In [None]:
del hist
gc.collect()

# Adversarial Validation 

In [None]:
class Adversarial:
    def __init__(self):
        self.params =  {
            'objective': 'binary',  
            'boosting_type': 'gbdt',
            'n_jobs': -1,
            'metric' :   'auc',
            'max_bin' : 128,
            'verbose': -1
        }
    
    def _plot_roc_feat(self, y_trues, y_preds, labels, cols, clf, idxs:set=(), x_max:float=1.0):
        fig = plt.figure(constrained_layout=True, figsize=(12, 12))
        grid = gridspec.GridSpec(ncols=4, nrows=2, figure=fig)
        ax1 = fig.add_subplot(grid[0, :2])    
        for i, y_pred in enumerate(y_preds):
            y_true = y_trues[i]
            fpr, tpr, thresholds = roc_curve(y_true, y_pred)
            auc = roc_auc_score(y_true, y_pred)
            ax1.plot(fpr, tpr, label='%s; AUC=%.3f' % (labels[i], auc), marker='o', markersize=1)
        ax1.legend()
        ax1.grid()
        ax1.plot(np.linspace(0, 1, 20), np.linspace(0, 1, 20), linestyle='--')
        ax1.set_title('Adversarial ROC curve')
        ax1.set_xlabel('False Positive Rate')
        ax1.set_xlim([-0.01, x_max])
        _ = ax1.set_ylabel('True Positive Rate')
        
        ax2 = fig.add_subplot(grid[0, 2:])

        feature_imp = pd.DataFrame(sorted(zip(clf.feature_importance(),cols)), columns=['Value','Feature'])
        sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False).iloc[:25,:], ax=ax2, palette='YlOrBr_r', linewidth=0.7, edgecolor=".2")
        ax2.set_title('Most Distinctive Features Between Train and Test')
        
        
        ax3 = fig.add_subplot(grid[1, :])
        ax3.set_title('A')
        ax3.hist([train.loc[idxs[0], 'S_2'], test.loc[idxs[1], 'S_2']], bins=pd.date_range("2017-03-01", "2019-11-01", freq="MS"), histtype="barstacked",
                label=['Training', 'Test'], color=[orange_black[0], orange_black[-1]])
        ax3.set_xticks(pd.date_range("2017-03-01", "2019-11-01", freq="QS"))
        ax3.legend()
        ax3.set_title('Train Test Sample Intervals')
        plt.show()

    def validate(self, train, test, method:str='full'):
        train['set'] = 0
        test['set'] = 1

        train['set'] = train['set'].astype('uint8')
        test['set'] = test['set'].astype('uint8')

        if method=='full':
            a = train.drop(['customer_ID', 'S_2', 'day_name', 'target'], axis=1).sample(frac=0.1, random_state=42)
            b = test.drop(['customer_ID', 'S_2'], axis=1).sample(frac=0.1, random_state=42)
            full = pd.concat((a, b), axis=0)
            a_idx = a.index
            b_idx = b.index

        elif method=='far_gap':
            a = train[train.S_2<train.S_2.quantile(0.05)].drop(['customer_ID', 'S_2', 'day_name', 'target'], axis=1)
            b = test[test.S_2>test.S_2.quantile(0.95)].drop(['customer_ID', 'S_2'], axis=1)
            full = pd.concat((a, b), axis=0)
            a_idx = a.index
            b_idx = b.index

        elif method=='close_gap':
            a = train[train.S_2>train.S_2.quantile(0.95)].drop(['customer_ID', 'S_2', 'day_name', 'target'], axis=1)
            b = test[test.S_2<test.S_2.quantile(0.05)].drop(['customer_ID', 'S_2'], axis=1)
            full = pd.concat((a, b), axis=0)
            a_idx = a.index
            b_idx = b.index
            
        del a, b

        X_train, X_test, y_train, y_test = train_test_split(full.iloc[:,:-1], full['set'], stratify=full['set'], shuffle=True, random_state=42)

        del full
        gc.collect()

        train_data = lgb.Dataset(data=X_train, label=y_train, params={'verbose': -1},
                             categorical_feature=cat_cols,
                     free_raw_data=False
                    )
        val_data = lgb.Dataset(data=X_test, label=y_test, params={'verbose': -1},
                             categorical_feature=cat_cols,
                     free_raw_data=False
                          )
        clf = lgb.train(
            params=self.params,
            train_set=train_data,
            valid_sets=[train_data,val_data],
            verbose_eval=-1,
            )

        y_pred = clf.predict(X_test)
        
        self._plot_roc_feat(
            y_trues=[y_test],
            y_preds=[y_pred],
            labels=[f'{method} data'],
            cols=X_train.columns.tolist(),
            clf = clf,
            idxs = (a_idx, b_idx)
        )

        del X_train, X_test, y_train, y_test, train_data, val_data
        gc.collect()
        
adv = Adversarial()

In [None]:
adv.validate(train, test, method='full')

Isn't looking good, our model doing perfectly on task where it needs to predict if given sample coming from train or test sets. We can see the features which is important for the model while deciding train test predictions. In top features I can see some familiar names we have spotted by looking distributions. I suspect these features are related to time variable directly or indirectly, but anyways we cannot assume train and test sets are sampled randomly for now. Let's take a closer look...

In [None]:
adv.validate(train, test, method='close_gap')

So when we close down the timewise gap in our adversarial samples we can see a different story: The AUC score decreases, so the model having little bit more hard time to classify train and test sets. But it still has a great score... You can ntoice some features are getting more importance as the gap tightens, so these features are not likely to related to time shifts.

In [None]:
adv.validate(train, test, method='far_gap')

And when we take earlier parts of the training data and the latter parts of the test data, our adversarial model gets a perfect score! While D_59 still keeping the most important feature status we can see some features like S_27, D_121 gets huge boost, so we can assume these features are more time related than others.

In [None]:
del adv
gc.collect()

# Work in Progress

These are my first insights about this vast data. I hope I'll be doing more EDA for rest of the feautres and do some modelling on them. I hope you find it useful, best of luck everyone :)