<div style="padding:20px;color:white;margin:0;font-size:175%;text-align:center;display:fill;border-radius:5px;background-color:#016CC9;overflow:hidden;font-weight:500">[EDA]American Express/Comparison of training and test</div>

# <b><span style='color:#4B4B4B'>1 |</span><span style='color:#016CC9'> Notes</span></b>
This note follows the specifications of the following note. [[here]](https://www.kaggle.com/code/kellibelcher/amex-default-prediction-eda-lgbm-baseline)
If you upvote, please be sure to upvote the referenced note as well.


This note compares the differences between training and test data in terms of histograms and missing values.

# <b><span style='color:#4B4B4B'>2 |</span><span style='color:#016CC9'> Data Overview</span></b>
The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.

The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:  
**`D_*`:** Delinquency variables  
**`S_*`:** Spend variables  
**`P_*`:** Payment variables  
**`B_*`:** Balance variables  
**`R_*`:** Risk variables  
With the following features being categorical: `B_30`, `B_38`, `D_63`, `D_64`, `D_66`, `D_68`, `D_114`, `D_116`, `D_117`, `D_120`, `D_126`. 

There are a total of 190 variables in the dataset with approximately 450,000 customers in the training set and 925,000 in the test set. Due to the dataset size, I will use the compressed version of the train and test sets provided by @munumbutt's [AMEX-Feather-Dataset](https://www.kaggle.com/datasets/munumbutt/amexfeather) and take the last statement for each customer.

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import roc_auc_score, roc_curve, auc
from lightgbm import LGBMClassifier, early_stopping, log_evaluation
import warnings, gc
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)

temp=dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=12), 
                           height=500, width=1000))

train = pd.read_feather('../input/amexfeather/train_data.ftr')
train = train.groupby('customer_ID').tail(1).set_index('customer_ID')

print("The training data begins on {} and ends on {}.".format(train['S_2'].min().strftime('%m-%d-%Y'),train['S_2'].max().strftime('%m-%d-%Y')))
print("There are {:,.0f} customers in the training set and {} features.".format(train.shape[0],train.shape[1]))

test = pd.read_feather('../input/amexfeather/test_data.ftr')
test = test.groupby('customer_ID').tail(1).set_index('customer_ID')

print("\nThe test data begins on {} and ends on {}.".format(test['S_2'].min().strftime('%m-%d-%Y'),test['S_2'].max().strftime('%m-%d-%Y')))
print("There are {:,.0f} customers in the test set and {} features.".format(test.shape[0],test.shape[1]))

del test['S_2']
gc.collect()

titles=['Delinquency '+str(i).split('_')[1] if i.startswith('D') else 'Spend '+str(i).split('_')[1] 
        if i.startswith('S') else 'Payment '+str(i).split('_')[1]  if i.startswith('P') 
        else 'Balance '+str(i).split('_')[1] if i.startswith('B') else 
        'Risk '+str(i).split('_')[1] for i in train.columns[:-1]]
cat_cols=['Balance 30', 'Balance 38', 'Delinquency 63', 'Delinquency 64', 'Delinquency 66', 'Delinquency 68',
          'Delinquency 114', 'Delinquency 116', 'Delinquency 117', 'Delinquency 120', 'Delinquency 126', 'Target','category']
test.columns=titles[1:]
titles.append('Target')
train.columns=titles

train = train.assign(category=0)#Insert data name
train.pop('Spend 2')#Exclude time series columns

test = test.assign(Target="")#Insert missing value
test = test.assign(category=1)#Insert data name

# <b><span style='color:#4B4B4B'>3 |</span><span style='color:#016CC9'> Exploratory Data Analysis</span></b>

In [None]:
target=train.category.value_counts(normalize=True)
target.rename(index={1:'Default',0:'Paid'},inplace=True)
pal, color=['#016CC9','#DEB078'], ['#8DBAE2','#EDD3B3']

In [None]:
train = pd.concat([train,test])

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>3.1 EDA of Delinquency Variables</div></b>

In [None]:
cols=[col for col in train.columns if (col.startswith(('D','T','c'))) & (col not in cat_cols[:-1])]
plot_df=train[cols]
fig, ax = plt.subplots(18,5, figsize=(16,54))
fig.suptitle('Distribution of Delinquency Variables',fontsize=16)
row=0
col=[0,1,2,3,4]*18
for i, column in enumerate(plot_df.columns[:-1]):
    if (i!=0)&(i%5==0):
        row+=1
    sns.kdeplot(x=column, hue='category', palette=pal[::-1], hue_order=[0,1], 
                label=['train','test'], data=plot_df, 
                fill=True, linewidth=2, legend=False, ax=ax[row,col[i]])
    ax[row,col[i]].tick_params(left=False,bottom=False)
    ax[row,col[i]].set(title='\n\n{}'.format(column), xlabel='', ylabel=('Density' if i%5==0 else ''))
for i in range(2,5):
    ax[17,i].set_visible(False)
handles, _ = ax[0,0].get_legend_handles_labels() 
fig.legend(labels=['train','test'], handles=reversed(handles), ncol=2, bbox_to_anchor=(0.18, 0.983))
sns.despine(bottom=True, trim=True)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])

**Delinquency_142　 Focusing on the unevenness on the right side of the distribution outline, we can confirm that there is a difference between the training and test data.**

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>3.2 EDA of Spend Variables</div></b>

In [None]:
cols=[col for col in train.columns if (col.startswith(('S','T','c'))) & (col not in cat_cols[:-1])]
plot_df=train[cols]
fig, ax = plt.subplots(5,5, figsize=(16,20))
fig.suptitle('Distribution of Spend Variables',fontsize=16)
row=0
col=[0,1,2,3,4]*5
for i, column in enumerate(plot_df.columns[:-1]):
    if (i!=0)&(i%5==0):
        row+=1
    sns.kdeplot(x=column, hue='category', palette=pal[::-1], hue_order=[0,1], 
                label=['train','test'], data=plot_df, 
                fill=True, linewidth=2, legend=False, ax=ax[row,col[i]])
    ax[row,col[i]].tick_params(left=False,bottom=False)
    ax[row,col[i]].set(title='\n\n{}'.format(column), xlabel='', ylabel=('Density' if i%5==0 else ''))
for i in range(1,5):
    ax[4,i].set_visible(False)
handles, _ = ax[0,0].get_legend_handles_labels() 
fig.legend(labels=['train','test'], handles=reversed(handles), ncol=2, bbox_to_anchor=(0.18, 0.985))
sns.despine(bottom=True, trim=True)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])

**Spend_11　Spend_11 has different external shapes for training and test data. There is a possibility that the distribution can be regarded as the same by scaling.**

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>3.3 EDA of Payment Variables</div></b>

In [None]:
cols=[col for col in train.columns if (col.startswith(('P','T','c'))) & (col not in cat_cols[:-1])]
plot_df=train[cols]
fig, ax = plt.subplots(1,3, figsize=(16,5))
fig.suptitle('Distribution of Payment Variables',fontsize=16)
for i, col in enumerate(plot_df.columns[:-1]):
    sns.kdeplot(x=col, hue='category', palette=pal[::-1], hue_order=[0,1], 
                label=['train','test'], data=plot_df, 
                fill=True, linewidth=2, legend=False, ax=ax[i])
    ax[i].tick_params(left=False,bottom=False)
    ax[i].set(title='{}'.format(col), xlabel='', ylabel=('Density' if i==0 else ''))
handles, _ = ax[0].get_legend_handles_labels() 
fig.legend(labels=['train','test'], handles=reversed(handles), ncol=2, bbox_to_anchor=(0.18, 1))
sns.despine(bottom=True, trim=True)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>3.4 EDA of Balance Variables</div></b>

In [None]:
cols=[col for col in train.columns if (col.startswith(('B','T','c'))) & (col not in cat_cols[:-1])]
plot_df=train[cols]
fig, ax = plt.subplots(8,5, figsize=(16,32))
fig.suptitle('Distribution of Balance Variables',fontsize=16)
row=0
col=[0,1,2,3,4]*8
for i, column in enumerate(plot_df.columns[:-1]):
    if (i!=0)&(i%5==0):
        row+=1
    sns.kdeplot(x=column, hue='category', palette=pal[::-1], hue_order=[0,1], 
                label=['train','test'], data=plot_df, 
                fill=True, linewidth=2, legend=False, ax=ax[row,col[i]])
    ax[row,col[i]].tick_params(left=False,bottom=False)
    ax[row,col[i]].set(title='\n\n{}'.format(column), xlabel='', ylabel=('Density' if i%5==0 else ''))
for i in range(3,5):
    ax[7,i].set_visible(False)
handles, _ = ax[0,0].get_legend_handles_labels() 
fig.legend(labels=['train','test'], handles=reversed(handles), ncol=2, bbox_to_anchor=(0.18, 0.984))
sns.despine(bottom=True, trim=True)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>3.5 EDA of Risk Variables</div></b>

In [None]:
cols=[col for col in train.columns if (col.startswith(('R','T','c'))) & (col not in cat_cols[:-1])]
plot_df=train[cols]
fig, ax = plt.subplots(6,5, figsize=(16,24))
fig.suptitle('Distribution of Risk Variables',fontsize=16)
row=0
col=[0,1,2,3,4]*6
for i, column in enumerate(plot_df.columns[:-1]):
    if (i!=0)&(i%5==0):
        row+=1
    sns.kdeplot(x=column, hue='category', palette=pal[::-1], hue_order=[0,1], 
                label=['train','test'], data=plot_df, 
                fill=True, linewidth=2, legend=False, ax=ax[row,col[i]])
    ax[row,col[i]].tick_params(left=False,bottom=False)
    ax[row,col[i]].set(title='\n\n{}'.format(column), xlabel='', ylabel=('Density' if i%5==0 else ''))
for i in range(3,5):
    ax[5,i].set_visible(False)
handles, _ = ax[0,0].get_legend_handles_labels() 
fig.legend(labels=['train','test'], handles=reversed(handles), ncol=2, bbox_to_anchor=(0.18, 0.984))
sns.despine(bottom=True, trim=True)
plt.tight_layout(rect=[0, 0.2, 1, 0.99])

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>3.6 EDA of Categorical Variables</div></b>

In [None]:
rgb=['rgba'+str(matplotlib.colors.to_rgba(i,0.7)) for i in pal]

In [None]:
fig = make_subplots(rows=4, cols=3, 
                    subplot_titles=cat_cols[:-2], 
                    vertical_spacing=0.1)
row=0
c=[1,2,3]*5
plot_df=train[cat_cols]
for i,col in enumerate(cat_cols[:-2]):
    if i%3==0:
        row+=1
    plot_df[col]=plot_df[col].astype(object)
    df=plot_df.groupby(col)['category'].value_counts().rename('count').reset_index().replace('',np.nan)
    
    fig.add_trace(go.Bar(x=df[df.category==0][col], y=df[df.category==0]['count'],
                         marker_color=rgb[1], marker_line=dict(color=pal[1],width=2), 
                         hovertemplate='Value %{x} Frequency = %{y}',
                         name='train', showlegend=(True if i==0 else False)),
                  row=row, col=c[i])
    fig.add_trace(go.Bar(x=df[df.category==1][col], y=df[df.category==1]['count'],
                         marker_color=rgb[0], marker_line=dict(color=pal[0],width=2),
                         hovertemplate='Value %{x} Frequency = %{y}',
                         name='test', showlegend=(True if i==0 else False)),
                  row=row, col=c[i])
    if i%3==0:
        fig.update_yaxes(title='Frequency',row=row,col=c[i])
fig.update_layout(template=temp,title="Distribution of Categorical Variables",
                  legend=dict(orientation="h",yanchor="bottom",y=1.03,xanchor="right",x=0.2),
                  barmode='group',height=1500,width=900)
fig.show()

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>Missing value</div></b>

In [None]:
null=round((train[train.category==0].isna().sum()/train[train.category==0].shape[0]*100),2).sort_values(ascending=False)
null=null.to_frame().rename(columns={0:'Missing_train %'})

null2=round((train[train.category==1].isna().sum()/train[train.category==1].shape[0]*100),2).sort_values(ascending=False)
null2=null2.to_frame().rename(columns={0:'Missing_test %'})


null = pd.merge(null,null2,left_index=True,right_index=True)
null['diff %'] = null['Missing_train %'] - null['Missing_test %']
# null['diff_abs %'] = abs(null['Missing %_x'] - null['Missing %_y'])
null['diff %'].sort_values(ascending=False)
# sort_index(axis=1, ascending=False)
# null.sort_values(by='diff_abs %', ascending=False).head(50)

pd.set_option('display.max_rows', None)

# null.sort_values(by='diff_abs %', ascending=False)
null.sort_index()

If the value of diff is positive, the training data has a smaller percentage of missing data, and if the value of diff is negative, the test data has a larger percentage of missing data.

In [None]:
plt.rcParams["figure.figsize"] = (20, 10)
sns.scatterplot(data=null, x='Missing_train %', y='Missing_test %')
plt.show()

# <b><div style='padding:15px;background-color:#4B4B4B;color:white;border-radius:5px;font-size:60%'>Couclusion</div></b>

# <b><span style='color:#4B4B4B'>1 |</span><span style='color:#016CC9'>Scaling Delinquency_142 and Spend_11 may improve the accuracy of the model.</span></b>

# <b><span style='color:#4B4B4B'>2 |</span><span style='color:#016CC9'>Balance_29 and Spend 9 should be handled with care.</span></b>