# AMEX Competition - Data Exploration

This notebook was created during a live coding session on twitch.

Check out the VOD video of this stream and follow for future streams [here](https://www.twitch.tv/medallionstallion_)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
color_pal = sns.color_palette()

## Reading in the Dataset
We will use the parquet format of the dataset created by @odins0n for some data exploration. Parquet format is faster, more compressed, and saves the dtypes of each column when we read and write.

[Learn more about it here in the youtube video I made.](https://www.youtube.com/watch?v=u4rsA5ZiTls)

We will subsample the training data so that the notebook does not run out of memory.

In [None]:
train = pd.read_parquet('../input/amex-parquet/train_data.parquet')
print(f'The full training data shape is: {train.shape}')
train = train.sample(100_000, random_state=529)

# About the Data:
The features are broken down into types. We will explore each one:

- D_* = Delinquency variables
- S_* = Spend variables
- P_* = Payment variables
- B_* = Balance variables
- R_* = Risk variables

In [None]:
d_feats = [c for c in train.columns if c.startswith('D_')]
s_feats = [c for c in train.columns if c.startswith('S_')]
p_feats = [c for c in train.columns if c.startswith('P_')]
b_feats = [c for c in train.columns if c.startswith('B_')]
r_feats = [c for c in train.columns if c.startswith('R_')]

In [None]:
print(f'Number of Delinquency variables: {len(d_feats)}')
print(f'Number of Spend variables: {len(s_feats)}')
print(f'Number of Payment variables: {len(p_feats)}')
print(f'Number of Balance variables: {len(b_feats)}')
print(f'Number of Risk variables: {len(r_feats)}')

# Distribution of the Target

In [None]:
pct_default = train['target'].mean()
print(f'{(pct_default *100): 0.2f}% of the Training Data Defaults')
train['target'].value_counts() \
    .plot(kind='barh',
          title='Distribution of Target',
          color=color_pal[1])

# How many Null Values by Features

In [None]:
fig, axs = plt.subplots(5, 1, figsize=(10, 20))
train[d_feats].isna().mean() \
    .plot(kind='hist', bins=20, color=color_pal[0], ax=axs[0])
axs[0].set_title('Null Values in Delinquency variables', fontsize=20)
axs[0].set_xlabel('Percent of Null Values')

train[s_feats].isna().mean() \
    .plot(kind='hist', bins=20, color=color_pal[1], ax=axs[1])
axs[1].set_title('Null Values in Spend variables', fontsize=20)
axs[1].set_xlabel('Percent of Null Values')

train[p_feats].isna().mean() \
    .plot(kind='hist', bins=20, color=color_pal[2], ax=axs[2])
axs[2].set_title('Null Values in Payment variables', fontsize=20)
axs[2].set_xlabel('Percent of Null Values')

train[b_feats].isna().mean() \
    .plot(kind='hist', bins=20, color=color_pal[3], ax=axs[3])
axs[3].set_title('Null Values in Balance variables', fontsize=20)
axs[3].set_xlabel('Percent of Null Values')

train[r_feats].isna().mean() \
    .plot(kind='hist', bins=20, color=color_pal[4], ax=axs[4])
axs[4].set_title('Null Values in Risk variables', fontsize=20)
axs[4].set_xlabel('Percent of Null Values')
plt.tight_layout()
plt.show()

In [None]:
# Look at 10 D features
train[d_feats[:10]].describe().T

In [None]:
ax = train[d_feats[:10]] \
    .plot(kind='kde', figsize=(10, 5))
ax.set_title('Distribution of 10 D_ features')
ax.set_xlim(-0.5, 1)
plt.show()

In [None]:
ax = train[p_feats] \
    .plot(kind='kde', figsize=(10, 5))
ax.set_title('Distribution of Payment features')
ax.set_xlim(-0.5, 1.5)
plt.show()

# Plot Each Feature by Target

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(10, 3))
for i in range(3):
    train.groupby('target')[p_feats[i]] \
        .plot(kind='kde',
              title=p_feats[i], alpha=0.5, ax=axs[i])
    axs[i].legend()
fig.suptitle('Distribution Payment Features by Target',
             y=1.05, fontsize=14)
plt.show()

# Non-Numeric Features

There are some features that are non-numeric!

They are:
- S_2
- D_63
- D_64

In [None]:
train.select_dtypes('object').columns

In [None]:
train['S_2_date'] = pd.to_datetime(train['S_2'])

## S_2 (Date) Feature Exploration

In [None]:
train.set_index('S_2_date')['target'] \
    .plot(figsize=(15, 5), lw=1, alpha=0.5,
          title='Target by Date')

# Categorical Feature Exploration

In [None]:
train.groupby('D_63')['target'].value_counts() \
    .unstack() \
    .sort_values(0) \
    .plot(kind='barh', stacked=True,
                    title='D_63 Feature by Target')
plt.show()

In [None]:
train.groupby('D_64')['target'].value_counts() \
    .unstack() \
    .sort_values(0) \
    .plot(kind='barh', stacked=True,
                    title='D_64 Feature by Target')
plt.show()

# Find the Correlation of Features with Target

In [None]:
numeric_feats = train.select_dtypes('float32').columns
feats = [c for c in train.columns if c not in ['customer_ID', 'target']]
feat_corrs = {}
for f in numeric_feats:
    feat_corr = np.corrcoef(train[f].fillna(0), train['target'])[0, 1]
    feat_corrs[f] = feat_corr

# Most and Least Correlated Features with the Target

In [None]:
pd.Series(feat_corrs).abs().sort_values(ascending=False).head(25) \
     .sort_values() \
    .plot(kind='barh', title='Top Correlated Features with Target')

In [None]:
pd.Series(feat_corrs).abs().sort_values(ascending=True).head(25) \
    .plot(kind='barh', title='Least Correlated Features with Target')

In [None]:
ax = train.groupby('target')['D_48'] \
    .plot(kind='kde',
          title='D_48', alpha=0.5)