# American Express - Default Prediction
Whether out at a restaurant or buying tickets to a concert, modern life counts on the convenience of a credit card to make daily purchases. It saves us from carrying large amounts of cash and also can advance a full purchase that can be paid over time. How do card issuers know we’ll pay back what we charge? That’s a complex problem with many existing solutions—and even more potential improvements, to be explored in this competition.
## Introduction
Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. Current models exist to help manage risk. But it's possible to create better models that can outperform those currently in use.

In this competition, we’ll apply supervised machine learning to predict credit default. Specifically, we will leverage an industrial scale dataset to build binary classifaction models that challenge the current model in production. Training, validation, and testing datasets include: time-series, behavioral data, and anonymized customer profile information. Apart from creating a base model, we will explore numerous techniques and methodolgies to create an impressive model through feature engineering and using the data in a more organic way within a model.
### Objective
The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.
### Evaluation Criteria
If successful, our solution when implemented may yield better customer experiences for cardholders by making it easier for them to be approved for a new credit card. Top solutions may even challenge the credit default prediction model used by the world's largest payment card issuer at American Express.

## Data Preprocessing
### Project Setup and Configuration
#### Notebook Configuration

In [None]:
# Change working directory to project root
import os

if os.getcwd().split("/")[-1] == "notebooks":
    os.chdir("../")

#### Import Packages

In [None]:
# Import required packages
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

### Data Description
#### Files
| Filename | Description |
|-----------------|-----------------------|
|`train_data.csv`   | training data with multiple statement dates per `customer_ID`|
|`train_labels.csv` | target label for each `customer_ID`|
|`test_data.csv`    | corresponding test data; goal: predict `target label` for each `customer_ID`|
|`sample_submission.csv` | sample submission file in the correct format|
#### Feature/Target Variables
The dataset contains aggregated profile features for each customer at each statement date. 

Features are anonymized and normalized, and fall into the following general categories:
| Prefix | Feature Type |
|:------:|--------------|
|`D_*`| Delinquency |
|`S_*`| Spend |
|`P_*`| Payment |
|`B_*`| Balance |
|`R_*`| Risk |

with the following features being categorical: 
    `B_30`, `B_38`, `D_114`, `D_116`, `D_117`, `D_120`, `D_126`, `D_63`, `D_64`, `D_66`, `D_68`


**Objective:** For each `customer_ID`, predict the probability of a future payment default (`target == 1`).

**Note:** The negative class (`target == 0`) has been *subsampled at 5%*, and thus *receives a 20x weighting in the scoring metric*.

**Data Source (AMEX):**

American Express is a globally integrated payments company. As the largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.

### Load AMEX Datasets
**Source (Raw Data):**
- https://www.kaggle.com/competitions/amex-default-prediction/data

**Source (Compressed Data):**
- https://www.kaggle.com/datasets/munumbutt/amexfeather

// TODO: Add information regarding the source, advantages and limitations in using the compressed datasets

In [3]:
# Load compressed datasets
# Source: https://www.kaggle.com/datasets/munumbutt/amexfeather

train = pd.read_feather("./data/raw/train_data.ftr")
test = pd.read_feather("./data/raw/test_data.ftr")

In [17]:
# Preview the first five rows of the training data
train.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145,target
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-03-09,0.938477,0.001734,0.008728,1.006836,0.009224,0.124023,0.008774,0.004707,...,,,0.002426,0.003706,0.003819,,0.000569,0.00061,0.002674,0
1,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-04-07,0.936523,0.005775,0.004925,1.000977,0.006153,0.126709,0.000798,0.002714,...,,,0.003956,0.003166,0.005032,,0.009575,0.005493,0.009216,0
2,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-05-28,0.954102,0.091492,0.021652,1.009766,0.006817,0.123962,0.007599,0.009422,...,,,0.003269,0.007328,0.000427,,0.003429,0.006985,0.002604,0
3,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-06-13,0.960449,0.002455,0.013687,1.00293,0.001372,0.117188,0.000685,0.005531,...,,,0.006119,0.004517,0.003201,,0.008423,0.006527,0.009598,0
4,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-07-16,0.947266,0.002483,0.01519,1.000977,0.007607,0.11731,0.004654,0.009308,...,,,0.003672,0.004944,0.008888,,0.00167,0.008125,0.009827,0


In [14]:
# Print the shape of the DataFrame for the training set
print(f"Training Data: Shape == {train.shape}")
print(
    f"\nThe training set consists of {train.shape[0]} observations with {train.shape[1]} features and 1 target variable."
)

Training Data: Shape == (5531451, 191)

The training set consists of 5531451 observations with 191 features and 1 target variable.


In [28]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5531451 entries, 0 to 5531450
Columns: 191 entries, customer_ID to target
dtypes: category(11), datetime64[ns](1), float16(177), int64(1), object(1)
memory usage: 2.0+ GB


In [None]:
# Preview the first five rows of the testing set
test.head(5)

In [16]:
# Print the shape of the DataFrame for the testing dataset
print(f"Testing Data: Shape == {test.shape}")
print(
    f"\nThe testing set consists of {test.shape[0]} observations with {test.shape[1]} features."
)

Testing Data: Shape == (11363762, 190)

The testing set consists of 11363762 observations with 190 features.


In [27]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11363762 entries, 0 to 11363761
Columns: 190 entries, customer_ID to D_145
dtypes: category(11), datetime64[ns](1), float16(177), object(1)
memory usage: 4.0+ GB
None


In [44]:
# 1-2) Sum the number of incomplete (missing or null) values in each column
# 3-4) Divide by the number of observations and multipy by 100 to make it a percentage.
#   5) Lastly, sort the values in descending order to better observe feature incompleteness.
pct_incomplete = (
    train.isna().sum().div(len(train)).mul(100).sort_values(ascending=False)
)

# Subset pct_incomplete to select incomplete features (Threshold: >20%)
incomplete_features = set(pct_incomplete[pct_incomplete >= 20].index)

f34 features have 20% or greater have missing or null values
{'B_17', 'D_43', 'D_88', 'D_110', 'D_105', 'D_53', 'D_135', 'R_9', 'B_39', 'D_87', 'D_82', 'D_138', 'B_42', 'S_9', 'D_76', 'D_73', 'D_108', 'D_136', 'D_132', 'D_142', 'D_77', 'D_46', 'S_27', 'D_42', 'R_26', 'D_137', 'B_29', 'D_66', 'D_111', 'D_50', 'D_49', 'D_56', 'D_106', 'D_134'}


In [47]:
# Print the count of incomplete features
print(
    f"{len(incomplete_features)} features with over 20% values are missing or null.\n"
)

# Print column names of features where 20% or greater have missing or null values
print(f"Incomplete Features: \n{incomplete_features}")

34 features with over 20% values are missing or null.

Incomplete Features: 
{'B_17', 'D_43', 'D_88', 'D_110', 'D_105', 'D_53', 'D_135', 'R_9', 'B_39', 'D_87', 'D_82', 'D_138', 'B_42', 'S_9', 'D_76', 'D_73', 'D_108', 'D_136', 'D_132', 'D_142', 'D_77', 'D_46', 'S_27', 'D_42', 'R_26', 'D_137', 'B_29', 'D_66', 'D_111', 'D_50', 'D_49', 'D_56', 'D_106', 'D_134'}
