# American Express - Default Prediction 
## Predict If A Customer Will Default in the Future ...
The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.

<img style="float: center;" src="https://img.freepik.com/free-vector/brain-with-digital-circuit-programmer-with-laptop-machine-learning-artificial-intelligence-digital-brain-artificial-thinking-process-concept-vector-isolated-illustration_335657-2246.jpg?w=2000" width = '550'>
<a href='https://www.freepik.com/vectors/machine-learning'>Machine learning vector created by vectorjuice - www.freepik.com</a>

#### Data Description
The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

* D_* = Delinquency variables
* S_* = Spend variables
* P_* = Payment variables
* B_* = Balance variables
* R_* = Risk variables

With the following features being categorical:

**['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']**

Your task is to predict, for each customer_ID, the probability of a future payment default (target = 1).

Note that the negative class has been subsampled for this dataset at 5%, and thus receives a 20x weighting in the scoring metric.

**Files**
* train_data.csv - training data with multiple statement dates per customer_ID
* train_labels.csv - target label for each customer_ID
* test_data.csv - corresponding test data; your objective is to predict the target label for each customer_ID
* sample_submission.csv - a sample submission file in the correct format

---

## My Strategy, or How I Will Aproach this Competition...
We have data from many Customers and there is many points of information by for each of the customers, the target labels are only one per customer id so aggregation will be requiered, from here there is quie a lot of possibilities, this is what I will folow in this Notebook...

#### Loading the Datasets
The datasets is massive so I will rely on other Kaggles optimized datasets stored in a feather format to make my life easier in this competition.

#### Quick EDA
The typical analysis that I always like to complete to undertstand the dataset better...
* Information of the datasets, size and others.
* Simple visualization of the first few records.
* Data statistical analalysis using describe.
* Visualization of the number of NaNs.
* Understanding the amount of unique records.

#### Exploring the Target Variable
Nothing in particular dataset seems to be quite inbalanced so I will get back to this part later...

#### Structuring the Datasets
Here is where everything happens, because we have time-base data o multiple points per customer we are trying to aggregate the information in certain way that's practical:
* Statistical aggregation for numeric features
* Only keep the last know record for analysis
* Statictical aggregation for categorical features

#### Feature Engineering
At this point the only thing that I can consider some type of feature will be the aggregation of the datasets, as I mentioned in the previous point
* Statistical aggregation
* Only keep the last know record for analysis

#### Label Encoding
Because there is quite a lot of categorical variables and this is a NN model I will use the following encoding technique:
* OneHot encoder, only train in the train dataset and applyed on test

#### Fill NaNs**
At this point just to get started, I will fill everything with ceros, probably not a good idea.
* Fill NaNs with 0

#### Model Development and Training
I'm going to go first with an NN in the last few competitions the NN models have been working quite well also we have so much data.
* Simple NN tested, layer after later.
* I also tested a more complex NN, that I learned from Ambross with Skip conections.

#### Predictions and Submission
No much details here, just the simple average of all the predictions across multiple folds.
* Average predictions across 5 folds

---

## Updates
#### 05/28/2022
* Build the initial model using Neuronal Nets and simple agg strategy (Last data point).
* Evaluated the model and uploaded for Ranking.

#### 05/29/2022
* Improve model architecture.
* Really dive deep into Feature Engineering (Not much here, memory is a big chalenge)

#### 05/30/2022
* ...

---

## Resources, Inspiration
I have taken Ideas or learned quite a lot from the Notebooks below, please check also if you like my work.

* https://www.kaggle.com/code/ambrosm/amex-keras-quickstart-1-training/notebook
* ...
* ...
* ...

---

# 1.0 Loading Model Libraries...

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
import datetime # ...

---

# 2.0 Setting the Notebook Parameters and Default Configuration...

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# 3.0 Loading the Dataset Information (Using Feather)...

In [None]:
%%time
# Load the CSV information into a Pandas DataFrame...
trn_data = pd.read_feather('../input/parquet-files-amexdefault-prediction/train_data.ftr')
trn_lbls = pd.read_csv('/kaggle/input/amex-default-prediction/train_labels.csv').set_index('customer_ID')

tst_data = pd.read_feather('../input/parquet-files-amexdefault-prediction/test_data.ftr')

In [None]:
%%time
sub = pd.read_csv('/kaggle/input/amex-default-prediction/sample_submission.csv')

---

# 4.0 Exploring the Dataset, Quick EDA...

In [None]:
%%time
# Explore the shape of the DataFrame...
trn_data.shape

In [None]:
%%time
# Display simple information of the variables in the dataset...
trn_data.info()

In [None]:
%%time
# Display the first few rows of the DataFrame...
trn_data.head()

In [None]:
%%time
# Display the Min Date...
trn_data['S_2'].min()

In [None]:
%%time
# Display the Max Date...
trn_data['S_2'].max()

In [None]:
%%time
# Generate a simple statistical summary of the DataFrame, Only Numerical...
trn_data.describe()

In [None]:
%%time
# Calculates the total number of missing values...
trn_data.isnull().sum().sum()

In [None]:
%%time
# Display the number of missing values by variable...
trn_data.isnull().sum()

In [None]:
%%time
# Display the number of unique values for each variable...
trn_data.nunique()

In [None]:
%%time
# Display the number of unique values for each variable, sorted by quantity...
trn_data.nunique().sort_values(ascending = True)

---

# 5.0 Understanding the Target Variable...

In [None]:
%%time
# Explore the shape of the DataFrame...
trn_lbls.shape

In [None]:
%%time
# Display simple information of the variables in the dataset...
trn_lbls.info()

In [None]:
%%time
# Check how well balanced is the dataset
trn_lbls['target'].value_counts()

In [None]:
%%time
# Check some statistics on the target variable
trn_lbls['target'].describe()

---

# 6.0 Structuring Data for the Model (Aggreations and More)

## 6.1 Training Dataset...

In [None]:
%%time
# We have 458913 customers. and we have 458913 train labels...

In [None]:
%%time
# Calculates the amount of information by costumer or records available...
trn_num_statements = trn_data.groupby('customer_ID').size().sort_index()

In [None]:
%%time
# Review some of the information created...
trn_num_statements

In [None]:
%%time
# Create a new dataset based on aggregated information
trn_agg_data = (trn_data
                .groupby('customer_ID')
                .tail(1)
                .set_index('customer_ID', drop=True)
                .sort_index()
                .drop(['S_2'], axis='columns'))

# Merge the labels from the labels dataframe
trn_agg_data['target'] = trn_lbls.target
trn_agg_data['num_statements'] = trn_num_statements

trn_agg_data.reset_index(inplace = True, drop = True) # forget the customer_IDs

In [None]:
%%time
trn_agg_data.head()

---

## 6.2 Test Dataset...

In [None]:
%%time
# Calculates the amount of information by costumer or records available...
tst_num_statements = tst_data.groupby('customer_ID').size().sort_index()

In [None]:
%%time
# Create a new dataset based on aggregated information
tst_agg_data = (tst_data
                .groupby('customer_ID')
                .tail(1)
                .set_index('customer_ID', drop=True)
                .sort_index()
                .drop(['S_2'], axis='columns'))

# Merge the labels from the labels dataframe
tst_agg_data['num_statements'] = tst_num_statements

tst_agg_data.reset_index(inplace = True, drop = True) # forget the customer_IDs

In [None]:
%%time
tst_agg_data.head()

---

# 7.0 Label / One-Hot Encoding the Categorical Variables...

## 7.1 One Hot Encoding Configuration...

In [None]:
%%time
from sklearn.preprocessing import StandardScaler, QuantileTransformer, OneHotEncoder, OrdinalEncoder

In [None]:
%%time
# One-hot Encoding Configuration
cat_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

#trn_agg_data[cat_features] = trn_agg_data[cat_features].astype(object)
trn_not_cat_features = [f for f in trn_agg_data.columns if f not in cat_features]
tst_not_cat_features = [f for f in tst_agg_data.columns if f not in cat_features]

In [None]:
%%time
trn_agg_data[cat_features].head()

In [None]:
%%time
#encoder = OneHotEncoder(drop = 'first', sparse = False, dtype = np.float32, handle_unknown = 'ignore')
encoder = OrdinalEncoder()
trn_encoded_features = encoder.fit_transform(trn_agg_data[cat_features])
#feat_names = list(encoder.get_feature_names())

## 7.2 Train Dataset One Hot Encoding...

In [None]:
%%time
# One-hot Encoding
trn_encoded_features = pd.DataFrame(trn_encoded_features)
#trn_encoded_features.columns = feat_names

In [None]:
%%time
trn_agg_data = pd.concat([trn_agg_data[trn_not_cat_features], trn_encoded_features], axis = 1)
trn_agg_data.head(5)

---

## 7.3 Test Dataset One-Hot Encoding...

In [None]:
%%time
tst_agg_data[cat_features].head()

In [None]:
%%time
# One-hot Encoding
tst_encoded_features = encoder.transform(tst_agg_data[cat_features])
tst_encoded_features = pd.DataFrame(tst_encoded_features)
#tst_encoded_features.columns = feat_names

In [None]:
%%time
tst_agg_data = pd.concat([tst_agg_data[tst_not_cat_features], tst_encoded_features], axis = 1)
tst_agg_data.head()

---

# 8.0 Pre-Processing the Data, Fill NaNs for model functionality...

In [None]:
%%time
# Impute missing values
trn_agg_data.fillna(value = 0, inplace = True)
tst_agg_data.fillna(value = 0, inplace = True)

---

# 9.0 Feature Selection for Baseline Model...

In [None]:
%%time
features = [f for f in trn_agg_data.columns if f != 'target' and f != 'customer_ID']

---

# 10.0 NN Development

In [None]:
%%time
# Release some memory by deleting the original DataFrames...
import gc
del trn_data, tst_data
gc.collect()

## 10.1 Loading Specific Model Libraries...

In [None]:
%%time
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, BatchNormalization, Dropout, Concatenate
from tensorflow.keras.utils import plot_model
from sklearn.metrics import log_loss

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
import random

---

## 10.2 Amex Metric, Function...

In [None]:
%%time
# From https://www.kaggle.com/code/inversion/amex-competition-metric-python

def amex_metric(y_true, y_pred, return_components=False) -> float:
    """Amex metric for ndarrays"""
    
    def top_four_percent_captured(df) -> float:
        """Corresponds to the recall for a threshold of 4 %"""
        
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
    
    
    def weighted_gini(df) -> float:
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    
    def normalized_weighted_gini(df) -> float:
        """Corresponds to 2 * AUC - 1"""
        
        df2 = pd.DataFrame({'target': df.target, 'prediction': df.target})
        df2.sort_values('prediction', ascending=False, inplace=True)
        return weighted_gini(df) / weighted_gini(df2)

    
    df = pd.DataFrame({'target': y_true.ravel(), 'prediction': y_pred.ravel()})
    df.sort_values('prediction', ascending=False, inplace=True)
    g = normalized_weighted_gini(df)
    d = top_four_percent_captured(df)

    if return_components: return g, d, 0.5 * (g + d)
    return 0.5 * (g + d)

---

## 10.3 Defining the NN Model Architecture...

## 10.3.1 Architecture 01, Simple NN

In [None]:
%%time
def nn_model():
    '''
    '''
    regularization = 4e-4
    activation_func = 'swish'
    inputs = Input(shape = (len(features)))
    
    x = Dense(256, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(regularization), 
              activation = activation_func)(inputs)
    
    x = BatchNormalization()(x)
    
    x = Dense(64, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(regularization), 
              activation = activation_func)(x)
    
    x = BatchNormalization()(x)
    
    x = Dense(64, 
          #use_bias  = True, 
          kernel_regularizer = tf.keras.regularizers.l2(regularization), 
          activation = activation_func)(x)
    
    x = BatchNormalization()(x)

    x = Dense(32, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(regularization), 
              activation = activation_func)(x)
    
    x = BatchNormalization()(x)

    x = Dense(1, 
              #use_bias  = True, 
              #kernel_regularizer = tf.keras.regularizers.l2(regularization),
              activation = 'sigmoid')(x)
    
    model = Model(inputs, x)
    
    return model

---

## 10.3.2 Architecture 02, Concatenated NN

In [None]:
%%time
def nn_model():
    regularization = 4e-4
    activation_func = 'swish'
    inputs = Input(shape = (len(features)))

    x0 = Dense(256,
               kernel_regularizer = tf.keras.regularizers.l2(regularization), 
               activation = activation_func)(inputs)
    x1 = Dense(128,
               kernel_regularizer = tf.keras.regularizers.l2(regularization),
               activation = activation_func)(x0)
    x1 = Dense(64,
               kernel_regularizer = tf.keras.regularizers.l2(regularization),
               activation = activation_func)(x1)
    x1 = Dense(32,
           kernel_regularizer = tf.keras.regularizers.l2(regularization),
           activation = activation_func)(x1)
    
    x1 = Concatenate()([x1, x0])
    x1 = Dropout(0.1)(x1)
    
    x1 = Dense(16, kernel_regularizer=tf.keras.regularizers.l2(regularization),activation=activation_func,)(x1)
    
    x1 = Dense(1, 
              #kernel_regularizer=tf.keras.regularizers.l2(regularization),
              activation='sigmoid')(x1)
    
    model = Model(inputs, x1)
    
    return model
    

---

## 10.4 Visualizing the Model Structure...

In [None]:
%%time
architecture = nn_model()
architecture.summary()

In [None]:
%%time
plot_model(nn_model(), show_layer_names = False, show_shapes = True, dpi = 60)

---

## 10.5 Defining Model Training Parameters...

In [None]:
%%time
# Defining model parameters...
BATCH_SIZE         = 2048
EPOCHS             = 192 
EPOCHS_COSINEDECAY = 192 
DIAGRAMS           = True
USE_PLATEAU        = False
INFERENCE          = False
VERBOSE            = 0 
TARGET             = 'target'

---

## 10.6 Defining the Model Training Configuration...

In [None]:
 %%time
# Defining model training function...
def fit_model(X_train, y_train, X_val, y_val, run = 0):
    '''
    '''
    lr_start = 0.01
    start_time = datetime.datetime.now()
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)

    epochs = EPOCHS    
    lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.7, patience = 4, verbose = VERBOSE)
    es = EarlyStopping(monitor = 'val_loss',patience = 12, verbose = 1, mode = 'min', restore_best_weights = True)
    tm = tf.keras.callbacks.TerminateOnNaN()
    callbacks = [lr, es, tm]
    
    # Cosine Learning Rate Decay
    if USE_PLATEAU == False:
        epochs = EPOCHS_COSINEDECAY
        lr_end = 0.0002

        def cosine_decay(epoch):
            if epochs > 1:
                w = (1 + math.cos(epoch / (epochs - 1) * math.pi)) / 2
            else:
                w = 1
            return w * lr_start + (1 - w) * lr_end
        
        lr = LearningRateScheduler(cosine_decay, verbose = 0)
        callbacks = [lr, tm]
    
    # Model Initialization...
    model = nn_model()
    optimizer_func = tf.keras.optimizers.Adam(learning_rate = lr_start)
    loss_func = tf.keras.losses.BinaryCrossentropy()
    model.compile(optimizer = optimizer_func, loss = loss_func)
    
    
    X_val = scaler.transform(X_val)
    validation_data = (X_val, y_val)
    
    history = model.fit(X_train, 
                        y_train, 
                        validation_data = validation_data, 
                        epochs          = epochs,
                        verbose         = VERBOSE,
                        batch_size      = BATCH_SIZE,
                        shuffle         = True,
                        callbacks       = callbacks
                       )
    
    history_list.append(history.history)
    
    print(f'Training Loss: {history_list[-1]["loss"][-1]:.5f}, Validation Loss: {history_list[-1]["val_loss"][-1]:.5f}')
    callbacks, es, lr, tm, history = None, None, None, None, None
    
    
    y_val_pred = model.predict(X_val, batch_size = BATCH_SIZE, verbose = VERBOSE).ravel()
    amex_score = amex_metric(y_val.values, y_val_pred, return_components = False)
    
    print(f'Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}'
          f'| Amex Score: {amex_score:.5f}')
    
    print('')
    
    score_list.append(amex_score)
    
    tst_data_scaled = scaler.transform(tst_agg_data[features])
    tst_pred = model.predict(tst_data_scaled)
    predictions.append(tst_pred)
    
    return model

---

## 10.7 Creating a Model Training Loop and Cross Validating in 5 Folds... 

In [None]:
%%time
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score, roc_curve
import math

# Create empty lists to store NN information...
history_list = []
score_list   = []
predictions  = []

# Define kfolds for training purposes...
kf = KFold(n_splits = 5)

for fold, (trn_idx, val_idx) in enumerate(kf.split(trn_agg_data)):
    X_train, X_val = trn_agg_data.iloc[trn_idx][features], trn_agg_data.iloc[val_idx][features]
    y_train, y_val = trn_agg_data.iloc[trn_idx][TARGET], trn_agg_data.iloc[val_idx][TARGET]
    
    fit_model(X_train, y_train, X_val, y_val)
    
print(f'OOF AUC: {np.mean(score_list):.5f}')

---

# 11.0 Model Prediction and Submissions

In [None]:
%%time
sub.head()

In [None]:
%%time
sub['prediction'] = np.array(predictions).mean(axis = 0)

In [None]:
%%time
sub.to_csv('my_submission.csv', index = False)

In [None]:
%%time
sub.head()

---