# ⚡⚡ PyTorch Quickstart for the American Express - Default Prediction competition
This notebook shows how to define and train a Pytorch LSTM to leverages the time series structure of the data.

I expect Deep Learning models to dominate in this competition, so here's a simple LSTM architecture.

Parameters were not really tweaked so the baseline is improvable.

**Please consider upvoting if you find this work helpful. Don't fork without upvoting !**



## Why PyTorch Lightning?

Lightning is simply organized PyTorch code. There's NO new framework to learn. For more details about Lightning visit the repo:

https://github.com/PyTorchLightning/pytorch-lightning

Run on CPU, GPU clusters or TPU, without any code changes



# Imports


In [None]:
import pandas as pd
import gc
import numpy as np

# Torch and Sklearn
import pytorch_lightning as pl
import torch
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
from torchmetrics import Metric
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning import Trainer
import torch.nn.functional as F
from torch import nn, Tensor
from torchmetrics.utilities import rank_zero_warn

# Typing 
from typing import Optional

In [None]:
# File system
train_file   = "../input/amex-data-integer-dtypes-100k-cid-per-chunk/train_chunk_0.parquet"
train_labels = "../input/amex-default-prediction/train_labels.csv"
model_output_folder = './experiment'

# Data
batch_size   = 1028
num_workers  = 4
epochs = 5

# Model 
in_features=188
hidden_dim=128
num_layers=2
learning_rate=1e-3

# Data

Reading and preprocessing the data

We read the data from @raddar's [dataset](http://https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format) that i splitted into chunks [dataset](https://www.kaggle.com/datasets/what5up/amex-data-integer-dtypes-100k-cid-per-chunk). @raddar has denoised the data so that we can achieve better results with his dataset than with the original competition csv files.

We also convert the dataframe into a 3D-tensor dataset as highlighted by Chris Deotte [here](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327828) 



In [None]:
def load_train_df(train_file, train_labels):
    train = pd.read_parquet(train_file)
    train['S_2'] = pd.to_datetime(train['S_2'])
    tmp = train[['customer_ID','S_2']].groupby('customer_ID').count()

    missing_cids = []
    for nb_available_rows in range(1, 14):
        cids = tmp[tmp['S_2'] == nb_available_rows].index.values
        batch_missing_cids = [cid for cid in cids for _ in range(13 - nb_available_rows)]
        missing_cids.extend(batch_missing_cids)

    train_part2 = train.iloc[:len(missing_cids)].copy()
    train_part2.loc[:] = np.nan
    train_part2['customer_ID'] = missing_cids

    train = pd.concat([train_part2, train])
    
    train = train.sort_values('customer_ID')
    
    train_labels = pd.read_csv(train_labels)
    train = pd.merge(train, train_labels, how='inner', on='customer_ID')
    
    train = train.sort_values('customer_ID')
    return train

train_df = load_train_df(train_file, train_labels)

In [None]:
class DataModule(pl.LightningDataModule):

    def __init__(self, all_data: pd.DataFrame, batch_size: int = batch_size, num_workers: int = num_workers):
        super().__init__()
        self.all_data = all_data
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.sc = StandardScaler()

    def prepare_data(self):
        pass

    def setup(self, stage=None):
        # All data comumns except customer_ID, target, and S_2 are features
        features = self.all_data.columns[2:-1]
        self.all_data[features] = self.sc.fit_transform(self.all_data[features])
        self.all_data[features] = self.all_data[features].fillna(0)
        
        # https://www.kaggle.com/competitions/amex-default-prediction/discussion/327828 !! Many Thanks @Chris Deotte for your sharing
        all_tensor_x = torch.reshape(torch.tensor(self.all_data[features].to_numpy()), (-1, 13, 188)).float()
        all_tensor_y = torch.tensor(self.all_data.groupby('customer_ID').first()['target'].to_numpy()).float()

        X_trainval, X_test, y_trainval, y_test = train_test_split(all_tensor_x, all_tensor_y, test_size=0.1, random_state=1)
        X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.1, random_state=1)

        # TRAIN
        self.train_tensor = TensorDataset(X_train, y_train)
        # VAL
        self.val_tensor = TensorDataset(X_val, y_val)
        # TEST
        self.test_tensor = TensorDataset(X_test, y_test)

    def train_dataloader(self):
        return DataLoader(self.train_tensor, batch_size=self.batch_size, num_workers=self.num_workers)

    def val_dataloader(self):
        return DataLoader(self.val_tensor, batch_size=self.batch_size, num_workers=self.num_workers)

    def test_dataloader(self):
        return DataLoader(self.test_tensor, batch_size=self.batch_size, num_workers=self.num_workers)

# Metrics : [Implementation Source](https://www.kaggle.com/code/rohanrao/amex-competition-metric-implementations)

In [None]:
## https://www.kaggle.com/code/rohanrao/amex-competition-metric-implementations

class AmexMetric(Metric):
    is_differentiable: Optional[bool] = False

    # Set to True if the metric reaches it optimal value when the metric is maximized.
    # Set to False if it when the metric is minimized.
    higher_is_better: Optional[bool] = True

    # Set to True if the metric during 'update' requires access to the global metric
    # state for its calculations. If not, setting this to False indicates that all
    # batch states are independent and we will optimize the runtime of 'forward'
    full_state_update: bool = True

    def __init__(self):
        super().__init__()
        
        self.add_state("all_true", default=[], dist_reduce_fx="cat")
        self.add_state("all_pred", default=[], dist_reduce_fx="cat")

        rank_zero_warn(
            "Metric `Amex` will save all targets and predictions in buffer."
            " For large datasets this may lead to large memory footprint."
        )

    def update(self, y_pred: torch.Tensor, y_true: torch.Tensor):
        
        y_true = y_true.double()
        y_pred = y_pred.double()
        
        self.all_true.append(y_true)
        self.all_pred.append(y_pred)
        
    def compute(self):
        y_true = torch.cat(self.all_true)
        y_pred = torch.cat(self.all_pred)
        # count of positives and negatives
        n_pos = y_true.sum()
        n_neg = y_pred.shape[0] - n_pos

        # sorting by descring prediction values
        indices = torch.argsort(y_pred, dim=0, descending=True)
        preds, target = y_pred[indices], y_true[indices]

        # filter the top 4% by cumulative row weights
        weight = 20.0 - target * 19.0
        cum_norm_weight = (weight / weight.sum()).cumsum(dim=0)
        four_pct_filter = cum_norm_weight <= 0.04

        # default rate captured at 4%
        d = target[four_pct_filter].sum() / n_pos

        # weighted gini coefficient
        lorentz = (target / n_pos).cumsum(dim=0)
        gini = ((lorentz - cum_norm_weight) * weight).sum()

        # max weighted gini coefficient
        gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))

        # normalized weighted gini coefficient
        g = gini / gini_max
        
        return 0.5 * (g + d)


def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x == 0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()

    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x == 0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

## Model

In [None]:
class LSTMClassifier(nn.Module):
    """Very simple implementation of LSTM-based time-series classifier."""

    def __init__(self, input_dim, hidden_dim, num_layers, output_dim, device):
        super().__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.rnn = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc1 = nn.Linear(hidden_dim, 100)
        self.fc2 = nn.Linear(100, output_dim)
        self.device = device

    def forward(self, x):
        h0, c0 = self.init_hidden(x)
        out, (_, _) = self.rnn(x, (h0, c0))
        out = F.relu(self.fc1(out[:, -1, :]))
        out = torch.sigmoid(self.fc2(out))
        return out

    def init_hidden(self, x):
        batch_size = x.size(0)
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim)
        if torch.cuda.is_available():
            h0, c0 = h0.cuda(), c0.cuda()
        return h0, c0


class TsLstmLightning(pl.LightningModule):
    def __init__(self, in_features, hidden_dim, num_layers, learning_rate):
        super(TsLstmLightning, self).__init__()

        self.learning_rate = learning_rate

        self.train_amex_metric = AmexMetric()
        self.val_amex_metric   = AmexMetric()

        self.model = LSTMClassifier(in_features, hidden_dim, num_layers, 1, device = self.device)

        self.num_parameters = count_parameters(self.model)
        
        print(f"Trainable params: {self.num_parameters:,}")

        self.loss_fn = nn.BCELoss(reduction="mean")

    def forward(self, x):
        res = self.model(x)
        return res

    def training_step(self, batch, batch_idx):
        X, target = batch
        preds = self(X)  # (batch_size, 1)
        preds = preds.squeeze(1)

        loss = self.loss_fn(preds, target)
        
        self.train_amex_metric.update(preds, target) 

        self.log_dict({'train_loss': loss, 'train_amex_metric': self.train_amex_metric }, on_step=True, on_epoch=True, prog_bar=True, logger=True)

        return {'loss': loss}

    def validation_step(self, batch, batch_idx):
        with torch.no_grad():
            X, target = batch
            preds = self(X)  
            preds = preds.squeeze(1)

            loss = self.loss_fn(preds, target)
            
            self.val_amex_metric.update(preds, target),

            self.log_dict({'val_loss': loss, 'val_amex_metric': self.val_amex_metric}, on_step=True, on_epoch=True, prog_bar=True, logger=True)

            return {'loss': loss}


    def predict_step(self, batch, batch_idx, dataloader_idx=None):
        with torch.no_grad():
            X = batch[0]
            preds = self(X)
            return preds.detach().cpu()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Training Loop 

In [None]:
print(f"Train Shape: {train_df.shape}")
dm = DataModule(train_df, batch_size=batch_size)

model = TsLstmLightning(in_features=in_features, hidden_dim=hidden_dim, num_layers=num_layers, learning_rate=learning_rate)

logger = TensorBoardLogger(model_output_folder, name=f"logs", default_hp_metric=True)

checkpoint_callback = ModelCheckpoint(dirpath= "mycheckpoints", save_top_k=1, save_weights_only=True, save_last=False, verbose=True,
                                      monitor='val_loss_epoch', mode='min')

callbacks = [checkpoint_callback]

trainer = Trainer(
    gpus=[0] if torch.cuda.is_available() else None,
    max_epochs=epochs,
    benchmark=False,
    deterministic=True,
    callbacks=callbacks,
    logger=logger)

trainer.fit(model, dm)

In [None]:
# TODO : TPU SUPPORT --> https://www.kaggle.com/code/justusschock/pytorch-on-tpu-with-pytorch-lightning/notebook
# !curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
# !python pytorch-xla-env-setup.py --version 1.7 --apt-packages libomp5 libopenblas-dev

# Train Monitoring


In [None]:
# Tensorboard temporarly disabled on Kaggle. https://www.kaggle.com/product-feedback/89671 --> Download the data and run it on your computer to follow the metrics
%tensorboard --logdir model_output_folder

# Possible Nexts steps


1. Have a best handling of missiing values in the data ( Ex : Do not drop customers that don't have 13 records,  do not fill N/A with 0, Replace previous -1 values with N/A, ...)
2. Enhance the model (More LSTM / 1D CNN / Transformers / Param optimisation / ... ) 
3. Have a better understanding of the predictive features = Feature engineering ( feature selection or permutation feature importance, for instance.)
4. (1. + 2.--> Transformer Unsupervised training with Times Series) https://arxiv.org/abs/2010.02803
5. Model ensemble with other classification techniques
6. Enable Cross-validation (Stratified K-Fold, ...)
