# Baseline model - XGBoost

Aiming to perform some naive prediction techniques to understand how the classification challenge will work.

Taken inspiration from @ambrosm [AMEX EDA which makes sense](https://www.kaggle.com/code/ambrosm/amex-eda-which-makes-sense). Analysis has also been sourced from @cdeotte [XGBoost starter](https://www.kaggle.com/code/cdeotte/xgboost-starter-0-793/notebook).

### Objectives
* Build baseline model
* Take a reduced dataset with training variables
* Add the competition evaluation metric to the XGBoost model

## Updates
* eval_metric: making use of the aucpr (Area under the Precision Recall curve)

In [None]:
# Import packages
import pandas as pd
import numpy as np
import pickle
from matplotlib import pyplot as plt
import os, gc
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from xgboost import plot_importance

# Import xgb modules
import xgboost as xgb

import cupy, cudf # GPU libraries

In [None]:
print('RAPIDS version',cudf.__version__)

# The data

The dataset of this competition has a considerable size. If you read the original csv files, the data barely fits into memory. That's why we read the data from @munumbutt's [AMEX-Feather-Dataset](https://www.kaggle.com/datasets/munumbutt/amexfeather). In this [Feather](https://arrow.apache.org/docs/python/feather.html) file, the floating point precision has been reduced from 64 bit to 16 bit. And reading a Feather file is faster than reading a csv file because the Feather file format is binary.

There are 5.5 million rows for training and 11 million rows of test data.

In [None]:
# Original import strategy
# %%time
# train = pd.read_feather('../input/amexfeather/train_data.ftr')
# test = pd.read_feather('../input/amexfeather/test_data.ftr')
# with pd.option_context("display.min_rows", 6):
#     display(train)
#     display(test)

In [None]:
# FILL NAN VALUE
# NAN_VALUE = -99 # will fit in int8

In [None]:
# Updated code to include garbage collection to help release memory after data processing
# Making use of the GPU library. This only works for integer only features at present.
def read_file_int(path = '', usecols = None):
    # LOAD DATAFRAME
    if usecols is not None: df = cudf.read_feather(path, columns=usecols)
    else: df = cudf.read_feather(path)
    # REDUCE DTYPE FOR CUSTOMER AND DATE
#   df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
    df.S_2 = cudf.to_datetime(df.S_2)
    # CREATE OVERALL ROW MISS VALUE
    features = [x for x in df.columns.values if x not in ['customer_ID', 'target']]
    df['n_missing'] = df[features].isna().sum(axis=1)
    # FILL NAN
    df = df.fillna(NAN_VALUE) 
    # KEEP ONLY FINAL CUSTOMER ID UNTIL FUTURE TIME SERIES WORK BEGINS
    df_out = df.groupby(['customer_ID']).nth(-1).reset_index(drop=True)
    print('shape of data:', df_out.shape)
    del df
    _ = gc.collect()
    return df_out

# To ensure that the categorical features are imported only using CPU
def read_file_cpu(path = '', usecols = None):
    # LOAD DATAFRAME
    if usecols is not None: df = pd.read_feather(path, columns=usecols)
    else: df = pd.read_feather(path)
    # REDUCE DTYPE FOR CUSTOMER AND DATE
#   df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
    df.S_2 = pd.to_datetime(df.S_2)
    # CREATE OVERALL ROW MISS VALUE
    features = [x for x in df.columns.values if x not in ['customer_ID', 'target']]
    df['n_missing'] = df[features].isna().sum(axis=1)
    # FILL NAN
#     features_num = [x for x in df._get_numeric_data().columns.values if x not in ['customer_ID', 'target']]
#     df = df[features_num].fillna(NAN_VALUE) 
    # KEEP ONLY FINAL CUSTOMER ID UNTIL FUTURE TIME SERIES WORK BEGINS
    df_out = df.groupby(['customer_ID']).nth(-1).reset_index(drop=True)
    print('shape of data:', df_out.shape)
    del df
    _ = gc.collect()
    return df_out

In [None]:
print('Reading train data...')
TRAIN_PATH = '../input/amexfeather/train_data.ftr'
train_df = read_file_cpu(path = TRAIN_PATH)

print('Reading test data...')
TEST_PATH = '../input/amexfeather/test_data.ftr'
test_df = read_file_cpu(path = TEST_PATH)

The target column of the train dataframe corresponds to the target column of train_labels.csv. In the csv file of the train data, there is no target column; it has been joined into the Feather file as a convenience.

S_2 is the statement date. All train statement dates are between March of 2017 and March of 2018 (13 months), and no statement dates are missing. All test statement dates are between April of 2018 and October of 2019. This means that the statement dates of train and test don't overlap:

In [None]:
# Understanding the file size of one file
from humanize import naturalsize
size = train_df.memory_usage(deep='True').sum()
print(size)
print(naturalsize(size))

In [None]:
print(f'Train data memory usage: {naturalsize(train_df.memory_usage(deep="True").sum())} ')
print(f'Test data memory usage:  {naturalsize(test_df.memory_usage(deep="True").sum())}')

## Feature Analysis

Removing many of the highly correlated features resulted in a reduction in model performance. For the time being the features will be retained. However, more analysis is required to understand if this feature reduction makes sense.

In [None]:
# Correlation matrix
corr = train_df.corr()
# Mask the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Add the mask to the heatmap
fig, ax = plt.subplots(1, 1, figsize=(20,14))
sns.heatmap(corr, mask=mask, center=0, linewidths=1, annot=True, fmt=".2f", ax=ax)
plt.show()

In [None]:
# Remove highly correlated features
corr_matrix = corr.abs()
# Create a boolean mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)

# List column names of highly correlated features (r > 0.7)
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.7)]
print(f'Number of features: {len(to_drop)} \n {to_drop}')

In [None]:
# Remove the highly correlated features
# train_df = train_df.drop(to_drop, axis=1)
# train_df.shape

## Dataset prepared for analysis

In [None]:
target = train_df['target']
train_df = train_df.drop(['target'], axis=1)
train_df.shape

### Competition metric performance

The numpy metric for evaluation has been taken from @rohanrao [AMEX: Competition Metric Implementations](https://www.kaggle.com/code/rohanrao/amex-competition-metric-implementations)

In [None]:
def amex_metric_numpy(y_true: np.array, y_pred: np.array) -> float:

    # count of positives and negatives
    n_pos = np.sum(y_true)
    n_neg = y_true.shape[0] - n_pos

    # sorting by describing prediction values
    indices = np.argsort(y_pred)[::-1]
    preds, target = y_pred[indices], y_true[indices]

    # filter the top 4% by cumulative row weights
    weight = 20.0 - target * 19.0
    cum_norm_weight = (weight / weight.sum()).cumsum()
    four_pct_mask = cum_norm_weight <= 0.04

    # default rate captured at 4%
    d = np.sum(target[four_pct_mask]) / n_pos

    # weighted gini coefficient
    lorentz = (target / n_pos).cumsum()
    gini = ((lorentz - cum_norm_weight) * weight).sum()

    # max weighted gini coefficient
    gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))

    # normalized weighted gini coefficient
    g = gini / gini_max

    return 0.5 * (g + d)

# Building baseline

This model should help to provide an early building block of what to expect with the challenge. Lets try to review only the last entry by each customer_ID for initial model discovery

In [None]:
num_features = train_df._get_numeric_data().columns
num_features

In [None]:
# Create the arrays for features and the target: X, y
X, y = train_df._get_numeric_data(), target

In [None]:
# Create the training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state=100,
                                                    stratify=y)

In [None]:
# Instantiate the classifier. Can switch on parameter tree_method='gpu_hist' in the future
xg_cl = XGBClassifier(objective='binary:logistic', 
                      n_estimators=10,
                      seed=123,
                      use_label_encoder=False,
                      eval_metric='aucpr', # updated to make use of the aucpr option
                      early_stopping_rounds=10,
                      tree_method='gpu_hist',
                      enable_categorical=True
                      )
eval_set = [(X_test, y_test)]

In [None]:
# Fit the classifier
xg_cl.fit(X_train, y_train, eval_set=eval_set, verbose=True)

In [None]:
# Predict the labels of the test set
preds = xg_cl.predict(X_test)
preds_prob = xg_cl.predict_proba(X_test)[:,1]

# Compute accuracy
accuracy = accuracy_score(y_test, preds)
print(f'accuracy: {accuracy: .2%}')

In [None]:
# Review the important features
# print(xg_cl.feature_importances_)
def plot_features(booster, figsize, max_num_features=15):    
    fig, ax = plt.subplots(1,1,figsize=figsize)
    return plot_importance(booster=booster, ax=ax, max_num_features=max_num_features)
plot_features(xg_cl, (10,14))
plt.show()

In [None]:
print('Metric Evaluation Values\n')
print(f'Numpy: {amex_metric_numpy(y_test.to_numpy().ravel(), preds_prob)}')

### Boosting and CV methods
Lets make use of the boosting and inbuilt CV methods

In [None]:
# Understanding weighted class imbalance
from collections import Counter

counter = Counter(y)
print(counter)

# estimate scale_pos_weight value
estimate = counter[0] / counter[1]
print('Estimate: %.3f' % estimate)

In [None]:
# Create the DMatrix from X and y: churn_dmatrix
d_train = xgb.DMatrix(data=X_train, label=y_train)
d_test = xgb.DMatrix(data=X_test, label=y_test)
xgd_test = xgb.DMatrix(data=test_df._get_numeric_data())

# Create the parameter dictionary: params. NOTE: have to explicitly provide the objective param
params = {"objective":"binary:logistic", 
          "max_depth": 6,
          "eval_metric":'aucpr', # updated to make use of the aucpr option
          "tree_method":'gpu_hist',
          "predictor": 'gpu_predictor',
#           "scale_pos_weight": 30,
         }

# Reviewing the AUC metric
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=d_train, params=params,
                    nfold=5, num_boost_round=10, 
                    metrics="aucpr", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
# print((cv_results["test-auc-mean"]).iloc[-1])
print((cv_results["test-aucpr-mean"]).iloc[-1])

In [None]:
# Review the train method
params = {
    "objective":"binary:logistic", 
    "max_depth": 6,
    "eval_metric":'aucpr', 
    "tree_method":'gpu_hist',
    "predictor": 'gpu_predictor',
#     "scale_pos_weight": 30,    
}

# train - verbose_eval option switches off the log outputs
xgb_clf = xgb.train(
    params,
    d_train,
    num_boost_round=5000,
    evals=[(d_train, 'train'), (d_test, 'test')],
    early_stopping_rounds=10,
    verbose_eval=0
)

In [None]:
# predict
y_pred = xgb_clf.predict(d_test)

# Compute and print metrics
print('Metric Evaluation Values\n')
print(f'Numpy: {amex_metric_numpy(y_test.to_numpy().ravel(), y_pred)}')

The booster has helped to improve the model performance. Will try to add the custom objective for this challenge

## Rank Order table

Evaluating model performance for rank ordering tasks. Taking guidance from [7 Important Model Performance Measures](https://www.k2analytics.co.in/7-important-model-performance-measures/#:~:text=Rank%20Order%20Table%20is%20a,from%20Non%2DChurners%2C%20etc.)

In [None]:
train_df._get_numeric_data().columns

In [None]:
# Lets build using the X_test data - this was to check and see if the code worked. Now going to score up the train_df to get a larger sample
# 1. Predict probability
# rank_data = X_test.copy()
# rank_data['target'] = y_test
# rank_data['prob'] = preds_prob
# rank_data.head()

rank_data = train_df._get_numeric_data()
xgd_rank = xgb.DMatrix(data=train_df._get_numeric_data())
rank_data['prob'] = xgb_clf.predict(xgd_rank)
rank_data['target'] = target
rank_data.head()

In [None]:
# First create the decile value by prob
rank = rank_data.loc[:, ['target', 'prob']]
rank["ranks"] = rank['prob'].rank(method="first")

# The notes displayed here had related to only using the X_test dataframe. With the train_df being used we can try using the probabilities again
# First method bunchs the final three buckets into one as there are a low of low probs
rank['decile'] = pd.qcut(rank.prob, 10, labels=False, duplicates='drop') 
# Second method aims to use the rank method, however the nature of this rank is still random
# An alternative for this piece might be to put the 'prob' in order and sort by the target
# rank['decile'] = pd.qcut(rank.ranks, 10, labels=False)
# Reviewing the lowest probability
min_prob = np.min(rank.prob)
rank.loc[(rank.prob == min_prob)].head()

In [None]:
# Create a rank_order table
def rank_order(df: pd.DataFrame, y: str, target: str) -> pd.DataFrame:
    
    rank = df.groupby('decile').apply(lambda x: pd.Series([
        np.min(x[y]),
        np.max(x[y]),
        np.mean(x[y]),
        np.size(x[y]),
        np.sum(x[target]),
        np.size(x[target][x[target]==0]),
    ],
        index=(["min_prob","max_prob","avg_prob",
               "cnt_cust","cnt_def","cnt_non_def"])
    )).reset_index()
    rank = rank.sort_values(by='decile', ascending=False)
    rank["drate"] = round(rank["cnt_def"]*100/rank["cnt_cust"], 2)
    rank["cum_cust"] = np.cumsum(rank["cnt_cust"])
    rank["cum_def"] = np.cumsum(rank["cnt_def"])
    rank["cum_non_def"] = np.cumsum(rank["cnt_non_def"])
    rank["cum_cust_pct"] = round(rank["cum_cust"]*100/np.sum(rank["cnt_cust"]), 2)
    rank["cum_def_pct"] = round(rank["cum_def"]*100/np.sum(rank["cnt_def"]), 2)
    rank["cum_non_def_pct"] = round(rank["cum_non_def"]*100/np.sum(rank["cnt_non_def"]), 2)
    rank["KS"] = round(rank["cum_def_pct"] - rank["cum_non_def_pct"],2)
    rank["Lift"] = round(rank["cum_def_pct"] / rank["cum_non_def_pct"],2)
    return rank

rank_gains_table = rank_order(rank, "prob", "target")
rank_gains_table

### Make submission

In [None]:
# Score up the test dataset
test_preds = xgb_clf.predict(xgd_test)
test_preds.view()

In [None]:
# Make submission
sub_data = pd.read_csv('../input/amex-default-prediction/sample_submission.csv')
sub_data.head()

In [None]:
sub_data['prediction'] = test_preds
sub_data.head()

In [None]:
# Submission file
sub_data.to_csv('submission.csv', index=False)