# Santander Customer Transaction Prediction

Authors: Martin Korytak, Thomas Parnell

In this challenge, the Santander team invited Kagglers to help them identify which customers will make a specific transaction in the future, irrespective of the amount of money involved. The data provided for this competition has the same structure as the real data Santander have available to solve this problem.

In this kernel, we are going to show how we can use IBM's **Snap ML** library to **accelerate training**. We will demonstrate how Snap ML enables us to construct a high-ranking submission in much less time than the equivalent approach using scikit-learn. We will use the heuristic introduced in this [kernel](https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split) and extend the idea presented in this [kernel](https://www.kaggle.com/cdeotte/200-magical-models-santander-0-920). Furthermore, this kernel uses fixed `seed` in order to make the results easily reproducible.

For more information about the capabilities of Snap ML please visit the project homepage at: https://www.zurich.ibm.com/snapml/

### Import and Install Necessary Packages

In [None]:
!pip install snapml

In [None]:
import numpy as np
import pandas as pd
from sklearn import decomposition
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.utils.class_weight import compute_sample_weight
import sklearn.ensemble as skl
from os import path
import time
from scipy.stats import normaltest
from scipy.special import logit
import seaborn as sns
from matplotlib import pyplot as plt
import snapml
import multiprocessing
from typing import List, Dict, Union
from tqdm import tqdm

In [None]:
np.random.seed(130720)

### Load Data Sets

The data sets are ready for use immediately after loading into the memory. The only difference between `train.csv` and `test.csv` is that `train.csv` contains also `target` column with ground truth. The only thing in both data sets, which we will not use during training, is `ID_code` column.

In [None]:
path_to_folder = '/kaggle/input/santander-customer-transaction-prediction'
df = pd.read_csv(path.join(path_to_folder, 'train.csv'))
df_test = pd.read_csv(path.join(path_to_folder, 'test.csv'))

id_codes = df_test['ID_code'] # we need to keep ID_code for a submission of our predictions
df_test.drop('ID_code', axis=1, inplace=True)
df.drop('ID_code', axis=1, inplace=True)
features = df.drop('target', axis=1)
columns = features.columns
target = df.target

### Create Auxiliary Functions

In [None]:
def time_decorator(function):
    """
    Decorator which measure time spent within a particular function.
    :param function: timed function
    :return: function which measures time
    """
    def timed(*args, **kwargs):
        ts = time.time_ns()
        result = function(*args, **kwargs)
        te = time.time_ns()
        return result, (te - ts) * 1e-9

    return timed

In [None]:
def get_real_synthetic_testing_samples(testing_set: pd.DataFrame) -> tuple:
    # idea from https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split
    """
    Discovers which rows are artificially created and which are real.
    :param testing_set: dataframe of testing samples
    :return: indices of real and synthetic samples
    """
    counts = np.zeros(shape=testing_set.shape)
    for i, col in enumerate(testing_set.columns):
        counts[:, i] = testing_set[col].map(testing_set[col].value_counts())

    is_row_real_sample = np.any(counts == 1, axis=1)
    fake_samples_indices = np.argwhere(~is_row_real_sample).flatten()
    real_samples_indices = np.argwhere(is_row_real_sample).flatten()
    return real_samples_indices, fake_samples_indices

In [None]:
def get_aggregated_features(column: pd.Series, column_name: str, decimals: int, agg_function: str) -> pd.Series:
    """
    Creates a new column created by `agg_function` per each unique value in `column`.
    :param column: a column which will be used for calculation
    :param column_name: a name of the column in original dataframe
    :param decimals: a number of decimal places for rounding values before group by function is called
    :param agg_function: a function which will be used for calculating new values in the new column
    :return: a new column consisting of values calculated by `agg_function` per each unique value in the column
    """
    return column.to_frame().groupby(column.round(decimals))[column_name].agg(agg_function)

In [None]:
def add_features_in_batch(data_set_for_calculation: pd.DataFrame, all_dataframes: List, columns: np.array, decimal_places: Dict) -> List:
    """
    Adds new columns per each specified feature in a particular batch.
    :param data_set_for_calculation: a batch of data for calculating new features
    :param all_dataframes: a list of dataframes where new features will be added
    :param columns: a list of column names
    :param decimal_places: a number of decimal places used for calculating new features
    :return: a list of dataframes with added new features
    """
    for col in tqdm(columns):
        for feature in decimal_places.keys():
            x = get_aggregated_features(data_set_for_calculation[col], col, decimal_places[feature], feature)
            for dataframe in all_dataframes:
                dataframe[f'{col}_{feature}'] = dataframe[col].round(decimal_places[feature]).map(x)
                if feature == 'std':
                    dataframe.fillna(value=0, inplace=True)  # std might introduce some NaN values
    return all_dataframes

In [None]:
def estimate_counts_based_on_real_testing_samples(training_and_validation_sets: List[pd.DataFrame], testing_set: pd.DataFrame, columns: List) -> List:
    # idea from https://www.kaggle.com/cdeotte/200-magical-models-santander-0-920
    """
    Add new columns to all provided dataframes with counts of particular entries.
    :param training_and_validation_sets: list of dataframes with training and validation rows
    :param testing_set: dataframe of testing set
    :param columns: names of columns in original data set
    :return: new dataframes with additional columns
    """
    real_samples_indices, _ = get_real_synthetic_testing_samples(testing_set)

    # estimating counts using only real testing samples
    data_set_for_estimation = pd.concat([*training_and_validation_sets, testing_set.loc[real_samples_indices]])
    # create a copy of provided dataframes
    all_dataframes = [*[d.copy(deep=True) for d in training_and_validation_sets], testing_set.copy(deep=True)]

    decimals = {  # optimal number of decimal places per a new feature
        'count': 4,
        'mean': 4,
        'std': 4,
        'sum': 4,
        'min': 2,
        'max': 3
    }

    args = []
    n_processes = 10

    columns = np.array(columns)  # convert to NumPy array due to the need to access indices
    batch_size = int(np.ceil(len(columns) / n_processes))
    for i in range(n_processes):
        indices = list(range(batch_size * i, batch_size * (i + 1)))
        args.append((data_set_for_estimation.iloc[:, indices], [dataframe.iloc[:, indices] for dataframe in all_dataframes], columns[indices], decimals))

    with multiprocessing.Pool(processes=n_processes) as p:
        chunks = p.starmap(add_features_in_batch, args)

    all_dataframes = [pd.concat(d, axis=1) for d in zip(*chunks)]
    return all_dataframes

In [None]:
def get_column_indices(df: pd.DataFrame, query_cols: List) -> np.array:
    # idea from: https://stackoverflow.com/a/38489403
    """
    Retrieve indices of given columns.
    :param df: dataframe from which this function retrieves indices
    :param query_cols: names of columns whose names are requested as indices 
    :return: array with indices coresponding to `query_cols`
    """
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols, query_cols, sorter=sidx)]

In [None]:
@time_decorator
def train(model: Union[snapml.RandomForestClassifier, skl.RandomForestClassifier], X_train: np.ndarray, y_train: np.array, weights: np.array):
    """
    Train a model using `X_train` data set.
    :param model: instance of a machine learning model which implements `fit` method
    :param X_train: training examples
    :param y_train: training labels
    :param weights: weight of each example in `X_train`
    """
    model.fit(X_train, y_train, weights)

In [None]:
def create_model(library: str, params: Dict, r_seed: int, n_cpus: int) -> Union[snapml.RandomForestClassifier, skl.RandomForestClassifier]:
    """
    Build a machine learning model using Snap ML or scikit-learn library.
    :param library: name of library which is going to be used
    :param params: hyper-parameters with optimal values of random forest
    :param r_seed: random seed of a particular model
    :param n_cpus: number of CPUs available for training
    :return: created model
    """
    if library == 'snapml':
        return snapml.RandomForestClassifier(**params, use_gpu=False, random_state=r_seed, n_jobs=n_cpus)
    elif library == 'sklearn':
        return skl.RandomForestClassifier(**params, random_state=r_seed, n_jobs=n_cpus)

In [None]:
def plot_barchart(sklearn_time: float, snapml_time: float):
    """
    Plot bar chart with training time for Snap ML and scikit-learn library.
    :param sklearn_time: training time of scikit-learn model in seconds
    :param snapml_time: training time of Snap ML model in seconds
    """
    plt.bar(0.5, sklearn_time, 0.4, ecolor='black')
    plt.bar(1.5, snapml_time, 0.4, ecolor='black')
    plt.xticks([0.5, 1.5], ['scikit-learn', 'Snap ML'])
    plt.ylabel('Training Time (s)')
    plt.xlabel('Machine Learning Libraries')
    plt.title('Training Time of Snap ML and scikit-learn Library')
    plt.show()

In [None]:
def plot_roc_curve(y_val: np.array, y_pred: np.array):
    """
    Plot ROC curve with baseline score and AUC score.
    :param y_val: true labels
    :param y_pred: predicted labels
    """
    fpr, tpr, _ = roc_curve(y_val,  y_pred)
    auc = roc_auc_score(y_val, y_pred)
    plt.plot(fpr, tpr, label=f'target 1, auc={auc:.4f}', color='r')
    plt.plot([0, 1], [0, 1], color='k', linestyle='dashed')
    plt.legend(loc=4)
    plt.ylabel('TPR - "probability of detection"')
    plt.xlabel('FPR - "probability of false alarm"')
    plt.show()

### Exploratory Data Analysis

First of all, let's explore the data set. We can see that all feature columns contain continuous numbers.

In [None]:
df.head()

In [None]:
df.describe()

We can confirm that there are no missing values either in `train.csv` or `test.csv`.

In [None]:
print(f'Number of NaN entries in train: {sum(df.isnull().sum())}, test: {sum(df_test.isnull().sum())}.')

Features are not correlated at all and we will leverage this fact in the training section. There is neither negative nor possitive correlation between any pair of features.

In [None]:
corr = np.tril(features.corr(), k=-1) # without the diagonal ones
print(f'Maximal correlation: {corr.max():.4f}, minimal correlation {corr.min():.4f}.')

Let's create the first plot to get an intuition what is happening under the hood. As we can see, the particular features are not normally distributed even though it might be the human intuition at first sight.

In [None]:
for col in features.columns:
    features[col].hist(alpha=0.5, bins=30);
    
# check normality with statistics -> data is not normally distributed
alpha = 0.05
for col in features.columns:
    stat, p = normaltest(features[col])
    if p > alpha: # null hypothesis: feature column comes from a normal distribution
        print(f'p-value was {p} > {alpha}, null hypothesis cannot be rejected for feature {col}, therefore column {col} is normally distributed.')

Means of features are normally distributed, whereas standard deviations of features are not normally distributed which we can confirm looking at the plots as well as using tests of data normality.

In [None]:
features.mean().plot.hist(bins=10);
print(normaltest(features.mean())) # means of features are normally distributed

In [None]:
features.std().plot.hist(bins=10);
print(normaltest(features.std())) # standard deviations of features are not normally distributed

According to the box plot, it seems there are some outliers but none of them is too far from whiskers and therefore we do not remove any row from the original data set.

In [None]:
features.plot.box(vert=False, figsize=(13, 32));

The data set is highly imbalanced and we need to keep this in mind when training a model.

In [None]:
sns.countplot(x=target, hue='target', data=df)
plt.show()

Using PCA analysis, we can see that `target=1` is randomly distributed across the figure. This fact may make accurate predictions more difficult.

In [None]:
portion_of_data = features.sample(frac=0.01, random_state=1)
print(target.iloc[portion_of_data.index].value_counts())
pca = decomposition.KernelPCA(n_components=2, kernel='poly')
X = pca.fit_transform(portion_of_data)

target_one_condition = target.iloc[portion_of_data.index] == 1
target_zero_condition = target.iloc[portion_of_data.index] == 0
plt.scatter(X[target_zero_condition, 0], X[target_zero_condition, 1], label='0');
plt.scatter(X[target_one_condition, 0], X[target_one_condition, 1], label='1');
plt.legend();

### Feature Engineering

We prepare our data set for training, the first thing we will do is discovering which samples are real and which samples are synthetic. The rule, how to detect synthetic samples, is very simple. We say the sample is real if and only if at least one feature value is unique within a particular feature vector. Then, we create multiple new features using aggregate functions. A closer look shows us that many values are repeating in a column and they can be grouped into bins. The new features are created using `count`, `mean`, `standard deviation`, `sum`, `minimum` and `maximum` function, respectively. We tried to round the number of decimal places but in most cases it did not bring any additional improvement in terms of AUC score. Please note that the original precision is set to 4 decimal places.

In [None]:
X, X_test = estimate_counts_based_on_real_testing_samples([features], df_test, columns)

### Build and Train Models

In this section, we train a random forest classifier using Snap ML as well as the equivalent classifier from scikit-learn. Finally, we compare the time spent within the training phase and we compare the performance of both models.

Here, we use the fact that features are not correlated and therefore we can build and train multiple models on different features separately and leverage the new features added to data set in `Feature Engineering` section. At the end, we add the predictions together. We also use 5-fold cross validation in order to reduce a possible variance. And due to the unbalanced data set stratified cross validation is used together with `sample_weight` parameter in `fit` method. The optimal hyper-parameters have been found using a random search beforehand.

In [None]:
optimal_hyperparams = {
    'max_depth': 5,
    'min_samples_leaf': 386,
    'n_estimators': 88
}

In [None]:
# create array with training columns for each base learner
features = [get_column_indices(X, X.filter(regex=fr'{col}(?!\d)').columns) for col in columns]

# cast all pd.DataFrames to NumPy arrays since Snap ML does not support pd.DataFrame
X = X.to_numpy()
X_test = X_test.to_numpy()
y = target.to_numpy()

In [None]:
n_cpus = multiprocessing.cpu_count() # get number of available CPUs

In [None]:
verbose = True
k = 5 # number of folds
y_pred_logit_snapml = np.zeros(X_test.shape[0])
validation_auc = 0
snapml_fit_time = 0

cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=1)
for n_fold, (train_indices, val_indices) in enumerate(cv.split(X, y), start=1):
    print(f'{n_fold}. fold (out of {k}) is running.')
    
    X_train = X[train_indices, :]
    X_val = X[val_indices, :]
    y_train = y[train_indices]
    y_val = y[val_indices]
    
    w_train = compute_sample_weight('balanced', y_train)

    tmp_test = np.zeros(shape=X_test.shape[0])
    tmp_val = np.zeros(shape=X_val.shape[0])
    for idx in tqdm(range(len(columns))):       
        rf = create_model('snapml', optimal_hyperparams, idx, n_cpus)
        _, t_fit = train(rf, X_train[:, features[idx]], y_train, w_train)
        snapml_fit_time += t_fit
        
        tmp_val += logit(rf.predict_proba(X_val[:, features[idx]])[:, 1])
        tmp_test += logit(rf.predict_proba(X_test[:, features[idx]])[:, 1])    
    y_pred_logit_snapml += tmp_test
    
    validation_auc += roc_auc_score(y_val, tmp_val)
    
    if verbose:
        plot_roc_curve(y_val, tmp_val)
    
print(f'The average AUC estimated from {k} validation splits is: {validation_auc / k:.4f}.')

In [None]:
verbose = True
k = 5 # number of folds
y_pred_logit_skl = np.zeros(X_test.shape[0])
validation_auc = 0
sklearn_fit_time = 0

cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=1)
for n_fold, (train_indices, val_indices) in enumerate(cv.split(X, y), start=1):
    print(f'{n_fold}. fold (out of {k}) is running.')
    
    X_train = X[train_indices, :]
    X_val = X[val_indices, :]
    y_train = y[train_indices]
    y_val = y[val_indices]
    
    w_train = compute_sample_weight('balanced', y_train)

    tmp_test = np.zeros(shape=X_test.shape[0])
    tmp_val = np.zeros(shape=X_val.shape[0])
    for idx in tqdm(range(len(columns))):        
        rf = create_model('sklearn', optimal_hyperparams, idx, n_cpus)
        _, t_fit = train(rf, X_train[:, features[idx]], y_train, w_train)
        sklearn_fit_time += t_fit
        
        tmp_val += logit(rf.predict_proba(X_val[:, features[idx]])[:, 1])
        tmp_test += logit(rf.predict_proba(X_test[:, features[idx]])[:, 1])    
    y_pred_logit_skl += tmp_test
    
    validation_auc += roc_auc_score(y_val, tmp_val)
    
    if verbose:
        plot_roc_curve(y_val, tmp_val)
    
print(f'The average AUC estimated from {k} validation splits is: {validation_auc / k:.4f}.')

In [None]:
print(f'[scikit-learn] training time: {sklearn_fit_time:.2f} (s)')
print(f'[Snap ML] training time: {snapml_fit_time:.2f} (s)')
print(f'Speed-up: {sklearn_fit_time / snapml_fit_time:.2f}x')

In [None]:
plot_barchart(sklearn_fit_time, snapml_fit_time)

From the training above it is clear that (1) both models perform very similarly in terms of AUC score and (2) the Snap ML model **trains significantly faster** than the scikit-learn model.

### Submission

The final step is to submit our Snap ML predictions to the competition to see what our position in the leaderboard is.

In [None]:
filepath = 'submission.csv'

In [None]:
submission = pd.DataFrame({
        "ID_code": id_codes,
        "target": y_pred_logit_snapml
    })
submission.to_csv(filepath, index=False)

In [None]:
submission.head()