## Introduction

The main task in this dataset is to predict which of the transactions are fraudulent, based on several numerical measures. A major challenge is the large class imbalance in the data - only 0.17% of the transactions are fraudulent. It will be difficult to train classification models with good discriminative power. 

In this notebook, we take an alternative approach by employing an autoencoder-based anomaly detection method to classify transactions. The underlying assumption is that if the features carry some relevance for predicting fraud, then fraudulent transactions are likely to be outliers within the feature space when compared to non-fraudulent transactions. To implement this approach, we train the autoencoder only on the non-fraudulent transactions. Subsequently, the reconstruction error derived from this autoencoder serves as the foundation for constructing a decision function to classify transactions as fraudulent or not. 

Given the dataset's substantial class imbalance, we will evaluate model performance using the area under the precision-recall curve, which is a suitable metric for assessing model quality in such imbalanced scenarios.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
import keras

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc

import joblib

from typing import Dict, Optional, List, Tuple
from numbers import Number

plt.style.use("ggplot")
plt.rcParams.update(**{'figure.dpi':150})

## Loading the data

**Note**: We drop the `Time` feature since we don't know how to interpret it. 

In [None]:
raw_df = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv').drop('Time', axis=1)
raw_df.head()

In [None]:
print(f'Number of observations: {raw_df.shape[0]}')

In [None]:
neg, pos = np.bincount(raw_df['Class'])
total = neg + pos
print(f'Number of positive observations: {pos} ({100*pos / total:.2f}% of total)')

## Generating training, validation and test splits

In [None]:
train_df, test_df = train_test_split(raw_df, test_size=0.2, random_state=1, stratify=raw_df['Class'])
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=2, stratify=train_df['Class'])

In [None]:
y_train = train_df.pop('Class').values
y_val = val_df.pop('Class').values
y_test = test_df.pop('Class').values

In [None]:
for tag, labels in zip(['training', 'validation', 'test'], [y_train, y_val, y_test]):
    print(f'Percentage of positive class observations in {tag} set: {labels.mean()*100:.3f}')

## Basic EDA

In [None]:
def filter_greater_than(series:pd.Series,threshold:Number) -> pd.Series:
    '''
    Returns series elements greater than threshold. This funtion can be
    used with the .pipe methods
    '''
    return series[series>threshold]


def get_missing_percentage(df:pd.DataFrame) -> pd.Series:
    '''
    Returns the percentages of missing values for columns 
    in `df`that have atleast one missing entry
    '''
    
    return (
        (df.isnull().sum()/df.shape[0]*100)
        .sort_values(ascending=False)
        .pipe(filter_greater_than,threshold=0)
        .round(3)
    )


for tag, df in zip(['training', 'validation', 'test'], [train_df, val_df, test_df]):
    print(f'Percentage of missing entries per column in {tag} set (if any):')
    print(get_missing_percentage(df))
    print()

In [None]:
skew_columns = train_df.skew()
pos_skew = {}
neg_skew = {}

for column, skew in skew_columns.items():
    if skew > 1:
        pos_skew[column] = skew
    elif skew < -1:
        neg_skew[column] = skew
        
print(f'Number of columns that are positively skewed: {len(pos_skew)}')
print(f'Number of columns that are negatively skewed: {len(neg_skew)}')

For simplicity, we preprocess all the columns through the `QuantileTransformer` in scikit-learn, so that the transformed features are (roughly) normally distributed across the training set.

In [None]:
qt_transform = QuantileTransformer(output_distribution='normal')
X_train = qt_transform.fit_transform(train_df)
X_val = qt_transform.transform(val_df)
X_test = qt_transform.transform(test_df)

## Autoencoder architecture

In [None]:
def encoder(inputs, params={}):
    x = inputs
    for i in range(params.get('n_hidden_encoder', 2)):
        x = keras.layers.Dense(params.get(f'hsize_encoder{i}', 64//2**i), activation=None)(x)
        x = keras.layers.BatchNormalization()(x)
        x = keras.layers.ReLU()(x)
        x = keras.layers.Dropout(params.get(f'dropout_encoder{i}', 0.05))(x)
    
    return x

def decoder(inputs, input_dim, params={}):
    x = inputs
    n_hidden_decoder = params.get('n_hidden_decoder', 2)
    for i in range(n_hidden_decoder):
        x = keras.layers.Dense(params.get(f'hsize_decoder{i}', 64//2**(n_hidden_decoder-i-1)), activation=None)(x)
        x = keras.layers.BatchNormalization()(x)
        x = keras.layers.ReLU()(x)
        x = keras.layers.Dropout(params.get(f'dropout_decoder{i}', 0.05))(x)
    
    # output
    x = keras.layers.Dense(input_dim, activation=None)(x)
    return x

def bottleneck_layer(inputs, params={}):
    x = keras.layers.Dense(params.get(f'hsize_bottleneck', 16), activation=None)(inputs)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.ReLU(name='bottleneck')(x)
    
    return x


def dense_autoencoder(input_dim, params={}):
    inputs = keras.Input(shape=(input_dim,))
    encoder_output = encoder(inputs, params)
    bottleneck_output = bottleneck_layer(encoder_output, params)
    decoder_output = decoder(bottleneck_output, input_dim, params)
    
    model = keras.Model(inputs=inputs, outputs=decoder_output)
    
    
    model.compile(
        optimizer=keras.optimizers.Adam(
            learning_rate=params.get('learning_rate', 1e-3),
        ),
        loss='mean_squared_error'
    )

    return model

In [None]:
keras.backend.clear_session()
ae_model = dense_autoencoder(X_train.shape[1])
ae_model.summary()

## Training the autoencoder

As previously mentioned in the introduction, our approach involves training the autoencoder exclusively on the non-fraudulent transactions. The primary goal is to minimize the reconstruction error generated by the network. This strategy is grounded in the assumption that non-fraudulent transactions provide a reliable representation of typical, legitimate data patterns.

To optimize the training process and ensure the best model performance, we employ two key techniques:

1. Early Stopping: Early stopping is a mechanism used during training to prevent overfitting. By monitoring the reconstruction error on the validation set (comprising non-fraudulent transactions), we can halt training when the error starts to increase or stagnate. This ensures that we stop training before the model becomes overly specialized to the training data, resulting in better generalization to unseen data.
2. Learning Rate Scheduler: We use a learning rate scheduler that adjusts the learning rate during training based on the reconstruction error of the non-fraudulent transactions in the validation set. When the error plateaus or increases, the learning rate is halved.

By incorporating these strategies, we aim to train an autoencoder that effectively captures the underlying patterns of legitimate transactions. This approach is crucial in building a reliable anomaly detection system, where the autoencoder's reconstruction error serves as a key metric for identifying suspicious or fraudulent activities.

In [None]:
EPOCHS = 100
BATCH_SIZE = 128
    
    
# callbacks - reduce lr on plateau and early stopping
early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    mode='min',
    verbose=True,
    patience=15,
    restore_best_weights=True
)
reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    mode='min',
    verbose=True,
    patience=5,
    factor= 0.5,
    min_lr = 1e-5
)

history = ae_model.fit(
    X_train[y_train==0,:],
    X_train[y_train==0,:],
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    verbose=2,
    callbacks = [early_stopping, reduce_lr], 
    validation_data=(X_val[y_val==0], X_val[y_val==0])
)

In [None]:
fig, axs = plt.subplots(1, 1, figsize=(4,3))
metric = 'loss'
_ = axs.plot(history.epoch, history.history[f'{metric}'], label='Training')
_ = axs.plot(history.epoch, history.history[f'val_{metric}'], label='Validation')                 
_ = axs.legend()
_ = axs.set_xlabel('Epoch')
_ = axs.set_ylabel('Reconstruction loss')
fig.tight_layout()

## Reconstruction loss across classes

For each observation, we define the reconstruction loss as the sum of the squared differences between its features and uts corresponding reconstructed output from the autoencoder.

In [None]:
def get_reconst_loss(X):
    X_reconst = ae_model.predict(X, batch_size=256)
    return (
        ((X-X_reconst)**2).sum(axis=1)
    )

In the cell below, we have generated a histogram plotting the log-transformed values of (1 + reconstruction loss) for the two distinct classes. The y-axis represents the density within each class. As anticipated, the fraudulent transactions tend to exhibit significantly higher reconstruction losses compared to the non-fraudulent ones. This observation underscores the utility of the reconstruction loss as a reliable criterion for classifying transactions as either fraudulent or legitimate.

In [None]:
reconst_train = get_reconst_loss(X_train)

fig, ax = plt.subplots(1,1, figsize=(6,4))
_ = sns.histplot(
    x=np.log1p(reconst_train[y_train==0]), bins = 20, stat='density',
    kde=True,alpha = 0.5, label=0, ax=ax
)
_ = sns.histplot(
    x=np.log1p(reconst_train[y_train==1]), bins = 20, stat='density',
    kde=True, alpha = 0.75, label=1, ax=ax
)
_ = ax.legend()
_ = ax.set_xlabel('log(1+ Reconstruction Loss)')

## Evaluating the performance of the models

In [None]:
reconst_val = get_reconst_loss(X_val)
reconst_test = get_reconst_loss(X_test)

### Area under the precision recall curve

We can use the reconstruction error to get the precision and recall at various thresholds. The function `pr_auc_score` computes the area under the precision recall curve.

In [None]:
def pr_auc_score(labels, predictions):
    # compute precision recall at several thresholds
    precision, recall, _ = precision_recall_curve(labels, predictions)
    
    return auc(recall, precision)

In [None]:
print(f'Area under PR curve for training set: {pr_auc_score(y_train, np.log(reconst_train)):.3f}')
print(f'Area under PR curve for validation set: {pr_auc_score(y_val, np.log(reconst_val)):.3f}')
print(f'Area under PR curve for test set: {pr_auc_score(y_test, np.log(reconst_test)):.3f}')

## Precision recall curves

In [None]:
def plot_pr_curve(name, labels, predictions, ax, **kwargs):
    precision, recall, _ = precision_recall_curve(labels, predictions)

    _ = ax.plot(recall, precision, label=name, linewidth=2, **kwargs)
    _ = ax.set_ylabel('Precision')
    _ = ax.set_xlabel('Recall')
    _ = ax.grid(True)
    _ = ax.set_aspect('equal')
    
fig, ax = plt.subplots(1, 1, figsize=(4, 4))


plot_pr_curve("Training", y_train, np.log1p(reconst_train), ax)
plot_pr_curve("Validation", y_val, np.log1p(reconst_val), ax)
plot_pr_curve("Test", y_test, np.log1p(reconst_test), ax)
_ = ax.legend(loc='lower left')

### Confusion matrix

Finally, we evaulate the confusion matrix on the test set, using various thresholds on the log(1+reconstruction loss) to generate the classes.

In [None]:
thresholds = [1.5, 2.5, 3.5, 4.5]

fig, axs = plt.subplots(1, len(thresholds), figsize=(4.5*len(thresholds), 4))
for i, threshold in enumerate(thresholds):
    _ = sns.heatmap(confusion_matrix(y_test, np.log1p(reconst_test) > threshold), annot=True, ax=axs[i], fmt='g')
    _ = axs[i].set_ylabel('Actual')
    _ = axs[i].set_xlabel('Predicted')
    _ = axs[i].set_title(f'Confusion matrix @ {threshold}')
    
fig.tight_layout()

## Further directions

There are several directions to improve upon this work.

1. Feature selection: The current reconstruction loss function treats all features as equally important. However, it's highly probable that not all features hold the same predictive power. In fact, some features may not even be relevant for our task. Therefore, considering feature selection strategies becomes essential to align the reconstruction task more effectively with the classification task.
2. Calibration: In our approach, we didn't calculate class probabilities; instead, we relied on thresholds for the reconstruction loss. Enhancements in performance could potentially be achieved through calibration techniques, such as Platt's scaling or isotonic regression. These methods can fine-tune the decision boundaries and improve the model's ability to distinguish between fraudulent and non-fraudulent transactions.
3. Hyperparameter Tuning: Tuning the autoencoder hyperparameters can potentially improve performance.  It is crucial to emphasize that while tuning, our primary objective should be improving classification, rather than focusing solely on reconstruction quality. 