In [None]:
##### Kaggle European Credit Card Dataset
#############################################

In [None]:
'''
############################################################################################
Overview-

In this project, I'm predicting  fraud using three ML models: Random Forest , XGBoost, and a Multi-layer
Perceptron (MLP). I tried to use a consistent processing and eval pipeline to ensure that the comparisons
between them are  fair. Also, to address the dataset's severe class imbalanc, which is typical in fraud
datasets, two sampling techniques—SMOTE  & Random UnderSampling- are applied to improve minority class
representation.

Feature Engineering: Ihe "Time" feature is removed, because it lacks informational value. 
Additionally, features V22 to V28 are dropped due to low variance.  (Will be  detailed in final report).


*NOTE on Dataset Downsampling:  
Due to compute and time constraints associated with training some models on full dataset, a
stratified sample of the data was used for some model configs. This allowed for faster training
without compromising the class balance, ensuring the model still learns patterns from  both classes.
I had to do this to avoid system overheating and runtime limitations (e.g., timeouts when I also tried with colab)
while still providing a representative dataset for training and eval.


Train-Test Split: A train-test split of 80-20 is performed with stratification. 
Stratification ensures that training and testing sets maintain the original dataset's class
distribution, helping the model generalize well without bias toward one class.

###########################################################################################

Avoiding Data Leakage-
Data leakage, where information from test set can influence the training set, is carefully avoided through
the following measures:

Proper Train-Test Split: The test data is kept separate from the training data before any transformations,
sampling/scaling which ensures model evaluation is fair and representative of real-world performance.

Pipeline use: The transformation and sampling process is wrapped up in a pipeline, meaning that data
transformations (scaling, sampling) are applied only on the training data during cross-val and model training.
This approach avoids access to information from test set that could bias the models.


Model Pipeline:

Pipeline Construction: A pipeline is created, which includes:
Scaling: The StandardScaler scales features. 

Sampling Techniques: Two sampling methods are tested independently:
-SMOTE: This technique synthesizes new instances for the minority class, balancing the class distribution.
-Random Under-Sampling: This technique reduces majority class by randomly sampling it down to a size thts
closer to the minority class.

#########################################################################################

Parameter Tuning: 
-GridSearchCV is used to optimize model hyperparameters by systematically testing different combinations
of params and evaluating each combinations performance. This process is important because hyperparameters
(e.g., the number of trees in a Random Forest or the max depth of trees) can significantly impact a model.
-After evaluating all combos, GridSearchCV selects the one with the highest score for the selected metric (ROC AUC).
This best model is then re-trained on the entire training set using this  configuration, making it ready for the
final test sett.
-Using the stored metrics in cv_results_  we can extract metrics from GridSearchCV run,allowing
analysis of model performance across various metrics and parameters.

##########################################################################################

Cross-Val: Cross-validation with roc_auc as the primary metric is applied to ensure the model’s reliability.
By using 3-fold cross-validation, we measure performance across different folds, which gives a more robust
indication of the model's effectiveness.

With imbalanced classification tasks, it’s usually useful to use Stratified K-Fold CV  Stratified
K-Fold ensures each fold maintains same class proportion as the entire dataset, which is useful
when dealing with rare events like fraud. This ensured each fold has a similar balance between
fraud and non-fraud, leading to more consistent training and evaluation across folds.

Evaluation: Each model configuration is evaluated with: metrics like precision, recall, and F1-score
to assess model performance. ROC AUC Score (Area Under the Receiver Operating Characteristic curve),
is used because it is well-suited for imbalanced datasets and measures the model’s ability to distinguish
between classes.

Training and Inference Time: The average times for training and inference are recorded to understand
the computational efficiency of each sampling approach.

Satability: Shows the SD of roc_auc across folds, giving insight into model stability.

Final Test Set Performance: The final model from GridSearchCV is evaluated on the test set for metrics
such as precision, recall, F1-score, and ROC AUC to confirm its effectiveness on unseen data.

####################################################################################


******** NOTE on RandomUndersampler: ********

Across all models, the use of RandomUnderSampler consistently resulted in unusaable performance. This
is because of significant loss of information from the majority class during undersampling, which reduced
dataset size and limited the model's ability to learn. Initial dataset size reduction in some
cases also contributed to this.  This was a big lesson learned.
'''


In [None]:
#### RANDOM FOREST ##########################

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix



# Load the dataset
file_path = r'C:\Users\ssain\Downloads\CCfraud\creditcard.csv'
df = pd.read_csv(file_path)

# 'Time' removed - no informational value
df = df.drop(columns=['Time'])
df = df.drop(columns=[f'V{i}' for i in range(22, 29)])  # dropped (low variance)

# Downsample dataset for faster processing
# Group by 'Class' (fraud or not) and take a 50% sample from each group
df_sampled = df.groupby('Class', group_keys=False).apply(lambda x: x.sample(frac=0.5, random_state=42))

# # Split data into features and target
X = df_sampled.drop(['Class'], axis=1)
y = df_sampled['Class']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Define parameter grid and scorers
param_grid_rf = {
    'classifier__n_estimators': [50, 100], # number of trees
    'classifier__max_depth': [10],  # Max depth
}

# eval metrics for GridsearchCV
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score),
    'roc_auc': 'roc_auc'
}


# Function to Evaluate models with both samplers
def evaluate_model_with_sampler(sampler, sampler_name):
    print(f"Evaluating model with {sampler_name}...")
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        (sampler_name, sampler),
        ('classifier', RandomForestClassifier(class_weight='balanced', random_state=42))
    ])
    
    # Stratified K-Folds ensures balanced class dist.
    strat_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    
    # Perform hyperparam tuning, refit based on ROCAUC 
    grid_search = GridSearchCV(pipeline, param_grid_rf, cv=strat_cv, scoring=scoring, refit='roc_auc', return_train_score=True)
    grid_search.fit(X_train, y_train)
    
    # Display best parameters and cross-validation results
    print(f"Best parameters for {sampler_name}: {grid_search.best_params_}")
    cv_results = grid_search.cv_results_
    best_index = grid_search.best_index_
    roc_auc_mean = cv_results['mean_test_roc_auc'][best_index]
    roc_auc_std = cv_results['std_test_roc_auc'][best_index]
    print(f"Cross-Validation ROC AUC Mean for {sampler_name}: {roc_auc_mean:.3f}")
    print(f"Cross-Validation ROC AUC Std Dev for {sampler_name}: {roc_auc_std:.3f}")
    
    # Extract training and inference times
    train_time = cv_results['mean_fit_time'][best_index]
    inference_time = cv_results['mean_score_time'][best_index]

    # Evaluate on test set
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    y_proba = best_model.predict_proba(X_test)[:, 1]
    
    # Print confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print(f"Confusion Matrix for {sampler_name}:")
    print(cm)
    
    print("Test Set Metrics:")
    print(classification_report(y_test, y_pred))
    print(f"Test Set ROC AUC for {sampler_name}: {roc_auc_score(y_test, y_proba):.3f}")
    print(f"Average Training Time for {sampler_name}: {train_time:.3f} seconds")
    print(f"Average Inference Time for {sampler_name}: {inference_time:.3f} seconds\n")

# Evaluate with SMOTE
evaluate_model_with_sampler(SMOTE(random_state=42), 'smote')

# Evaluate with RandomUnderSampler
evaluate_model_with_sampler(RandomUnderSampler(random_state=42), 'undersample')



Evaluating model with smote...
Best parameters for smote: {'classifier__max_depth': 10, 'classifier__n_estimators': 100}
Cross-Validation ROC AUC Mean for smote: 0.974
Cross-Validation ROC AUC Std Dev for smote: 0.013
Confusion Matrix for smote:
[[28405    27]
 [    7    42]]
Test Set Metrics:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28432
           1       0.61      0.86      0.71        49

    accuracy                           1.00     28481
   macro avg       0.80      0.93      0.86     28481
weighted avg       1.00      1.00      1.00     28481

Test Set ROC AUC for smote: 0.983
Average Training Time for smote: 40.664 seconds
Average Inference Time for smote: 0.315 seconds

Evaluating model with undersample...
Best parameters for undersample: {'classifier__max_depth': 10, 'classifier__n_estimators': 100}
Cross-Validation ROC AUC Mean for undersample: 0.971
Cross-Validation ROC AUC Std Dev for undersample: 0.021
Confu

In [None]:
############################################################################################

In [None]:
################### XG BOOST ###########################

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from xgboost import XGBClassifier


# Load the dataset
file_path = r'C:\Users\ssain\Downloads\CCfraud\creditcard.csv'
df = pd.read_csv(file_path)

# Drop the 'Time' column and keep PCA features (need them in this case)
df = df.drop(columns=['Time'])

# Downsample the dataset for quicker training
# Sample 50% of the data from each class to maintain balance
df_sampled = df.groupby('Class', group_keys=False).apply(lambda x: x.sample(frac=0.5, random_state=42))

# Split data into features and target using the sampled data
X = df_sampled.drop(['Class'], axis=1)
y = df_sampled['Class']

# Split into training and test sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Define resamplers
resamplers = {
    'SMOTE': SMOTE(random_state=42),
    'RandomUnderSampler': RandomUnderSampler(random_state=42)
}

# Define model and parameter grid
param_grid_xgb = {
    'classifier__n_estimators': [100, 200], # boosting rounds
    'classifier__max_depth': [4, 6],        # tree depth
    'classifier__learning_rate': [0.01, 0.1],
     #scaling for class imbalance
    'classifier__scale_pos_weight': [1, len(y_train[y_train == 0]) / len(y_train[y_train == 1])]
}

# Set up scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score),
    'roc_auc': 'roc_auc'
}


 # Use Stratified K-Folds for Cross-Validation
strat_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Loop through resamplers
for name, resampler in resamplers.items():
    print(f"\nResampling Method: {name}")
    
    # Define pipeline  !!!!!!
    pipeline_xgb = Pipeline([
        ('scaler', StandardScaler()),
        ('resampler', resampler),
        ('classifier', XGBClassifier(
            objective='binary:logistic',
            eval_metric='aucpr',
            random_state=42
        ))
    ])
    
   
    
    # Perform GridSearchCV to find the best hyperparameters
    #  refit on precision (gave better performance than ROCAUC etc.)
    grid_search_xgb = GridSearchCV(pipeline_xgb, param_grid_xgb, cv=strat_cv, scoring=scoring, refit='precision')
    grid_search_xgb.fit(X_train, y_train)
    
    # Evaluate the best model on the test set !!
    best_model_xgb = grid_search_xgb.best_estimator_
    y_pred = best_model_xgb.predict(X_test)
    y_pred_proba = best_model_xgb.predict_proba(X_test)[:, 1]
    
    # Print confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print(f"Confusion Matrix for {name}:")
    print(cm)

    # Print classification report and ROC AUC for the test set
    print("Test Set Metrics with Default Threshold:")
    print(classification_report(y_test, y_pred))
    print(f"Test Set ROC AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")
    
    # Extract best model index and retrieve its metrics
    best_index = grid_search_xgb.best_index_
    train_time_best = grid_search_xgb.cv_results_['mean_fit_time'][best_index]
    inference_time_best = grid_search_xgb.cv_results_['mean_score_time'][best_index]
    cv_roc_auc_mean = grid_search_xgb.best_score_
    cv_roc_auc_std = grid_search_xgb.cv_results_['std_test_roc_auc'][best_index]
    
    # Display metrics for the best model
    print(f"Best Parameters: {grid_search_xgb.best_params_}")
    print(f"Mean Train Time (s): {train_time_best}")
    print(f"Mean Inference Time (s): {inference_time_best}")
    print(f"CV ROC AUC Mean: {cv_roc_auc_mean}")
    print(f"CV ROC AUC Std Dev: {cv_roc_auc_std}")




Resampling Method: SMOTE
Confusion Matrix for SMOTE:
[[28420    12]
 [    9    40]]
Test Set Metrics with Default Threshold:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28432
           1       0.77      0.82      0.79        49

    accuracy                           1.00     28481
   macro avg       0.88      0.91      0.90     28481
weighted avg       1.00      1.00      1.00     28481

Test Set ROC AUC: 0.977
Best Parameters: {'classifier__learning_rate': 0.1, 'classifier__max_depth': 6, 'classifier__n_estimators': 200, 'classifier__scale_pos_weight': 1}
Mean Train Time (s): 2.0349721908569336
Mean Inference Time (s): 0.09009852409362792
CV ROC AUC Mean: 0.7538088547189821
CV ROC AUC Std Dev: 0.019381475787937003

Resampling Method: RandomUnderSampler
Confusion Matrix for RandomUnderSampler:
[[27425  1007]
 [    4    45]]
Test Set Metrics with Default Threshold:
              precision    recall  f1-score   support

      

In [None]:
# RandomUndersampler giving extremely low precision
# When undersampling, we're drastically reducing the number of non-fraudulent
# samples to match the minority (fraudulent) class, which is likely causing 
# severe information loss.

In [None]:
########################### MLP ############################################

In [None]:
################################################
##  MLP ########################################

# This Multi-Layer Perceptron (MLP) is a feedforward neural net designed for binary classification.
# In this implementation, the Multi-Layer Perceptron (MLP) consists of an input layer, two hidden layers with
# ReLU activation, and an output layer with a sigmoid activation function for binary classif. The hidden
# layers are designed with a moderate number of neurons to balance complexity and computational efficiency, while
# dropout layers are added to prevent overfitting. This design was chosen to capture relationships
# in the data while remaining computationally practical given resource constraints.


# Note:create_model is a factory function used by KerasClassifier to generate new model instances dynamically.
# Although we only define create_model once, it’s called multiple times by KerasClassifier during training,
# cross-validation, and hyperparameter search.


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
import tensorflow as tf
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import StratifiedKFold
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix
from tensorflow.keras.callbacks import EarlyStopping



# Load the dataset
file_path = r'C:\Users\ssain\Downloads\CCfraud\creditcard.csv'
df = pd.read_csv(file_path)

# Downsample the dataset for quicker training
# Sample 50% of the data from each class to maintain balance
df_sampled = df.groupby('Class', group_keys=False).apply(lambda x: x.sample(frac=0.5, random_state=42))
# Train-test split (80% train, 20%test)



# Separate features (X) and target (y)
X = df_sampled.drop(columns=['Class'])
y = df_sampled['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# Define a function to create the models
# This builds MLP using Tensorflow/Keras
def create_model(neurons=32, dropout_rate=0.3, optimizer='adam'):
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(X_train.shape[1],)),  # Input layer
        tf.keras.layers.Dense(neurons, activation='relu'), # First hidden layer
        tf.keras.layers.Dropout(dropout_rate),            # Dropout for regularization
        tf.keras.layers.Dense(neurons // 2, activation='relu'), #2nd hidden
        tf.keras.layers.Dropout(dropout_rate),
        tf.keras.layers.Dense(1, activation='sigmoid')  #output layer for binary classif.
    ])
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model


# Define EarlyStopping
early_stopping = EarlyStopping(
    monitor='val_loss',  # Track validation loss
    patience=3,          # Stop after 3 epochs with no improvement
    restore_best_weights=True  # Roll back to the best weights
)

# Wrap the model using KerasClassifier
# This makes model compatible with scikit learn
model = KerasClassifier(model=create_model, verbose=0, random_state=42,
                       callbacks=[early_stopping],validation_split=0.2)


# Define hyperparameter grid for RandomizedSearchCV

param_grid = {
    'mlp__model__neurons': [64, 128], #neurons in first hidden
    'mlp__model__dropout_rate': [0.3], 
    'mlp__epochs': [50],  # No. of epochs 
    'mlp__batch_size': [128],
    'mlp__model__optimizer': ['adam']  #optimizer
        
}

# Define stratified K-fold CrossVal
# Ensures class dist. preserved in every fold
stratified_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Define sampling methods
sampling_methods = {
    "SMOTE": SMOTE(),
    "RandomUnderSampler": RandomUnderSampler()
}

# Loop through each sampling method
# Loop trains and evaluates with both sampling methods
for name, sampler in sampling_methods.items():
    print(f"\Results with {name}...")
    
    
    # Create pipeline with the current sampler
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  # standardize features
        ('sampler', sampler),          # Apply sampler
        ('mlp', model)                 # use the mlp as the model
    ])
    
    # Use RandomizedSearchCV (to reduce compute/train time)
    # RANDOMLY samples specified number of combos from hyperparameter grid
    grid = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=param_grid,
        cv=stratified_cv,
        verbose=1,
        n_jobs=1,
        n_iter=2,   #number of random hyperparamater combos to try
        scoring='roc_auc'  #optimize for roc_auc
    )
    
    # Fit the RandomizedSearchCV
    grid_result = grid.fit(X_train, y_train)
    
    # Make predictions on test set using best model found
    y_pred = grid_result.best_estimator_.predict(X_test) #pred class labels
    y_prob = grid_result.best_estimator_.predict_proba(X_test)[:, 1] #pred probabs
    
    # Print evaluation metrics
    print(f"\nResults with {name}:")
    print(f"Best Parameters: {grid_result.best_params_}")
    print(f"Test Accuracy: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test, y_pred))
    print(f"ROC-AUC Score: {roc_auc_score(y_test, y_prob):.4f}")
    
    # Print confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print(f"Confusion Matrix for {name}:")
    print(cm)
    
    # Access the index of the best model
    best_index = grid.best_index_

    # Extract metrics for the best model
    best_train_time = grid.cv_results_['mean_fit_time'][best_index]  # Training time
    best_inference_time = grid.cv_results_['mean_score_time'][best_index]  # Inference time
    best_roc_auc_std = grid.cv_results_['std_test_score'][best_index]  # Std dev of CV ROC-AUC

    # Print results for the best model
    print(f"Best Model Train Time (s): {best_train_time:.4f}s")
    print(f"Best Model Inference Time (s): {best_inference_time:.4f}s")
    print(f"Best Model CV ROC AUC Std Dev(stability): {best_roc_auc_std:.4f}")
    


\Results with SMOTE...
Fitting 3 folds for each of 2 candidates, totalling 6 fits

Results with SMOTE:
Best Parameters: {'mlp__model__optimizer': 'adam', 'mlp__model__neurons': 64, 'mlp__model__dropout_rate': 0.3, 'mlp__epochs': 50, 'mlp__batch_size': 128}
Test Accuracy: 0.9991924440855307
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28424
           1       0.81      0.77      0.79        57

    accuracy                           1.00     28481
   macro avg       0.91      0.89      0.90     28481
weighted avg       1.00      1.00      1.00     28481

ROC-AUC Score: 0.9486
Confusion Matrix for SMOTE:
[[28414    10]
 [   13    44]]
Best Model Train Time (s): 19.1080s
Best Model Inference Time (s): 0.3840s
Best Model CV ROC AUC Std Dev(stability): 0.0095
\Results with RandomUnderSampler...
Fitting 3 folds for each of 2 candidates, totalling 6 fits

Results with RandomUnderSampler:
Best Parameters: {'mlp__model__optimizer': 'adam