# Financial Fraud Detection System

This notebook implements a robust fraud detection system for financial transactions. The system is designed to identify fraudulent transactions while minimizing false positives, with an emphasis on handling imbalanced data, temporal validation, and model explainability.

### Dataset
The dataset used is the 'Credit Card Fraud Detection' dataset available at [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud).

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.random_projection import SparseRandomProjection
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, f1_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import warnings

warnings.filterwarnings("ignore")

In [2]:
# Load the dataset
df = pd.read_csv('./Datasets/creditcard.csv')
df['Time'] = pd.to_datetime(df['Time'], unit='s')  # Assuming 'Time' is in seconds since epoch
print(df.head())

                 Time        V1        V2        V3        V4        V5  \
0 1970-01-01 00:00:00 -1.359807 -0.072781  2.536347  1.378155 -0.338321   
1 1970-01-01 00:00:00  1.191857  0.266151  0.166480  0.448154  0.060018   
2 1970-01-01 00:00:01 -1.358354 -1.340163  1.773209  0.379780 -0.503198   
3 1970-01-01 00:00:01 -0.966272 -0.185226  1.792993 -0.863291 -0.010309   
4 1970-01-01 00:00:02 -1.158233  0.877737  1.548718  0.403034 -0.407193   

         V6        V7        V8        V9  ...       V21       V22       V23  \
0  0.462388  0.239599  0.098698  0.363787  ... -0.018307  0.277838 -0.110474   
1 -0.082361 -0.078803  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288   
2  1.800499  0.791461  0.247676 -1.514654  ...  0.247998  0.771679  0.909412   
3  1.247203  0.237609  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321   
4  0.095921  0.592941 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458   

        V24       V25       V26       V27       V28  Amount  Class  

## 1. Data Preparation
### Handling Class Imbalance and Temporal Splitting

In [3]:
# Check class distribution
print("Class distribution:")
print(df['Class'].value_counts())

# Handle class imbalance using SMOTE
X = df.drop(['Class', 'Time'], axis=1)
y = df['Class']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)

# Temporal splitting for time-based validation
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, shuffle=False)

Class distribution:
Class
0    284315
1       492
Name: count, dtype: int64


## 2. Feature Engineering and Selection

In [4]:
from sklearn.random_projection import SparseRandomProjection
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Optimize Sparse Random Projection
random_projection = SparseRandomProjection(n_components=20, random_state=42)  # Reduce n_components
X_random_projected = random_projection.fit_transform(X_train)

# Optimize Random Forest Training
rf = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42, n_jobs=-1)  # Reduce estimators and set parallelism
rf.fit(X_random_projected, y_train)
feature_importances = rf.feature_importances_

# Results
print(f"Feature Importances (Random Forest):")
print(feature_importances)

Feature Importances (Random Forest):
[0.00557093 0.04684572 0.00882326 0.00594302 0.14088451 0.00566561
 0.24330383 0.08351342 0.08192156 0.01129714 0.0090376  0.03274946
 0.01663589 0.17708479 0.02397017 0.07875894 0.00348874 0.01095785
 0.00280231 0.01074526]


## 3. Build Ensemble Detection System

## 4. Evaluation

In [7]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import precision_recall_curve, roc_auc_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split

# Reduce dataset size for testing
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled[:10000], y_resampled[:10000], test_size=0.2, shuffle=False
)

# Build ensemble models with reduced complexity
models = {
    'RandomForest': RandomForestClassifier(n_estimators=50, max_depth=10, n_jobs=-1, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=50, max_depth=3, random_state=42, use_label_encoder=False, eval_metric='logloss')  # No GPU
}

# Evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    auc_score = roc_auc_score(y_test, y_prob)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    results[name] = {
        'ROC AUC': auc_score,
        'F1 Score': f1,
        'Confusion Matrix': cm
    }

# Display evaluation metrics
for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    print(f"  ROC AUC: {metrics['ROC AUC']:.4f}")
    print(f"  F1 Score: {metrics['F1 Score']:.4f}")
    print("  Confusion Matrix:")
    print(metrics['Confusion Matrix'])
    print("\n")

Model: RandomForest
  ROC AUC: 1.0000
  F1 Score: 0.9286
  Confusion Matrix:
[[1985    2]
 [   0   13]]


Model: GradientBoosting
  ROC AUC: 0.9992
  F1 Score: 0.8966
  Confusion Matrix:
[[1984    3]
 [   0   13]]


Model: XGBoost
  ROC AUC: 0.9999
  F1 Score: 0.8889
  Confusion Matrix:
[[1985    2]
 [   1   12]]




## 5. Structural Risk Minimization

In [8]:
# Use hyperparameter tuning to balance model complexity and performance
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10],
    'n_estimators': [50, 100, 200]
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, scoring='f1', cv=3)
grid_search.fit(X_train, y_train)
print("Best parameters for Random Forest:", grid_search.best_params_)
print("Best F1 Score:", grid_search.best_score_)


Best parameters for Random Forest: {'max_depth': 5, 'n_estimators': 100}
Best F1 Score: 0.8888888888888888


## 6. Monitoring System for Concept Drift

In [12]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

# Train the pipeline model
pipeline.fit(X_train, y_train)

# Define monitor_drift function
def monitor_drift(new_data, pipeline, threshold=0.1):
    """
    Detect concept drift based on changes in prediction probabilities.
    
    Parameters:
    - new_data (DataFrame): The new data to check for drift
    - pipeline (Pipeline): The trained pipeline containing preprocessing and model
    - threshold (float): The threshold for standard deviation of predictions to detect drift
    
    Returns:
    - bool: True if drift is detected, False otherwise
    """
    predictions = pipeline.predict_proba(new_data)[:, 1]  # Get probabilities for the positive class
    drift_detected = np.std(predictions) > threshold  # Check standard deviation against the threshold
    return drift_detected

# Example usage of monitor_drift
drift = monitor_drift(X_test, pipeline, threshold=0.1)
print("Concept drift detected:" if drift else "No concept drift detected.")

No concept drift detected.


## Conclusion
This notebook demonstrates the implementation of a financial fraud detection system with a focus on handling class imbalance, temporal validation, feature engineering, ensemble modeling, and monitoring for concept drift. The system optimizes for business-critical metrics such as the precision-recall tradeoff, minimizing false positives while maintaining high recall.