## Step 1 – Load and Inspect the Data

In this step, we load the cleaned dataset `Fraud_Cleaned.csv` into a pandas DataFrame and perform an initial inspection. 
We check for missing values, confirm data types, and analyze the class distribution of the target variable `fraud`.
This step ensures the dataset is clean and structurally ready for training and evaluation.

### Goals of this step:
- Ensure no missing values are present
- Verify that all feature columns are properly formatted
- Analyze class imbalance in the target column
- Review basic feature distributions and dataset size

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Schritt 2: CSV-Datei einlesen (Pfad ggf. anpassen)
df = pd.read_csv("../Data/Fraud_Cleaned.csv")

# Schritt 3: Features und Zielvariable definieren
X = df.drop("fraud", axis=1)  # Alle Spalten außer der Zielvariable
y = df["fraud"]               # Zielvariable (0 = kein Betrug, 1 = Betrug)

# Schritt 4: Aufteilen in Trainings-, Validierungs- und Testset (70/15/15) mit stratifizierter Verteilung
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp)

# Optional: Shapes ausgeben zur Kontrolle
print(f"Trainingsdaten: {X_train.shape}")
print(f"Validierungsdaten: {X_val.shape}")
print(f"Testdaten: {X_test.shape}")

Trainingsdaten: (348635, 9)
Validierungsdaten: (74707, 9)
Testdaten: (74708, 9)


## Step 2 – Data Splitting (Train, Validation, Test)

To prepare for model training and evaluation, we split the dataset into three parts:

- **70%** Training Set: Used to fit the model
- **15%** Validation Set: Used for threshold tuning and cost evaluation
- **15%** Test Set: Used for final model evaluation

We use stratified sampling to ensure that the distribution of the target class `fraud` remains consistent across all splits. This is especially important given the strong class imbalance (only ~4.8% of fraud cases).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.class_weight import compute_sample_weight
import numpy as np

# Random Forest Modell
rf = RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1)

# Hyperparameter-Suchraum definieren
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# RandomizedSearchCV Setup
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=30,
    cv=5,
    scoring='f1',  # F1-Score als initiale Metrik
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Modelltraining mit Hyperparameter-Tuning
random_search.fit(X_train, y_train)

# Bestes Modell sichern
best_rf = random_search.best_estimator_

# Beste Parameter anzeigen
best_params = random_search.best_params_
best_params

Fitting 5 folds for each of 30 candidates, totalling 150 fits


## Step 3 – Threshold Tuning Based on Cost

After training the classification model, we do not simply use the default threshold of 0.5.  
Instead, we perform **cost-based threshold tuning** on the validation set, considering the business impact:

- A **False Positive (FP)** leads to an unnecessary control of an honest customer, which is expensive and damaging
- A **False Negative (FN)** means a fraudster goes undetected, which also has cost implications
- In our case: **FP is 5× more costly than FN**

We calculate the total cost for various thresholds (from 0.01 to 0.99) and select the one with the lowest overall cost.

In [7]:
from sklearn.metrics import confusion_matrix
import numpy as np

y_val_proba = best_rf.predict_proba(X_val)[:, 1]

thresholds = np.arange(0.01, 1.00, 0.01)
costs = []

for t in thresholds:
    y_pred = (y_val_proba >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    cost = fp * 5 + fn
    costs.append(cost)

optimal_threshold = thresholds[np.argmin(costs)]
print(f"Optimaler Threshold: {optimal_threshold:.2f}, Minimale Kosten: {min(costs)}")

Optimaler Threshold: 0.64, Minimale Kosten: 491


## Step 4 – Final Model Evaluation on Test Set

Now that we have identified the optimal threshold based on cost minimization, we evaluate the model performance on the **unseen test set**.

We use the trained model and apply the optimal threshold to classify test instances. We then calculate:
- The **confusion matrix**
- The **total cost**: (False Positives × 5 + False Negatives)
- **Precision**, **Recall**, and **F1-score** for the minority class (`fraud = 1`)

This gives us a realistic and fair assessment of how well the model would perform in production.

In [8]:
from sklearn.metrics import classification_report, confusion_matrix

# Schritt 4: Vorhersage auf dem Testset
y_test_proba = best_rf.predict_proba(X_test)[:, 1]
y_test_pred = (y_test_proba >= optimal_threshold).astype(int)

# Confusion Matrix und finale Kosten berechnen
tn, fp, fn, tp = confusion_matrix(y_test, y_test_pred).ravel()
final_cost = fp * 5 + fn
print(f"Finale Kosten (FP×5 + FN): {final_cost}")

# Klassifikationsbericht (nur Klasse 1 wichtig)
report = classification_report(y_test, y_test_pred)
print("Klassifikationsbericht:")
print(report)

Finale Kosten (FP×5 + FN): 567
Klassifikationsbericht:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71149
           1       0.99      0.91      0.95      3559

    accuracy                           1.00     74708
   macro avg       0.99      0.95      0.97     74708
weighted avg       0.99      1.00      0.99     74708



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay

# 1. Kostenkurve
plt.figure(figsize=(10, 6))
plt.plot(thresholds, costs, label="Gesamtkosten (FP×5 + FN)")
plt.axvline(x=optimal_threshold, color='red', linestyle='--', label=f"Optimaler Threshold = {optimal_threshold:.2f}")
plt.xlabel("Threshold")
plt.ylabel("Kosten")
plt.title("Kostenbasierte Threshold-Optimierung (Validation Set)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# 2. Confusion Matrix auf dem Testset
conf_matrix = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["No Fraud", "Fraud"], yticklabels=["No Fraud", "Fraud"])
plt.xlabel("Vorhergesagt")
plt.ylabel("Tatsächlich")
plt.title("Confusion Matrix (Test Set)")
plt.tight_layout()
plt.show()

# 3. ROC-Kurve
RocCurveDisplay.from_estimator(best_rf, X_test, y_test)
plt.title("ROC-Kurve (Test Set)")
plt.grid(True)
plt.tight_layout()
plt.show()

# 4. Precision-Recall-Kurve
PrecisionRecallDisplay.from_estimator(best_rf, X_test, y_test)
plt.title("Precision-Recall-Kurve (Test Set)")
plt.grid(True)
plt.tight_layout()
plt.show()