#### Loading data and reusing filtering:

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

df_synth = pd.read_csv("data/PS_20174392719_1491204439457_log.csv")

In [3]:
df_synth.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [4]:
df_synth_ct = df_synth[df_synth["type"].isin(["CASH_OUT", "TRANSFER"])].copy()
df_synth_ct["type_code"] = (df_synth_ct["type"] == "TRANSFER").astype(int)

features = ["amount", "oldbalanceOrg", "newbalanceOrig", "type_code"]
X = df_synth_ct[features]
Y = df_synth_ct["isFraud"]

print(Y.value_counts(normalize = True))

isFraud
0    0.997035
1    0.002965
Name: proportion, dtype: float64


In [5]:
#### Train-test split:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

In [32]:
# Unweighted logistic
clf = LogisticRegression(max_iter=1000, n_jobs=-1)
clf.fit(X_train, Y_train)

# Class-weighted logistic
clf_w = LogisticRegression(max_iter=1000, class_weight="balanced", n_jobs=-1)
clf_w.fit(X_train, Y_train)

#### Baseline RandomForestClassifier:

In [7]:
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    n_jobs=-1,
    random_state=42,
    class_weight=None
)
rf.fit(X_train, Y_train)
Y_pred_rf = rf.predict(X_test)

print("Confusion Matrix (RandomForest, default threshold):")
print(confusion_matrix(Y_test, Y_pred_rf))
print()
print("Classification Report (RandonForest, default threshold):")
print(classification_report(Y_test, Y_pred_rf, digits=4))

Confusion Matrix (RandomForest, default threshold):
[[552301    138]
 [   178   1465]]

Classification Report (RandonForest, default threshold):
              precision    recall  f1-score   support

           0     0.9997    0.9998    0.9997    552439
           1     0.9139    0.8917    0.9026      1643

    accuracy                         0.9994    554082
   macro avg     0.9568    0.9457    0.9512    554082
weighted avg     0.9994    0.9994    0.9994    554082



#### The baseline RandomForest reaches fraud precision of ~ 0.91 and recall of ~ 0.89 at the default 0.5 threshold. Compared to logistic regression, this shows whether a non-linear ensemble improves sensitivity to fraud given the same simple feature set.

- This is a huge improvement over our logistic baseline (recall ~ 0.42) and over the class-weighted logistic model, which needed tens of thousands of alerts to get high recall.

- Confusion matrix: TN = 552,301;
- FP = 138;
- FN = 178;
- TP = 1,465.
- So it only misses 178 of 1,643 frauds and raises 138 false alarms.

#### RandomForest vs logistic regression: The RandomForest model achieves fraud precision of ~ 0.91 and recall of ~ 0.89 with only 138 false positives, compared to the unweighted logistic regression (~ 0.92 precision but only ~ 0.42 recall) and the class‑weighted logistic model (high recall but > 40k false positives). This demonstrates that a non-linear ensemble can deliver both high recall and a manageable alert volume on the same PaySim feature set, which is much closer to what AML teams require in practice.m

## Model comparison overview:

- Model	                           Threshold Fraud  Precision   Fraud recall	False positives	  False negatives	                 Notes
- Logistic regression (unweighted)	    0.50	      ~0.92	       ~0.42	         58	              958	          Very few alerts, but misses most frauds. ​
- Logistic regression (class-weighted)	0.50	      ~0.03	       ~0.87	       43,846	          216	          High recall, but operationally unusable alert volume. ​
- Logistic regression (weighted)	    0.80	      ~0.07	       ~0.73	       15,885	          449	          More balanced, but still low precision and many alerts. ​
- RandomForest	                        0.50	      ~0.91	       ~0.89	         138	          178	          High recall with very low false positives; best AML fit. ​

- Overall, the RandomForest model dominates the logistic baselines on both fraud precision and recall, while keeping the number of false positives very low. This aligns with fraud‑detection practice, where tree ensembles are widely used to capture complex, non‑linear risk patterns in transactional data.

## Limitations & future work:

- The models in this notebook are trained on the PaySim synthetic mobile money dataset, which is designed to resemble real transactional data but cannot fully capture the diversity of real-world fraud and AML behaviour.
- Only a small set of simple, transaction-level features (amount and balances plus type) is used, without any explicit temporal, customer-history, or network/graph-based information, even though such features are known to be important in production fraud systems.
- Future work could therefore extend this baseline by engineering richer behavioural features, evaluating models with PRAUC/ROC‑AUC and cost-sensitive metrics, and eventually testing on more realistic synthetic or anonymised AML datasets that include customer and network context.

### Computing ROC-AUC score and Average Precision score:

In [34]:
from sklearn.metrics import roc_auc_score, average_precision_score

def eval_model(Y_true, Y_prob, Y_pred, c_fp=10, c_fn=1000):
    tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred).ravel()
    roc = roc_auc_score(Y_true, Y_prob)
    pr = average_precision_score(Y_true, Y_prob)
    total_cost = c_fp * fp + c_fn * fn
    return {
        "model": name,
        "ROC_AUC": roc,
        "PR_AUC": pr,
        "precision": tp / (tp + fp) if (tp + fp) > 0 else 0,
        "recall": tp / (tp + fn) if (tp + fn) > 0 else 0,
        "FP": fp,
        "FN": fn,
        "total_cost": total_cost
    }

results = []

# 1) Logistic Unweighted:

Y_prob_log = clf.predict_proba(X_test)[:, 1]
Y_pred_log = (Y_prob_log >= 0.5).astype(int)
results.append(eval_model("Logistic (unweighted, 0.5)", Y_test, Y_prob_log, Y_pred_log))

# 2) Logistic Weighted, threshold 0.5 & 0.8:

Y_prob_w = clf_w.predict_proba(X_test)[:, 1]
for thr in [0.5, 0.8]:
    Y_pred_w_thr = (Y_prob_w >= thr).astype(int)
    results.append(eval_model(f"Logistic (weighted, {thr})", Y_test, Y_prob_w, Y_pred_w_thr))

df_synth_results = pd.DataFrame(results)
df_synth_results

InvalidParameterError: The 'y_true' parameter of confusion_matrix must be an array-like. Got 'Logistic (unweighted, 0.5)' instead.